[gentoo-commits] proj/linux-patches:3.16 commit in: /

public inbox for gentoo-commits@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-07-15 12:18 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-07-15 12:18 UTC (permalink / raw
  To: gentoo-commits

commit:     9f27167757173dcde5f5673d721e8dd7047df9e1
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Tue Jul 15 12:18:08 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Tue Jul 15 12:18:08 2014 +0000
URL:        http://git.overlays.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=9f271677

Zero copy for infiniband psm userspace driver. ACPI: Disable Windows 8 compatibility for some Lenovo ThinkPads Ensure that /dev/root doesn't appear in /proc/mounts when bootint without an initramfs. Do not lock when UMH is waiting on current thread spawned by linuxrc. (bug #481344). Bootsplash ported by Uladzimir Bely (bug #513334). Support for Pogoplug e02 (bug #460350), adjusted to be opt-in by TomWij. Add Gentoo Linux support config settings and defaults.

---
 0000_README                                   |   25 +
 2400_kcopy-patch-for-infiniband-driver.patch  |  731 +++++++++
 2700_ThinkPad-30-brightness-control-fix.patch |   67 +
 2900_dev-root-proc-mount-fix.patch            |   29 +
 2905_2disk-resume-image-fix.patch             |   24 +
 4200_fbcondecor-3.15.patch                    | 2119 +++++++++++++++++++++++++
 4500_support-for-pogoplug-e02.patch           |  172 ++
 7 files changed, 3167 insertions(+)

diff --git a/0000_README b/0000_README
index 9018993..6276507 100644
--- a/0000_README
+++ b/0000_README
@@ -43,6 +43,31 @@ EXPERIMENTAL
 Individual Patch Descriptions:
 --------------------------------------------------------------------------
 
+Patch:  2400_kcopy-patch-for-infiniband-driver.patch
+From:   Alexey Shvetsov <alexxy@gentoo.org>
+Desc:   Zero copy for infiniband psm userspace driver
+
+Patch:  2700_ThinkPad-30-brightness-control-fix.patch
+From:   Seth Forshee <seth.forshee@canonical.com>
+Desc:   ACPI: Disable Windows 8 compatibility for some Lenovo ThinkPads
+
+Patch:  2900_dev-root-proc-mount-fix.patch
+From:   https://bugs.gentoo.org/show_bug.cgi?id=438380
+Desc:   Ensure that /dev/root doesn't appear in /proc/mounts when bootint without an initramfs.
+
+Patch:  2905_s2disk-resume-image-fix.patch
+From:   Al Viro <viro <at> ZenIV.linux.org.uk>
+Desc:   Do not lock when UMH is waiting on current thread spawned by linuxrc. (bug #481344)
+
+Patch:  4200_fbcondecor-3.15.patch
+From:   http://www.mepiscommunity.org/fbcondecor
+Desc:   Bootsplash ported by Uladzimir Bely (bug #513334)
+
+Patch:  4500_support-for-pogoplug-e02.patch
+From:   Cristoph Junghans <ottxor@gentoo.org>
+Desc:   Support for Pogoplug e02 (bug #460350), adjusted to be opt-in by TomWij.
+
 Patch:  4567_distro-Gentoo-Kconfig.patch
 From:   Tom Wijsman <TomWij@gentoo.org>
 Desc:   Add Gentoo Linux support config settings and defaults.
+

diff --git a/2400_kcopy-patch-for-infiniband-driver.patch b/2400_kcopy-patch-for-infiniband-driver.patch
new file mode 100644
index 0000000..759f451
--- /dev/null
+++ b/2400_kcopy-patch-for-infiniband-driver.patch
@@ -0,0 +1,731 @@
+From 1f52075d672a9bdd0069b3ea68be266ef5c229bd Mon Sep 17 00:00:00 2001
+From: Alexey Shvetsov <alexxy@gentoo.org>
+Date: Tue, 17 Jan 2012 21:08:49 +0400
+Subject: [PATCH] [kcopy] Add kcopy driver
+
+Add kcopy driver from qlogic to implement zero copy for infiniband psm
+userspace driver
+
+Signed-off-by: Alexey Shvetsov <alexxy@gentoo.org>
+---
+ drivers/char/Kconfig        |    2 +
+ drivers/char/Makefile       |    2 +
+ drivers/char/kcopy/Kconfig  |   17 ++
+ drivers/char/kcopy/Makefile |    4 +
+ drivers/char/kcopy/kcopy.c  |  646 +++++++++++++++++++++++++++++++++++++++++++
+ 5 files changed, 671 insertions(+)
+ create mode 100644 drivers/char/kcopy/Kconfig
+ create mode 100644 drivers/char/kcopy/Makefile
+ create mode 100644 drivers/char/kcopy/kcopy.c
+
+diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
+index ee94686..5b81449 100644
+--- a/drivers/char/Kconfig
++++ b/drivers/char/Kconfig
+@@ -6,6 +6,8 @@ menu "Character devices"
+ 
+ source "drivers/tty/Kconfig"
+ 
++source "drivers/char/kcopy/Kconfig"
++
+ config DEVKMEM
+    bool "/dev/kmem virtual device support"
+    default y
+diff --git a/drivers/char/Makefile b/drivers/char/Makefile
+index 0dc5d7c..be519d6 100644
+--- a/drivers/char/Makefile
++++ b/drivers/char/Makefile
+@@ -62,3 +62,5 @@
+ js-rtc-y = rtc.o
+
+ obj-$(CONFIG_TILE_SROM)		+= tile-srom.o
++
++obj-$(CONFIG_KCOPY)		+= kcopy/
+diff --git a/drivers/char/kcopy/Kconfig b/drivers/char/kcopy/Kconfig
+new file mode 100644
+index 0000000..453ae52
+--- /dev/null
++++ b/drivers/char/kcopy/Kconfig
+@@ -0,0 +1,17 @@
++#
++# KCopy character device configuration
++#
++
++menu "KCopy"
++
++config KCOPY
++   tristate "Memory-to-memory copies using kernel assist"
++   default m
++   ---help---
++     High-performance inter-process memory copies.  Can often save a
++     memory copy to shared memory in the application.   Useful at least
++     for MPI applications where the point-to-point nature of vmsplice
++     and pipes can be a limiting factor in performance.
++
++endmenu
++
+diff --git a/drivers/char/kcopy/Makefile b/drivers/char/kcopy/Makefile
+new file mode 100644
+index 0000000..9cb269b
+--- /dev/null
++++ b/drivers/char/kcopy/Makefile
+@@ -0,0 +1,4 @@
++#
++# Makefile for the kernel character device drivers.
++#
++obj-$(CONFIG_KCOPY)    += kcopy.o
+diff --git a/drivers/char/kcopy/kcopy.c b/drivers/char/kcopy/kcopy.c
+new file mode 100644
+index 0000000..a9f915c
+--- /dev/null
++++ b/drivers/char/kcopy/kcopy.c
+@@ -0,0 +1,646 @@
++#include <linux/module.h>
++#include <linux/fs.h>
++#include <linux/cdev.h>
++#include <linux/device.h>
++#include <linux/mutex.h>
++#include <linux/mman.h>
++#include <linux/highmem.h>
++#include <linux/spinlock.h>
++#include <linux/sched.h>
++#include <linux/rbtree.h>
++#include <linux/rcupdate.h>
++#include <linux/uaccess.h>
++#include <linux/slab.h>
++
++MODULE_LICENSE("GPL");
++MODULE_AUTHOR("Arthur Jones <arthur.jones@qlogic.com>");
++MODULE_DESCRIPTION("QLogic kcopy driver");
++
++#define KCOPY_ABI      1
++#define KCOPY_MAX_MINORS   64
++
++struct kcopy_device {
++   struct cdev cdev;
++   struct class *class;
++   struct device *devp[KCOPY_MAX_MINORS];
++   dev_t dev;
++
++   struct kcopy_file *kf[KCOPY_MAX_MINORS];
++   struct mutex open_lock;
++};
++
++static struct kcopy_device kcopy_dev;
++
++/* per file data / one of these is shared per minor */
++struct kcopy_file {
++   int count;
++
++   /* pid indexed */
++   struct rb_root live_map_tree;
++
++   struct mutex map_lock;
++};
++
++struct kcopy_map_entry {
++   int    count;
++   struct task_struct *task;
++   pid_t  pid;
++   struct kcopy_file *file; /* file backpointer */
++
++   struct list_head list; /* free map list */
++   struct rb_node   node; /* live map tree */
++};
++
++#define KCOPY_GET_SYSCALL 1
++#define KCOPY_PUT_SYSCALL 2
++#define KCOPY_ABI_SYSCALL 3
++
++struct kcopy_syscall {
++   __u32 tag;
++   pid_t pid;
++   __u64 n;
++   __u64 src;
++   __u64 dst;
++};
++
++static const void __user *kcopy_syscall_src(const struct kcopy_syscall *ks)
++{
++   return (const void __user *) (unsigned long) ks->src;
++}
++
++static void __user *kcopy_syscall_dst(const struct kcopy_syscall *ks)
++{
++   return (void __user *) (unsigned long) ks->dst;
++}
++
++static unsigned long kcopy_syscall_n(const struct kcopy_syscall *ks)
++{
++   return (unsigned long) ks->n;
++}
++
++static struct kcopy_map_entry *kcopy_create_entry(struct kcopy_file *file)
++{
++   struct kcopy_map_entry *kme =
++       kmalloc(sizeof(struct kcopy_map_entry), GFP_KERNEL);
++
++   if (!kme)
++       return NULL;
++
++   kme->count = 1;
++   kme->file = file;
++   kme->task = current;
++   kme->pid = current->tgid;
++   INIT_LIST_HEAD(&kme->list);
++
++   return kme;
++}
++
++static struct kcopy_map_entry *
++kcopy_lookup_pid(struct rb_root *root, pid_t pid)
++{
++   struct rb_node *node = root->rb_node;
++
++   while (node) {
++       struct kcopy_map_entry *kme =
++           container_of(node, struct kcopy_map_entry, node);
++
++       if (pid < kme->pid)
++           node = node->rb_left;
++       else if (pid > kme->pid)
++           node = node->rb_right;
++       else
++           return kme;
++   }
++
++   return NULL;
++}
++
++static int kcopy_insert(struct rb_root *root, struct kcopy_map_entry *kme)
++{
++   struct rb_node **new = &(root->rb_node);
++   struct rb_node *parent = NULL;
++
++   while (*new) {
++       struct kcopy_map_entry *tkme =
++           container_of(*new, struct kcopy_map_entry, node);
++
++       parent = *new;
++       if (kme->pid < tkme->pid)
++           new = &((*new)->rb_left);
++       else if (kme->pid > tkme->pid)
++           new = &((*new)->rb_right);
++       else {
++           printk(KERN_INFO "!!! debugging: bad rb tree !!!\n");
++           return -EINVAL;
++       }
++   }
++
++   rb_link_node(&kme->node, parent, new);
++   rb_insert_color(&kme->node, root);
++
++   return 0;
++}
++
++static int kcopy_open(struct inode *inode, struct file *filp)
++{
++   int ret;
++   const int minor = iminor(inode);
++   struct kcopy_file *kf = NULL;
++   struct kcopy_map_entry *kme;
++   struct kcopy_map_entry *okme;
++
++   if (minor < 0 || minor >= KCOPY_MAX_MINORS)
++       return -ENODEV;
++
++   mutex_lock(&kcopy_dev.open_lock);
++
++   if (!kcopy_dev.kf[minor]) {
++       kf = kmalloc(sizeof(struct kcopy_file), GFP_KERNEL);
++
++       if (!kf) {
++           ret = -ENOMEM;
++           goto bail;
++       }
++
++       kf->count = 1;
++       kf->live_map_tree = RB_ROOT;
++       mutex_init(&kf->map_lock);
++       kcopy_dev.kf[minor] = kf;
++   } else {
++       if (filp->f_flags & O_EXCL) {
++           ret = -EBUSY;
++           goto bail;
++       }
++       kcopy_dev.kf[minor]->count++;
++   }
++
++   kme = kcopy_create_entry(kcopy_dev.kf[minor]);
++   if (!kme) {
++       ret = -ENOMEM;
++       goto err_free_kf;
++   }
++
++   kf = kcopy_dev.kf[minor];
++
++   mutex_lock(&kf->map_lock);
++
++   okme = kcopy_lookup_pid(&kf->live_map_tree, kme->pid);
++   if (okme) {
++       /* pid already exists... */
++       okme->count++;
++       kfree(kme);
++       kme = okme;
++   } else
++       ret = kcopy_insert(&kf->live_map_tree, kme);
++
++   mutex_unlock(&kf->map_lock);
++
++   filp->private_data = kme;
++
++   ret = 0;
++   goto bail;
++
++err_free_kf:
++   if (kf) {
++       kcopy_dev.kf[minor] = NULL;
++       kfree(kf);
++   }
++bail:
++   mutex_unlock(&kcopy_dev.open_lock);
++   return ret;
++}
++
++static int kcopy_flush(struct file *filp, fl_owner_t id)
++{
++   struct kcopy_map_entry *kme = filp->private_data;
++   struct kcopy_file *kf = kme->file;
++
++   if (file_count(filp) == 1) {
++       mutex_lock(&kf->map_lock);
++       kme->count--;
++
++       if (!kme->count) {
++           rb_erase(&kme->node, &kf->live_map_tree);
++           kfree(kme);
++       }
++       mutex_unlock(&kf->map_lock);
++   }
++
++   return 0;
++}
++
++static int kcopy_release(struct inode *inode, struct file *filp)
++{
++   const int minor = iminor(inode);
++
++   mutex_lock(&kcopy_dev.open_lock);
++   kcopy_dev.kf[minor]->count--;
++   if (!kcopy_dev.kf[minor]->count) {
++       kfree(kcopy_dev.kf[minor]);
++       kcopy_dev.kf[minor] = NULL;
++   }
++   mutex_unlock(&kcopy_dev.open_lock);
++
++   return 0;
++}
++
++static void kcopy_put_pages(struct page **pages, int npages)
++{
++   int j;
++
++   for (j = 0; j < npages; j++)
++       put_page(pages[j]);
++}
++
++static int kcopy_validate_task(struct task_struct *p)
++{
++   return p && (uid_eq(current_euid(), task_euid(p)) || uid_eq(current_euid(), task_uid(p)));
++}
++
++static int kcopy_get_pages(struct kcopy_file *kf, pid_t pid,
++              struct page **pages, void __user *addr,
++              int write, size_t npages)
++{
++   int err;
++   struct mm_struct *mm;
++   struct kcopy_map_entry *rkme;
++
++   mutex_lock(&kf->map_lock);
++
++   rkme = kcopy_lookup_pid(&kf->live_map_tree, pid);
++   if (!rkme || !kcopy_validate_task(rkme->task)) {
++       err = -EINVAL;
++       goto bail_unlock;
++   }
++
++   mm = get_task_mm(rkme->task);
++   if (unlikely(!mm)) {
++       err = -ENOMEM;
++       goto bail_unlock;
++   }
++
++   down_read(&mm->mmap_sem);
++   err = get_user_pages(rkme->task, mm,
++                (unsigned long) addr, npages, write, 0,
++                pages, NULL);
++
++   if (err < npages && err > 0) {
++       kcopy_put_pages(pages, err);
++       err = -ENOMEM;
++   } else if (err == npages)
++       err = 0;
++
++   up_read(&mm->mmap_sem);
++
++   mmput(mm);
++
++bail_unlock:
++   mutex_unlock(&kf->map_lock);
++
++   return err;
++}
++
++static unsigned long kcopy_copy_pages_from_user(void __user *src,
++                       struct page **dpages,
++                       unsigned doff,
++                       unsigned long n)
++{
++   struct page *dpage = *dpages;
++   char *daddr = kmap(dpage);
++   int ret = 0;
++
++   while (1) {
++       const unsigned long nleft = PAGE_SIZE - doff;
++       const unsigned long nc = (n < nleft) ? n : nleft;
++
++       /* if (copy_from_user(daddr + doff, src, nc)) { */
++       if (__copy_from_user_nocache(daddr + doff, src, nc)) {
++           ret = -EFAULT;
++           goto bail;
++       }
++
++       n -= nc;
++       if (n == 0)
++           break;
++
++       doff += nc;
++       doff &= ~PAGE_MASK;
++       if (doff == 0) {
++           kunmap(dpage);
++           dpages++;
++           dpage = *dpages;
++           daddr = kmap(dpage);
++       }
++
++       src += nc;
++   }
++
++bail:
++   kunmap(dpage);
++
++   return ret;
++}
++
++static unsigned long kcopy_copy_pages_to_user(void __user *dst,
++                         struct page **spages,
++                         unsigned soff,
++                         unsigned long n)
++{
++   struct page *spage = *spages;
++   const char *saddr = kmap(spage);
++   int ret = 0;
++
++   while (1) {
++       const unsigned long nleft = PAGE_SIZE - soff;
++       const unsigned long nc = (n < nleft) ? n : nleft;
++
++       if (copy_to_user(dst, saddr + soff, nc)) {
++           ret = -EFAULT;
++           goto bail;
++       }
++
++       n -= nc;
++       if (n == 0)
++           break;
++
++       soff += nc;
++       soff &= ~PAGE_MASK;
++       if (soff == 0) {
++           kunmap(spage);
++           spages++;
++           spage = *spages;
++           saddr = kmap(spage);
++       }
++
++       dst += nc;
++   }
++
++bail:
++   kunmap(spage);
++
++   return ret;
++}
++
++static unsigned long kcopy_copy_to_user(void __user *dst,
++                   struct kcopy_file *kf, pid_t pid,
++                   void __user *src,
++                   unsigned long n)
++{
++   struct page **pages;
++   const int pages_len = PAGE_SIZE / sizeof(struct page *);
++   int ret = 0;
++
++   pages = (struct page **) __get_free_page(GFP_KERNEL);
++   if (!pages) {
++       ret = -ENOMEM;
++       goto bail;
++   }
++
++   while (n) {
++       const unsigned long soff = (unsigned long) src & ~PAGE_MASK;
++       const unsigned long spages_left =
++           (soff + n + PAGE_SIZE - 1) >> PAGE_SHIFT;
++       const unsigned long spages_cp =
++           min_t(unsigned long, spages_left, pages_len);
++       const unsigned long sbytes =
++           PAGE_SIZE - soff + (spages_cp - 1) * PAGE_SIZE;
++       const unsigned long nbytes = min_t(unsigned long, sbytes, n);
++
++       ret = kcopy_get_pages(kf, pid, pages, src, 0, spages_cp);
++       if (unlikely(ret))
++           goto bail_free;
++
++       ret = kcopy_copy_pages_to_user(dst, pages, soff, nbytes);
++       kcopy_put_pages(pages, spages_cp);
++       if (ret)
++           goto bail_free;
++       dst = (char *) dst + nbytes;
++       src = (char *) src + nbytes;
++
++       n -= nbytes;
++   }
++
++bail_free:
++   free_page((unsigned long) pages);
++bail:
++   return ret;
++}
++
++static unsigned long kcopy_copy_from_user(const void __user *src,
++                     struct kcopy_file *kf, pid_t pid,
++                     void __user *dst,
++                     unsigned long n)
++{
++   struct page **pages;
++   const int pages_len = PAGE_SIZE / sizeof(struct page *);
++   int ret = 0;
++
++   pages = (struct page **) __get_free_page(GFP_KERNEL);
++   if (!pages) {
++       ret = -ENOMEM;
++       goto bail;
++   }
++
++   while (n) {
++       const unsigned long doff = (unsigned long) dst & ~PAGE_MASK;
++       const unsigned long dpages_left =
++           (doff + n + PAGE_SIZE - 1) >> PAGE_SHIFT;
++       const unsigned long dpages_cp =
++           min_t(unsigned long, dpages_left, pages_len);
++       const unsigned long dbytes =
++           PAGE_SIZE - doff + (dpages_cp - 1) * PAGE_SIZE;
++       const unsigned long nbytes = min_t(unsigned long, dbytes, n);
++
++       ret = kcopy_get_pages(kf, pid, pages, dst, 1, dpages_cp);
++       if (unlikely(ret))
++           goto bail_free;
++
++       ret = kcopy_copy_pages_from_user((void __user *) src,
++                        pages, doff, nbytes);
++       kcopy_put_pages(pages, dpages_cp);
++       if (ret)
++           goto bail_free;
++
++       dst = (char *) dst + nbytes;
++       src = (char *) src + nbytes;
++
++       n -= nbytes;
++   }
++
++bail_free:
++   free_page((unsigned long) pages);
++bail:
++   return ret;
++}
++
++static int kcopy_do_get(struct kcopy_map_entry *kme, pid_t pid,
++           const void __user *src, void __user *dst,
++           unsigned long n)
++{
++   struct kcopy_file *kf = kme->file;
++   int ret = 0;
++
++   if (n == 0) {
++       ret = -EINVAL;
++       goto bail;
++   }
++
++   ret = kcopy_copy_to_user(dst, kf, pid, (void __user *) src, n);
++
++bail:
++   return ret;
++}
++
++static int kcopy_do_put(struct kcopy_map_entry *kme, const void __user *src,
++           pid_t pid, void __user *dst,
++           unsigned long n)
++{
++   struct kcopy_file *kf = kme->file;
++   int ret = 0;
++
++   if (n == 0) {
++       ret = -EINVAL;
++       goto bail;
++   }
++
++   ret = kcopy_copy_from_user(src, kf, pid, (void __user *) dst, n);
++
++bail:
++   return ret;
++}
++
++static int kcopy_do_abi(u32 __user *dst)
++{
++   u32 val = KCOPY_ABI;
++   int err;
++
++   err = put_user(val, dst);
++   if (err)
++       return -EFAULT;
++
++   return 0;
++}
++
++ssize_t kcopy_write(struct file *filp, const char __user *data, size_t cnt,
++           loff_t *o)
++{
++   struct kcopy_map_entry *kme = filp->private_data;
++   struct kcopy_syscall ks;
++   int err = 0;
++   const void __user *src;
++   void __user *dst;
++   unsigned long n;
++
++   if (cnt != sizeof(struct kcopy_syscall)) {
++       err = -EINVAL;
++       goto bail;
++   }
++
++   err = copy_from_user(&ks, data, cnt);
++   if (unlikely(err))
++       goto bail;
++
++   src = kcopy_syscall_src(&ks);
++   dst = kcopy_syscall_dst(&ks);
++   n = kcopy_syscall_n(&ks);
++   if (ks.tag == KCOPY_GET_SYSCALL)
++       err = kcopy_do_get(kme, ks.pid, src, dst, n);
++   else if (ks.tag == KCOPY_PUT_SYSCALL)
++       err = kcopy_do_put(kme, src, ks.pid, dst, n);
++   else if (ks.tag == KCOPY_ABI_SYSCALL)
++       err = kcopy_do_abi(dst);
++   else
++       err = -EINVAL;
++
++bail:
++   return err ? err : cnt;
++}
++
++static const struct file_operations kcopy_fops = {
++   .owner = THIS_MODULE,
++   .open = kcopy_open,
++   .release = kcopy_release,
++   .flush = kcopy_flush,
++   .write = kcopy_write,
++};
++
++static int __init kcopy_init(void)
++{
++   int ret;
++   const char *name = "kcopy";
++   int i;
++   int ninit = 0;
++
++   mutex_init(&kcopy_dev.open_lock);
++
++   ret = alloc_chrdev_region(&kcopy_dev.dev, 0, KCOPY_MAX_MINORS, name);
++   if (ret)
++       goto bail;
++
++   kcopy_dev.class = class_create(THIS_MODULE, (char *) name);
++
++   if (IS_ERR(kcopy_dev.class)) {
++       ret = PTR_ERR(kcopy_dev.class);
++       printk(KERN_ERR "kcopy: Could not create "
++           "device class (err %d)\n", -ret);
++       goto bail_chrdev;
++   }
++
++   cdev_init(&kcopy_dev.cdev, &kcopy_fops);
++   ret = cdev_add(&kcopy_dev.cdev, kcopy_dev.dev, KCOPY_MAX_MINORS);
++   if (ret < 0) {
++       printk(KERN_ERR "kcopy: Could not add cdev (err %d)\n",
++                  -ret);
++       goto bail_class;
++   }
++
++   for (i = 0; i < KCOPY_MAX_MINORS; i++) {
++       char devname[8];
++       const int minor = MINOR(kcopy_dev.dev) + i;
++       const dev_t dev = MKDEV(MAJOR(kcopy_dev.dev), minor);
++
++       snprintf(devname, sizeof(devname), "kcopy%02d", i);
++       kcopy_dev.devp[i] =
++           device_create(kcopy_dev.class, NULL,
++               dev, NULL, devname);
++
++       if (IS_ERR(kcopy_dev.devp[i])) {
++           ret = PTR_ERR(kcopy_dev.devp[i]);
++           printk(KERN_ERR "kcopy: Could not create "
++                  "devp %d (err %d)\n", i, -ret);
++           goto bail_cdev_add;
++       }
++
++       ninit++;
++   }
++
++   ret = 0;
++   goto bail;
++
++bail_cdev_add:
++   for (i = 0; i < ninit; i++)
++       device_unregister(kcopy_dev.devp[i]);
++
++   cdev_del(&kcopy_dev.cdev);
++bail_class:
++   class_destroy(kcopy_dev.class);
++bail_chrdev:
++   unregister_chrdev_region(kcopy_dev.dev, KCOPY_MAX_MINORS);
++bail:
++   return ret;
++}
++
++static void __exit kcopy_fini(void)
++{
++   int i;
++
++   for (i = 0; i < KCOPY_MAX_MINORS; i++)
++       device_unregister(kcopy_dev.devp[i]);
++
++   cdev_del(&kcopy_dev.cdev);
++   class_destroy(kcopy_dev.class);
++   unregister_chrdev_region(kcopy_dev.dev, KCOPY_MAX_MINORS);
++}
++
++module_init(kcopy_init);
++module_exit(kcopy_fini);
+-- 
+1.7.10
+

diff --git a/2700_ThinkPad-30-brightness-control-fix.patch b/2700_ThinkPad-30-brightness-control-fix.patch
new file mode 100644
index 0000000..b548c6d
--- /dev/null
+++ b/2700_ThinkPad-30-brightness-control-fix.patch
@@ -0,0 +1,67 @@
+diff --git a/drivers/acpi/blacklist.c b/drivers/acpi/blacklist.c
+index cb96296..6c242ed 100644
+--- a/drivers/acpi/blacklist.c
++++ b/drivers/acpi/blacklist.c
+@@ -269,6 +276,61 @@  static struct dmi_system_id acpi_osi_dmi_table[] __initdata = {
+ 	},
+ 
+ 	/*
++	 * The following Lenovo models have a broken workaround in the
++	 * acpi_video backlight implementation to meet the Windows 8
++	 * requirement of 101 backlight levels. Reverting to pre-Win8
++	 * behavior fixes the problem.
++	 */
++	{
++	.callback = dmi_disable_osi_win8,
++	.ident = "Lenovo ThinkPad L430",
++	.matches = {
++		     DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		     DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad L430"),
++		},
++	},
++	{
++	.callback = dmi_disable_osi_win8,
++	.ident = "Lenovo ThinkPad T430s",
++	.matches = {
++		     DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		     DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T430s"),
++		},
++	},
++	{
++	.callback = dmi_disable_osi_win8,
++	.ident = "Lenovo ThinkPad T530",
++	.matches = {
++		     DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		     DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T530"),
++		},
++	},
++	{
++	.callback = dmi_disable_osi_win8,
++	.ident = "Lenovo ThinkPad W530",
++	.matches = {
++		     DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		     DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad W530"),
++		},
++	},
++	{
++	.callback = dmi_disable_osi_win8,
++	.ident = "Lenovo ThinkPad X1 Carbon",
++	.matches = {
++		     DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		     DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad X1 Carbon"),
++		},
++	},
++	{
++	.callback = dmi_disable_osi_win8,
++	.ident = "Lenovo ThinkPad X230",
++	.matches = {
++		     DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		     DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad X230"),
++		},
++	},
++
++	/*
+ 	 * BIOS invocation of _OSI(Linux) is almost always a BIOS bug.
+ 	 * Linux ignores it, except for the machines enumerated below.
+ 	 */
+

diff --git a/2900_dev-root-proc-mount-fix.patch b/2900_dev-root-proc-mount-fix.patch
new file mode 100644
index 0000000..4c89adf
--- /dev/null
+++ b/2900_dev-root-proc-mount-fix.patch
@@ -0,0 +1,29 @@
+--- a/init/do_mounts.c	2013-01-25 19:11:11.609802424 -0500
++++ b/init/do_mounts.c	2013-01-25 19:14:20.606053568 -0500
+@@ -461,7 +461,10 @@ void __init change_floppy(char *fmt, ...
+ 	va_start(args, fmt);
+ 	vsprintf(buf, fmt, args);
+ 	va_end(args);
+-	fd = sys_open("/dev/root", O_RDWR | O_NDELAY, 0);
++	if (saved_root_name[0])
++		fd = sys_open(saved_root_name, O_RDWR | O_NDELAY, 0);
++	else
++		fd = sys_open("/dev/root", O_RDWR | O_NDELAY, 0);
+ 	if (fd >= 0) {
+ 		sys_ioctl(fd, FDEJECT, 0);
+ 		sys_close(fd);
+@@ -505,7 +508,13 @@ void __init mount_root(void)
+ #endif
+ #ifdef CONFIG_BLOCK
+ 	create_dev("/dev/root", ROOT_DEV);
+-	mount_block_root("/dev/root", root_mountflags);
++	if (saved_root_name[0]) {
++		create_dev(saved_root_name, ROOT_DEV);
++		mount_block_root(saved_root_name, root_mountflags);
++	} else {
++		create_dev("/dev/root", ROOT_DEV);
++		mount_block_root("/dev/root", root_mountflags);
++	}
+ #endif
+ }
+ 

diff --git a/2905_2disk-resume-image-fix.patch b/2905_2disk-resume-image-fix.patch
new file mode 100644
index 0000000..7e95d29
--- /dev/null
+++ b/2905_2disk-resume-image-fix.patch
@@ -0,0 +1,24 @@
+diff --git a/kernel/kmod.c b/kernel/kmod.c
+index fb32636..d968882 100644
+--- a/kernel/kmod.c
++++ b/kernel/kmod.c
+@@ -575,7 +575,8 @@
+ 		call_usermodehelper_freeinfo(sub_info);
+ 		return -EINVAL;
+ 	}
+-	helper_lock();
++	if (!(current->flags & PF_FREEZER_SKIP))
++		helper_lock();
+ 	if (!khelper_wq || usermodehelper_disabled) {
+ 		retval = -EBUSY;
+ 		goto out;
+@@ -611,7 +612,8 @@ wait_done:
+ out:
+ 	call_usermodehelper_freeinfo(sub_info);
+ unlock:
+-	helper_unlock();
++	if (!(current->flags & PF_FREEZER_SKIP))
++		helper_unlock();
+ 	return retval;
+ }
+ EXPORT_SYMBOL(call_usermodehelper_exec);

diff --git a/4200_fbcondecor-3.15.patch b/4200_fbcondecor-3.15.patch
new file mode 100644
index 0000000..c96e5dc
--- /dev/null
+++ b/4200_fbcondecor-3.15.patch
@@ -0,0 +1,2119 @@
+diff --git a/Documentation/fb/00-INDEX b/Documentation/fb/00-INDEX
+index fe85e7c..2230930 100644
+--- a/Documentation/fb/00-INDEX
++++ b/Documentation/fb/00-INDEX
+@@ -23,6 +23,8 @@ ep93xx-fb.txt
+ 	- info on the driver for EP93xx LCD controller.
+ fbcon.txt
+ 	- intro to and usage guide for the framebuffer console (fbcon).
++fbcondecor.txt
++	- info on the Framebuffer Console Decoration
+ framebuffer.txt
+ 	- introduction to frame buffer devices.
+ gxfb.txt
+diff --git a/Documentation/fb/fbcondecor.txt b/Documentation/fb/fbcondecor.txt
+new file mode 100644
+index 0000000..3388c61
+--- /dev/null
++++ b/Documentation/fb/fbcondecor.txt
+@@ -0,0 +1,207 @@
++What is it?
++-----------
++
++The framebuffer decorations are a kernel feature which allows displaying a 
++background picture on selected consoles.
++
++What do I need to get it to work?
++---------------------------------
++
++To get fbcondecor up-and-running you will have to:
++ 1) get a copy of splashutils [1] or a similar program
++ 2) get some fbcondecor themes
++ 3) build the kernel helper program
++ 4) build your kernel with the FB_CON_DECOR option enabled.
++
++To get fbcondecor operational right after fbcon initialization is finished, you
++will have to include a theme and the kernel helper into your initramfs image.
++Please refer to splashutils documentation for instructions on how to do that.
++
++[1] The splashutils package can be downloaded from:
++    http://github.com/alanhaggai/fbsplash
++
++The userspace helper
++--------------------
++
++The userspace fbcondecor helper (by default: /sbin/fbcondecor_helper) is called by the
++kernel whenever an important event occurs and the kernel needs some kind of
++job to be carried out. Important events include console switches and video
++mode switches (the kernel requests background images and configuration
++parameters for the current console). The fbcondecor helper must be accessible at
++all times. If it's not, fbcondecor will be switched off automatically.
++
++It's possible to set path to the fbcondecor helper by writing it to
++/proc/sys/kernel/fbcondecor.
++
++*****************************************************************************
++
++The information below is mostly technical stuff. There's probably no need to
++read it unless you plan to develop a userspace helper.
++
++The fbcondecor protocol
++-----------------------
++
++The fbcondecor protocol defines a communication interface between the kernel and
++the userspace fbcondecor helper.
++
++The kernel side is responsible for:
++
++ * rendering console text, using an image as a background (instead of a
++   standard solid color fbcon uses),
++ * accepting commands from the user via ioctls on the fbcondecor device,
++ * calling the userspace helper to set things up as soon as the fb subsystem 
++   is initialized.
++
++The userspace helper is responsible for everything else, including parsing
++configuration files, decompressing the image files whenever the kernel needs
++it, and communicating with the kernel if necessary.
++
++The fbcondecor protocol specifies how communication is done in both ways:
++kernel->userspace and userspace->helper.
++  
++Kernel -> Userspace
++-------------------
++
++The kernel communicates with the userspace helper by calling it and specifying
++the task to be done in a series of arguments.
++
++The arguments follow the pattern:
++<fbcondecor protocol version> <command> <parameters>
++
++All commands defined in fbcondecor protocol v2 have the following parameters:
++ virtual console
++ framebuffer number
++ theme
++
++Fbcondecor protocol v1 specified an additional 'fbcondecor mode' after the
++framebuffer number. Fbcondecor protocol v1 is deprecated and should not be used.
++
++Fbcondecor protocol v2 specifies the following commands:
++
++getpic
++------
++ The kernel issues this command to request image data. It's up to the 
++ userspace  helper to find a background image appropriate for the specified 
++ theme and the current resolution. The userspace helper should respond by 
++ issuing the FBIOCONDECOR_SETPIC ioctl.
++
++init
++----
++ The kernel issues this command after the fbcondecor device is created and
++ the fbcondecor interface is initialized. Upon receiving 'init', the userspace
++ helper should parse the kernel command line (/proc/cmdline) or otherwise
++ decide whether fbcondecor is to be activated.
++
++ To activate fbcondecor on the first console the helper should issue the
++ FBIOCONDECOR_SETCFG, FBIOCONDECOR_SETPIC and FBIOCONDECOR_SETSTATE commands,
++ in the above-mentioned order.
++
++ When the userspace helper is called in an early phase of the boot process
++ (right after the initialization of fbcon), no filesystems will be mounted.
++ The helper program should mount sysfs and then create the appropriate
++ framebuffer, fbcondecor and tty0 devices (if they don't already exist) to get
++ current display settings and to be able to communicate with the kernel side.
++ It should probably also mount the procfs to be able to parse the kernel
++ command line parameters.
++
++ Note that the console sem is not held when the kernel calls fbcondecor_helper
++ with the 'init' command. The fbcondecor helper should perform all ioctls with
++ origin set to FBCON_DECOR_IO_ORIG_USER.
++
++modechange
++----------
++ The kernel issues this command on a mode change. The helper's response should
++ be similar to the response to the 'init' command. Note that this time the
++ console sem is held and all ioctls must be performed with origin set to
++ FBCON_DECOR_IO_ORIG_KERNEL.
++
++
++Userspace -> Kernel
++-------------------
++
++Userspace programs can communicate with fbcondecor via ioctls on the
++fbcondecor device. These ioctls are to be used by both the userspace helper
++(called only by the kernel) and userspace configuration tools (run by the users).
++
++The fbcondecor helper should set the origin field to FBCON_DECOR_IO_ORIG_KERNEL
++when doing the appropriate ioctls. All userspace configuration tools should
++use FBCON_DECOR_IO_ORIG_USER. Failure to set the appropriate value in the origin
++field when performing ioctls from the kernel helper will most likely result
++in a console deadlock.
++
++FBCON_DECOR_IO_ORIG_KERNEL instructs fbcondecor not to try to acquire the console
++semaphore. Not surprisingly, FBCON_DECOR_IO_ORIG_USER instructs it to acquire
++the console sem.
++
++The framebuffer console decoration provides the following ioctls (all defined in 
++linux/fb.h):
++
++FBIOCONDECOR_SETPIC
++description: loads a background picture for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: struct fb_image*
++notes: 
++If called for consoles other than the current foreground one, the picture data
++will be ignored.
++
++If the current virtual console is running in a 8-bpp mode, the cmap substruct
++of fb_image has to be filled appropriately: start should be set to 16 (first
++16 colors are reserved for fbcon), len to a value <= 240 and red, green and
++blue should point to valid cmap data. The transp field is ingored. The fields
++dx, dy, bg_color, fg_color in fb_image are ignored as well.
++
++FBIOCONDECOR_SETCFG
++description: sets the fbcondecor config for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: struct vc_decor*
++notes: The structure has to be filled with valid data.
++
++FBIOCONDECOR_GETCFG
++description: gets the fbcondecor config for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: struct vc_decor*
++
++FBIOCONDECOR_SETSTATE
++description: sets the fbcondecor state for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: unsigned int*
++          values: 0 = disabled, 1 = enabled.
++
++FBIOCONDECOR_GETSTATE
++description: gets the fbcondecor state for a virtual console
++argument: struct fbcon_decor_iowrapper*; data: unsigned int*
++          values: as in FBIOCONDECOR_SETSTATE
++
++Info on used structures:
++
++Definition of struct vc_decor can be found in linux/console_decor.h. It's
++heavily commented. Note that the 'theme' field should point to a string
++no longer than FBCON_DECOR_THEME_LEN. When FBIOCONDECOR_GETCFG call is
++performed, the theme field should point to a char buffer of length
++FBCON_DECOR_THEME_LEN.
++
++Definition of struct fbcon_decor_iowrapper can be found in linux/fb.h.
++The fields in this struct have the following meaning:
++
++vc: 
++Virtual console number.
++
++origin: 
++Specifies if the ioctl is performed as a response to a kernel request. The
++fbcondecor helper should set this field to FBCON_DECOR_IO_ORIG_KERNEL, userspace
++programs should set it to FBCON_DECOR_IO_ORIG_USER. This field is necessary to
++avoid console semaphore deadlocks.
++
++data: 
++Pointer to a data structure appropriate for the performed ioctl. Type of
++the data struct is specified in the ioctls description.
++
++*****************************************************************************
++
++Credit
++------
++
++Original 'bootsplash' project & implementation by:
++  Volker Poplawski <volker@poplawski.de>, Stefan Reinauer <stepan@suse.de>,
++  Steffen Winterfeldt <snwint@suse.de>, Michael Schroeder <mls@suse.de>,
++  Ken Wimer <wimer@suse.de>.
++
++Fbcondecor, fbcondecor protocol design, current implementation & docs by:
++  Michal Januszewski <michalj+fbcondecor@gmail.com>
++
+diff --git a/drivers/Makefile b/drivers/Makefile
+index 7183b6a..d576148 100644
+--- a/drivers/Makefile
++++ b/drivers/Makefile
+@@ -17,6 +17,10 @@ obj-y				+= pwm/
+ obj-$(CONFIG_PCI)		+= pci/
+ obj-$(CONFIG_PARISC)		+= parisc/
+ obj-$(CONFIG_RAPIDIO)		+= rapidio/
++# tty/ comes before char/ so that the VT console is the boot-time
++# default.
++obj-y				+= tty/
++obj-y				+= char/
+ obj-y				+= video/
+ obj-y				+= idle/
+ 
+@@ -42,11 +46,6 @@ obj-$(CONFIG_REGULATOR)		+= regulator/
+ # reset controllers early, since gpu drivers might rely on them to initialize
+ obj-$(CONFIG_RESET_CONTROLLER)	+= reset/
+ 
+-# tty/ comes before char/ so that the VT console is the boot-time
+-# default.
+-obj-y				+= tty/
+-obj-y				+= char/
+-
+ # gpu/ comes after char for AGP vs DRM startup
+ obj-y				+= gpu/
+ 
+diff --git a/drivers/video/console/Kconfig b/drivers/video/console/Kconfig
+index fe1cd01..6d2e87a 100644
+--- a/drivers/video/console/Kconfig
++++ b/drivers/video/console/Kconfig
+@@ -126,6 +126,19 @@ config FRAMEBUFFER_CONSOLE_ROTATION
+          such that other users of the framebuffer will remain normally
+          oriented.
+ 
++config FB_CON_DECOR
++	bool "Support for the Framebuffer Console Decorations"
++	depends on FRAMEBUFFER_CONSOLE=y && !FB_TILEBLITTING
++	default n
++	---help---
++	  This option enables support for framebuffer console decorations which
++	  makes it possible to display images in the background of the system
++	  consoles.  Note that userspace utilities are necessary in order to take 
++	  advantage of these features. Refer to Documentation/fb/fbcondecor.txt 
++	  for more information.
++
++	  If unsure, say N.
++
+ config STI_CONSOLE
+         bool "STI text console"
+         depends on PARISC
+diff --git a/drivers/video/console/Makefile b/drivers/video/console/Makefile
+index 43bfa48..cc104b6 100644
+--- a/drivers/video/console/Makefile
++++ b/drivers/video/console/Makefile
+@@ -16,4 +16,5 @@ obj-$(CONFIG_FRAMEBUFFER_CONSOLE)     += fbcon_rotate.o fbcon_cw.o fbcon_ud.o \
+                                          fbcon_ccw.o
+ endif
+ 
++obj-$(CONFIG_FB_CON_DECOR)     	  += fbcondecor.o cfbcondecor.o
+ obj-$(CONFIG_FB_STI)              += sticore.o
+diff --git a/drivers/video/console/bitblit.c b/drivers/video/console/bitblit.c
+index 61b182b..984384b 100644
+--- a/drivers/video/console/bitblit.c
++++ b/drivers/video/console/bitblit.c
+@@ -18,6 +18,7 @@
+ #include <linux/console.h>
+ #include <asm/types.h>
+ #include "fbcon.h"
++#include "fbcondecor.h"
+ 
+ /*
+  * Accelerated handlers.
+@@ -55,6 +56,13 @@ static void bit_bmove(struct vc_data *vc, struct fb_info *info, int sy,
+ 	area.height = height * vc->vc_font.height;
+ 	area.width = width * vc->vc_font.width;
+ 
++	if (fbcon_decor_active(info, vc)) {
++ 		area.sx += vc->vc_decor.tx;
++ 		area.sy += vc->vc_decor.ty;
++ 		area.dx += vc->vc_decor.tx;
++ 		area.dy += vc->vc_decor.ty;
++ 	}
++
+ 	info->fbops->fb_copyarea(info, &area);
+ }
+ 
+@@ -380,11 +388,15 @@ static void bit_cursor(struct vc_data *vc, struct fb_info *info, int mode,
+ 	cursor.image.depth = 1;
+ 	cursor.rop = ROP_XOR;
+ 
+-	if (info->fbops->fb_cursor)
+-		err = info->fbops->fb_cursor(info, &cursor);
++	if (fbcon_decor_active(info, vc)) {
++		fbcon_decor_cursor(info, &cursor);
++	} else {
++		if (info->fbops->fb_cursor)
++			err = info->fbops->fb_cursor(info, &cursor);
+ 
+-	if (err)
+-		soft_cursor(info, &cursor);
++		if (err)
++			soft_cursor(info, &cursor);
++	}
+ 
+ 	ops->cursor_reset = 0;
+ }
+diff --git a/drivers/video/console/cfbcondecor.c b/drivers/video/console/cfbcondecor.c
+new file mode 100644
+index 0000000..a2b4497
+--- /dev/null
++++ b/drivers/video/console/cfbcondecor.c
+@@ -0,0 +1,471 @@
++/*
++ *  linux/drivers/video/cfbcon_decor.c -- Framebuffer decor render functions
++ *
++ *  Copyright (C) 2004 Michal Januszewski <michalj+fbcondecor@gmail.com>
++ *
++ *  Code based upon "Bootdecor" (C) 2001-2003
++ *       Volker Poplawski <volker@poplawski.de>,
++ *       Stefan Reinauer <stepan@suse.de>,
++ *       Steffen Winterfeldt <snwint@suse.de>,
++ *       Michael Schroeder <mls@suse.de>,
++ *       Ken Wimer <wimer@suse.de>.
++ *
++ *  This file is subject to the terms and conditions of the GNU General Public
++ *  License.  See the file COPYING in the main directory of this archive for
++ *  more details.
++ */
++#include <linux/module.h>
++#include <linux/types.h>
++#include <linux/fb.h>
++#include <linux/selection.h>
++#include <linux/slab.h>
++#include <linux/vt_kern.h>
++#include <asm/irq.h>
++
++#include "fbcon.h"
++#include "fbcondecor.h"
++
++#define parse_pixel(shift,bpp,type)						\
++	do {									\
++		if (d & (0x80 >> (shift)))					\
++			dd2[(shift)] = fgx;					\
++		else								\
++			dd2[(shift)] = transparent ? *(type *)decor_src : bgx;	\
++		decor_src += (bpp);						\
++	} while (0)								\
++
++extern int get_color(struct vc_data *vc, struct fb_info *info,
++		     u16 c, int is_fg);
++
++void fbcon_decor_fix_pseudo_pal(struct fb_info *info, struct vc_data *vc)
++{
++	int i, j, k;
++	int minlen = min(min(info->var.red.length, info->var.green.length),
++			     info->var.blue.length);
++	u32 col;
++
++	for (j = i = 0; i < 16; i++) {
++		k = color_table[i];
++
++		col = ((vc->vc_palette[j++]  >> (8-minlen))
++			<< info->var.red.offset);
++		col |= ((vc->vc_palette[j++] >> (8-minlen))
++			<< info->var.green.offset);
++		col |= ((vc->vc_palette[j++] >> (8-minlen))
++			<< info->var.blue.offset);
++			((u32 *)info->pseudo_palette)[k] = col;
++	}
++}
++
++void fbcon_decor_renderc(struct fb_info *info, int ypos, int xpos, int height,
++		      int width, u8* src, u32 fgx, u32 bgx, u8 transparent)
++{
++	unsigned int x, y;
++	u32 dd;
++	int bytespp = ((info->var.bits_per_pixel + 7) >> 3);
++	unsigned int d = ypos * info->fix.line_length + xpos * bytespp;
++	unsigned int ds = (ypos * info->var.xres + xpos) * bytespp;
++	u16 dd2[4];
++
++	u8* decor_src = (u8 *)(info->bgdecor.data + ds);
++	u8* dst = (u8 *)(info->screen_base + d);
++
++	if ((ypos + height) > info->var.yres || (xpos + width) > info->var.xres)
++		return;
++
++	for (y = 0; y < height; y++) {
++		switch (info->var.bits_per_pixel) {
++
++		case 32:
++			for (x = 0; x < width; x++) {
++
++				if ((x & 7) == 0)
++					d = *src++;
++				if (d & 0x80)
++					dd = fgx;
++				else
++					dd = transparent ?
++					     *(u32 *)decor_src : bgx;
++
++				d <<= 1;
++				decor_src += 4;
++				fb_writel(dd, dst);
++				dst += 4;
++			}
++			break;
++		case 24:
++			for (x = 0; x < width; x++) {
++
++				if ((x & 7) == 0)
++					d = *src++;
++				if (d & 0x80)
++					dd = fgx;
++				else
++					dd = transparent ?
++					     (*(u32 *)decor_src & 0xffffff) : bgx;
++
++				d <<= 1;
++				decor_src += 3;
++#ifdef __LITTLE_ENDIAN
++				fb_writew(dd & 0xffff, dst);
++				dst += 2;
++				fb_writeb((dd >> 16), dst);
++#else
++				fb_writew(dd >> 8, dst);
++				dst += 2;
++				fb_writeb(dd & 0xff, dst);
++#endif
++				dst++;
++			}
++			break;
++		case 16:
++			for (x = 0; x < width; x += 2) {
++				if ((x & 7) == 0)
++					d = *src++;
++
++				parse_pixel(0, 2, u16);
++				parse_pixel(1, 2, u16);
++#ifdef __LITTLE_ENDIAN
++				dd = dd2[0] | (dd2[1] << 16);
++#else
++				dd = dd2[1] | (dd2[0] << 16);
++#endif
++				d <<= 2;
++				fb_writel(dd, dst);
++				dst += 4;
++			}
++			break;
++
++		case 8:
++			for (x = 0; x < width; x += 4) {
++				if ((x & 7) == 0)
++					d = *src++;
++
++				parse_pixel(0, 1, u8);
++				parse_pixel(1, 1, u8);
++				parse_pixel(2, 1, u8);
++				parse_pixel(3, 1, u8);
++
++#ifdef __LITTLE_ENDIAN
++				dd = dd2[0] | (dd2[1] << 8) | (dd2[2] << 16) | (dd2[3] << 24);
++#else
++				dd = dd2[3] | (dd2[2] << 8) | (dd2[1] << 16) | (dd2[0] << 24);
++#endif
++				d <<= 4;
++				fb_writel(dd, dst);
++				dst += 4;
++			}
++		}
++
++		dst += info->fix.line_length - width * bytespp;
++		decor_src += (info->var.xres - width) * bytespp;
++	}
++}
++
++#define cc2cx(a) 						\
++	((info->fix.visual == FB_VISUAL_TRUECOLOR || 		\
++	  info->fix.visual == FB_VISUAL_DIRECTCOLOR) ? 		\
++	 ((u32*)info->pseudo_palette)[a] : a)
++
++void fbcon_decor_putcs(struct vc_data *vc, struct fb_info *info,
++		   const unsigned short *s, int count, int yy, int xx)
++{
++	unsigned short charmask = vc->vc_hi_font_mask ? 0x1ff : 0xff;
++	struct fbcon_ops *ops = info->fbcon_par;
++	int fg_color, bg_color, transparent;
++	u8 *src;
++	u32 bgx, fgx;
++	u16 c = scr_readw(s);
++
++	fg_color = get_color(vc, info, c, 1);
++        bg_color = get_color(vc, info, c, 0);
++
++	/* Don't paint the background image if console is blanked */
++	transparent = ops->blank_state ? 0 :
++		(vc->vc_decor.bg_color == bg_color);
++
++	xx = xx * vc->vc_font.width + vc->vc_decor.tx;
++	yy = yy * vc->vc_font.height + vc->vc_decor.ty;
++
++	fgx = cc2cx(fg_color);
++	bgx = cc2cx(bg_color);
++
++	while (count--) {
++		c = scr_readw(s++);
++		src = vc->vc_font.data + (c & charmask) * vc->vc_font.height *
++		      ((vc->vc_font.width + 7) >> 3);
++
++		fbcon_decor_renderc(info, yy, xx, vc->vc_font.height,
++			       vc->vc_font.width, src, fgx, bgx, transparent);
++		xx += vc->vc_font.width;
++	}
++}
++
++void fbcon_decor_cursor(struct fb_info *info, struct fb_cursor *cursor)
++{
++	int i;
++	unsigned int dsize, s_pitch;
++	struct fbcon_ops *ops = info->fbcon_par;
++	struct vc_data* vc;
++	u8 *src;
++
++	/* we really don't need any cursors while the console is blanked */
++	if (info->state != FBINFO_STATE_RUNNING || ops->blank_state)
++		return;
++
++	vc = vc_cons[ops->currcon].d;
++
++	src = kmalloc(64 + sizeof(struct fb_image), GFP_ATOMIC);
++	if (!src)
++		return;
++
++	s_pitch = (cursor->image.width + 7) >> 3;
++	dsize = s_pitch * cursor->image.height;
++	if (cursor->enable) {
++		switch (cursor->rop) {
++		case ROP_XOR:
++			for (i = 0; i < dsize; i++)
++				src[i] = cursor->image.data[i] ^ cursor->mask[i];
++                        break;
++		case ROP_COPY:
++		default:
++			for (i = 0; i < dsize; i++)
++				src[i] = cursor->image.data[i] & cursor->mask[i];
++			break;
++		}
++	} else
++		memcpy(src, cursor->image.data, dsize);
++
++	fbcon_decor_renderc(info,
++			cursor->image.dy + vc->vc_decor.ty,
++			cursor->image.dx + vc->vc_decor.tx,
++			cursor->image.height,
++			cursor->image.width,
++			(u8*)src,
++			cc2cx(cursor->image.fg_color),
++			cc2cx(cursor->image.bg_color),
++			cursor->image.bg_color == vc->vc_decor.bg_color);
++
++	kfree(src);
++}
++
++static void decorset(u8 *dst, int height, int width, int dstbytes,
++		        u32 bgx, int bpp)
++{
++	int i;
++
++	if (bpp == 8)
++		bgx |= bgx << 8;
++	if (bpp == 16 || bpp == 8)
++		bgx |= bgx << 16;
++
++	while (height-- > 0) {
++		u8 *p = dst;
++
++		switch (bpp) {
++
++		case 32:
++			for (i=0; i < width; i++) {
++				fb_writel(bgx, p); p += 4;
++			}
++			break;
++		case 24:
++			for (i=0; i < width; i++) {
++#ifdef __LITTLE_ENDIAN
++				fb_writew((bgx & 0xffff),(u16*)p); p += 2;
++				fb_writeb((bgx >> 16),p++);
++#else
++				fb_writew((bgx >> 8),(u16*)p); p += 2;
++				fb_writeb((bgx & 0xff),p++);
++#endif
++			}
++		case 16:
++			for (i=0; i < width/4; i++) {
++				fb_writel(bgx,p); p += 4;
++				fb_writel(bgx,p); p += 4;
++			}
++			if (width & 2) {
++				fb_writel(bgx,p); p += 4;
++			}
++			if (width & 1)
++				fb_writew(bgx,(u16*)p);
++			break;
++		case 8:
++			for (i=0; i < width/4; i++) {
++				fb_writel(bgx,p); p += 4;
++			}
++
++			if (width & 2) {
++				fb_writew(bgx,p); p += 2;
++			}
++			if (width & 1)
++				fb_writeb(bgx,(u8*)p);
++			break;
++
++		}
++		dst += dstbytes;
++	}
++}
++
++void fbcon_decor_copy(u8 *dst, u8 *src, int height, int width, int linebytes,
++		   int srclinebytes, int bpp)
++{
++	int i;
++
++	while (height-- > 0) {
++		u32 *p = (u32 *)dst;
++		u32 *q = (u32 *)src;
++
++		switch (bpp) {
++
++		case 32:
++			for (i=0; i < width; i++)
++				fb_writel(*q++, p++);
++			break;
++		case 24:
++			for (i=0; i < (width*3/4); i++)
++				fb_writel(*q++, p++);
++			if ((width*3) % 4) {
++				if (width & 2) {
++					fb_writeb(*(u8*)q, (u8*)p);
++				} else if (width & 1) {
++					fb_writew(*(u16*)q, (u16*)p);
++					fb_writeb(*(u8*)((u16*)q+1),(u8*)((u16*)p+2));
++				}
++			}
++			break;
++		case 16:
++			for (i=0; i < width/4; i++) {
++				fb_writel(*q++, p++);
++				fb_writel(*q++, p++);
++			}
++			if (width & 2)
++				fb_writel(*q++, p++);
++			if (width & 1)
++				fb_writew(*(u16*)q, (u16*)p);
++			break;
++		case 8:
++			for (i=0; i < width/4; i++)
++				fb_writel(*q++, p++);
++
++			if (width & 2) {
++				fb_writew(*(u16*)q, (u16*)p);
++				q = (u32*) ((u16*)q + 1);
++				p = (u32*) ((u16*)p + 1);
++			}
++			if (width & 1)
++				fb_writeb(*(u8*)q, (u8*)p);
++			break;
++		}
++
++		dst += linebytes;
++		src += srclinebytes;
++	}
++}
++
++static void decorfill(struct fb_info *info, int sy, int sx, int height,
++		       int width)
++{
++	int bytespp = ((info->var.bits_per_pixel + 7) >> 3);
++	int d  = sy * info->fix.line_length + sx * bytespp;
++	int ds = (sy * info->var.xres + sx) * bytespp;
++
++	fbcon_decor_copy((u8 *)(info->screen_base + d), (u8 *)(info->bgdecor.data + ds),
++		    height, width, info->fix.line_length, info->var.xres * bytespp,
++		    info->var.bits_per_pixel);
++}
++
++void fbcon_decor_clear(struct vc_data *vc, struct fb_info *info, int sy, int sx,
++		    int height, int width)
++{
++	int bgshift = (vc->vc_hi_font_mask) ? 13 : 12;
++	struct fbcon_ops *ops = info->fbcon_par;
++	u8 *dst;
++	int transparent, bg_color = attr_bgcol_ec(bgshift, vc, info);
++
++	transparent = (vc->vc_decor.bg_color == bg_color);
++	sy = sy * vc->vc_font.height + vc->vc_decor.ty;
++	sx = sx * vc->vc_font.width + vc->vc_decor.tx;
++	height *= vc->vc_font.height;
++	width *= vc->vc_font.width;
++
++	/* Don't paint the background image if console is blanked */
++	if (transparent && !ops->blank_state) {
++		decorfill(info, sy, sx, height, width);
++	} else {
++		dst = (u8 *)(info->screen_base + sy * info->fix.line_length +
++			     sx * ((info->var.bits_per_pixel + 7) >> 3));
++		decorset(dst, height, width, info->fix.line_length, cc2cx(bg_color),
++			  info->var.bits_per_pixel);
++	}
++}
++
++void fbcon_decor_clear_margins(struct vc_data *vc, struct fb_info *info,
++			    int bottom_only)
++{
++	unsigned int tw = vc->vc_cols*vc->vc_font.width;
++	unsigned int th = vc->vc_rows*vc->vc_font.height;
++
++	if (!bottom_only) {
++		/* top margin */
++		decorfill(info, 0, 0, vc->vc_decor.ty, info->var.xres);
++		/* left margin */
++		decorfill(info, vc->vc_decor.ty, 0, th, vc->vc_decor.tx);
++		/* right margin */
++		decorfill(info, vc->vc_decor.ty, vc->vc_decor.tx + tw, th, 
++			   info->var.xres - vc->vc_decor.tx - tw);
++	}
++	decorfill(info, vc->vc_decor.ty + th, 0, 
++		   info->var.yres - vc->vc_decor.ty - th, info->var.xres);
++}
++
++void fbcon_decor_bmove_redraw(struct vc_data *vc, struct fb_info *info, int y, 
++			   int sx, int dx, int width)
++{
++	u16 *d = (u16 *) (vc->vc_origin + vc->vc_size_row * y + dx * 2);
++	u16 *s = d + (dx - sx);
++	u16 *start = d;
++	u16 *ls = d;
++	u16 *le = d + width;
++	u16 c;
++	int x = dx;
++	u16 attr = 1;
++
++	do {
++		c = scr_readw(d);
++		if (attr != (c & 0xff00)) {
++			attr = c & 0xff00;
++			if (d > start) {
++				fbcon_decor_putcs(vc, info, start, d - start, y, x);
++				x += d - start;
++				start = d;
++			}
++		}
++		if (s >= ls && s < le && c == scr_readw(s)) {
++			if (d > start) {
++				fbcon_decor_putcs(vc, info, start, d - start, y, x);
++				x += d - start + 1;
++				start = d + 1;
++			} else {
++				x++;
++				start++;
++			}
++		}
++		s++;
++		d++;
++	} while (d < le);
++	if (d > start)
++		fbcon_decor_putcs(vc, info, start, d - start, y, x);
++}
++
++void fbcon_decor_blank(struct vc_data *vc, struct fb_info *info, int blank)
++{
++	if (blank) {
++		decorset((u8 *)info->screen_base, info->var.yres, info->var.xres,
++			  info->fix.line_length, 0, info->var.bits_per_pixel);
++	} else {
++		update_screen(vc);
++		fbcon_decor_clear_margins(vc, info, 0);
++	}
++}
++
+diff --git a/drivers/video/console/fbcon.c b/drivers/video/console/fbcon.c
+index f447734..1a840c2 100644
+--- a/drivers/video/console/fbcon.c
++++ b/drivers/video/console/fbcon.c
+@@ -79,6 +79,7 @@
+ #include <asm/irq.h>
+ 
+ #include "fbcon.h"
++#include "fbcondecor.h"
+ 
+ #ifdef FBCONDEBUG
+ #  define DPRINTK(fmt, args...) printk(KERN_DEBUG "%s: " fmt, __func__ , ## args)
+@@ -94,7 +95,7 @@ enum {
+ 
+ static struct display fb_display[MAX_NR_CONSOLES];
+ 
+-static signed char con2fb_map[MAX_NR_CONSOLES];
++signed char con2fb_map[MAX_NR_CONSOLES];
+ static signed char con2fb_map_boot[MAX_NR_CONSOLES];
+ 
+ static int logo_lines;
+@@ -286,7 +287,7 @@ static inline int fbcon_is_inactive(struct vc_data *vc, struct fb_info *info)
+ 		!vt_force_oops_output(vc);
+ }
+ 
+-static int get_color(struct vc_data *vc, struct fb_info *info,
++int get_color(struct vc_data *vc, struct fb_info *info,
+ 	      u16 c, int is_fg)
+ {
+ 	int depth = fb_get_color_depth(&info->var, &info->fix);
+@@ -551,6 +552,9 @@ static int do_fbcon_takeover(int show_logo)
+ 		info_idx = -1;
+ 	} else {
+ 		fbcon_has_console_bind = 1;
++#ifdef CONFIG_FB_CON_DECOR
++		fbcon_decor_init();
++#endif
+ 	}
+ 
+ 	return err;
+@@ -1007,6 +1011,12 @@ static const char *fbcon_startup(void)
+ 	rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres);
+ 	cols /= vc->vc_font.width;
+ 	rows /= vc->vc_font.height;
++
++	if (fbcon_decor_active(info, vc)) {
++		cols = vc->vc_decor.twidth / vc->vc_font.width;
++		rows = vc->vc_decor.theight / vc->vc_font.height;
++	}
++
+ 	vc_resize(vc, cols, rows);
+ 
+ 	DPRINTK("mode:   %s\n", info->fix.id);
+@@ -1036,7 +1046,7 @@ static void fbcon_init(struct vc_data *vc, int init)
+ 	cap = info->flags;
+ 
+ 	if (vc != svc || logo_shown == FBCON_LOGO_DONTSHOW ||
+-	    (info->fix.type == FB_TYPE_TEXT))
++	    (info->fix.type == FB_TYPE_TEXT) || fbcon_decor_active(info, vc))
+ 		logo = 0;
+ 
+ 	if (var_to_display(p, &info->var, info))
+@@ -1260,6 +1270,11 @@ static void fbcon_clear(struct vc_data *vc, int sy, int sx, int height,
+ 		fbcon_clear_margins(vc, 0);
+ 	}
+ 
++ 	if (fbcon_decor_active(info, vc)) {
++ 		fbcon_decor_clear(vc, info, sy, sx, height, width);
++ 		return;
++ 	}
++
+ 	/* Split blits that cross physical y_wrap boundary */
+ 
+ 	y_break = p->vrows - p->yscroll;
+@@ -1279,10 +1294,15 @@ static void fbcon_putcs(struct vc_data *vc, const unsigned short *s,
+ 	struct display *p = &fb_display[vc->vc_num];
+ 	struct fbcon_ops *ops = info->fbcon_par;
+ 
+-	if (!fbcon_is_inactive(vc, info))
+-		ops->putcs(vc, info, s, count, real_y(p, ypos), xpos,
+-			   get_color(vc, info, scr_readw(s), 1),
+-			   get_color(vc, info, scr_readw(s), 0));
++	if (!fbcon_is_inactive(vc, info)) {
++
++		if (fbcon_decor_active(info, vc))
++			fbcon_decor_putcs(vc, info, s, count, ypos, xpos);
++		else
++			ops->putcs(vc, info, s, count, real_y(p, ypos), xpos,
++				   get_color(vc, info, scr_readw(s), 1),
++				   get_color(vc, info, scr_readw(s), 0));
++	}
+ }
+ 
+ static void fbcon_putc(struct vc_data *vc, int c, int ypos, int xpos)
+@@ -1298,8 +1318,13 @@ static void fbcon_clear_margins(struct vc_data *vc, int bottom_only)
+ 	struct fb_info *info = registered_fb[con2fb_map[vc->vc_num]];
+ 	struct fbcon_ops *ops = info->fbcon_par;
+ 
+-	if (!fbcon_is_inactive(vc, info))
+-		ops->clear_margins(vc, info, bottom_only);
++	if (!fbcon_is_inactive(vc, info)) {
++	 	if (fbcon_decor_active(info, vc)) {
++	 		fbcon_decor_clear_margins(vc, info, bottom_only);
++ 		} else {
++			ops->clear_margins(vc, info, bottom_only);
++		}
++	}
+ }
+ 
+ static void fbcon_cursor(struct vc_data *vc, int mode)
+@@ -1819,7 +1844,7 @@ static int fbcon_scroll(struct vc_data *vc, int t, int b, int dir,
+ 			count = vc->vc_rows;
+ 		if (softback_top)
+ 			fbcon_softback_note(vc, t, count);
+-		if (logo_shown >= 0)
++		if (logo_shown >= 0 || fbcon_decor_active(info, vc))
+ 			goto redraw_up;
+ 		switch (p->scrollmode) {
+ 		case SCROLL_MOVE:
+@@ -1912,6 +1937,8 @@ static int fbcon_scroll(struct vc_data *vc, int t, int b, int dir,
+ 			count = vc->vc_rows;
+ 		if (logo_shown >= 0)
+ 			goto redraw_down;
++		if (fbcon_decor_active(info, vc))
++			goto redraw_down;
+ 		switch (p->scrollmode) {
+ 		case SCROLL_MOVE:
+ 			fbcon_redraw_blit(vc, info, p, b - 1, b - t - count,
+@@ -2060,6 +2087,13 @@ static void fbcon_bmove_rec(struct vc_data *vc, struct display *p, int sy, int s
+ 		}
+ 		return;
+ 	}
++
++	if (fbcon_decor_active(info, vc) && sy == dy && height == 1) {
++ 		/* must use slower redraw bmove to keep background pic intact */
++ 		fbcon_decor_bmove_redraw(vc, info, sy, sx, dx, width);
++ 		return;
++ 	}
++
+ 	ops->bmove(vc, info, real_y(p, sy), sx, real_y(p, dy), dx,
+ 		   height, width);
+ }
+@@ -2130,8 +2164,8 @@ static int fbcon_resize(struct vc_data *vc, unsigned int width,
+ 	var.yres = virt_h * virt_fh;
+ 	x_diff = info->var.xres - var.xres;
+ 	y_diff = info->var.yres - var.yres;
+-	if (x_diff < 0 || x_diff > virt_fw ||
+-	    y_diff < 0 || y_diff > virt_fh) {
++	if ((x_diff < 0 || x_diff > virt_fw ||
++		y_diff < 0 || y_diff > virt_fh) && !vc->vc_decor.state) {
+ 		const struct fb_videomode *mode;
+ 
+ 		DPRINTK("attempting resize %ix%i\n", var.xres, var.yres);
+@@ -2167,6 +2201,21 @@ static int fbcon_switch(struct vc_data *vc)
+ 
+ 	info = registered_fb[con2fb_map[vc->vc_num]];
+ 	ops = info->fbcon_par;
++	prev_console = ops->currcon;
++	if (prev_console != -1)
++		old_info = registered_fb[con2fb_map[prev_console]];
++
++#ifdef CONFIG_FB_CON_DECOR
++	if (!fbcon_decor_active_vc(vc) && info->fix.visual == FB_VISUAL_DIRECTCOLOR) {
++		struct vc_data *vc_curr = vc_cons[prev_console].d;
++		if (vc_curr && fbcon_decor_active_vc(vc_curr)) {
++			/* Clear the screen to avoid displaying funky colors during
++			 * palette updates. */
++			memset((u8*)info->screen_base + info->fix.line_length * info->var.yoffset,
++			       0, info->var.yres * info->fix.line_length);
++		}
++	}
++#endif
+ 
+ 	if (softback_top) {
+ 		if (softback_lines)
+@@ -2185,9 +2234,6 @@ static int fbcon_switch(struct vc_data *vc)
+ 		logo_shown = FBCON_LOGO_CANSHOW;
+ 	}
+ 
+-	prev_console = ops->currcon;
+-	if (prev_console != -1)
+-		old_info = registered_fb[con2fb_map[prev_console]];
+ 	/*
+ 	 * FIXME: If we have multiple fbdev's loaded, we need to
+ 	 * update all info->currcon.  Perhaps, we can place this
+@@ -2231,6 +2277,18 @@ static int fbcon_switch(struct vc_data *vc)
+ 			fbcon_del_cursor_timer(old_info);
+ 	}
+ 
++	if (fbcon_decor_active_vc(vc)) {
++		struct vc_data *vc_curr = vc_cons[prev_console].d;
++
++		if (!vc_curr->vc_decor.theme ||
++			strcmp(vc->vc_decor.theme, vc_curr->vc_decor.theme) ||
++			(fbcon_decor_active_nores(info, vc_curr) &&
++			 !fbcon_decor_active(info, vc_curr))) {
++			fbcon_decor_disable(vc, 0);
++			fbcon_decor_call_helper("modechange", vc->vc_num);
++		}
++	}
++
+ 	if (fbcon_is_inactive(vc, info) ||
+ 	    ops->blank_state != FB_BLANK_UNBLANK)
+ 		fbcon_del_cursor_timer(info);
+@@ -2339,15 +2397,20 @@ static int fbcon_blank(struct vc_data *vc, int blank, int mode_switch)
+ 		}
+ 	}
+ 
+- 	if (!fbcon_is_inactive(vc, info)) {
++	if (!fbcon_is_inactive(vc, info)) {
+ 		if (ops->blank_state != blank) {
+ 			ops->blank_state = blank;
+ 			fbcon_cursor(vc, blank ? CM_ERASE : CM_DRAW);
+ 			ops->cursor_flash = (!blank);
+ 
+-			if (!(info->flags & FBINFO_MISC_USEREVENT))
+-				if (fb_blank(info, blank))
+-					fbcon_generic_blank(vc, info, blank);
++			if (!(info->flags & FBINFO_MISC_USEREVENT)) {
++				if (fb_blank(info, blank)) {
++					if (fbcon_decor_active(info, vc))
++						fbcon_decor_blank(vc, info, blank);
++					else
++						fbcon_generic_blank(vc, info, blank);
++				}
++			}
+ 		}
+ 
+ 		if (!blank)
+@@ -2522,13 +2585,22 @@ static int fbcon_do_set_font(struct vc_data *vc, int w, int h,
+ 	}
+ 
+ 	if (resize) {
++		/* reset wrap/pan */
+ 		int cols, rows;
+ 
+ 		cols = FBCON_SWAP(ops->rotate, info->var.xres, info->var.yres);
+ 		rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres);
++
++		if (fbcon_decor_active(info, vc)) {
++			info->var.xoffset = info->var.yoffset = p->yscroll = 0;
++			cols = vc->vc_decor.twidth;
++			rows = vc->vc_decor.theight;
++		}
+ 		cols /= w;
+ 		rows /= h;
++
+ 		vc_resize(vc, cols, rows);
++
+ 		if (CON_IS_VISIBLE(vc) && softback_buf)
+ 			fbcon_update_softback(vc);
+ 	} else if (CON_IS_VISIBLE(vc)
+@@ -2657,7 +2729,11 @@ static int fbcon_set_palette(struct vc_data *vc, unsigned char *table)
+ 	int i, j, k, depth;
+ 	u8 val;
+ 
+-	if (fbcon_is_inactive(vc, info))
++	if (fbcon_is_inactive(vc, info)
++#ifdef CONFIG_FB_CON_DECOR
++			|| vc->vc_num != fg_console
++#endif
++		)
+ 		return -EINVAL;
+ 
+ 	if (!CON_IS_VISIBLE(vc))
+@@ -2683,14 +2759,56 @@ static int fbcon_set_palette(struct vc_data *vc, unsigned char *table)
+ 	} else
+ 		fb_copy_cmap(fb_default_cmap(1 << depth), &palette_cmap);
+ 
+-	return fb_set_cmap(&palette_cmap, info);
++	if (fbcon_decor_active(info, vc_cons[fg_console].d) &&
++	    info->fix.visual == FB_VISUAL_DIRECTCOLOR) {
++
++		u16 *red, *green, *blue;
++		int minlen = min(min(info->var.red.length, info->var.green.length),
++				     info->var.blue.length);
++		int h;
++
++		struct fb_cmap cmap = {
++			.start = 0,
++			.len = (1 << minlen),
++			.red = NULL,
++			.green = NULL,
++			.blue = NULL,
++			.transp = NULL
++		};
++
++		red = kmalloc(256 * sizeof(u16) * 3, GFP_KERNEL);
++
++		if (!red)
++			goto out;
++
++		green = red + 256;
++		blue = green + 256;
++		cmap.red = red;
++		cmap.green = green;
++		cmap.blue = blue;
++
++		for (i = 0; i < cmap.len; i++) {
++			red[i] = green[i] = blue[i] = (0xffff * i)/(cmap.len-1);
++		}
++
++		h = fb_set_cmap(&cmap, info);
++		fbcon_decor_fix_pseudo_pal(info, vc_cons[fg_console].d);
++		kfree(red);
++
++		return h;
++
++	} else if (fbcon_decor_active(info, vc_cons[fg_console].d) &&
++		   info->var.bits_per_pixel == 8 && info->bgdecor.cmap.red != NULL)
++		fb_set_cmap(&info->bgdecor.cmap, info);
++
++out:	return fb_set_cmap(&palette_cmap, info);
+ }
+ 
+ static u16 *fbcon_screen_pos(struct vc_data *vc, int offset)
+ {
+ 	unsigned long p;
+ 	int line;
+-	
++
+ 	if (vc->vc_num != fg_console || !softback_lines)
+ 		return (u16 *) (vc->vc_origin + offset);
+ 	line = offset / vc->vc_size_row;
+@@ -2909,7 +3027,14 @@ static void fbcon_modechanged(struct fb_info *info)
+ 		rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres);
+ 		cols /= vc->vc_font.width;
+ 		rows /= vc->vc_font.height;
+-		vc_resize(vc, cols, rows);
++
++		if (!fbcon_decor_active_nores(info, vc)) {
++			vc_resize(vc, cols, rows);
++		} else {
++			fbcon_decor_disable(vc, 0);
++			fbcon_decor_call_helper("modechange", vc->vc_num);
++		}
++
+ 		updatescrollmode(p, info, vc);
+ 		scrollback_max = 0;
+ 		scrollback_current = 0;
+@@ -2954,7 +3079,9 @@ static void fbcon_set_all_vcs(struct fb_info *info)
+ 		rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres);
+ 		cols /= vc->vc_font.width;
+ 		rows /= vc->vc_font.height;
+-		vc_resize(vc, cols, rows);
++		if (!fbcon_decor_active_nores(info, vc)) {
++			vc_resize(vc, cols, rows);
++		}
+ 	}
+ 
+ 	if (fg != -1)
+@@ -3596,6 +3723,7 @@ static void fbcon_exit(void)
+ 		}
+ 	}
+ 
++	fbcon_decor_exit();
+ 	fbcon_has_exited = 1;
+ }
+ 
+diff --git a/drivers/video/console/fbcondecor.c b/drivers/video/console/fbcondecor.c
+new file mode 100644
+index 0000000..babc8c5
+--- /dev/null
++++ b/drivers/video/console/fbcondecor.c
+@@ -0,0 +1,555 @@
++/*
++ *  linux/drivers/video/console/fbcondecor.c -- Framebuffer console decorations
++ *
++ *  Copyright (C) 2004-2009 Michal Januszewski <michalj+fbcondecor@gmail.com>
++ *
++ *  Code based upon "Bootsplash" (C) 2001-2003
++ *       Volker Poplawski <volker@poplawski.de>,
++ *       Stefan Reinauer <stepan@suse.de>,
++ *       Steffen Winterfeldt <snwint@suse.de>,
++ *       Michael Schroeder <mls@suse.de>,
++ *       Ken Wimer <wimer@suse.de>.
++ *
++ *  Compat ioctl support by Thorsten Klein <TK@Thorsten-Klein.de>.
++ *
++ *  This file is subject to the terms and conditions of the GNU General Public
++ *  License.  See the file COPYING in the main directory of this archive for
++ *  more details.
++ *
++ */
++#include <linux/module.h>
++#include <linux/kernel.h>
++#include <linux/string.h>
++#include <linux/types.h>
++#include <linux/fb.h>
++#include <linux/vt_kern.h>
++#include <linux/vmalloc.h>
++#include <linux/unistd.h>
++#include <linux/syscalls.h>
++#include <linux/init.h>
++#include <linux/proc_fs.h>
++#include <linux/workqueue.h>
++#include <linux/kmod.h>
++#include <linux/miscdevice.h>
++#include <linux/device.h>
++#include <linux/fs.h>
++#include <linux/compat.h>
++#include <linux/console.h>
++
++#include <asm/uaccess.h>
++#include <asm/irq.h>
++
++#include "fbcon.h"
++#include "fbcondecor.h"
++
++extern signed char con2fb_map[];
++static int fbcon_decor_enable(struct vc_data *vc);
++char fbcon_decor_path[KMOD_PATH_LEN] = "/sbin/fbcondecor_helper";
++static int initialized = 0;
++
++int fbcon_decor_call_helper(char* cmd, unsigned short vc)
++{
++	char *envp[] = {
++		"HOME=/",
++		"PATH=/sbin:/bin",
++		NULL
++	};
++
++	char tfb[5];
++	char tcons[5];
++	unsigned char fb = (int) con2fb_map[vc];
++
++	char *argv[] = {
++		fbcon_decor_path,
++		"2",
++		cmd,
++		tcons,
++		tfb,
++		vc_cons[vc].d->vc_decor.theme,
++		NULL
++	};
++
++	snprintf(tfb,5,"%d",fb);
++	snprintf(tcons,5,"%d",vc);
++
++	return call_usermodehelper(fbcon_decor_path, argv, envp, UMH_WAIT_EXEC);
++}
++
++/* Disables fbcondecor on a virtual console; called with console sem held. */
++int fbcon_decor_disable(struct vc_data *vc, unsigned char redraw)
++{
++	struct fb_info* info;
++
++	if (!vc->vc_decor.state)
++		return -EINVAL;
++
++	info = registered_fb[(int) con2fb_map[vc->vc_num]];
++
++	if (info == NULL)
++		return -EINVAL;
++
++	vc->vc_decor.state = 0;
++	vc_resize(vc, info->var.xres / vc->vc_font.width,
++		  info->var.yres / vc->vc_font.height);
++
++	if (fg_console == vc->vc_num && redraw) {
++		redraw_screen(vc, 0);
++		update_region(vc, vc->vc_origin +
++			      vc->vc_size_row * vc->vc_top,
++			      vc->vc_size_row * (vc->vc_bottom - vc->vc_top) / 2);
++	}
++
++	printk(KERN_INFO "fbcondecor: switched decor state to 'off' on console %d\n",
++			 vc->vc_num);
++
++	return 0;
++}
++
++/* Enables fbcondecor on a virtual console; called with console sem held. */
++static int fbcon_decor_enable(struct vc_data *vc)
++{
++	struct fb_info* info;
++
++	info = registered_fb[(int) con2fb_map[vc->vc_num]];
++
++	if (vc->vc_decor.twidth == 0 || vc->vc_decor.theight == 0 ||
++	    info == NULL || vc->vc_decor.state || (!info->bgdecor.data &&
++	    vc->vc_num == fg_console))
++		return -EINVAL;
++
++	vc->vc_decor.state = 1;
++	vc_resize(vc, vc->vc_decor.twidth / vc->vc_font.width,
++		  vc->vc_decor.theight / vc->vc_font.height);
++
++	if (fg_console == vc->vc_num) {
++		redraw_screen(vc, 0);
++		update_region(vc, vc->vc_origin +
++			      vc->vc_size_row * vc->vc_top,
++			      vc->vc_size_row * (vc->vc_bottom - vc->vc_top) / 2);
++		fbcon_decor_clear_margins(vc, info, 0);
++	}
++
++	printk(KERN_INFO "fbcondecor: switched decor state to 'on' on console %d\n",
++			 vc->vc_num);
++
++	return 0;
++}
++
++static inline int fbcon_decor_ioctl_dosetstate(struct vc_data *vc, unsigned int state, unsigned char origin)
++{
++	int ret;
++
++//	if (origin == FBCON_DECOR_IO_ORIG_USER)
++		console_lock();
++	if (!state)
++		ret = fbcon_decor_disable(vc, 1);
++	else
++		ret = fbcon_decor_enable(vc);
++//	if (origin == FBCON_DECOR_IO_ORIG_USER)
++		console_unlock();
++
++	return ret;
++}
++
++static inline void fbcon_decor_ioctl_dogetstate(struct vc_data *vc, unsigned int *state)
++{
++	*state = vc->vc_decor.state;
++}
++
++static int fbcon_decor_ioctl_dosetcfg(struct vc_data *vc, struct vc_decor *cfg, unsigned char origin)
++{
++	struct fb_info *info;
++	int len;
++	char *tmp;
++
++	info = registered_fb[(int) con2fb_map[vc->vc_num]];
++
++	if (info == NULL || !cfg->twidth || !cfg->theight ||
++	    cfg->tx + cfg->twidth  > info->var.xres ||
++	    cfg->ty + cfg->theight > info->var.yres)
++		return -EINVAL;
++
++	len = strlen_user(cfg->theme);
++	if (!len || len > FBCON_DECOR_THEME_LEN)
++		return -EINVAL;
++	tmp = kmalloc(len, GFP_KERNEL);
++	if (!tmp)
++		return -ENOMEM;
++	if (copy_from_user(tmp, (void __user *)cfg->theme, len))
++		return -EFAULT;
++	cfg->theme = tmp;
++	cfg->state = 0;
++
++	/* If this ioctl is a response to a request from kernel, the console sem
++	 * is already held; we also don't need to disable decor because either the
++	 * new config and background picture will be successfully loaded, and the
++	 * decor will stay on, or in case of a failure it'll be turned off in fbcon. */
++//	if (origin == FBCON_DECOR_IO_ORIG_USER) {
++		console_lock();
++		if (vc->vc_decor.state)
++			fbcon_decor_disable(vc, 1);
++//	}
++
++	if (vc->vc_decor.theme)
++		kfree(vc->vc_decor.theme);
++
++	vc->vc_decor = *cfg;
++
++//	if (origin == FBCON_DECOR_IO_ORIG_USER)
++		console_unlock();
++
++	printk(KERN_INFO "fbcondecor: console %d using theme '%s'\n",
++			 vc->vc_num, vc->vc_decor.theme);
++	return 0;
++}
++
++static int fbcon_decor_ioctl_dogetcfg(struct vc_data *vc, struct vc_decor *decor)
++{
++	char __user *tmp;
++
++	tmp = decor->theme;
++	*decor = vc->vc_decor;
++	decor->theme = tmp;
++
++	if (vc->vc_decor.theme) {
++		if (copy_to_user(tmp, vc->vc_decor.theme, strlen(vc->vc_decor.theme) + 1))
++			return -EFAULT;
++	} else
++		if (put_user(0, tmp))
++			return -EFAULT;
++
++	return 0;
++}
++
++static int fbcon_decor_ioctl_dosetpic(struct vc_data *vc, struct fb_image *img, unsigned char origin)
++{
++	struct fb_info *info;
++	int len;
++	u8 *tmp;
++
++	if (vc->vc_num != fg_console)
++		return -EINVAL;
++
++	info = registered_fb[(int) con2fb_map[vc->vc_num]];
++
++	if (info == NULL)
++		return -EINVAL;
++
++	if (img->width != info->var.xres || img->height != info->var.yres) {
++		printk(KERN_ERR "fbcondecor: picture dimensions mismatch\n");
++		printk(KERN_ERR "%dx%d vs %dx%d\n", img->width, img->height, info->var.xres, info->var.yres);
++		return -EINVAL;
++	}
++
++	if (img->depth != info->var.bits_per_pixel) {
++		printk(KERN_ERR "fbcondecor: picture depth mismatch\n");
++		return -EINVAL;
++	}
++
++	if (img->depth == 8) {
++		if (!img->cmap.len || !img->cmap.red || !img->cmap.green ||
++		    !img->cmap.blue)
++			return -EINVAL;
++
++		tmp = vmalloc(img->cmap.len * 3 * 2);
++		if (!tmp)
++			return -ENOMEM;
++
++		if (copy_from_user(tmp,
++			    	   (void __user*)img->cmap.red, (img->cmap.len << 1)) ||
++		    copy_from_user(tmp + (img->cmap.len << 1),
++			    	   (void __user*)img->cmap.green, (img->cmap.len << 1)) ||
++		    copy_from_user(tmp + (img->cmap.len << 2),
++			    	   (void __user*)img->cmap.blue, (img->cmap.len << 1))) {
++			vfree(tmp);
++			return -EFAULT;
++		}
++
++		img->cmap.transp = NULL;
++		img->cmap.red = (u16*)tmp;
++		img->cmap.green = img->cmap.red + img->cmap.len;
++		img->cmap.blue = img->cmap.green + img->cmap.len;
++	} else {
++		img->cmap.red = NULL;
++	}
++
++	len = ((img->depth + 7) >> 3) * img->width * img->height;
++
++	/*
++	 * Allocate an additional byte so that we never go outside of the
++	 * buffer boundaries in the rendering functions in a 24 bpp mode.
++	 */
++	tmp = vmalloc(len + 1);
++
++	if (!tmp)
++		goto out;
++
++	if (copy_from_user(tmp, (void __user*)img->data, len))
++		goto out;
++
++	img->data = tmp;
++
++	/* If this ioctl is a response to a request from kernel, the console sem
++	 * is already held. */
++//	if (origin == FBCON_DECOR_IO_ORIG_USER)
++		console_lock();
++
++	if (info->bgdecor.data)
++		vfree((u8*)info->bgdecor.data);
++	if (info->bgdecor.cmap.red)
++		vfree(info->bgdecor.cmap.red);
++
++	info->bgdecor = *img;
++
++	if (fbcon_decor_active_vc(vc) && fg_console == vc->vc_num) {
++		redraw_screen(vc, 0);
++		update_region(vc, vc->vc_origin +
++			      vc->vc_size_row * vc->vc_top,
++			      vc->vc_size_row * (vc->vc_bottom - vc->vc_top) / 2);
++		fbcon_decor_clear_margins(vc, info, 0);
++	}
++
++//	if (origin == FBCON_DECOR_IO_ORIG_USER)
++		console_unlock();
++
++	return 0;
++
++out:	if (img->cmap.red)
++		vfree(img->cmap.red);
++
++	if (tmp)
++		vfree(tmp);
++	return -ENOMEM;
++}
++
++static long fbcon_decor_ioctl(struct file *filp, u_int cmd, u_long arg)
++{
++	struct fbcon_decor_iowrapper __user *wrapper = (void __user*) arg;
++	struct vc_data *vc = NULL;
++	unsigned short vc_num = 0;
++	unsigned char origin = 0;
++	void __user *data = NULL;
++
++	if (!access_ok(VERIFY_READ, wrapper,
++			sizeof(struct fbcon_decor_iowrapper)))
++		return -EFAULT;
++
++	__get_user(vc_num, &wrapper->vc);
++	__get_user(origin, &wrapper->origin);
++	__get_user(data, &wrapper->data);
++
++	if (!vc_cons_allocated(vc_num))
++		return -EINVAL;
++
++	vc = vc_cons[vc_num].d;
++
++	switch (cmd) {
++	case FBIOCONDECOR_SETPIC:
++	{
++		struct fb_image img;
++		if (copy_from_user(&img, (struct fb_image __user *)data, sizeof(struct fb_image)))
++			return -EFAULT;
++
++		return fbcon_decor_ioctl_dosetpic(vc, &img, origin);
++	}
++	case FBIOCONDECOR_SETCFG:
++	{
++		struct vc_decor cfg;
++		if (copy_from_user(&cfg, (struct vc_decor __user *)data, sizeof(struct vc_decor)))
++			return -EFAULT;
++
++		return fbcon_decor_ioctl_dosetcfg(vc, &cfg, origin);
++	}
++	case FBIOCONDECOR_GETCFG:
++	{
++		int rval;
++		struct vc_decor cfg;
++
++		if (copy_from_user(&cfg, (struct vc_decor __user *)data, sizeof(struct vc_decor)))
++			return -EFAULT;
++
++		rval = fbcon_decor_ioctl_dogetcfg(vc, &cfg);
++
++		if (copy_to_user(data, &cfg, sizeof(struct vc_decor)))
++			return -EFAULT;
++		return rval;
++	}
++	case FBIOCONDECOR_SETSTATE:
++	{
++		unsigned int state = 0;
++		if (get_user(state, (unsigned int __user *)data))
++			return -EFAULT;
++		return fbcon_decor_ioctl_dosetstate(vc, state, origin);
++	}
++	case FBIOCONDECOR_GETSTATE:
++	{
++		unsigned int state = 0;
++		fbcon_decor_ioctl_dogetstate(vc, &state);
++		return put_user(state, (unsigned int __user *)data);
++	}
++
++	default:
++		return -ENOIOCTLCMD;
++	}
++}
++
++#ifdef CONFIG_COMPAT
++
++static long fbcon_decor_compat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) {
++
++	struct fbcon_decor_iowrapper32 __user *wrapper = (void __user *)arg;
++	struct vc_data *vc = NULL;
++	unsigned short vc_num = 0;
++	unsigned char origin = 0;
++	compat_uptr_t data_compat = 0;
++	void __user *data = NULL;
++
++	if (!access_ok(VERIFY_READ, wrapper,
++                       sizeof(struct fbcon_decor_iowrapper32)))
++		return -EFAULT;
++
++	__get_user(vc_num, &wrapper->vc);
++	__get_user(origin, &wrapper->origin);
++	__get_user(data_compat, &wrapper->data);
++	data = compat_ptr(data_compat);
++
++	if (!vc_cons_allocated(vc_num))
++		return -EINVAL;
++
++	vc = vc_cons[vc_num].d;
++
++	switch (cmd) {
++	case FBIOCONDECOR_SETPIC32:
++	{
++		struct fb_image32 img_compat;
++		struct fb_image img;
++
++		if (copy_from_user(&img_compat, (struct fb_image32 __user *)data, sizeof(struct fb_image32)))
++			return -EFAULT;
++
++		fb_image_from_compat(img, img_compat);
++
++		return fbcon_decor_ioctl_dosetpic(vc, &img, origin);
++	}
++
++	case FBIOCONDECOR_SETCFG32:
++	{
++		struct vc_decor32 cfg_compat;
++		struct vc_decor cfg;
++
++		if (copy_from_user(&cfg_compat, (struct vc_decor32 __user *)data, sizeof(struct vc_decor32)))
++			return -EFAULT;
++
++		vc_decor_from_compat(cfg, cfg_compat);
++
++		return fbcon_decor_ioctl_dosetcfg(vc, &cfg, origin);
++	}
++
++	case FBIOCONDECOR_GETCFG32:
++	{
++		int rval;
++		struct vc_decor32 cfg_compat;
++		struct vc_decor cfg;
++
++		if (copy_from_user(&cfg_compat, (struct vc_decor32 __user *)data, sizeof(struct vc_decor32)))
++			return -EFAULT;
++		cfg.theme = compat_ptr(cfg_compat.theme);
++
++		rval = fbcon_decor_ioctl_dogetcfg(vc, &cfg);
++
++		vc_decor_to_compat(cfg_compat, cfg);
++
++		if (copy_to_user((struct vc_decor32 __user *)data, &cfg_compat, sizeof(struct vc_decor32)))
++			return -EFAULT;
++		return rval;
++	}
++
++	case FBIOCONDECOR_SETSTATE32:
++	{
++		compat_uint_t state_compat = 0;
++		unsigned int state = 0;
++
++		if (get_user(state_compat, (compat_uint_t __user *)data))
++			return -EFAULT;
++
++		state = (unsigned int)state_compat;
++
++		return fbcon_decor_ioctl_dosetstate(vc, state, origin);
++	}
++
++	case FBIOCONDECOR_GETSTATE32:
++	{
++		compat_uint_t state_compat = 0;
++		unsigned int state = 0;
++
++		fbcon_decor_ioctl_dogetstate(vc, &state);
++		state_compat = (compat_uint_t)state;
++
++		return put_user(state_compat, (compat_uint_t __user *)data);
++	}
++
++	default:
++		return -ENOIOCTLCMD;
++	}
++}
++#else
++  #define fbcon_decor_compat_ioctl NULL
++#endif
++
++static struct file_operations fbcon_decor_ops = {
++	.owner = THIS_MODULE,
++	.unlocked_ioctl = fbcon_decor_ioctl,
++	.compat_ioctl = fbcon_decor_compat_ioctl
++};
++
++static struct miscdevice fbcon_decor_dev = {
++	.minor = MISC_DYNAMIC_MINOR,
++	.name = "fbcondecor",
++	.fops = &fbcon_decor_ops
++};
++
++void fbcon_decor_reset(void)
++{
++	int i;
++
++	for (i = 0; i < num_registered_fb; i++) {
++		registered_fb[i]->bgdecor.data = NULL;
++		registered_fb[i]->bgdecor.cmap.red = NULL;
++	}
++
++	for (i = 0; i < MAX_NR_CONSOLES && vc_cons[i].d; i++) {
++		vc_cons[i].d->vc_decor.state = vc_cons[i].d->vc_decor.twidth =
++						vc_cons[i].d->vc_decor.theight = 0;
++		vc_cons[i].d->vc_decor.theme = NULL;
++	}
++
++	return;
++}
++
++int fbcon_decor_init(void)
++{
++	int i;
++
++	fbcon_decor_reset();
++
++	if (initialized)
++		return 0;
++
++	i = misc_register(&fbcon_decor_dev);
++	if (i) {
++		printk(KERN_ERR "fbcondecor: failed to register device\n");
++		return i;
++	}
++
++	fbcon_decor_call_helper("init", 0);
++	initialized = 1;
++	return 0;
++}
++
++int fbcon_decor_exit(void)
++{
++	fbcon_decor_reset();
++	return 0;
++}
++
++EXPORT_SYMBOL(fbcon_decor_path);
+diff --git a/drivers/video/console/fbcondecor.h b/drivers/video/console/fbcondecor.h
+new file mode 100644
+index 0000000..3b3724b
+--- /dev/null
++++ b/drivers/video/console/fbcondecor.h
+@@ -0,0 +1,78 @@
++/* 
++ *  linux/drivers/video/console/fbcondecor.h -- Framebuffer Console Decoration headers
++ *
++ *  Copyright (C) 2004 Michal Januszewski <michalj+fbcondecor@gmail.com>
++ *
++ */
++
++#ifndef __FBCON_DECOR_H
++#define __FBCON_DECOR_H
++
++#ifndef _LINUX_FB_H
++#include <linux/fb.h>
++#endif
++
++/* This is needed for vc_cons in fbcmap.c */
++#include <linux/vt_kern.h>
++
++struct fb_cursor;
++struct fb_info;
++struct vc_data;
++
++#ifdef CONFIG_FB_CON_DECOR
++/* fbcondecor.c */
++int fbcon_decor_init(void);
++int fbcon_decor_exit(void);
++int fbcon_decor_call_helper(char* cmd, unsigned short cons);
++int fbcon_decor_disable(struct vc_data *vc, unsigned char redraw);
++
++/* cfbcondecor.c */
++void fbcon_decor_putcs(struct vc_data *vc, struct fb_info *info, const unsigned short *s, int count, int yy, int xx);
++void fbcon_decor_cursor(struct fb_info *info, struct fb_cursor *cursor);
++void fbcon_decor_clear(struct vc_data *vc, struct fb_info *info, int sy, int sx, int height, int width);
++void fbcon_decor_clear_margins(struct vc_data *vc, struct fb_info *info, int bottom_only);
++void fbcon_decor_blank(struct vc_data *vc, struct fb_info *info, int blank);
++void fbcon_decor_bmove_redraw(struct vc_data *vc, struct fb_info *info, int y, int sx, int dx, int width);
++void fbcon_decor_copy(u8 *dst, u8 *src, int height, int width, int linebytes, int srclinesbytes, int bpp);
++void fbcon_decor_fix_pseudo_pal(struct fb_info *info, struct vc_data *vc);
++
++/* vt.c */
++void acquire_console_sem(void);
++void release_console_sem(void);
++void do_unblank_screen(int entering_gfx);
++
++/* struct vc_data *y */
++#define fbcon_decor_active_vc(y) (y->vc_decor.state && y->vc_decor.theme) 
++
++/* struct fb_info *x, struct vc_data *y */
++#define fbcon_decor_active_nores(x,y) (x->bgdecor.data && fbcon_decor_active_vc(y))
++
++/* struct fb_info *x, struct vc_data *y */
++#define fbcon_decor_active(x,y) (fbcon_decor_active_nores(x,y) &&		\
++			      x->bgdecor.width == x->var.xres && 	\
++			      x->bgdecor.height == x->var.yres &&	\
++			      x->bgdecor.depth == x->var.bits_per_pixel)
++
++
++#else /* CONFIG_FB_CON_DECOR */
++
++static inline void fbcon_decor_putcs(struct vc_data *vc, struct fb_info *info, const unsigned short *s, int count, int yy, int xx) {}
++static inline void fbcon_decor_putc(struct vc_data *vc, struct fb_info *info, int c, int ypos, int xpos) {}
++static inline void fbcon_decor_cursor(struct fb_info *info, struct fb_cursor *cursor) {}
++static inline void fbcon_decor_clear(struct vc_data *vc, struct fb_info *info, int sy, int sx, int height, int width) {}
++static inline void fbcon_decor_clear_margins(struct vc_data *vc, struct fb_info *info, int bottom_only) {}
++static inline void fbcon_decor_blank(struct vc_data *vc, struct fb_info *info, int blank) {}
++static inline void fbcon_decor_bmove_redraw(struct vc_data *vc, struct fb_info *info, int y, int sx, int dx, int width) {}
++static inline void fbcon_decor_fix_pseudo_pal(struct fb_info *info, struct vc_data *vc) {}
++static inline int fbcon_decor_call_helper(char* cmd, unsigned short cons) { return 0; }
++static inline int fbcon_decor_init(void) { return 0; }
++static inline int fbcon_decor_exit(void) { return 0; }
++static inline int fbcon_decor_disable(struct vc_data *vc, unsigned char redraw) { return 0; }
++
++#define fbcon_decor_active_vc(y) (0)
++#define fbcon_decor_active_nores(x,y) (0)
++#define fbcon_decor_active(x,y) (0)
++
++#endif /* CONFIG_FB_CON_DECOR */
++
++#endif /* __FBCON_DECOR_H */
+diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig
+index e1f4727..2952e33 100644
+--- a/drivers/video/fbdev/Kconfig
++++ b/drivers/video/fbdev/Kconfig
+@@ -1204,7 +1204,6 @@ config FB_MATROX
+ 	select FB_CFB_FILLRECT
+ 	select FB_CFB_COPYAREA
+ 	select FB_CFB_IMAGEBLIT
+-	select FB_TILEBLITTING
+ 	select FB_MACMODES if PPC_PMAC
+ 	---help---
+ 	  Say Y here if you have a Matrox Millennium, Matrox Millennium II,
+diff --git a/drivers/video/fbdev/core/fbcmap.c b/drivers/video/fbdev/core/fbcmap.c
+index f89245b..05e036c 100644
+--- a/drivers/video/fbdev/core/fbcmap.c
++++ b/drivers/video/fbdev/core/fbcmap.c
+@@ -17,6 +17,8 @@
+ #include <linux/slab.h>
+ #include <linux/uaccess.h>
+ 
++#include "../../console/fbcondecor.h"
++
+ static u16 red2[] __read_mostly = {
+     0x0000, 0xaaaa
+ };
+@@ -249,14 +251,17 @@ int fb_set_cmap(struct fb_cmap *cmap, struct fb_info *info)
+ 			if (transp)
+ 				htransp = *transp++;
+ 			if (info->fbops->fb_setcolreg(start++,
+-						      hred, hgreen, hblue,
++						      hred, hgreen, hblue, 
+ 						      htransp, info))
+ 				break;
+ 		}
+ 	}
+-	if (rc == 0)
++	if (rc == 0) {
+ 		fb_copy_cmap(cmap, &info->cmap);
+-
++		if (fbcon_decor_active(info, vc_cons[fg_console].d) &&
++		    info->fix.visual == FB_VISUAL_DIRECTCOLOR)
++			fbcon_decor_fix_pseudo_pal(info, vc_cons[fg_console].d);
++	}
+ 	return rc;
+ }
+ 
+diff --git a/drivers/video/fbdev/core/fbmem.c b/drivers/video/fbdev/core/fbmem.c
+index b6d5008..d6703f2 100644
+--- a/drivers/video/fbdev/core/fbmem.c
++++ b/drivers/video/fbdev/core/fbmem.c
+@@ -1250,15 +1250,6 @@ struct fb_fix_screeninfo32 {
+ 	u16			reserved[3];
+ };
+ 
+-struct fb_cmap32 {
+-	u32			start;
+-	u32			len;
+-	compat_caddr_t	red;
+-	compat_caddr_t	green;
+-	compat_caddr_t	blue;
+-	compat_caddr_t	transp;
+-};
+-
+ static int fb_getput_cmap(struct fb_info *info, unsigned int cmd,
+ 			  unsigned long arg)
+ {
+diff --git a/include/linux/console_decor.h b/include/linux/console_decor.h
+new file mode 100644
+index 0000000..04b8d80
+--- /dev/null
++++ b/include/linux/console_decor.h
+@@ -0,0 +1,46 @@
++#ifndef _LINUX_CONSOLE_DECOR_H_
++#define _LINUX_CONSOLE_DECOR_H_ 1
++
++/* A structure used by the framebuffer console decorations (drivers/video/console/fbcondecor.c) */
++struct vc_decor {
++	__u8 bg_color;				/* The color that is to be treated as transparent */
++	__u8 state;				/* Current decor state: 0 = off, 1 = on */
++	__u16 tx, ty;				/* Top left corner coordinates of the text field */
++	__u16 twidth, theight;			/* Width and height of the text field */
++	char* theme;
++};
++
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++#include <linux/compat.h>
++
++struct vc_decor32 {
++	__u8 bg_color;				/* The color that is to be treated as transparent */
++	__u8 state;				/* Current decor state: 0 = off, 1 = on */
++	__u16 tx, ty;				/* Top left corner coordinates of the text field */
++	__u16 twidth, theight;			/* Width and height of the text field */
++	compat_uptr_t theme;
++};
++
++#define vc_decor_from_compat(to, from) \
++	(to).bg_color = (from).bg_color; \
++	(to).state    = (from).state; \
++	(to).tx       = (from).tx; \
++	(to).ty       = (from).ty; \
++	(to).twidth   = (from).twidth; \
++	(to).theight  = (from).theight; \
++	(to).theme    = compat_ptr((from).theme)
++
++#define vc_decor_to_compat(to, from) \
++	(to).bg_color = (from).bg_color; \
++	(to).state    = (from).state; \
++	(to).tx       = (from).tx; \
++	(to).ty       = (from).ty; \
++	(to).twidth   = (from).twidth; \
++	(to).theight  = (from).theight; \
++	(to).theme    = ptr_to_compat((from).theme)
++
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
++#endif
+diff --git a/include/linux/console_struct.h b/include/linux/console_struct.h
+index 7f0c329..98f5d60 100644
+--- a/include/linux/console_struct.h
++++ b/include/linux/console_struct.h
+@@ -19,6 +19,7 @@
+ struct vt_struct;
+ 
+ #define NPAR 16
++#include <linux/console_decor.h>
+ 
+ struct vc_data {
+ 	struct tty_port port;			/* Upper level data */
+@@ -107,6 +108,8 @@ struct vc_data {
+ 	unsigned long	vc_uni_pagedir;
+ 	unsigned long	*vc_uni_pagedir_loc;  /* [!] Location of uni_pagedir variable for this console */
+ 	bool vc_panic_force_write; /* when oops/panic this VC can accept forced output/blanking */
++
++	struct vc_decor vc_decor;
+ 	/* additional information is in vt_kern.h */
+ };
+ 
+diff --git a/include/linux/fb.h b/include/linux/fb.h
+index fe6ac95..1e36b03 100644
+--- a/include/linux/fb.h
++++ b/include/linux/fb.h
+@@ -219,6 +219,34 @@ struct fb_deferred_io {
+ };
+ #endif
+ 
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++struct fb_image32 {
++	__u32 dx;			/* Where to place image */
++	__u32 dy;
++	__u32 width;			/* Size of image */
++	__u32 height;
++	__u32 fg_color;			/* Only used when a mono bitmap */
++	__u32 bg_color;
++	__u8  depth;			/* Depth of the image */
++	const compat_uptr_t data;	/* Pointer to image data */
++	struct fb_cmap32 cmap;		/* color map info */
++};
++
++#define fb_image_from_compat(to, from) \
++	(to).dx       = (from).dx; \
++	(to).dy       = (from).dy; \
++	(to).width    = (from).width; \
++	(to).height   = (from).height; \
++	(to).fg_color = (from).fg_color; \
++	(to).bg_color = (from).bg_color; \
++	(to).depth    = (from).depth; \
++	(to).data     = compat_ptr((from).data); \
++	fb_cmap_from_compat((to).cmap, (from).cmap)
++
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
+ /*
+  * Frame buffer operations
+  *
+@@ -489,6 +517,9 @@ struct fb_info {
+ #define FBINFO_STATE_SUSPENDED	1
+ 	u32 state;			/* Hardware state i.e suspend */
+ 	void *fbcon_par;                /* fbcon use-only private area */
++
++	struct fb_image bgdecor;
++
+ 	/* From here on everything is device dependent */
+ 	void *par;
+ 	/* we need the PCI or similar aperture base/size not
+diff --git a/include/uapi/linux/fb.h b/include/uapi/linux/fb.h
+index fb795c3..dc77a03 100644
+--- a/include/uapi/linux/fb.h
++++ b/include/uapi/linux/fb.h
+@@ -8,6 +8,25 @@
+ 
+ #define FB_MAX			32	/* sufficient for now */
+ 
++struct fbcon_decor_iowrapper
++{
++	unsigned short vc;		/* Virtual console */
++	unsigned char origin;		/* Point of origin of the request */
++	void *data;
++};
++
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++#include <linux/compat.h>
++struct fbcon_decor_iowrapper32
++{
++	unsigned short vc;		/* Virtual console */
++	unsigned char origin;		/* Point of origin of the request */
++	compat_uptr_t data;
++};
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
+ /* ioctls
+    0x46 is 'F'								*/
+ #define FBIOGET_VSCREENINFO	0x4600
+@@ -35,6 +54,25 @@
+ #define FBIOGET_DISPINFO        0x4618
+ #define FBIO_WAITFORVSYNC	_IOW('F', 0x20, __u32)
+ 
++#define FBIOCONDECOR_SETCFG	_IOWR('F', 0x19, struct fbcon_decor_iowrapper)
++#define FBIOCONDECOR_GETCFG	_IOR('F', 0x1A, struct fbcon_decor_iowrapper)
++#define FBIOCONDECOR_SETSTATE	_IOWR('F', 0x1B, struct fbcon_decor_iowrapper)
++#define FBIOCONDECOR_GETSTATE	_IOR('F', 0x1C, struct fbcon_decor_iowrapper)
++#define FBIOCONDECOR_SETPIC 	_IOWR('F', 0x1D, struct fbcon_decor_iowrapper)
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++#define FBIOCONDECOR_SETCFG32	_IOWR('F', 0x19, struct fbcon_decor_iowrapper32)
++#define FBIOCONDECOR_GETCFG32	_IOR('F', 0x1A, struct fbcon_decor_iowrapper32)
++#define FBIOCONDECOR_SETSTATE32	_IOWR('F', 0x1B, struct fbcon_decor_iowrapper32)
++#define FBIOCONDECOR_GETSTATE32	_IOR('F', 0x1C, struct fbcon_decor_iowrapper32)
++#define FBIOCONDECOR_SETPIC32	_IOWR('F', 0x1D, struct fbcon_decor_iowrapper32)
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
++#define FBCON_DECOR_THEME_LEN		128	/* Maximum lenght of a theme name */
++#define FBCON_DECOR_IO_ORIG_KERNEL	0	/* Kernel ioctl origin */
++#define FBCON_DECOR_IO_ORIG_USER	1	/* User ioctl origin */
++ 
+ #define FB_TYPE_PACKED_PIXELS		0	/* Packed Pixels	*/
+ #define FB_TYPE_PLANES			1	/* Non interleaved planes */
+ #define FB_TYPE_INTERLEAVED_PLANES	2	/* Interleaved planes	*/
+@@ -277,6 +315,29 @@ struct fb_var_screeninfo {
+ 	__u32 reserved[4];		/* Reserved for future compatibility */
+ };
+ 
++#ifdef __KERNEL__
++#ifdef CONFIG_COMPAT
++struct fb_cmap32 {
++	__u32 start;
++	__u32 len;			/* Number of entries */
++	compat_uptr_t red;		/* Red values	*/
++	compat_uptr_t green;
++	compat_uptr_t blue;
++	compat_uptr_t transp;		/* transparency, can be NULL */
++};
++
++#define fb_cmap_from_compat(to, from) \
++	(to).start  = (from).start; \
++	(to).len    = (from).len; \
++	(to).red    = compat_ptr((from).red); \
++	(to).green  = compat_ptr((from).green); \
++	(to).blue   = compat_ptr((from).blue); \
++	(to).transp = compat_ptr((from).transp)
++
++#endif /* CONFIG_COMPAT */
++#endif /* __KERNEL__ */
++
++
+ struct fb_cmap {
+ 	__u32 start;			/* First entry	*/
+ 	__u32 len;			/* Number of entries */
+diff --git a/kernel/sysctl.c b/kernel/sysctl.c
+index 74f5b58..6386ab0 100644
+--- a/kernel/sysctl.c
++++ b/kernel/sysctl.c
+@@ -146,6 +146,10 @@ static const int cap_last_cap = CAP_LAST_CAP;
+ static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
+ #endif
+ 
++#ifdef CONFIG_FB_CON_DECOR
++extern char fbcon_decor_path[];
++#endif
++
+ #ifdef CONFIG_INOTIFY_USER
+ #include <linux/inotify.h>
+ #endif
+@@ -255,6 +259,15 @@ static struct ctl_table sysctl_base_table[] = {
+ 		.mode		= 0555,
+ 		.child		= dev_table,
+ 	},
++#ifdef CONFIG_FB_CON_DECOR
++	{
++		.procname	= "fbcondecor",
++		.data		= &fbcon_decor_path,
++		.maxlen		= KMOD_PATH_LEN,
++		.mode		= 0644,
++		.proc_handler	= &proc_dostring,
++	},
++#endif
+ 	{ }
+ };
+ 

diff --git a/4500_support-for-pogoplug-e02.patch b/4500_support-for-pogoplug-e02.patch
new file mode 100644
index 0000000..9f0becd
--- /dev/null
+++ b/4500_support-for-pogoplug-e02.patch
@@ -0,0 +1,172 @@
+diff --git a/arch/arm/configs/kirkwood_defconfig b/arch/arm/configs/kirkwood_defconfig
+index 0f2aa61..8c3146b 100644
+--- a/arch/arm/configs/kirkwood_defconfig
++++ b/arch/arm/configs/kirkwood_defconfig
+@@ -20,6 +20,7 @@ CONFIG_MACH_NET2BIG_V2=y
+ CONFIG_MACH_D2NET_V2=y
+ CONFIG_MACH_NET2BIG_V2=y
+ CONFIG_MACH_NET5BIG_V2=y
++CONFIG_MACH_POGO_E02=n
+ CONFIG_MACH_OPENRD_BASE=y
+ CONFIG_MACH_OPENRD_CLIENT=y
+ CONFIG_MACH_OPENRD_ULTIMATE=y
+diff --git a/arch/arm/mach-kirkwood/Kconfig b/arch/arm/mach-kirkwood/Kconfig
+index b634f96..cd7f289 100644
+--- a/arch/arm/mach-kirkwood/Kconfig
++++ b/arch/arm/mach-kirkwood/Kconfig
+@@ -62,6 +62,15 @@ config MACH_NETSPACE_V2
+ 	  Say 'Y' here if you want your kernel to support the
+ 	  LaCie Network Space v2 NAS.
+ 
++config MACH_POGO_E02
++	bool "CE Pogoplug E02"
++	default n
++	help
++	  Say 'Y' here if you want your kernel to support the
++	  CloudEngines Pogoplug e02. It differs from Marvell's
++	  SheevaPlug Reference Board by a few details, but
++	  especially in the led assignments.
++
+ config MACH_OPENRD
+         bool
+ 
+diff --git a/arch/arm/mach-kirkwood/Makefile b/arch/arm/mach-kirkwood/Makefile
+index ac4cd75..dddbb40 100644
+--- a/arch/arm/mach-kirkwood/Makefile
++++ b/arch/arm/mach-kirkwood/Makefile
+@@ -2,6 +2,7 @@ obj-y				+= common.o irq.o pcie.o mpp.o
+ obj-$(CONFIG_MACH_D2NET_V2)		+= d2net_v2-setup.o lacie_v2-common.o
+ obj-$(CONFIG_MACH_NET2BIG_V2)		+= netxbig_v2-setup.o lacie_v2-common.o
+ obj-$(CONFIG_MACH_NET5BIG_V2)		+= netxbig_v2-setup.o lacie_v2-common.o
++obj-$(CONFIG_MACH_POGO_E02)		+= pogo_e02-setup.o
+ obj-$(CONFIG_MACH_OPENRD)		+= openrd-setup.o
+ obj-$(CONFIG_MACH_RD88F6192_NAS)	+= rd88f6192-nas-setup.o
+ obj-$(CONFIG_MACH_RD88F6281)		+= rd88f6281-setup.o
+diff --git a/arch/arm/mach-kirkwood/pogo_e02-setup.c b/arch/arm/mach-kirkwood/pogo_e02-setup.c
+new file mode 100644
+index 0000000..f57e8f7
+--- /dev/null
++++ b/arch/arm/mach-kirkwood/pogo_e02-setup.c
+@@ -0,0 +1,122 @@
++/*
++ * arch/arm/mach-kirkwood/pogo_e02-setup.c
++ *
++ * CloudEngines Pogoplug E02 support
++ *
++ * Copyright (C) 2013 Christoph Junghans <ottxor@gentoo.org>
++ * Based on a patch in Arch Linux for Arm by:
++ * Copyright (C) 2012 Kevin Mihelich <kevin@miheli.ch>
++ *                and  <pazos@lavabit.com>
++ *
++ * Based on the board file sheevaplug-setup.c
++ *
++ * This file is licensed under the terms of the GNU General Public
++ * License version 2.  This program is licensed "as is" without any
++ * warranty of any kind, whether express or implied.
++ */
++
++#include <linux/kernel.h>
++#include <linux/init.h>
++#include <linux/platform_device.h>
++#include <linux/ata_platform.h>
++#include <linux/mtd/partitions.h>
++#include <linux/mv643xx_eth.h>
++#include <linux/gpio.h>
++#include <linux/leds.h>
++#include <asm/mach-types.h>
++#include <asm/mach/arch.h>
++#include <mach/kirkwood.h>
++#include "common.h"
++#include "mpp.h"
++
++static struct mtd_partition pogo_e02_nand_parts[] = {
++	{
++		.name = "u-boot",
++		.offset = 0,
++		.size = SZ_1M
++	}, {
++		.name = "uImage",
++		.offset = MTDPART_OFS_NXTBLK,
++		.size = SZ_4M
++	}, {
++		.name = "pogoplug",
++		.offset = MTDPART_OFS_NXTBLK,
++		.size = SZ_32M
++	}, {
++		.name = "root",
++		.offset = MTDPART_OFS_NXTBLK,
++		.size = MTDPART_SIZ_FULL
++	},
++};
++
++static struct mv643xx_eth_platform_data pogo_e02_ge00_data = {
++	.phy_addr	= MV643XX_ETH_PHY_ADDR(0),
++};
++
++static struct gpio_led pogo_e02_led_pins[] = {
++	{
++		.name			= "status:green:health",
++		.default_trigger	= "default-on",
++		.gpio			= 48,
++		.active_low		= 1,
++	},
++	{
++		.name			= "status:orange:fault",
++		.default_trigger	= "none",
++		.gpio			= 49,
++		.active_low		= 1,
++	}
++};
++
++static struct gpio_led_platform_data pogo_e02_led_data = {
++	.leds		= pogo_e02_led_pins,
++	.num_leds	= ARRAY_SIZE(pogo_e02_led_pins),
++};
++
++static struct platform_device pogo_e02_leds = {
++	.name	= "leds-gpio",
++	.id	= -1,
++	.dev	= {
++		.platform_data	= &pogo_e02_led_data,
++	}
++};
++
++static unsigned int pogo_e02_mpp_config[] __initdata = {
++	MPP29_GPIO,	/* USB Power Enable */
++	MPP48_GPIO,	/* LED Green */
++	MPP49_GPIO,	/* LED Orange */
++	0
++};
++
++static void __init pogo_e02_init(void)
++{
++	/*
++	 * Basic setup. Needs to be called early.
++	 */
++	kirkwood_init();
++
++	/* setup gpio pin select */
++	kirkwood_mpp_conf(pogo_e02_mpp_config);
++
++	kirkwood_uart0_init();
++	kirkwood_nand_init(ARRAY_AND_SIZE(pogo_e02_nand_parts), 25);
++
++	if (gpio_request(29, "USB Power Enable") != 0 ||
++	    gpio_direction_output(29, 1) != 0)
++		pr_err("can't set up GPIO 29 (USB Power Enable)\n");
++	kirkwood_ehci_init();
++
++	kirkwood_ge00_init(&pogo_e02_ge00_data);
++
++	platform_device_register(&pogo_e02_leds);
++}
++
++MACHINE_START(POGO_E02, "Pogoplug E02")
++	.atag_offset	= 0x100,
++	.init_machine	= pogo_e02_init,
++	.map_io		= kirkwood_map_io,
++	.init_early	= kirkwood_init_early,
++	.init_irq	= kirkwood_init_irq,
++	.timer		= &kirkwood_timer,
++	.restart	= kirkwood_restart,
++MACHINE_END


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-07-15 12:23 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-07-15 12:23 UTC (permalink / raw
  To: gentoo-commits

commit:     3fe9f8aab7f5e1262afd9d1f45be1e3d0afe8ce9
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Tue Jul 15 12:22:59 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Tue Jul 15 12:22:59 2014 +0000
URL:        http://git.overlays.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=3fe9f8aa

Kernel patch enables gcc optimizations for additional CPUs.

---
 0000_README                                        |   4 +
 ...able-additional-cpu-optimizations-for-gcc.patch | 327 +++++++++++++++++++++
 2 files changed, 331 insertions(+)

diff --git a/0000_README b/0000_README
index 6276507..da7da0d 100644
--- a/0000_README
+++ b/0000_README
@@ -71,3 +71,7 @@ Patch:  4567_distro-Gentoo-Kconfig.patch
 From:   Tom Wijsman <TomWij@gentoo.org>
 Desc:   Add Gentoo Linux support config settings and defaults.
 
+Patch:  5000_enable-additional-cpu-optimizations-for-gcc.patch
+From:   https://github.com/graysky2/kernel_gcc_patch/
+Desc:   Kernel patch enables gcc optimizations for additional CPUs.
+

diff --git a/5000_enable-additional-cpu-optimizations-for-gcc.patch b/5000_enable-additional-cpu-optimizations-for-gcc.patch
new file mode 100644
index 0000000..f7ab6f0
--- /dev/null
+++ b/5000_enable-additional-cpu-optimizations-for-gcc.patch
@@ -0,0 +1,327 @@
+This patch has been tested on and known to work with kernel versions from 3.2
+up to the latest git version (pulled on 12/14/2013).
+
+This patch will expand the number of microarchitectures to include new
+processors including: AMD K10-family, AMD Family 10h (Barcelona), AMD Family
+14h (Bobcat), AMD Family 15h (Bulldozer), AMD Family 15h (Piledriver), AMD
+Family 16h (Jaguar), Intel 1st Gen Core i3/i5/i7 (Nehalem), Intel 2nd Gen Core
+i3/i5/i7 (Sandybridge), Intel 3rd Gen Core i3/i5/i7 (Ivybridge), and Intel 4th
+Gen Core i3/i5/i7 (Haswell). It also offers the compiler the 'native' flag.
+
+Small but real speed increases are measurable using a make endpoint comparing
+a generic kernel to one built with one of the respective microarchs.
+
+See the following experimental evidence supporting this statement:
+https://github.com/graysky2/kernel_gcc_patch
+
+REQUIREMENTS
+linux version >=3.15
+gcc version <4.9
+
+---
+diff -uprN a/arch/x86/include/asm/module.h b/arch/x86/include/asm/module.h
+--- a/arch/x86/include/asm/module.h	2013-11-03 18:41:51.000000000 -0500
++++ b/arch/x86/include/asm/module.h	2013-12-15 06:21:24.351122516 -0500
+@@ -15,6 +15,16 @@
+ #define MODULE_PROC_FAMILY "586MMX "
+ #elif defined CONFIG_MCORE2
+ #define MODULE_PROC_FAMILY "CORE2 "
++#elif defined CONFIG_MNATIVE
++#define MODULE_PROC_FAMILY "NATIVE "
++#elif defined CONFIG_MCOREI7
++#define MODULE_PROC_FAMILY "COREI7 "
++#elif defined CONFIG_MCOREI7AVX
++#define MODULE_PROC_FAMILY "COREI7AVX "
++#elif defined CONFIG_MCOREAVXI
++#define MODULE_PROC_FAMILY "COREAVXI "
++#elif defined CONFIG_MCOREAVX2
++#define MODULE_PROC_FAMILY "COREAVX2 "
+ #elif defined CONFIG_MATOM
+ #define MODULE_PROC_FAMILY "ATOM "
+ #elif defined CONFIG_M686
+@@ -33,6 +43,18 @@
+ #define MODULE_PROC_FAMILY "K7 "
+ #elif defined CONFIG_MK8
+ #define MODULE_PROC_FAMILY "K8 "
++#elif defined CONFIG_MK10
++#define MODULE_PROC_FAMILY "K10 "
++#elif defined CONFIG_MBARCELONA
++#define MODULE_PROC_FAMILY "BARCELONA "
++#elif defined CONFIG_MBOBCAT
++#define MODULE_PROC_FAMILY "BOBCAT "
++#elif defined CONFIG_MBULLDOZER
++#define MODULE_PROC_FAMILY "BULLDOZER "
++#elif defined CONFIG_MPILEDRIVER
++#define MODULE_PROC_FAMILY "PILEDRIVER "
++#elif defined CONFIG_MJAGUAR
++#define MODULE_PROC_FAMILY "JAGUAR "
+ #elif defined CONFIG_MELAN
+ #define MODULE_PROC_FAMILY "ELAN "
+ #elif defined CONFIG_MCRUSOE
+diff -uprN a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
+--- a/arch/x86/Kconfig.cpu	2013-11-03 18:41:51.000000000 -0500
++++ b/arch/x86/Kconfig.cpu	2013-12-15 06:21:24.351122516 -0500
+@@ -139,7 +139,7 @@ config MPENTIUM4
+ 
+ 
+ config MK6
+-	bool "K6/K6-II/K6-III"
++	bool "AMD K6/K6-II/K6-III"
+ 	depends on X86_32
+ 	---help---
+ 	  Select this for an AMD K6-family processor.  Enables use of
+@@ -147,7 +147,7 @@ config MK6
+ 	  flags to GCC.
+ 
+ config MK7
+-	bool "Athlon/Duron/K7"
++	bool "AMD Athlon/Duron/K7"
+ 	depends on X86_32
+ 	---help---
+ 	  Select this for an AMD Athlon K7-family processor.  Enables use of
+@@ -155,12 +155,55 @@ config MK7
+ 	  flags to GCC.
+ 
+ config MK8
+-	bool "Opteron/Athlon64/Hammer/K8"
++	bool "AMD Opteron/Athlon64/Hammer/K8"
+ 	---help---
+ 	  Select this for an AMD Opteron or Athlon64 Hammer-family processor.
+ 	  Enables use of some extended instructions, and passes appropriate
+ 	  optimization flags to GCC.
+ 
++config MK10
++	bool "AMD 61xx/7x50/PhenomX3/X4/II/K10"
++	---help---
++	  Select this for an AMD 61xx Eight-Core Magny-Cours, Athlon X2 7x50,
++		Phenom X3/X4/II, Athlon II X2/X3/X4, or Turion II-family processor.
++	  Enables use of some extended instructions, and passes appropriate
++	  optimization flags to GCC.
++
++config MBARCELONA
++	bool "AMD Barcelona"
++	---help---
++	  Select this for AMD Barcelona and newer processors.
++
++	  Enables -march=barcelona
++
++config MBOBCAT
++	bool "AMD Bobcat"
++	---help---
++	  Select this for AMD Bobcat processors.
++
++	  Enables -march=btver1
++
++config MBULLDOZER
++	bool "AMD Bulldozer"
++	---help---
++	  Select this for AMD Bulldozer processors.
++
++	  Enables -march=bdver1
++
++config MPILEDRIVER
++	bool "AMD Piledriver"
++	---help---
++	  Select this for AMD Piledriver processors.
++
++	  Enables -march=bdver2
++
++config MJAGUAR
++	bool "AMD Jaguar"
++	---help---
++	  Select this for AMD Jaguar processors.
++
++	  Enables -march=btver2
++
+ config MCRUSOE
+ 	bool "Crusoe"
+ 	depends on X86_32
+@@ -251,8 +294,17 @@ config MPSC
+ 	  using the cpu family field
+ 	  in /proc/cpuinfo. Family 15 is an older Xeon, Family 6 a newer one.
+ 
++config MATOM
++	bool "Intel Atom"
++	---help---
++
++	  Select this for the Intel Atom platform. Intel Atom CPUs have an
++	  in-order pipelining architecture and thus can benefit from
++	  accordingly optimized code. Use a recent GCC with specific Atom
++	  support in order to fully benefit from selecting this option.
++
+ config MCORE2
+-	bool "Core 2/newer Xeon"
++	bool "Intel Core 2"
+ 	---help---
+ 
+ 	  Select this for Intel Core 2 and newer Core 2 Xeons (Xeon 51xx and
+@@ -260,14 +312,40 @@ config MCORE2
+ 	  family in /proc/cpuinfo. Newer ones have 6 and older ones 15
+ 	  (not a typo)
+ 
+-config MATOM
+-	bool "Intel Atom"
++	  Enables -march=core2
++
++config MCOREI7
++	bool "Intel Core i7"
+ 	---help---
+ 
+-	  Select this for the Intel Atom platform. Intel Atom CPUs have an
+-	  in-order pipelining architecture and thus can benefit from
+-	  accordingly optimized code. Use a recent GCC with specific Atom
+-	  support in order to fully benefit from selecting this option.
++	  Select this for the Intel Nehalem platform. Intel Nehalem proecessors
++	  include Core i3, i5, i7, Xeon: 34xx, 35xx, 55xx, 56xx, 75xx processors.
++
++	  Enables -march=corei7
++
++config MCOREI7AVX
++	bool "Intel Core 2nd Gen AVX"
++	---help---
++
++	  Select this for 2nd Gen Core processors including Sandy Bridge.
++
++	  Enables -march=corei7-avx
++
++config MCOREAVXI
++	bool "Intel Core 3rd Gen AVX"
++	---help---
++
++	  Select this for 3rd Gen Core processors including Ivy Bridge.
++
++	  Enables -march=core-avx-i
++
++config MCOREAVX2
++	bool "Intel Core AVX2"
++	---help---
++
++	  Select this for AVX2 enabled processors including Haswell.
++
++	  Enables -march=core-avx2
+ 
+ config GENERIC_CPU
+ 	bool "Generic-x86-64"
+@@ -276,6 +354,19 @@ config GENERIC_CPU
+ 	  Generic x86-64 CPU.
+ 	  Run equally well on all x86-64 CPUs.
+ 
++config MNATIVE
++ bool "Native optimizations autodetected by GCC"
++ ---help---
++
++   GCC 4.2 and above support -march=native, which automatically detects
++   the optimum settings to use based on your processor. -march=native
++   also detects and applies additional settings beyond -march specific
++   to your CPU, (eg. -msse4). Unless you have a specific reason not to
++   (e.g. distcc cross-compiling), you should probably be using
++   -march=native rather than anything listed below.
++
++   Enables -march=native
++
+ endchoice
+ 
+ config X86_GENERIC
+@@ -300,7 +391,7 @@ config X86_INTERNODE_CACHE_SHIFT
+ config X86_L1_CACHE_SHIFT
+ 	int
+ 	default "7" if MPENTIUM4 || MPSC
+-	default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU
++	default "6" if MK7 || MK8 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MJAGUAR || MPENTIUMM || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2 || MATOM || MVIAC7 || X86_GENERIC || MNATIVE || GENERIC_CPU
+ 	default "4" if MELAN || M486 || MGEODEGX1
+ 	default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX
+ 
+@@ -331,11 +422,11 @@ config X86_ALIGNMENT_16
+ 
+ config X86_INTEL_USERCOPY
+ 	def_bool y
+-	depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || X86_GENERIC || MK8 || MK7 || MEFFICEON || MCORE2
++	depends on MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M586MMX || MNATIVE || X86_GENERIC || MK8 || MK7 || MK10 || MBARCELONA || MEFFICEON || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2
+ 
+ config X86_USE_PPRO_CHECKSUM
+ 	def_bool y
+-	depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MATOM
++	depends on MWINCHIP3D || MWINCHIPC6 || MCYRIXIII || MK7 || MK6 || MK10 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MK8 || MVIAC3_2 || MVIAC7 || MEFFICEON || MGEODE_LX || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2 || MATOM || MNATIVE
+ 
+ config X86_USE_3DNOW
+ 	def_bool y
+@@ -363,17 +454,17 @@ config X86_P6_NOP
+ 
+ config X86_TSC
+ 	def_bool y
+-	depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MATOM) || X86_64
++	depends on (MWINCHIP3D || MCRUSOE || MEFFICEON || MCYRIXIII || MK7 || MK6 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || MK8 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MJAGUAR || MVIAC3_2 || MVIAC7 || MGEODEGX1 || MGEODE_LX || MCORE2 || MCOREI7 || MCOREI7-AVX || MATOM) || X86_64 || MNATIVE
+ 
+ config X86_CMPXCHG64
+ 	def_bool y
+-	depends on X86_PAE || X86_64 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM
++	depends on X86_PAE || X86_64 || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM || MNATIVE
+ 
+ # this should be set for all -march=.. options where the compiler
+ # generates cmov.
+ config X86_CMOV
+ 	def_bool y
+-	depends on (MK8 || MK7 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MATOM || MGEODE_LX)
++	depends on (MK8 || MK10 || MBARCELONA || MBOBCAT || MBULLDOZER || MPILEDRIVER || MJAGUAR || MK7 || MCORE2 || MCOREI7 || MCOREI7AVX || MCOREAVXI || MCOREAVX2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MVIAC3_2 || MVIAC7 || MCRUSOE || MEFFICEON || X86_64 || MNATIVE || MATOM || MGEODE_LX)
+ 
+ config X86_MINIMUM_CPU_FAMILY
+ 	int
+diff -uprN a/arch/x86/Makefile b/arch/x86/Makefile
+--- a/arch/x86/Makefile	2013-11-03 18:41:51.000000000 -0500
++++ b/arch/x86/Makefile	2013-12-15 06:21:24.354455723 -0500
+@@ -61,11 +61,26 @@ else
+ 	KBUILD_CFLAGS += $(call cc-option,-mno-sse -mpreferred-stack-boundary=3)
+ 
+         # FIXME - should be integrated in Makefile.cpu (Makefile_32.cpu)
++        cflags-$(CONFIG_MNATIVE) += $(call cc-option,-march=native)
+         cflags-$(CONFIG_MK8) += $(call cc-option,-march=k8)
++        cflags-$(CONFIG_MK10) += $(call cc-option,-march=amdfam10)
++        cflags-$(CONFIG_MBARCELONA) += $(call cc-option,-march=barcelona)
++        cflags-$(CONFIG_MBOBCAT) += $(call cc-option,-march=btver1)
++        cflags-$(CONFIG_MBULLDOZER) += $(call cc-option,-march=bdver1)
++        cflags-$(CONFIG_MPILEDRIVER) += $(call cc-option,-march=bdver2)
++        cflags-$(CONFIG_MJAGUAR) += $(call cc-option,-march=btver2)
+         cflags-$(CONFIG_MPSC) += $(call cc-option,-march=nocona)
+ 
+         cflags-$(CONFIG_MCORE2) += \
+-                $(call cc-option,-march=core2,$(call cc-option,-mtune=generic))
++                $(call cc-option,-march=core2,$(call cc-option,-mtune=core2))
++        cflags-$(CONFIG_MCOREI7) += \
++                $(call cc-option,-march=corei7,$(call cc-option,-mtune=corei7))
++        cflags-$(CONFIG_MCOREI7AVX) += \
++                $(call cc-option,-march=corei7-avx,$(call cc-option,-mtune=corei7-avx))
++        cflags-$(CONFIG_MCOREAVXI) += \
++                $(call cc-option,-march=core-avx-i,$(call cc-option,-mtune=core-avx-i))
++        cflags-$(CONFIG_MCOREAVX2) += \
++                $(call cc-option,-march=core-avx2,$(call cc-option,-mtune=core-avx2))
+ 	cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom) \
+ 		$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))
+         cflags-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=generic)
+diff -uprN a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu
+--- a/arch/x86/Makefile_32.cpu	2013-11-03 18:41:51.000000000 -0500
++++ b/arch/x86/Makefile_32.cpu	2013-12-15 06:21:24.354455723 -0500
+@@ -23,7 +23,14 @@ cflags-$(CONFIG_MK6)		+= -march=k6
+ # Please note, that patches that add -march=athlon-xp and friends are pointless.
+ # They make zero difference whatsosever to performance at this time.
+ cflags-$(CONFIG_MK7)		+= -march=athlon
++cflags-$(CONFIG_MNATIVE) += $(call cc-option,-march=native)
+ cflags-$(CONFIG_MK8)		+= $(call cc-option,-march=k8,-march=athlon)
++cflags-$(CONFIG_MK10)	+= $(call cc-option,-march=amdfam10,-march=athlon)
++cflags-$(CONFIG_MBARCELONA)	+= $(call cc-option,-march=barcelona,-march=athlon)
++cflags-$(CONFIG_MBOBCAT)	+= $(call cc-option,-march=btver1,-march=athlon)
++cflags-$(CONFIG_MBULLDOZER)	+= $(call cc-option,-march=bdver1,-march=athlon)
++cflags-$(CONFIG_MPILEDRIVER)	+= $(call cc-option,-march=bdver2,-march=athlon)
++cflags-$(CONFIG_MJAGUAR)	+= $(call cc-option,-march=btver2,-march=athlon)
+ cflags-$(CONFIG_MCRUSOE)	+= -march=i686 $(align)-functions=0 $(align)-jumps=0 $(align)-loops=0
+ cflags-$(CONFIG_MEFFICEON)	+= -march=i686 $(call tune,pentium3) $(align)-functions=0 $(align)-jumps=0 $(align)-loops=0
+ cflags-$(CONFIG_MWINCHIPC6)	+= $(call cc-option,-march=winchip-c6,-march=i586)
+@@ -32,6 +39,10 @@ cflags-$(CONFIG_MCYRIXIII)	+= $(call cc-
+ cflags-$(CONFIG_MVIAC3_2)	+= $(call cc-option,-march=c3-2,-march=i686)
+ cflags-$(CONFIG_MVIAC7)		+= -march=i686
+ cflags-$(CONFIG_MCORE2)		+= -march=i686 $(call tune,core2)
++cflags-$(CONFIG_MCOREI7)	+= -march=i686 $(call tune,corei7)
++cflags-$(CONFIG_MCOREI7AVX)	+= -march=i686 $(call tune,corei7-avx)
++cflags-$(CONFIG_MCOREAVXI)	+= -march=i686 $(call tune,core-avx-i)
++cflags-$(CONFIG_MCOREAVX2)	+= -march=i686 $(call tune,core-avx2)
+ cflags-$(CONFIG_MATOM)		+= $(call cc-option,-march=atom,$(call cc-option,-march=core2,-march=i686)) \
+ 	$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-08-08 19:48 Mike Pagano
  2014-08-19 11:44 ` Mike Pagano
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Pagano @ 2014-08-08 19:48 UTC (permalink / raw
  To: gentoo-commits

commit:     9df8c18cd85acf5655794c6de5da3a0690675965
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Fri Aug  8 19:48:09 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Fri Aug  8 19:48:09 2014 +0000
URL:        http://git.overlays.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=9df8c18c

BFQ patch for 3.16

---
 0000_README                                        |   11 +
 ...-cgroups-kconfig-build-bits-for-v7r5-3.16.patch |  104 +
 ...ck-introduce-the-v7r5-I-O-sched-for-3.16.patch1 | 6635 ++++++++++++++++++++
 ...add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch | 1188 ++++
 4 files changed, 7938 insertions(+)

diff --git a/0000_README b/0000_README
index da7da0d..a6ec2e6 100644
--- a/0000_README
+++ b/0000_README
@@ -75,3 +75,14 @@ Patch:  5000_enable-additional-cpu-optimizations-for-gcc.patch
 From:   https://github.com/graysky2/kernel_gcc_patch/
 Desc:   Kernel patch enables gcc optimizations for additional CPUs.
 
+Patch:  5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
+From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc:   BFQ v7r5 patch 1 for 3.16: Build, cgroups and kconfig bits
+
+Patch:  5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
+From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc:   BFQ v7r5 patch 2 for 3.16: BFQ Scheduler
+
+Patch:  5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
+From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc:   BFQ v7r5 patch 3 for 3.16: Early Queue Merge (EQM)

diff --git a/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
new file mode 100644
index 0000000..088bd05
--- /dev/null
+++ b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
@@ -0,0 +1,104 @@
+From 6519e5beef1063a86d3fc917cff2592cb599e824 Mon Sep 17 00:00:00 2001
+From: Paolo Valente <paolo.valente@unimore.it>
+Date: Thu, 22 May 2014 11:59:35 +0200
+Subject: [PATCH 1/3] block: cgroups, kconfig, build bits for BFQ-v7r5-3.16
+
+Update Kconfig.iosched and do the related Makefile changes to include
+kernel configuration options for BFQ. Also add the bfqio controller
+to the cgroups subsystem.
+
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+---
+ block/Kconfig.iosched         | 32 ++++++++++++++++++++++++++++++++
+ block/Makefile                |  1 +
+ include/linux/cgroup_subsys.h |  4 ++++
+ 3 files changed, 37 insertions(+)
+
+diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
+index 421bef9..0ee5f0f 100644
+--- a/block/Kconfig.iosched
++++ b/block/Kconfig.iosched
+@@ -39,6 +39,27 @@ config CFQ_GROUP_IOSCHED
+ 	---help---
+ 	  Enable group IO scheduling in CFQ.
+ 
++config IOSCHED_BFQ
++	tristate "BFQ I/O scheduler"
++	default n
++	---help---
++	  The BFQ I/O scheduler tries to distribute bandwidth among
++	  all processes according to their weights.
++	  It aims at distributing the bandwidth as desired, independently of
++	  the disk parameters and with any workload. It also tries to
++	  guarantee low latency to interactive and soft real-time
++	  applications. If compiled built-in (saying Y here), BFQ can
++	  be configured to support hierarchical scheduling.
++
++config CGROUP_BFQIO
++	bool "BFQ hierarchical scheduling support"
++	depends on CGROUPS && IOSCHED_BFQ=y
++	default n
++	---help---
++	  Enable hierarchical scheduling in BFQ, using the cgroups
++	  filesystem interface.  The name of the subsystem will be
++	  bfqio.
++
+ choice
+ 	prompt "Default I/O scheduler"
+ 	default DEFAULT_CFQ
+@@ -52,6 +73,16 @@ choice
+ 	config DEFAULT_CFQ
+ 		bool "CFQ" if IOSCHED_CFQ=y
+ 
++	config DEFAULT_BFQ
++		bool "BFQ" if IOSCHED_BFQ=y
++		help
++		  Selects BFQ as the default I/O scheduler which will be
++		  used by default for all block devices.
++		  The BFQ I/O scheduler aims at distributing the bandwidth
++		  as desired, independently of the disk parameters and with
++		  any workload. It also tries to guarantee low latency to
++		  interactive and soft real-time applications.
++
+ 	config DEFAULT_NOOP
+ 		bool "No-op"
+ 
+@@ -61,6 +92,7 @@ config DEFAULT_IOSCHED
+ 	string
+ 	default "deadline" if DEFAULT_DEADLINE
+ 	default "cfq" if DEFAULT_CFQ
++	default "bfq" if DEFAULT_BFQ
+ 	default "noop" if DEFAULT_NOOP
+ 
+ endmenu
+diff --git a/block/Makefile b/block/Makefile
+index a2ce6ac..a0fc06a 100644
+--- a/block/Makefile
++++ b/block/Makefile
+@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
+ obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
+ obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
+ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
++obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o
+ 
+ obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
+ obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
+index 98c4f9b..13b010d 100644
+--- a/include/linux/cgroup_subsys.h
++++ b/include/linux/cgroup_subsys.h
+@@ -35,6 +35,10 @@ SUBSYS(net_cls)
+ SUBSYS(blkio)
+ #endif
+ 
++#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
++SUBSYS(bfqio)
++#endif
++
+ #if IS_ENABLED(CONFIG_CGROUP_PERF)
+ SUBSYS(perf_event)
+ #endif
+-- 
+2.0.3
+

diff --git a/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1 b/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
new file mode 100644
index 0000000..6f630ba
--- /dev/null
+++ b/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
@@ -0,0 +1,6635 @@
+From c56e6c5db41f7137d3e0b38063ef0c944eec1898 Mon Sep 17 00:00:00 2001
+From: Paolo Valente <paolo.valente@unimore.it>
+Date: Thu, 9 May 2013 19:10:02 +0200
+Subject: [PATCH 2/3] block: introduce the BFQ-v7r5 I/O sched for 3.16
+
+Add the BFQ-v7r5 I/O scheduler to 3.16.
+The general structure is borrowed from CFQ, as much of the code for
+handling I/O contexts. Over time, several useful features have been
+ported from CFQ as well (details in the changelog in README.BFQ). A
+(bfq_)queue is associated to each task doing I/O on a device, and each
+time a scheduling decision has to be made a queue is selected and served
+until it expires.
+
+    - Slices are given in the service domain: tasks are assigned
+      budgets, measured in number of sectors. Once got the disk, a task
+      must however consume its assigned budget within a configurable
+      maximum time (by default, the maximum possible value of the
+      budgets is automatically computed to comply with this timeout).
+      This allows the desired latency vs "throughput boosting" tradeoff
+      to be set.
+
+    - Budgets are scheduled according to a variant of WF2Q+, implemented
+      using an augmented rb-tree to take eligibility into account while
+      preserving an O(log N) overall complexity.
+
+    - A low-latency tunable is provided; if enabled, both interactive
+      and soft real-time applications are guaranteed a very low latency.
+
+    - Latency guarantees are preserved also in the presence of NCQ.
+
+    - Also with flash-based devices, a high throughput is achieved
+      while still preserving latency guarantees.
+
+    - BFQ features Early Queue Merge (EQM), a sort of fusion of the
+      cooperating-queue-merging and the preemption mechanisms present
+      in CFQ. EQM is in fact a unified mechanism that tries to get a
+      sequential read pattern, and hence a high throughput, with any
+      set of processes performing interleaved I/O over a contiguous
+      sequence of sectors.
+
+    - BFQ supports full hierarchical scheduling, exporting a cgroups
+      interface.  Since each node has a full scheduler, each group can
+      be assigned its own weight.
+
+    - If the cgroups interface is not used, only I/O priorities can be
+      assigned to processes, with ioprio values mapped to weights
+      with the relation weight = IOPRIO_BE_NR - ioprio.
+
+    - ioprio classes are served in strict priority order, i.e., lower
+      priority queues are not served as long as there are higher
+      priority queues.  Among queues in the same class the bandwidth is
+      distributed in proportion to the weight of each queue. A very
+      thin extra bandwidth is however guaranteed to the Idle class, to
+      prevent it from starving.
+
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+---
+ block/bfq-cgroup.c  |  930 +++++++++++++
+ block/bfq-ioc.c     |   36 +
+ block/bfq-iosched.c | 3617 +++++++++++++++++++++++++++++++++++++++++++++++++++
+ block/bfq-sched.c   | 1207 +++++++++++++++++
+ block/bfq.h         |  742 +++++++++++
+ 5 files changed, 6532 insertions(+)
+ create mode 100644 block/bfq-cgroup.c
+ create mode 100644 block/bfq-ioc.c
+ create mode 100644 block/bfq-iosched.c
+ create mode 100644 block/bfq-sched.c
+ create mode 100644 block/bfq.h
+
+diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
+new file mode 100644
+index 0000000..f742806
+--- /dev/null
++++ b/block/bfq-cgroup.c
+@@ -0,0 +1,930 @@
++/*
++ * BFQ: CGROUPS support.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
++ * file.
++ */
++
++#ifdef CONFIG_CGROUP_BFQIO
++
++static DEFINE_MUTEX(bfqio_mutex);
++
++static bool bfqio_is_removed(struct bfqio_cgroup *bgrp)
++{
++	return bgrp ? !bgrp->online : false;
++}
++
++static struct bfqio_cgroup bfqio_root_cgroup = {
++	.weight = BFQ_DEFAULT_GRP_WEIGHT,
++	.ioprio = BFQ_DEFAULT_GRP_IOPRIO,
++	.ioprio_class = BFQ_DEFAULT_GRP_CLASS,
++};
++
++static inline void bfq_init_entity(struct bfq_entity *entity,
++				   struct bfq_group *bfqg)
++{
++	entity->weight = entity->new_weight;
++	entity->orig_weight = entity->new_weight;
++	entity->ioprio = entity->new_ioprio;
++	entity->ioprio_class = entity->new_ioprio_class;
++	entity->parent = bfqg->my_entity;
++	entity->sched_data = &bfqg->sched_data;
++}
++
++static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css)
++{
++	return css ? container_of(css, struct bfqio_cgroup, css) : NULL;
++}
++
++/*
++ * Search the bfq_group for bfqd into the hash table (by now only a list)
++ * of bgrp.  Must be called under rcu_read_lock().
++ */
++static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp,
++					    struct bfq_data *bfqd)
++{
++	struct bfq_group *bfqg;
++	void *key;
++
++	hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) {
++		key = rcu_dereference(bfqg->bfqd);
++		if (key == bfqd)
++			return bfqg;
++	}
++
++	return NULL;
++}
++
++static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
++					 struct bfq_group *bfqg)
++{
++	struct bfq_entity *entity = &bfqg->entity;
++
++	/*
++	 * If the weight of the entity has never been set via the sysfs
++	 * interface, then bgrp->weight == 0. In this case we initialize
++	 * the weight from the current ioprio value. Otherwise, the group
++	 * weight, if set, has priority over the ioprio value.
++	 */
++	if (bgrp->weight == 0) {
++		entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio);
++		entity->new_ioprio = bgrp->ioprio;
++	} else {
++		entity->new_weight = bgrp->weight;
++		entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight);
++	}
++	entity->orig_weight = entity->weight = entity->new_weight;
++	entity->ioprio = entity->new_ioprio;
++	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
++	entity->my_sched_data = &bfqg->sched_data;
++	bfqg->active_entities = 0;
++}
++
++static inline void bfq_group_set_parent(struct bfq_group *bfqg,
++					struct bfq_group *parent)
++{
++	struct bfq_entity *entity;
++
++	BUG_ON(parent == NULL);
++	BUG_ON(bfqg == NULL);
++
++	entity = &bfqg->entity;
++	entity->parent = parent->my_entity;
++	entity->sched_data = &parent->sched_data;
++}
++
++/**
++ * bfq_group_chain_alloc - allocate a chain of groups.
++ * @bfqd: queue descriptor.
++ * @css: the leaf cgroup_subsys_state this chain starts from.
++ *
++ * Allocate a chain of groups starting from the one belonging to
++ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
++ * to the root has already an allocated group on @bfqd.
++ */
++static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd,
++					       struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp;
++	struct bfq_group *bfqg, *prev = NULL, *leaf = NULL;
++
++	for (; css != NULL; css = css->parent) {
++		bgrp = css_to_bfqio(css);
++
++		bfqg = bfqio_lookup_group(bgrp, bfqd);
++		if (bfqg != NULL) {
++			/*
++			 * All the cgroups in the path from there to the
++			 * root must have a bfq_group for bfqd, so we don't
++			 * need any more allocations.
++			 */
++			break;
++		}
++
++		bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC);
++		if (bfqg == NULL)
++			goto cleanup;
++
++		bfq_group_init_entity(bgrp, bfqg);
++		bfqg->my_entity = &bfqg->entity;
++
++		if (leaf == NULL) {
++			leaf = bfqg;
++			prev = leaf;
++		} else {
++			bfq_group_set_parent(prev, bfqg);
++			/*
++			 * Build a list of allocated nodes using the bfqd
++			 * filed, that is still unused and will be
++			 * initialized only after the node will be
++			 * connected.
++			 */
++			prev->bfqd = bfqg;
++			prev = bfqg;
++		}
++	}
++
++	return leaf;
++
++cleanup:
++	while (leaf != NULL) {
++		prev = leaf;
++		leaf = leaf->bfqd;
++		kfree(prev);
++	}
++
++	return NULL;
++}
++
++/**
++ * bfq_group_chain_link - link an allocated group chain to a cgroup
++ *                        hierarchy.
++ * @bfqd: the queue descriptor.
++ * @css: the leaf cgroup_subsys_state to start from.
++ * @leaf: the leaf group (to be associated to @cgroup).
++ *
++ * Try to link a chain of groups to a cgroup hierarchy, connecting the
++ * nodes bottom-up, so we can be sure that when we find a cgroup in the
++ * hierarchy that already as a group associated to @bfqd all the nodes
++ * in the path to the root cgroup have one too.
++ *
++ * On locking: the queue lock protects the hierarchy (there is a hierarchy
++ * per device) while the bfqio_cgroup lock protects the list of groups
++ * belonging to the same cgroup.
++ */
++static void bfq_group_chain_link(struct bfq_data *bfqd,
++				 struct cgroup_subsys_state *css,
++				 struct bfq_group *leaf)
++{
++	struct bfqio_cgroup *bgrp;
++	struct bfq_group *bfqg, *next, *prev = NULL;
++	unsigned long flags;
++
++	assert_spin_locked(bfqd->queue->queue_lock);
++
++	for (; css != NULL && leaf != NULL; css = css->parent) {
++		bgrp = css_to_bfqio(css);
++		next = leaf->bfqd;
++
++		bfqg = bfqio_lookup_group(bgrp, bfqd);
++		BUG_ON(bfqg != NULL);
++
++		spin_lock_irqsave(&bgrp->lock, flags);
++
++		rcu_assign_pointer(leaf->bfqd, bfqd);
++		hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
++		hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
++
++		spin_unlock_irqrestore(&bgrp->lock, flags);
++
++		prev = leaf;
++		leaf = next;
++	}
++
++	BUG_ON(css == NULL && leaf != NULL);
++	if (css != NULL && prev != NULL) {
++		bgrp = css_to_bfqio(css);
++		bfqg = bfqio_lookup_group(bgrp, bfqd);
++		bfq_group_set_parent(prev, bfqg);
++	}
++}
++
++/**
++ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
++ * @bfqd: queue descriptor.
++ * @cgroup: cgroup being searched for.
++ *
++ * Return a group associated to @bfqd in @cgroup, allocating one if
++ * necessary.  When a group is returned all the cgroups in the path
++ * to the root have a group associated to @bfqd.
++ *
++ * If the allocation fails, return the root group: this breaks guarantees
++ * but is a safe fallback.  If this loss becomes a problem it can be
++ * mitigated using the equivalent weight (given by the product of the
++ * weights of the groups in the path from @group to the root) in the
++ * root scheduler.
++ *
++ * We allocate all the missing nodes in the path from the leaf cgroup
++ * to the root and we connect the nodes only after all the allocations
++ * have been successful.
++ */
++static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
++					      struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++	struct bfq_group *bfqg;
++
++	bfqg = bfqio_lookup_group(bgrp, bfqd);
++	if (bfqg != NULL)
++		return bfqg;
++
++	bfqg = bfq_group_chain_alloc(bfqd, css);
++	if (bfqg != NULL)
++		bfq_group_chain_link(bfqd, css, bfqg);
++	else
++		bfqg = bfqd->root_group;
++
++	return bfqg;
++}
++
++/**
++ * bfq_bfqq_move - migrate @bfqq to @bfqg.
++ * @bfqd: queue descriptor.
++ * @bfqq: the queue to move.
++ * @entity: @bfqq's entity.
++ * @bfqg: the group to move to.
++ *
++ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
++ * it on the new one.  Avoid putting the entity on the old group idle tree.
++ *
++ * Must be called under the queue lock; the cgroup owning @bfqg must
++ * not disappear (by now this just means that we are called under
++ * rcu_read_lock()).
++ */
++static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			  struct bfq_entity *entity, struct bfq_group *bfqg)
++{
++	int busy, resume;
++
++	busy = bfq_bfqq_busy(bfqq);
++	resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
++
++	BUG_ON(resume && !entity->on_st);
++	BUG_ON(busy && !resume && entity->on_st &&
++	       bfqq != bfqd->in_service_queue);
++
++	if (busy) {
++		BUG_ON(atomic_read(&bfqq->ref) < 2);
++
++		if (!resume)
++			bfq_del_bfqq_busy(bfqd, bfqq, 0);
++		else
++			bfq_deactivate_bfqq(bfqd, bfqq, 0);
++	} else if (entity->on_st)
++		bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
++
++	/*
++	 * Here we use a reference to bfqg.  We don't need a refcounter
++	 * as the cgroup reference will not be dropped, so that its
++	 * destroy() callback will not be invoked.
++	 */
++	entity->parent = bfqg->my_entity;
++	entity->sched_data = &bfqg->sched_data;
++
++	if (busy && resume)
++		bfq_activate_bfqq(bfqd, bfqq);
++
++	if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver)
++		bfq_schedule_dispatch(bfqd);
++}
++
++/**
++ * __bfq_bic_change_cgroup - move @bic to @cgroup.
++ * @bfqd: the queue descriptor.
++ * @bic: the bic to move.
++ * @cgroup: the cgroup to move to.
++ *
++ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller
++ * has to make sure that the reference to cgroup is valid across the call.
++ *
++ * NOTE: an alternative approach might have been to store the current
++ * cgroup in bfqq and getting a reference to it, reducing the lookup
++ * time here, at the price of slightly more complex code.
++ */
++static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
++						struct bfq_io_cq *bic,
++						struct cgroup_subsys_state *css)
++{
++	struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
++	struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
++	struct bfq_entity *entity;
++	struct bfq_group *bfqg;
++	struct bfqio_cgroup *bgrp;
++
++	bgrp = css_to_bfqio(css);
++
++	bfqg = bfq_find_alloc_group(bfqd, css);
++	if (async_bfqq != NULL) {
++		entity = &async_bfqq->entity;
++
++		if (entity->sched_data != &bfqg->sched_data) {
++			bic_set_bfqq(bic, NULL, 0);
++			bfq_log_bfqq(bfqd, async_bfqq,
++				     "bic_change_group: %p %d",
++				     async_bfqq, atomic_read(&async_bfqq->ref));
++			bfq_put_queue(async_bfqq);
++		}
++	}
++
++	if (sync_bfqq != NULL) {
++		entity = &sync_bfqq->entity;
++		if (entity->sched_data != &bfqg->sched_data)
++			bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
++	}
++
++	return bfqg;
++}
++
++/**
++ * bfq_bic_change_cgroup - move @bic to @cgroup.
++ * @bic: the bic being migrated.
++ * @cgroup: the destination cgroup.
++ *
++ * When the task owning @bic is moved to @cgroup, @bic is immediately
++ * moved into its new parent group.
++ */
++static void bfq_bic_change_cgroup(struct bfq_io_cq *bic,
++				  struct cgroup_subsys_state *css)
++{
++	struct bfq_data *bfqd;
++	unsigned long uninitialized_var(flags);
++
++	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
++				   &flags);
++	if (bfqd != NULL) {
++		__bfq_bic_change_cgroup(bfqd, bic, css);
++		bfq_put_bfqd_unlock(bfqd, &flags);
++	}
++}
++
++/**
++ * bfq_bic_update_cgroup - update the cgroup of @bic.
++ * @bic: the @bic to update.
++ *
++ * Make sure that @bic is enqueued in the cgroup of the current task.
++ * We need this in addition to moving bics during the cgroup attach
++ * phase because the task owning @bic could be at its first disk
++ * access or we may end up in the root cgroup as the result of a
++ * memory allocation failure and here we try to move to the right
++ * group.
++ *
++ * Must be called under the queue lock.  It is safe to use the returned
++ * value even after the rcu_read_unlock() as the migration/destruction
++ * paths act under the queue lock too.  IOW it is impossible to race with
++ * group migration/destruction and end up with an invalid group as:
++ *   a) here cgroup has not yet been destroyed, nor its destroy callback
++ *      has started execution, as current holds a reference to it,
++ *   b) if it is destroyed after rcu_read_unlock() [after current is
++ *      migrated to a different cgroup] its attach() callback will have
++ *      taken care of remove all the references to the old cgroup data.
++ */
++static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic)
++{
++	struct bfq_data *bfqd = bic_to_bfqd(bic);
++	struct bfq_group *bfqg;
++	struct cgroup_subsys_state *css;
++
++	BUG_ON(bfqd == NULL);
++
++	rcu_read_lock();
++	css = task_css(current, bfqio_cgrp_id);
++	bfqg = __bfq_bic_change_cgroup(bfqd, bic, css);
++	rcu_read_unlock();
++
++	return bfqg;
++}
++
++/**
++ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
++ * @st: the service tree being flushed.
++ */
++static inline void bfq_flush_idle_tree(struct bfq_service_tree *st)
++{
++	struct bfq_entity *entity = st->first_idle;
++
++	for (; entity != NULL; entity = st->first_idle)
++		__bfq_deactivate_entity(entity, 0);
++}
++
++/**
++ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
++ * @bfqd: the device data structure with the root group.
++ * @entity: the entity to move.
++ */
++static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
++					    struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++	BUG_ON(bfqq == NULL);
++	bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
++	return;
++}
++
++/**
++ * bfq_reparent_active_entities - move to the root group all active
++ *                                entities.
++ * @bfqd: the device data structure with the root group.
++ * @bfqg: the group to move from.
++ * @st: the service tree with the entities.
++ *
++ * Needs queue_lock to be taken and reference to be valid over the call.
++ */
++static inline void bfq_reparent_active_entities(struct bfq_data *bfqd,
++						struct bfq_group *bfqg,
++						struct bfq_service_tree *st)
++{
++	struct rb_root *active = &st->active;
++	struct bfq_entity *entity = NULL;
++
++	if (!RB_EMPTY_ROOT(&st->active))
++		entity = bfq_entity_of(rb_first(active));
++
++	for (; entity != NULL; entity = bfq_entity_of(rb_first(active)))
++		bfq_reparent_leaf_entity(bfqd, entity);
++
++	if (bfqg->sched_data.in_service_entity != NULL)
++		bfq_reparent_leaf_entity(bfqd,
++			bfqg->sched_data.in_service_entity);
++
++	return;
++}
++
++/**
++ * bfq_destroy_group - destroy @bfqg.
++ * @bgrp: the bfqio_cgroup containing @bfqg.
++ * @bfqg: the group being destroyed.
++ *
++ * Destroy @bfqg, making sure that it is not referenced from its parent.
++ */
++static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
++{
++	struct bfq_data *bfqd;
++	struct bfq_service_tree *st;
++	struct bfq_entity *entity = bfqg->my_entity;
++	unsigned long uninitialized_var(flags);
++	int i;
++
++	hlist_del(&bfqg->group_node);
++
++	/*
++	 * Empty all service_trees belonging to this group before
++	 * deactivating the group itself.
++	 */
++	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
++		st = bfqg->sched_data.service_tree + i;
++
++		/*
++		 * The idle tree may still contain bfq_queues belonging
++		 * to exited task because they never migrated to a different
++		 * cgroup from the one being destroyed now.  No one else
++		 * can access them so it's safe to act without any lock.
++		 */
++		bfq_flush_idle_tree(st);
++
++		/*
++		 * It may happen that some queues are still active
++		 * (busy) upon group destruction (if the corresponding
++		 * processes have been forced to terminate). We move
++		 * all the leaf entities corresponding to these queues
++		 * to the root_group.
++		 * Also, it may happen that the group has an entity
++		 * in service, which is disconnected from the active
++		 * tree: it must be moved, too.
++		 * There is no need to put the sync queues, as the
++		 * scheduler has taken no reference.
++		 */
++		bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
++		if (bfqd != NULL) {
++			bfq_reparent_active_entities(bfqd, bfqg, st);
++			bfq_put_bfqd_unlock(bfqd, &flags);
++		}
++		BUG_ON(!RB_EMPTY_ROOT(&st->active));
++		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
++	}
++	BUG_ON(bfqg->sched_data.next_in_service != NULL);
++	BUG_ON(bfqg->sched_data.in_service_entity != NULL);
++
++	/*
++	 * We may race with device destruction, take extra care when
++	 * dereferencing bfqg->bfqd.
++	 */
++	bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
++	if (bfqd != NULL) {
++		hlist_del(&bfqg->bfqd_node);
++		__bfq_deactivate_entity(entity, 0);
++		bfq_put_async_queues(bfqd, bfqg);
++		bfq_put_bfqd_unlock(bfqd, &flags);
++	}
++	BUG_ON(entity->tree != NULL);
++
++	/*
++	 * No need to defer the kfree() to the end of the RCU grace
++	 * period: we are called from the destroy() callback of our
++	 * cgroup, so we can be sure that no one is a) still using
++	 * this cgroup or b) doing lookups in it.
++	 */
++	kfree(bfqg);
++}
++
++static void bfq_end_wr_async(struct bfq_data *bfqd)
++{
++	struct hlist_node *tmp;
++	struct bfq_group *bfqg;
++
++	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node)
++		bfq_end_wr_async_queues(bfqd, bfqg);
++	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
++}
++
++/**
++ * bfq_disconnect_groups - disconnect @bfqd from all its groups.
++ * @bfqd: the device descriptor being exited.
++ *
++ * When the device exits we just make sure that no lookup can return
++ * the now unused group structures.  They will be deallocated on cgroup
++ * destruction.
++ */
++static void bfq_disconnect_groups(struct bfq_data *bfqd)
++{
++	struct hlist_node *tmp;
++	struct bfq_group *bfqg;
++
++	bfq_log(bfqd, "disconnect_groups beginning");
++	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) {
++		hlist_del(&bfqg->bfqd_node);
++
++		__bfq_deactivate_entity(bfqg->my_entity, 0);
++
++		/*
++		 * Don't remove from the group hash, just set an
++		 * invalid key.  No lookups can race with the
++		 * assignment as bfqd is being destroyed; this
++		 * implies also that new elements cannot be added
++		 * to the list.
++		 */
++		rcu_assign_pointer(bfqg->bfqd, NULL);
++
++		bfq_log(bfqd, "disconnect_groups: put async for group %p",
++			bfqg);
++		bfq_put_async_queues(bfqd, bfqg);
++	}
++}
++
++static inline void bfq_free_root_group(struct bfq_data *bfqd)
++{
++	struct bfqio_cgroup *bgrp = &bfqio_root_cgroup;
++	struct bfq_group *bfqg = bfqd->root_group;
++
++	bfq_put_async_queues(bfqd, bfqg);
++
++	spin_lock_irq(&bgrp->lock);
++	hlist_del_rcu(&bfqg->group_node);
++	spin_unlock_irq(&bgrp->lock);
++
++	/*
++	 * No need to synchronize_rcu() here: since the device is gone
++	 * there cannot be any read-side access to its root_group.
++	 */
++	kfree(bfqg);
++}
++
++static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
++{
++	struct bfq_group *bfqg;
++	struct bfqio_cgroup *bgrp;
++	int i;
++
++	bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node);
++	if (bfqg == NULL)
++		return NULL;
++
++	bfqg->entity.parent = NULL;
++	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
++		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
++
++	bgrp = &bfqio_root_cgroup;
++	spin_lock_irq(&bgrp->lock);
++	rcu_assign_pointer(bfqg->bfqd, bfqd);
++	hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data);
++	spin_unlock_irq(&bgrp->lock);
++
++	return bfqg;
++}
++
++#define SHOW_FUNCTION(__VAR)						\
++static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \
++				       struct cftype *cftype)		\
++{									\
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
++	u64 ret = -ENODEV;						\
++									\
++	mutex_lock(&bfqio_mutex);					\
++	if (bfqio_is_removed(bgrp))					\
++		goto out_unlock;					\
++									\
++	spin_lock_irq(&bgrp->lock);					\
++	ret = bgrp->__VAR;						\
++	spin_unlock_irq(&bgrp->lock);					\
++									\
++out_unlock:								\
++	mutex_unlock(&bfqio_mutex);					\
++	return ret;							\
++}
++
++SHOW_FUNCTION(weight);
++SHOW_FUNCTION(ioprio);
++SHOW_FUNCTION(ioprio_class);
++#undef SHOW_FUNCTION
++
++#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
++static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\
++					struct cftype *cftype,		\
++					u64 val)			\
++{									\
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
++	struct bfq_group *bfqg;						\
++	int ret = -EINVAL;						\
++									\
++	if (val < (__MIN) || val > (__MAX))				\
++		return ret;						\
++									\
++	ret = -ENODEV;							\
++	mutex_lock(&bfqio_mutex);					\
++	if (bfqio_is_removed(bgrp))					\
++		goto out_unlock;					\
++	ret = 0;							\
++									\
++	spin_lock_irq(&bgrp->lock);					\
++	bgrp->__VAR = (unsigned short)val;				\
++	hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) {	\
++		/*							\
++		 * Setting the ioprio_changed flag of the entity        \
++		 * to 1 with new_##__VAR == ##__VAR would re-set        \
++		 * the value of the weight to its ioprio mapping.       \
++		 * Set the flag only if necessary.			\
++		 */							\
++		if ((unsigned short)val != bfqg->entity.new_##__VAR) {  \
++			bfqg->entity.new_##__VAR = (unsigned short)val; \
++			/*						\
++			 * Make sure that the above new value has been	\
++			 * stored in bfqg->entity.new_##__VAR before	\
++			 * setting the ioprio_changed flag. In fact,	\
++			 * this flag may be read asynchronously (in	\
++			 * critical sections protected by a different	\
++			 * lock than that held here), and finding this	\
++			 * flag set may cause the execution of the code	\
++			 * for updating parameters whose value may	\
++			 * depend also on bfqg->entity.new_##__VAR (in	\
++			 * __bfq_entity_update_weight_prio).		\
++			 * This barrier makes sure that the new value	\
++			 * of bfqg->entity.new_##__VAR is correctly	\
++			 * seen in that code.				\
++			 */						\
++			smp_wmb();                                      \
++			bfqg->entity.ioprio_changed = 1;                \
++		}							\
++	}								\
++	spin_unlock_irq(&bgrp->lock);					\
++									\
++out_unlock:								\
++	mutex_unlock(&bfqio_mutex);					\
++	return ret;							\
++}
++
++STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT);
++STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
++STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
++#undef STORE_FUNCTION
++
++static struct cftype bfqio_files[] = {
++	{
++		.name = "weight",
++		.read_u64 = bfqio_cgroup_weight_read,
++		.write_u64 = bfqio_cgroup_weight_write,
++	},
++	{
++		.name = "ioprio",
++		.read_u64 = bfqio_cgroup_ioprio_read,
++		.write_u64 = bfqio_cgroup_ioprio_write,
++	},
++	{
++		.name = "ioprio_class",
++		.read_u64 = bfqio_cgroup_ioprio_class_read,
++		.write_u64 = bfqio_cgroup_ioprio_class_write,
++	},
++	{ },	/* terminate */
++};
++
++static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state
++						*parent_css)
++{
++	struct bfqio_cgroup *bgrp;
++
++	if (parent_css != NULL) {
++		bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL);
++		if (bgrp == NULL)
++			return ERR_PTR(-ENOMEM);
++	} else
++		bgrp = &bfqio_root_cgroup;
++
++	spin_lock_init(&bgrp->lock);
++	INIT_HLIST_HEAD(&bgrp->group_data);
++	bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO;
++	bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS;
++
++	return &bgrp->css;
++}
++
++/*
++ * We cannot support shared io contexts, as we have no means to support
++ * two tasks with the same ioc in two different groups without major rework
++ * of the main bic/bfqq data structures.  By now we allow a task to change
++ * its cgroup only if it's the only owner of its ioc; the drawback of this
++ * behavior is that a group containing a task that forked using CLONE_IO
++ * will not be destroyed until the tasks sharing the ioc die.
++ */
++static int bfqio_can_attach(struct cgroup_subsys_state *css,
++			    struct cgroup_taskset *tset)
++{
++	struct task_struct *task;
++	struct io_context *ioc;
++	int ret = 0;
++
++	cgroup_taskset_for_each(task, tset) {
++		/*
++		 * task_lock() is needed to avoid races with
++		 * exit_io_context()
++		 */
++		task_lock(task);
++		ioc = task->io_context;
++		if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
++			/*
++			 * ioc == NULL means that the task is either too
++			 * young or exiting: if it has still no ioc the
++			 * ioc can't be shared, if the task is exiting the
++			 * attach will fail anyway, no matter what we
++			 * return here.
++			 */
++			ret = -EINVAL;
++		task_unlock(task);
++		if (ret)
++			break;
++	}
++
++	return ret;
++}
++
++static void bfqio_attach(struct cgroup_subsys_state *css,
++			 struct cgroup_taskset *tset)
++{
++	struct task_struct *task;
++	struct io_context *ioc;
++	struct io_cq *icq;
++
++	/*
++	 * IMPORTANT NOTE: The move of more than one process at a time to a
++	 * new group has not yet been tested.
++	 */
++	cgroup_taskset_for_each(task, tset) {
++		ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
++		if (ioc) {
++			/*
++			 * Handle cgroup change here.
++			 */
++			rcu_read_lock();
++			hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node)
++				if (!strncmp(
++					icq->q->elevator->type->elevator_name,
++					"bfq", ELV_NAME_MAX))
++					bfq_bic_change_cgroup(icq_to_bic(icq),
++							      css);
++			rcu_read_unlock();
++			put_io_context(ioc);
++		}
++	}
++}
++
++static void bfqio_destroy(struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++	struct hlist_node *tmp;
++	struct bfq_group *bfqg;
++
++	/*
++	 * Since we are destroying the cgroup, there are no more tasks
++	 * referencing it, and all the RCU grace periods that may have
++	 * referenced it are ended (as the destruction of the parent
++	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
++	 * anything else and we don't need any synchronization.
++	 */
++	hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node)
++		bfq_destroy_group(bgrp, bfqg);
++
++	BUG_ON(!hlist_empty(&bgrp->group_data));
++
++	kfree(bgrp);
++}
++
++static int bfqio_css_online(struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++
++	mutex_lock(&bfqio_mutex);
++	bgrp->online = true;
++	mutex_unlock(&bfqio_mutex);
++
++	return 0;
++}
++
++static void bfqio_css_offline(struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++
++	mutex_lock(&bfqio_mutex);
++	bgrp->online = false;
++	mutex_unlock(&bfqio_mutex);
++}
++
++struct cgroup_subsys bfqio_cgrp_subsys = {
++	.css_alloc = bfqio_create,
++	.css_online = bfqio_css_online,
++	.css_offline = bfqio_css_offline,
++	.can_attach = bfqio_can_attach,
++	.attach = bfqio_attach,
++	.css_free = bfqio_destroy,
++	.base_cftypes = bfqio_files,
++};
++#else
++static inline void bfq_init_entity(struct bfq_entity *entity,
++				   struct bfq_group *bfqg)
++{
++	entity->weight = entity->new_weight;
++	entity->orig_weight = entity->new_weight;
++	entity->ioprio = entity->new_ioprio;
++	entity->ioprio_class = entity->new_ioprio_class;
++	entity->sched_data = &bfqg->sched_data;
++}
++
++static inline struct bfq_group *
++bfq_bic_update_cgroup(struct bfq_io_cq *bic)
++{
++	struct bfq_data *bfqd = bic_to_bfqd(bic);
++	return bfqd->root_group;
++}
++
++static inline void bfq_bfqq_move(struct bfq_data *bfqd,
++				 struct bfq_queue *bfqq,
++				 struct bfq_entity *entity,
++				 struct bfq_group *bfqg)
++{
++}
++
++static void bfq_end_wr_async(struct bfq_data *bfqd)
++{
++	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
++}
++
++static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
++{
++	bfq_put_async_queues(bfqd, bfqd->root_group);
++}
++
++static inline void bfq_free_root_group(struct bfq_data *bfqd)
++{
++	kfree(bfqd->root_group);
++}
++
++static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
++{
++	struct bfq_group *bfqg;
++	int i;
++
++	bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
++	if (bfqg == NULL)
++		return NULL;
++
++	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
++		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
++
++	return bfqg;
++}
++#endif
+diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
+new file mode 100644
+index 0000000..7f6b000
+--- /dev/null
++++ b/block/bfq-ioc.c
+@@ -0,0 +1,36 @@
++/*
++ * BFQ: I/O context handling.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++/**
++ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
++ * @icq: the iocontext queue.
++ */
++static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
++{
++	/* bic->icq is the first member, %NULL will convert to %NULL */
++	return container_of(icq, struct bfq_io_cq, icq);
++}
++
++/**
++ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
++ * @bfqd: the lookup key.
++ * @ioc: the io_context of the process doing I/O.
++ *
++ * Queue lock must be held.
++ */
++static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
++					       struct io_context *ioc)
++{
++	if (ioc)
++		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
++	return NULL;
++}
+diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
+new file mode 100644
+index 0000000..0a0891b
+--- /dev/null
++++ b/block/bfq-iosched.c
+@@ -0,0 +1,3617 @@
++/*
++ * Budget Fair Queueing (BFQ) disk scheduler.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
++ * file.
++ *
++ * BFQ is a proportional-share storage-I/O scheduling algorithm based on
++ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
++ * measured in number of sectors, to processes instead of time slices. The
++ * device is not granted to the in-service process for a given time slice,
++ * but until it has exhausted its assigned budget. This change from the time
++ * to the service domain allows BFQ to distribute the device throughput
++ * among processes as desired, without any distortion due to ZBR, workload
++ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler,
++ * called B-WF2Q+, to schedule processes according to their budgets. More
++ * precisely, BFQ schedules queues associated to processes. Thanks to the
++ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
++ * I/O-bound processes issuing sequential requests (to boost the
++ * throughput), and yet guarantee a low latency to interactive and soft
++ * real-time applications.
++ *
++ * BFQ is described in [1], where also a reference to the initial, more
++ * theoretical paper on BFQ can be found. The interested reader can find
++ * in the latter paper full details on the main algorithm, as well as
++ * formulas of the guarantees and formal proofs of all the properties.
++ * With respect to the version of BFQ presented in these papers, this
++ * implementation adds a few more heuristics, such as the one that
++ * guarantees a low latency to soft real-time applications, and a
++ * hierarchical extension based on H-WF2Q+.
++ *
++ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
++ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
++ * complexity derives from the one introduced with EEVDF in [3].
++ *
++ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness
++ *     with the BFQ Disk I/O Scheduler'',
++ *     Proceedings of the 5th Annual International Systems and Storage
++ *     Conference (SYSTOR '12), June 2012.
++ *
++ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
++ *
++ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
++ *     Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
++ *     Oct 1997.
++ *
++ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
++ *
++ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
++ *     First: A Flexible and Accurate Mechanism for Proportional Share
++ *     Resource Allocation,'' technical report.
++ *
++ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
++ */
++#include <linux/module.h>
++#include <linux/slab.h>
++#include <linux/blkdev.h>
++#include <linux/cgroup.h>
++#include <linux/elevator.h>
++#include <linux/jiffies.h>
++#include <linux/rbtree.h>
++#include <linux/ioprio.h>
++#include "bfq.h"
++#include "blk.h"
++
++/* Max number of dispatches in one round of service. */
++static const int bfq_quantum = 4;
++
++/* Expiration time of sync (0) and async (1) requests, in jiffies. */
++static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
++
++/* Maximum backwards seek, in KiB. */
++static const int bfq_back_max = 16 * 1024;
++
++/* Penalty of a backwards seek, in number of sectors. */
++static const int bfq_back_penalty = 2;
++
++/* Idling period duration, in jiffies. */
++static int bfq_slice_idle = HZ / 125;
++
++/* Default maximum budget values, in sectors and number of requests. */
++static const int bfq_default_max_budget = 16 * 1024;
++static const int bfq_max_budget_async_rq = 4;
++
++/*
++ * Async to sync throughput distribution is controlled as follows:
++ * when an async request is served, the entity is charged the number
++ * of sectors of the request, multiplied by the factor below
++ */
++static const int bfq_async_charge_factor = 10;
++
++/* Default timeout values, in jiffies, approximating CFQ defaults. */
++static const int bfq_timeout_sync = HZ / 8;
++static int bfq_timeout_async = HZ / 25;
++
++struct kmem_cache *bfq_pool;
++
++/* Below this threshold (in ms), we consider thinktime immediate. */
++#define BFQ_MIN_TT		2
++
++/* hw_tag detection: parallel requests threshold and min samples needed. */
++#define BFQ_HW_QUEUE_THRESHOLD	4
++#define BFQ_HW_QUEUE_SAMPLES	32
++
++#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
++#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
++
++/* Min samples used for peak rate estimation (for autotuning). */
++#define BFQ_PEAK_RATE_SAMPLES	32
++
++/* Shift used for peak rate fixed precision calculations. */
++#define BFQ_RATE_SHIFT		16
++
++/*
++ * By default, BFQ computes the duration of the weight raising for
++ * interactive applications automatically, using the following formula:
++ * duration = (R / r) * T, where r is the peak rate of the device, and
++ * R and T are two reference parameters.
++ * In particular, R is the peak rate of the reference device (see below),
++ * and T is a reference time: given the systems that are likely to be
++ * installed on the reference device according to its speed class, T is
++ * about the maximum time needed, under BFQ and while reading two files in
++ * parallel, to load typical large applications on these systems.
++ * In practice, the slower/faster the device at hand is, the more/less it
++ * takes to load applications with respect to the reference device.
++ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
++ * applications.
++ *
++ * BFQ uses four different reference pairs (R, T), depending on:
++ * . whether the device is rotational or non-rotational;
++ * . whether the device is slow, such as old or portable HDDs, as well as
++ *   SD cards, or fast, such as newer HDDs and SSDs.
++ *
++ * The device's speed class is dynamically (re)detected in
++ * bfq_update_peak_rate() every time the estimated peak rate is updated.
++ *
++ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0]
++ * are the reference values for a slow/fast rotational device, whereas
++ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for
++ * a slow/fast non-rotational device. Finally, device_speed_thresh are the
++ * thresholds used to switch between speed classes.
++ * Both the reference peak rates and the thresholds are measured in
++ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
++ */
++static int R_slow[2] = {1536, 10752};
++static int R_fast[2] = {17415, 34791};
++/*
++ * To improve readability, a conversion function is used to initialize the
++ * following arrays, which entails that they can be initialized only in a
++ * function.
++ */
++static int T_slow[2];
++static int T_fast[2];
++static int device_speed_thresh[2];
++
++#define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
++				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
++
++#define RQ_BIC(rq)		((struct bfq_io_cq *) (rq)->elv.priv[0])
++#define RQ_BFQQ(rq)		((rq)->elv.priv[1])
++
++static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
++
++#include "bfq-ioc.c"
++#include "bfq-sched.c"
++#include "bfq-cgroup.c"
++
++#define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
++				 IOPRIO_CLASS_IDLE)
++#define bfq_class_rt(bfqq)	((bfqq)->entity.ioprio_class ==\
++				 IOPRIO_CLASS_RT)
++
++#define bfq_sample_valid(samples)	((samples) > 80)
++
++/*
++ * We regard a request as SYNC, if either it's a read or has the SYNC bit
++ * set (in which case it could also be a direct WRITE).
++ */
++static inline int bfq_bio_sync(struct bio *bio)
++{
++	if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC))
++		return 1;
++
++	return 0;
++}
++
++/*
++ * Scheduler run of queue, if there are requests pending and no one in the
++ * driver that will restart queueing.
++ */
++static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
++{
++	if (bfqd->queued != 0) {
++		bfq_log(bfqd, "schedule dispatch");
++		kblockd_schedule_work(&bfqd->unplug_work);
++	}
++}
++
++/*
++ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
++ * We choose the request that is closesr to the head right now.  Distance
++ * behind the head is penalized and only allowed to a certain extent.
++ */
++static struct request *bfq_choose_req(struct bfq_data *bfqd,
++				      struct request *rq1,
++				      struct request *rq2,
++				      sector_t last)
++{
++	sector_t s1, s2, d1 = 0, d2 = 0;
++	unsigned long back_max;
++#define BFQ_RQ1_WRAP	0x01 /* request 1 wraps */
++#define BFQ_RQ2_WRAP	0x02 /* request 2 wraps */
++	unsigned wrap = 0; /* bit mask: requests behind the disk head? */
++
++	if (rq1 == NULL || rq1 == rq2)
++		return rq2;
++	if (rq2 == NULL)
++		return rq1;
++
++	if (rq_is_sync(rq1) && !rq_is_sync(rq2))
++		return rq1;
++	else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
++		return rq2;
++	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
++		return rq1;
++	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
++		return rq2;
++
++	s1 = blk_rq_pos(rq1);
++	s2 = blk_rq_pos(rq2);
++
++	/*
++	 * By definition, 1KiB is 2 sectors.
++	 */
++	back_max = bfqd->bfq_back_max * 2;
++
++	/*
++	 * Strict one way elevator _except_ in the case where we allow
++	 * short backward seeks which are biased as twice the cost of a
++	 * similar forward seek.
++	 */
++	if (s1 >= last)
++		d1 = s1 - last;
++	else if (s1 + back_max >= last)
++		d1 = (last - s1) * bfqd->bfq_back_penalty;
++	else
++		wrap |= BFQ_RQ1_WRAP;
++
++	if (s2 >= last)
++		d2 = s2 - last;
++	else if (s2 + back_max >= last)
++		d2 = (last - s2) * bfqd->bfq_back_penalty;
++	else
++		wrap |= BFQ_RQ2_WRAP;
++
++	/* Found required data */
++
++	/*
++	 * By doing switch() on the bit mask "wrap" we avoid having to
++	 * check two variables for all permutations: --> faster!
++	 */
++	switch (wrap) {
++	case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
++		if (d1 < d2)
++			return rq1;
++		else if (d2 < d1)
++			return rq2;
++		else {
++			if (s1 >= s2)
++				return rq1;
++			else
++				return rq2;
++		}
++
++	case BFQ_RQ2_WRAP:
++		return rq1;
++	case BFQ_RQ1_WRAP:
++		return rq2;
++	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
++	default:
++		/*
++		 * Since both rqs are wrapped,
++		 * start with the one that's further behind head
++		 * (--> only *one* back seek required),
++		 * since back seek takes more time than forward.
++		 */
++		if (s1 <= s2)
++			return rq1;
++		else
++			return rq2;
++	}
++}
++
++static struct bfq_queue *
++bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
++		     sector_t sector, struct rb_node **ret_parent,
++		     struct rb_node ***rb_link)
++{
++	struct rb_node **p, *parent;
++	struct bfq_queue *bfqq = NULL;
++
++	parent = NULL;
++	p = &root->rb_node;
++	while (*p) {
++		struct rb_node **n;
++
++		parent = *p;
++		bfqq = rb_entry(parent, struct bfq_queue, pos_node);
++
++		/*
++		 * Sort strictly based on sector. Smallest to the left,
++		 * largest to the right.
++		 */
++		if (sector > blk_rq_pos(bfqq->next_rq))
++			n = &(*p)->rb_right;
++		else if (sector < blk_rq_pos(bfqq->next_rq))
++			n = &(*p)->rb_left;
++		else
++			break;
++		p = n;
++		bfqq = NULL;
++	}
++
++	*ret_parent = parent;
++	if (rb_link)
++		*rb_link = p;
++
++	bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
++		(long long unsigned)sector,
++		bfqq != NULL ? bfqq->pid : 0);
++
++	return bfqq;
++}
++
++static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	struct rb_node **p, *parent;
++	struct bfq_queue *__bfqq;
++
++	if (bfqq->pos_root != NULL) {
++		rb_erase(&bfqq->pos_node, bfqq->pos_root);
++		bfqq->pos_root = NULL;
++	}
++
++	if (bfq_class_idle(bfqq))
++		return;
++	if (!bfqq->next_rq)
++		return;
++
++	bfqq->pos_root = &bfqd->rq_pos_tree;
++	__bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
++			blk_rq_pos(bfqq->next_rq), &parent, &p);
++	if (__bfqq == NULL) {
++		rb_link_node(&bfqq->pos_node, parent, p);
++		rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
++	} else
++		bfqq->pos_root = NULL;
++}
++
++/*
++ * Tell whether there are active queues or groups with differentiated weights.
++ */
++static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
++{
++	BUG_ON(!bfqd->hw_tag);
++	/*
++	 * For weights to differ, at least one of the trees must contain
++	 * at least two nodes.
++	 */
++	return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
++		(bfqd->queue_weights_tree.rb_node->rb_left ||
++		 bfqd->queue_weights_tree.rb_node->rb_right)
++#ifdef CONFIG_CGROUP_BFQIO
++	       ) ||
++	       (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
++		(bfqd->group_weights_tree.rb_node->rb_left ||
++		 bfqd->group_weights_tree.rb_node->rb_right)
++#endif
++	       );
++}
++
++/*
++ * If the weight-counter tree passed as input contains no counter for
++ * the weight of the input entity, then add that counter; otherwise just
++ * increment the existing counter.
++ *
++ * Note that weight-counter trees contain few nodes in mostly symmetric
++ * scenarios. For example, if all queues have the same weight, then the
++ * weight-counter tree for the queues may contain at most one node.
++ * This holds even if low_latency is on, because weight-raised queues
++ * are not inserted in the tree.
++ * In most scenarios, the rate at which nodes are created/destroyed
++ * should be low too.
++ */
++static void bfq_weights_tree_add(struct bfq_data *bfqd,
++				 struct bfq_entity *entity,
++				 struct rb_root *root)
++{
++	struct rb_node **new = &(root->rb_node), *parent = NULL;
++
++	/*
++	 * Do not insert if:
++	 * - the device does not support queueing;
++	 * - the entity is already associated with a counter, which happens if:
++	 *   1) the entity is associated with a queue, 2) a request arrival
++	 *   has caused the queue to become both non-weight-raised, and hence
++	 *   change its weight, and backlogged; in this respect, each
++	 *   of the two events causes an invocation of this function,
++	 *   3) this is the invocation of this function caused by the second
++	 *   event. This second invocation is actually useless, and we handle
++	 *   this fact by exiting immediately. More efficient or clearer
++	 *   solutions might possibly be adopted.
++	 */
++	if (!bfqd->hw_tag || entity->weight_counter)
++		return;
++
++	while (*new) {
++		struct bfq_weight_counter *__counter = container_of(*new,
++						struct bfq_weight_counter,
++						weights_node);
++		parent = *new;
++
++		if (entity->weight == __counter->weight) {
++			entity->weight_counter = __counter;
++			goto inc_counter;
++		}
++		if (entity->weight < __counter->weight)
++			new = &((*new)->rb_left);
++		else
++			new = &((*new)->rb_right);
++	}
++
++	entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
++					 GFP_ATOMIC);
++	entity->weight_counter->weight = entity->weight;
++	rb_link_node(&entity->weight_counter->weights_node, parent, new);
++	rb_insert_color(&entity->weight_counter->weights_node, root);
++
++inc_counter:
++	entity->weight_counter->num_active++;
++}
++
++/*
++ * Decrement the weight counter associated with the entity, and, if the
++ * counter reaches 0, remove the counter from the tree.
++ * See the comments to the function bfq_weights_tree_add() for considerations
++ * about overhead.
++ */
++static void bfq_weights_tree_remove(struct bfq_data *bfqd,
++				    struct bfq_entity *entity,
++				    struct rb_root *root)
++{
++	/*
++	 * Check whether the entity is actually associated with a counter.
++	 * In fact, the device may not be considered NCQ-capable for a while,
++	 * which implies that no insertion in the weight trees is performed,
++	 * after which the device may start to be deemed NCQ-capable, and hence
++	 * this function may start to be invoked. This may cause the function
++	 * to be invoked for entities that are not associated with any counter.
++	 */
++	if (!entity->weight_counter)
++		return;
++
++	BUG_ON(RB_EMPTY_ROOT(root));
++	BUG_ON(entity->weight_counter->weight != entity->weight);
++
++	BUG_ON(!entity->weight_counter->num_active);
++	entity->weight_counter->num_active--;
++	if (entity->weight_counter->num_active > 0)
++		goto reset_entity_pointer;
++
++	rb_erase(&entity->weight_counter->weights_node, root);
++	kfree(entity->weight_counter);
++
++reset_entity_pointer:
++	entity->weight_counter = NULL;
++}
++
++static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
++					struct bfq_queue *bfqq,
++					struct request *last)
++{
++	struct rb_node *rbnext = rb_next(&last->rb_node);
++	struct rb_node *rbprev = rb_prev(&last->rb_node);
++	struct request *next = NULL, *prev = NULL;
++
++	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
++
++	if (rbprev != NULL)
++		prev = rb_entry_rq(rbprev);
++
++	if (rbnext != NULL)
++		next = rb_entry_rq(rbnext);
++	else {
++		rbnext = rb_first(&bfqq->sort_list);
++		if (rbnext && rbnext != &last->rb_node)
++			next = rb_entry_rq(rbnext);
++	}
++
++	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
++}
++
++/* see the definition of bfq_async_charge_factor for details */
++static inline unsigned long bfq_serv_to_charge(struct request *rq,
++					       struct bfq_queue *bfqq)
++{
++	return blk_rq_sectors(rq) *
++		(1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
++		bfq_async_charge_factor));
++}
++
++/**
++ * bfq_updated_next_req - update the queue after a new next_rq selection.
++ * @bfqd: the device data the queue belongs to.
++ * @bfqq: the queue to update.
++ *
++ * If the first request of a queue changes we make sure that the queue
++ * has enough budget to serve at least its first request (if the
++ * request has grown).  We do this because if the queue has not enough
++ * budget for its first request, it has to go through two dispatch
++ * rounds to actually get it dispatched.
++ */
++static void bfq_updated_next_req(struct bfq_data *bfqd,
++				 struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++	struct request *next_rq = bfqq->next_rq;
++	unsigned long new_budget;
++
++	if (next_rq == NULL)
++		return;
++
++	if (bfqq == bfqd->in_service_queue)
++		/*
++		 * In order not to break guarantees, budgets cannot be
++		 * changed after an entity has been selected.
++		 */
++		return;
++
++	BUG_ON(entity->tree != &st->active);
++	BUG_ON(entity == entity->sched_data->in_service_entity);
++
++	new_budget = max_t(unsigned long, bfqq->max_budget,
++			   bfq_serv_to_charge(next_rq, bfqq));
++	if (entity->budget != new_budget) {
++		entity->budget = new_budget;
++		bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
++					 new_budget);
++		bfq_activate_bfqq(bfqd, bfqq);
++	}
++}
++
++static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
++{
++	u64 dur;
++
++	if (bfqd->bfq_wr_max_time > 0)
++		return bfqd->bfq_wr_max_time;
++
++	dur = bfqd->RT_prod;
++	do_div(dur, bfqd->peak_rate);
++
++	return dur;
++}
++
++static void bfq_add_request(struct request *rq)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++	struct bfq_entity *entity = &bfqq->entity;
++	struct bfq_data *bfqd = bfqq->bfqd;
++	struct request *next_rq, *prev;
++	unsigned long old_wr_coeff = bfqq->wr_coeff;
++	int idle_for_long_time = 0;
++
++	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
++	bfqq->queued[rq_is_sync(rq)]++;
++	bfqd->queued++;
++
++	elv_rb_add(&bfqq->sort_list, rq);
++
++	/*
++	 * Check if this request is a better next-serve candidate.
++	 */
++	prev = bfqq->next_rq;
++	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
++	BUG_ON(next_rq == NULL);
++	bfqq->next_rq = next_rq;
++
++	/*
++	 * Adjust priority tree position, if next_rq changes.
++	 */
++	if (prev != bfqq->next_rq)
++		bfq_rq_pos_tree_add(bfqd, bfqq);
++
++	if (!bfq_bfqq_busy(bfqq)) {
++		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++			time_is_before_jiffies(bfqq->soft_rt_next_start);
++		idle_for_long_time = time_is_before_jiffies(
++			bfqq->budget_timeout +
++			bfqd->bfq_wr_min_idle_time);
++		entity->budget = max_t(unsigned long, bfqq->max_budget,
++				       bfq_serv_to_charge(next_rq, bfqq));
++
++		if (!bfq_bfqq_IO_bound(bfqq)) {
++			if (time_before(jiffies,
++					RQ_BIC(rq)->ttime.last_end_request +
++					bfqd->bfq_slice_idle)) {
++				bfqq->requests_within_timer++;
++				if (bfqq->requests_within_timer >=
++				    bfqd->bfq_requests_within_timer)
++					bfq_mark_bfqq_IO_bound(bfqq);
++			} else
++				bfqq->requests_within_timer = 0;
++		}
++
++		if (!bfqd->low_latency)
++			goto add_bfqq_busy;
++
++		/*
++		 * If the queue is not being boosted and has been idle
++		 * for enough time, start a weight-raising period
++		 */
++		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
++			if (idle_for_long_time)
++				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++			else
++				bfqq->wr_cur_max_time =
++					bfqd->bfq_wr_rt_max_time;
++			bfq_log_bfqq(bfqd, bfqq,
++				     "wrais starting at %lu, rais_max_time %u",
++				     jiffies,
++				     jiffies_to_msecs(bfqq->wr_cur_max_time));
++		} else if (old_wr_coeff > 1) {
++			if (idle_for_long_time)
++				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++			else if (bfqq->wr_cur_max_time ==
++				 bfqd->bfq_wr_rt_max_time &&
++				 !soft_rt) {
++				bfqq->wr_coeff = 1;
++				bfq_log_bfqq(bfqd, bfqq,
++					"wrais ending at %lu, rais_max_time %u",
++					jiffies,
++					jiffies_to_msecs(bfqq->
++						wr_cur_max_time));
++			} else if (time_before(
++					bfqq->last_wr_start_finish +
++					bfqq->wr_cur_max_time,
++					jiffies +
++					bfqd->bfq_wr_rt_max_time) &&
++				   soft_rt) {
++				/*
++				 *
++				 * The remaining weight-raising time is lower
++				 * than bfqd->bfq_wr_rt_max_time, which
++				 * means that the application is enjoying
++				 * weight raising either because deemed soft-
++				 * rt in the near past, or because deemed
++				 * interactive a long ago. In both cases,
++				 * resetting now the current remaining weight-
++				 * raising time for the application to the
++				 * weight-raising duration for soft rt
++				 * applications would not cause any latency
++				 * increase for the application (as the new
++				 * duration would be higher than the remaining
++				 * time).
++				 *
++				 * In addition, the application is now meeting
++				 * the requirements for being deemed soft rt.
++				 * In the end we can correctly and safely
++				 * (re)charge the weight-raising duration for
++				 * the application with the weight-raising
++				 * duration for soft rt applications.
++				 *
++				 * In particular, doing this recharge now, i.e.,
++				 * before the weight-raising period for the
++				 * application finishes, reduces the probability
++				 * of the following negative scenario:
++				 * 1) the weight of a soft rt application is
++				 *    raised at startup (as for any newly
++				 *    created application),
++				 * 2) since the application is not interactive,
++				 *    at a certain time weight-raising is
++				 *    stopped for the application,
++				 * 3) at that time the application happens to
++				 *    still have pending requests, and hence
++				 *    is destined to not have a chance to be
++				 *    deemed soft rt before these requests are
++				 *    completed (see the comments to the
++				 *    function bfq_bfqq_softrt_next_start()
++				 *    for details on soft rt detection),
++				 * 4) these pending requests experience a high
++				 *    latency because the application is not
++				 *    weight-raised while they are pending.
++				 */
++				bfqq->last_wr_start_finish = jiffies;
++				bfqq->wr_cur_max_time =
++					bfqd->bfq_wr_rt_max_time;
++			}
++		}
++		if (old_wr_coeff != bfqq->wr_coeff)
++			entity->ioprio_changed = 1;
++add_bfqq_busy:
++		bfqq->last_idle_bklogged = jiffies;
++		bfqq->service_from_backlogged = 0;
++		bfq_clear_bfqq_softrt_update(bfqq);
++		bfq_add_bfqq_busy(bfqd, bfqq);
++	} else {
++		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
++		    time_is_before_jiffies(
++				bfqq->last_wr_start_finish +
++				bfqd->bfq_wr_min_inter_arr_async)) {
++			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
++			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++
++			bfqd->wr_busy_queues++;
++			entity->ioprio_changed = 1;
++			bfq_log_bfqq(bfqd, bfqq,
++			    "non-idle wrais starting at %lu, rais_max_time %u",
++			    jiffies,
++			    jiffies_to_msecs(bfqq->wr_cur_max_time));
++		}
++		if (prev != bfqq->next_rq)
++			bfq_updated_next_req(bfqd, bfqq);
++	}
++
++	if (bfqd->low_latency &&
++		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
++		 idle_for_long_time))
++		bfqq->last_wr_start_finish = jiffies;
++}
++
++static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
++					  struct bio *bio)
++{
++	struct task_struct *tsk = current;
++	struct bfq_io_cq *bic;
++	struct bfq_queue *bfqq;
++
++	bic = bfq_bic_lookup(bfqd, tsk->io_context);
++	if (bic == NULL)
++		return NULL;
++
++	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++	if (bfqq != NULL)
++		return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
++
++	return NULL;
++}
++
++static void bfq_activate_request(struct request_queue *q, struct request *rq)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++
++	bfqd->rq_in_driver++;
++	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
++	bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
++		(long long unsigned)bfqd->last_position);
++}
++
++static inline void bfq_deactivate_request(struct request_queue *q,
++					  struct request *rq)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++
++	BUG_ON(bfqd->rq_in_driver == 0);
++	bfqd->rq_in_driver--;
++}
++
++static void bfq_remove_request(struct request *rq)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++	struct bfq_data *bfqd = bfqq->bfqd;
++	const int sync = rq_is_sync(rq);
++
++	if (bfqq->next_rq == rq) {
++		bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
++		bfq_updated_next_req(bfqd, bfqq);
++	}
++
++	list_del_init(&rq->queuelist);
++	BUG_ON(bfqq->queued[sync] == 0);
++	bfqq->queued[sync]--;
++	bfqd->queued--;
++	elv_rb_del(&bfqq->sort_list, rq);
++
++	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
++		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
++			bfq_del_bfqq_busy(bfqd, bfqq, 1);
++		/*
++		 * Remove queue from request-position tree as it is empty.
++		 */
++		if (bfqq->pos_root != NULL) {
++			rb_erase(&bfqq->pos_node, bfqq->pos_root);
++			bfqq->pos_root = NULL;
++		}
++	}
++
++	if (rq->cmd_flags & REQ_META) {
++		BUG_ON(bfqq->meta_pending == 0);
++		bfqq->meta_pending--;
++	}
++}
++
++static int bfq_merge(struct request_queue *q, struct request **req,
++		     struct bio *bio)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct request *__rq;
++
++	__rq = bfq_find_rq_fmerge(bfqd, bio);
++	if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) {
++		*req = __rq;
++		return ELEVATOR_FRONT_MERGE;
++	}
++
++	return ELEVATOR_NO_MERGE;
++}
++
++static void bfq_merged_request(struct request_queue *q, struct request *req,
++			       int type)
++{
++	if (type == ELEVATOR_FRONT_MERGE &&
++	    rb_prev(&req->rb_node) &&
++	    blk_rq_pos(req) <
++	    blk_rq_pos(container_of(rb_prev(&req->rb_node),
++				    struct request, rb_node))) {
++		struct bfq_queue *bfqq = RQ_BFQQ(req);
++		struct bfq_data *bfqd = bfqq->bfqd;
++		struct request *prev, *next_rq;
++
++		/* Reposition request in its sort_list */
++		elv_rb_del(&bfqq->sort_list, req);
++		elv_rb_add(&bfqq->sort_list, req);
++		/* Choose next request to be served for bfqq */
++		prev = bfqq->next_rq;
++		next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
++					 bfqd->last_position);
++		BUG_ON(next_rq == NULL);
++		bfqq->next_rq = next_rq;
++		/*
++		 * If next_rq changes, update both the queue's budget to
++		 * fit the new request and the queue's position in its
++		 * rq_pos_tree.
++		 */
++		if (prev != bfqq->next_rq) {
++			bfq_updated_next_req(bfqd, bfqq);
++			bfq_rq_pos_tree_add(bfqd, bfqq);
++		}
++	}
++}
++
++static void bfq_merged_requests(struct request_queue *q, struct request *rq,
++				struct request *next)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++	/*
++	 * Reposition in fifo if next is older than rq.
++	 */
++	if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
++	    time_before(next->fifo_time, rq->fifo_time)) {
++		list_move(&rq->queuelist, &next->queuelist);
++		rq->fifo_time = next->fifo_time;
++	}
++
++	if (bfqq->next_rq == next)
++		bfqq->next_rq = rq;
++
++	bfq_remove_request(next);
++}
++
++/* Must be called with bfqq != NULL */
++static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
++{
++	BUG_ON(bfqq == NULL);
++	if (bfq_bfqq_busy(bfqq))
++		bfqq->bfqd->wr_busy_queues--;
++	bfqq->wr_coeff = 1;
++	bfqq->wr_cur_max_time = 0;
++	/* Trigger a weight change on the next activation of the queue */
++	bfqq->entity.ioprio_changed = 1;
++}
++
++static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
++				    struct bfq_group *bfqg)
++{
++	int i, j;
++
++	for (i = 0; i < 2; i++)
++		for (j = 0; j < IOPRIO_BE_NR; j++)
++			if (bfqg->async_bfqq[i][j] != NULL)
++				bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
++	if (bfqg->async_idle_bfqq != NULL)
++		bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
++}
++
++static void bfq_end_wr(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq;
++
++	spin_lock_irq(bfqd->queue->queue_lock);
++
++	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
++		bfq_bfqq_end_wr(bfqq);
++	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
++		bfq_bfqq_end_wr(bfqq);
++	bfq_end_wr_async(bfqd);
++
++	spin_unlock_irq(bfqd->queue->queue_lock);
++}
++
++static int bfq_allow_merge(struct request_queue *q, struct request *rq,
++			   struct bio *bio)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_io_cq *bic;
++	struct bfq_queue *bfqq;
++
++	/*
++	 * Disallow merge of a sync bio into an async request.
++	 */
++	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
++		return 0;
++
++	/*
++	 * Lookup the bfqq that this bio will be queued with. Allow
++	 * merge only if rq is queued there.
++	 * Queue lock is held here.
++	 */
++	bic = bfq_bic_lookup(bfqd, current->io_context);
++	if (bic == NULL)
++		return 0;
++
++	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++	return bfqq == RQ_BFQQ(rq);
++}
++
++static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
++				       struct bfq_queue *bfqq)
++{
++	if (bfqq != NULL) {
++		bfq_mark_bfqq_must_alloc(bfqq);
++		bfq_mark_bfqq_budget_new(bfqq);
++		bfq_clear_bfqq_fifo_expire(bfqq);
++
++		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
++
++		bfq_log_bfqq(bfqd, bfqq,
++			     "set_in_service_queue, cur-budget = %lu",
++			     bfqq->entity.budget);
++	}
++
++	bfqd->in_service_queue = bfqq;
++}
++
++/*
++ * Get and set a new queue for service.
++ */
++static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd,
++						  struct bfq_queue *bfqq)
++{
++	if (!bfqq)
++		bfqq = bfq_get_next_queue(bfqd);
++	else
++		bfq_get_next_queue_forced(bfqd, bfqq);
++
++	__bfq_set_in_service_queue(bfqd, bfqq);
++	return bfqq;
++}
++
++static inline sector_t bfq_dist_from_last(struct bfq_data *bfqd,
++					  struct request *rq)
++{
++	if (blk_rq_pos(rq) >= bfqd->last_position)
++		return blk_rq_pos(rq) - bfqd->last_position;
++	else
++		return bfqd->last_position - blk_rq_pos(rq);
++}
++
++/*
++ * Return true if bfqq has no request pending and rq is close enough to
++ * bfqd->last_position, or if rq is closer to bfqd->last_position than
++ * bfqq->next_rq
++ */
++static inline int bfq_rq_close(struct bfq_data *bfqd, struct request *rq)
++{
++	return bfq_dist_from_last(bfqd, rq) <= BFQQ_SEEK_THR;
++}
++
++static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
++{
++	struct rb_root *root = &bfqd->rq_pos_tree;
++	struct rb_node *parent, *node;
++	struct bfq_queue *__bfqq;
++	sector_t sector = bfqd->last_position;
++
++	if (RB_EMPTY_ROOT(root))
++		return NULL;
++
++	/*
++	 * First, if we find a request starting at the end of the last
++	 * request, choose it.
++	 */
++	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
++	if (__bfqq != NULL)
++		return __bfqq;
++
++	/*
++	 * If the exact sector wasn't found, the parent of the NULL leaf
++	 * will contain the closest sector (rq_pos_tree sorted by
++	 * next_request position).
++	 */
++	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
++	if (bfq_rq_close(bfqd, __bfqq->next_rq))
++		return __bfqq;
++
++	if (blk_rq_pos(__bfqq->next_rq) < sector)
++		node = rb_next(&__bfqq->pos_node);
++	else
++		node = rb_prev(&__bfqq->pos_node);
++	if (node == NULL)
++		return NULL;
++
++	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
++	if (bfq_rq_close(bfqd, __bfqq->next_rq))
++		return __bfqq;
++
++	return NULL;
++}
++
++/*
++ * bfqd - obvious
++ * cur_bfqq - passed in so that we don't decide that the current queue
++ *            is closely cooperating with itself.
++ *
++ * We are assuming that cur_bfqq has dispatched at least one request,
++ * and that bfqd->last_position reflects a position on the disk associated
++ * with the I/O issued by cur_bfqq.
++ */
++static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
++					      struct bfq_queue *cur_bfqq)
++{
++	struct bfq_queue *bfqq;
++
++	if (bfq_class_idle(cur_bfqq))
++		return NULL;
++	if (!bfq_bfqq_sync(cur_bfqq))
++		return NULL;
++	if (BFQQ_SEEKY(cur_bfqq))
++		return NULL;
++
++	/* If device has only one backlogged bfq_queue, don't search. */
++	if (bfqd->busy_queues == 1)
++		return NULL;
++
++	/*
++	 * We should notice if some of the queues are cooperating, e.g.
++	 * working closely on the same area of the disk. In that case,
++	 * we can group them together and don't waste time idling.
++	 */
++	bfqq = bfqq_close(bfqd);
++	if (bfqq == NULL || bfqq == cur_bfqq)
++		return NULL;
++
++	/*
++	 * Do not merge queues from different bfq_groups.
++	*/
++	if (bfqq->entity.parent != cur_bfqq->entity.parent)
++		return NULL;
++
++	/*
++	 * It only makes sense to merge sync queues.
++	 */
++	if (!bfq_bfqq_sync(bfqq))
++		return NULL;
++	if (BFQQ_SEEKY(bfqq))
++		return NULL;
++
++	/*
++	 * Do not merge queues of different priority classes.
++	 */
++	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
++		return NULL;
++
++	return bfqq;
++}
++
++/*
++ * If enough samples have been computed, return the current max budget
++ * stored in bfqd, which is dynamically updated according to the
++ * estimated disk peak rate; otherwise return the default max budget
++ */
++static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
++{
++	if (bfqd->budgets_assigned < 194)
++		return bfq_default_max_budget;
++	else
++		return bfqd->bfq_max_budget;
++}
++
++/*
++ * Return min budget, which is a fraction of the current or default
++ * max budget (trying with 1/32)
++ */
++static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
++{
++	if (bfqd->budgets_assigned < 194)
++		return bfq_default_max_budget / 32;
++	else
++		return bfqd->bfq_max_budget / 32;
++}
++
++static void bfq_arm_slice_timer(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq = bfqd->in_service_queue;
++	struct bfq_io_cq *bic;
++	unsigned long sl;
++
++	BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list));
++
++	/* Processes have exited, don't wait. */
++	bic = bfqd->in_service_bic;
++	if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0)
++		return;
++
++	bfq_mark_bfqq_wait_request(bfqq);
++
++	/*
++	 * We don't want to idle for seeks, but we do want to allow
++	 * fair distribution of slice time for a process doing back-to-back
++	 * seeks. So allow a little bit of time for him to submit a new rq.
++	 *
++	 * To prevent processes with (partly) seeky workloads from
++	 * being too ill-treated, grant them a small fraction of the
++	 * assigned budget before reducing the waiting time to
++	 * BFQ_MIN_TT. This happened to help reduce latency.
++	 */
++	sl = bfqd->bfq_slice_idle;
++	/*
++	 * Unless the queue is being weight-raised, grant only minimum idle
++	 * time if the queue either has been seeky for long enough or has
++	 * already proved to be constantly seeky.
++	 */
++	if (bfq_sample_valid(bfqq->seek_samples) &&
++	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
++				  bfq_max_budget(bfqq->bfqd) / 8) ||
++	      bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1)
++		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
++	else if (bfqq->wr_coeff > 1)
++		sl = sl * 3;
++	bfqd->last_idling_start = ktime_get();
++	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
++	bfq_log(bfqd, "arm idle: %u/%u ms",
++		jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle));
++}
++
++/*
++ * Set the maximum time for the in-service queue to consume its
++ * budget. This prevents seeky processes from lowering the disk
++ * throughput (always guaranteed with a time slice scheme as in CFQ).
++ */
++static void bfq_set_budget_timeout(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq = bfqd->in_service_queue;
++	unsigned int timeout_coeff;
++	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
++		timeout_coeff = 1;
++	else
++		timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
++
++	bfqd->last_budget_start = ktime_get();
++
++	bfq_clear_bfqq_budget_new(bfqq);
++	bfqq->budget_timeout = jiffies +
++		bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff;
++
++	bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
++		jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] *
++		timeout_coeff));
++}
++
++/*
++ * Move request from internal lists to the request queue dispatch list.
++ */
++static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++	/*
++	 * For consistency, the next instruction should have been executed
++	 * after removing the request from the queue and dispatching it.
++	 * We execute instead this instruction before bfq_remove_request()
++	 * (and hence introduce a temporary inconsistency), for efficiency.
++	 * In fact, in a forced_dispatch, this prevents two counters related
++	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
++	 * is not in service, and then to be incremented again after
++	 * incrementing bfqq->dispatched.
++	 */
++	bfqq->dispatched++;
++	bfq_remove_request(rq);
++	elv_dispatch_sort(q, rq);
++
++	if (bfq_bfqq_sync(bfqq))
++		bfqd->sync_flight++;
++}
++
++/*
++ * Return expired entry, or NULL to just start from scratch in rbtree.
++ */
++static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
++{
++	struct request *rq = NULL;
++
++	if (bfq_bfqq_fifo_expire(bfqq))
++		return NULL;
++
++	bfq_mark_bfqq_fifo_expire(bfqq);
++
++	if (list_empty(&bfqq->fifo))
++		return NULL;
++
++	rq = rq_entry_fifo(bfqq->fifo.next);
++
++	if (time_before(jiffies, rq->fifo_time))
++		return NULL;
++
++	return rq;
++}
++
++/*
++ * Must be called with the queue_lock held.
++ */
++static int bfqq_process_refs(struct bfq_queue *bfqq)
++{
++	int process_refs, io_refs;
++
++	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
++	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
++	BUG_ON(process_refs < 0);
++	return process_refs;
++}
++
++static void bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++	int process_refs, new_process_refs;
++	struct bfq_queue *__bfqq;
++
++	/*
++	 * If there are no process references on the new_bfqq, then it is
++	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
++	 * may have dropped their last reference (not just their last process
++	 * reference).
++	 */
++	if (!bfqq_process_refs(new_bfqq))
++		return;
++
++	/* Avoid a circular list and skip interim queue merges. */
++	while ((__bfqq = new_bfqq->new_bfqq)) {
++		if (__bfqq == bfqq)
++			return;
++		new_bfqq = __bfqq;
++	}
++
++	process_refs = bfqq_process_refs(bfqq);
++	new_process_refs = bfqq_process_refs(new_bfqq);
++	/*
++	 * If the process for the bfqq has gone away, there is no
++	 * sense in merging the queues.
++	 */
++	if (process_refs == 0 || new_process_refs == 0)
++		return;
++
++	/*
++	 * Merge in the direction of the lesser amount of work.
++	 */
++	if (new_process_refs >= process_refs) {
++		bfqq->new_bfqq = new_bfqq;
++		atomic_add(process_refs, &new_bfqq->ref);
++	} else {
++		new_bfqq->new_bfqq = bfqq;
++		atomic_add(new_process_refs, &bfqq->ref);
++	}
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
++		new_bfqq->pid);
++}
++
++static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++	return entity->budget - entity->service;
++}
++
++static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	BUG_ON(bfqq != bfqd->in_service_queue);
++
++	__bfq_bfqd_reset_in_service(bfqd);
++
++	/*
++	 * If this bfqq is shared between multiple processes, check
++	 * to make sure that those processes are still issuing I/Os
++	 * within the mean seek distance. If not, it may be time to
++	 * break the queues apart again.
++	 */
++	if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
++		bfq_mark_bfqq_split_coop(bfqq);
++
++	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
++		/*
++		 * Overloading budget_timeout field to store the time
++		 * at which the queue remains with no backlog; used by
++		 * the weight-raising mechanism.
++		 */
++		bfqq->budget_timeout = jiffies;
++		bfq_del_bfqq_busy(bfqd, bfqq, 1);
++	} else {
++		bfq_activate_bfqq(bfqd, bfqq);
++		/*
++		 * Resort priority tree of potential close cooperators.
++		 */
++		bfq_rq_pos_tree_add(bfqd, bfqq);
++	}
++}
++
++/**
++ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
++ * @bfqd: device data.
++ * @bfqq: queue to update.
++ * @reason: reason for expiration.
++ *
++ * Handle the feedback on @bfqq budget.  See the body for detailed
++ * comments.
++ */
++static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
++				     struct bfq_queue *bfqq,
++				     enum bfqq_expiration reason)
++{
++	struct request *next_rq;
++	unsigned long budget, min_budget;
++
++	budget = bfqq->max_budget;
++	min_budget = bfq_min_budget(bfqd);
++
++	BUG_ON(bfqq != bfqd->in_service_queue);
++
++	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
++		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
++	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
++		budget, bfq_min_budget(bfqd));
++	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
++		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
++
++	if (bfq_bfqq_sync(bfqq)) {
++		switch (reason) {
++		/*
++		 * Caveat: in all the following cases we trade latency
++		 * for throughput.
++		 */
++		case BFQ_BFQQ_TOO_IDLE:
++			/*
++			 * This is the only case where we may reduce
++			 * the budget: if there is no request of the
++			 * process still waiting for completion, then
++			 * we assume (tentatively) that the timer has
++			 * expired because the batch of requests of
++			 * the process could have been served with a
++			 * smaller budget.  Hence, betting that
++			 * process will behave in the same way when it
++			 * becomes backlogged again, we reduce its
++			 * next budget.  As long as we guess right,
++			 * this budget cut reduces the latency
++			 * experienced by the process.
++			 *
++			 * However, if there are still outstanding
++			 * requests, then the process may have not yet
++			 * issued its next request just because it is
++			 * still waiting for the completion of some of
++			 * the still outstanding ones.  So in this
++			 * subcase we do not reduce its budget, on the
++			 * contrary we increase it to possibly boost
++			 * the throughput, as discussed in the
++			 * comments to the BUDGET_TIMEOUT case.
++			 */
++			if (bfqq->dispatched > 0) /* still outstanding reqs */
++				budget = min(budget * 2, bfqd->bfq_max_budget);
++			else {
++				if (budget > 5 * min_budget)
++					budget -= 4 * min_budget;
++				else
++					budget = min_budget;
++			}
++			break;
++		case BFQ_BFQQ_BUDGET_TIMEOUT:
++			/*
++			 * We double the budget here because: 1) it
++			 * gives the chance to boost the throughput if
++			 * this is not a seeky process (which may have
++			 * bumped into this timeout because of, e.g.,
++			 * ZBR), 2) together with charge_full_budget
++			 * it helps give seeky processes higher
++			 * timestamps, and hence be served less
++			 * frequently.
++			 */
++			budget = min(budget * 2, bfqd->bfq_max_budget);
++			break;
++		case BFQ_BFQQ_BUDGET_EXHAUSTED:
++			/*
++			 * The process still has backlog, and did not
++			 * let either the budget timeout or the disk
++			 * idling timeout expire. Hence it is not
++			 * seeky, has a short thinktime and may be
++			 * happy with a higher budget too. So
++			 * definitely increase the budget of this good
++			 * candidate to boost the disk throughput.
++			 */
++			budget = min(budget * 4, bfqd->bfq_max_budget);
++			break;
++		case BFQ_BFQQ_NO_MORE_REQUESTS:
++		       /*
++			* Leave the budget unchanged.
++			*/
++		default:
++			return;
++		}
++	} else /* async queue */
++	    /* async queues get always the maximum possible budget
++	     * (their ability to dispatch is limited by
++	     * @bfqd->bfq_max_budget_async_rq).
++	     */
++		budget = bfqd->bfq_max_budget;
++
++	bfqq->max_budget = budget;
++
++	if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
++	    bfqq->max_budget > bfqd->bfq_max_budget)
++		bfqq->max_budget = bfqd->bfq_max_budget;
++
++	/*
++	 * Make sure that we have enough budget for the next request.
++	 * Since the finish time of the bfqq must be kept in sync with
++	 * the budget, be sure to call __bfq_bfqq_expire() after the
++	 * update.
++	 */
++	next_rq = bfqq->next_rq;
++	if (next_rq != NULL)
++		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
++					    bfq_serv_to_charge(next_rq, bfqq));
++	else
++		bfqq->entity.budget = bfqq->max_budget;
++
++	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
++			next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
++			bfqq->entity.budget);
++}
++
++static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
++{
++	unsigned long max_budget;
++
++	/*
++	 * The max_budget calculated when autotuning is equal to the
++	 * amount of sectors transfered in timeout_sync at the
++	 * estimated peak rate.
++	 */
++	max_budget = (unsigned long)(peak_rate * 1000 *
++				     timeout >> BFQ_RATE_SHIFT);
++
++	return max_budget;
++}
++
++/*
++ * In addition to updating the peak rate, checks whether the process
++ * is "slow", and returns 1 if so. This slow flag is used, in addition
++ * to the budget timeout, to reduce the amount of service provided to
++ * seeky processes, and hence reduce their chances to lower the
++ * throughput. See the code for more details.
++ */
++static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++				int compensate, enum bfqq_expiration reason)
++{
++	u64 bw, usecs, expected, timeout;
++	ktime_t delta;
++	int update = 0;
++
++	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
++		return 0;
++
++	if (compensate)
++		delta = bfqd->last_idling_start;
++	else
++		delta = ktime_get();
++	delta = ktime_sub(delta, bfqd->last_budget_start);
++	usecs = ktime_to_us(delta);
++
++	/* Don't trust short/unrealistic values. */
++	if (usecs < 100 || usecs >= LONG_MAX)
++		return 0;
++
++	/*
++	 * Calculate the bandwidth for the last slice.  We use a 64 bit
++	 * value to store the peak rate, in sectors per usec in fixed
++	 * point math.  We do so to have enough precision in the estimate
++	 * and to avoid overflows.
++	 */
++	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
++	do_div(bw, (unsigned long)usecs);
++
++	timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
++
++	/*
++	 * Use only long (> 20ms) intervals to filter out spikes for
++	 * the peak rate estimation.
++	 */
++	if (usecs > 20000) {
++		if (bw > bfqd->peak_rate ||
++		   (!BFQQ_SEEKY(bfqq) &&
++		    reason == BFQ_BFQQ_BUDGET_TIMEOUT)) {
++			bfq_log(bfqd, "measured bw =%llu", bw);
++			/*
++			 * To smooth oscillations use a low-pass filter with
++			 * alpha=7/8, i.e.,
++			 * new_rate = (7/8) * old_rate + (1/8) * bw
++			 */
++			do_div(bw, 8);
++			if (bw == 0)
++				return 0;
++			bfqd->peak_rate *= 7;
++			do_div(bfqd->peak_rate, 8);
++			bfqd->peak_rate += bw;
++			update = 1;
++			bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate);
++		}
++
++		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
++
++		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
++			bfqd->peak_rate_samples++;
++
++		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
++		    update) {
++			int dev_type = blk_queue_nonrot(bfqd->queue);
++			if (bfqd->bfq_user_max_budget == 0) {
++				bfqd->bfq_max_budget =
++					bfq_calc_max_budget(bfqd->peak_rate,
++							    timeout);
++				bfq_log(bfqd, "new max_budget=%lu",
++					bfqd->bfq_max_budget);
++			}
++			if (bfqd->device_speed == BFQ_BFQD_FAST &&
++			    bfqd->peak_rate < device_speed_thresh[dev_type]) {
++				bfqd->device_speed = BFQ_BFQD_SLOW;
++				bfqd->RT_prod = R_slow[dev_type] *
++						T_slow[dev_type];
++			} else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
++			    bfqd->peak_rate > device_speed_thresh[dev_type]) {
++				bfqd->device_speed = BFQ_BFQD_FAST;
++				bfqd->RT_prod = R_fast[dev_type] *
++						T_fast[dev_type];
++			}
++		}
++	}
++
++	/*
++	 * If the process has been served for a too short time
++	 * interval to let its possible sequential accesses prevail on
++	 * the initial seek time needed to move the disk head on the
++	 * first sector it requested, then give the process a chance
++	 * and for the moment return false.
++	 */
++	if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
++		return 0;
++
++	/*
++	 * A process is considered ``slow'' (i.e., seeky, so that we
++	 * cannot treat it fairly in the service domain, as it would
++	 * slow down too much the other processes) if, when a slice
++	 * ends for whatever reason, it has received service at a
++	 * rate that would not be high enough to complete the budget
++	 * before the budget timeout expiration.
++	 */
++	expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
++
++	/*
++	 * Caveat: processes doing IO in the slower disk zones will
++	 * tend to be slow(er) even if not seeky. And the estimated
++	 * peak rate will actually be an average over the disk
++	 * surface. Hence, to not be too harsh with unlucky processes,
++	 * we keep a budget/3 margin of safety before declaring a
++	 * process slow.
++	 */
++	return expected > (4 * bfqq->entity.budget) / 3;
++}
++
++/*
++ * To be deemed as soft real-time, an application must meet two
++ * requirements. First, the application must not require an average
++ * bandwidth higher than the approximate bandwidth required to playback or
++ * record a compressed high-definition video.
++ * The next function is invoked on the completion of the last request of a
++ * batch, to compute the next-start time instant, soft_rt_next_start, such
++ * that, if the next request of the application does not arrive before
++ * soft_rt_next_start, then the above requirement on the bandwidth is met.
++ *
++ * The second requirement is that the request pattern of the application is
++ * isochronous, i.e., that, after issuing a request or a batch of requests,
++ * the application stops issuing new requests until all its pending requests
++ * have been completed. After that, the application may issue a new batch,
++ * and so on.
++ * For this reason the next function is invoked to compute
++ * soft_rt_next_start only for applications that meet this requirement,
++ * whereas soft_rt_next_start is set to infinity for applications that do
++ * not.
++ *
++ * Unfortunately, even a greedy application may happen to behave in an
++ * isochronous way if the CPU load is high. In fact, the application may
++ * stop issuing requests while the CPUs are busy serving other processes,
++ * then restart, then stop again for a while, and so on. In addition, if
++ * the disk achieves a low enough throughput with the request pattern
++ * issued by the application (e.g., because the request pattern is random
++ * and/or the device is slow), then the application may meet the above
++ * bandwidth requirement too. To prevent such a greedy application to be
++ * deemed as soft real-time, a further rule is used in the computation of
++ * soft_rt_next_start: soft_rt_next_start must be higher than the current
++ * time plus the maximum time for which the arrival of a request is waited
++ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
++ * This filters out greedy applications, as the latter issue instead their
++ * next request as soon as possible after the last one has been completed
++ * (in contrast, when a batch of requests is completed, a soft real-time
++ * application spends some time processing data).
++ *
++ * Unfortunately, the last filter may easily generate false positives if
++ * only bfqd->bfq_slice_idle is used as a reference time interval and one
++ * or both the following cases occur:
++ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
++ *    than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
++ *    HZ=100.
++ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
++ *    for a while, then suddenly 'jump' by several units to recover the lost
++ *    increments. This seems to happen, e.g., inside virtual machines.
++ * To address this issue, we do not use as a reference time interval just
++ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
++ * particular we add the minimum number of jiffies for which the filter
++ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
++ * machines.
++ */
++static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
++						       struct bfq_queue *bfqq)
++{
++	return max(bfqq->last_idle_bklogged +
++		   HZ * bfqq->service_from_backlogged /
++		   bfqd->bfq_wr_max_softrt_rate,
++		   jiffies + bfqq->bfqd->bfq_slice_idle + 4);
++}
++
++/*
++ * Return the largest-possible time instant such that, for as long as possible,
++ * the current time will be lower than this time instant according to the macro
++ * time_is_before_jiffies().
++ */
++static inline unsigned long bfq_infinity_from_now(unsigned long now)
++{
++	return now + ULONG_MAX / 2;
++}
++
++/**
++ * bfq_bfqq_expire - expire a queue.
++ * @bfqd: device owning the queue.
++ * @bfqq: the queue to expire.
++ * @compensate: if true, compensate for the time spent idling.
++ * @reason: the reason causing the expiration.
++ *
++ *
++ * If the process associated to the queue is slow (i.e., seeky), or in
++ * case of budget timeout, or, finally, if it is async, we
++ * artificially charge it an entire budget (independently of the
++ * actual service it received). As a consequence, the queue will get
++ * higher timestamps than the correct ones upon reactivation, and
++ * hence it will be rescheduled as if it had received more service
++ * than what it actually received. In the end, this class of processes
++ * will receive less service in proportion to how slowly they consume
++ * their budgets (and hence how seriously they tend to lower the
++ * throughput).
++ *
++ * In contrast, when a queue expires because it has been idling for
++ * too much or because it exhausted its budget, we do not touch the
++ * amount of service it has received. Hence when the queue will be
++ * reactivated and its timestamps updated, the latter will be in sync
++ * with the actual service received by the queue until expiration.
++ *
++ * Charging a full budget to the first type of queues and the exact
++ * service to the others has the effect of using the WF2Q+ policy to
++ * schedule the former on a timeslice basis, without violating the
++ * service domain guarantees of the latter.
++ */
++static void bfq_bfqq_expire(struct bfq_data *bfqd,
++			    struct bfq_queue *bfqq,
++			    int compensate,
++			    enum bfqq_expiration reason)
++{
++	int slow;
++	BUG_ON(bfqq != bfqd->in_service_queue);
++
++	/* Update disk peak rate for autotuning and check whether the
++	 * process is slow (see bfq_update_peak_rate).
++	 */
++	slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason);
++
++	/*
++	 * As above explained, 'punish' slow (i.e., seeky), timed-out
++	 * and async queues, to favor sequential sync workloads.
++	 *
++	 * Processes doing I/O in the slower disk zones will tend to be
++	 * slow(er) even if not seeky. Hence, since the estimated peak
++	 * rate is actually an average over the disk surface, these
++	 * processes may timeout just for bad luck. To avoid punishing
++	 * them we do not charge a full budget to a process that
++	 * succeeded in consuming at least 2/3 of its budget.
++	 */
++	if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
++		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
++		bfq_bfqq_charge_full_budget(bfqq);
++
++	bfqq->service_from_backlogged += bfqq->entity.service;
++
++	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
++	    !bfq_bfqq_constantly_seeky(bfqq)) {
++		bfq_mark_bfqq_constantly_seeky(bfqq);
++		if (!blk_queue_nonrot(bfqd->queue))
++			bfqd->const_seeky_busy_in_flight_queues++;
++	}
++
++	if (reason == BFQ_BFQQ_TOO_IDLE &&
++	    bfqq->entity.service <= 2 * bfqq->entity.budget / 10 )
++		bfq_clear_bfqq_IO_bound(bfqq);
++
++	if (bfqd->low_latency && bfqq->wr_coeff == 1)
++		bfqq->last_wr_start_finish = jiffies;
++
++	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
++	    RB_EMPTY_ROOT(&bfqq->sort_list)) {
++		/*
++		 * If we get here, and there are no outstanding requests,
++		 * then the request pattern is isochronous (see the comments
++		 * to the function bfq_bfqq_softrt_next_start()). Hence we
++		 * can compute soft_rt_next_start. If, instead, the queue
++		 * still has outstanding requests, then we have to wait
++		 * for the completion of all the outstanding requests to
++		 * discover whether the request pattern is actually
++		 * isochronous.
++		 */
++		if (bfqq->dispatched == 0)
++			bfqq->soft_rt_next_start =
++				bfq_bfqq_softrt_next_start(bfqd, bfqq);
++		else {
++			/*
++			 * The application is still waiting for the
++			 * completion of one or more requests:
++			 * prevent it from possibly being incorrectly
++			 * deemed as soft real-time by setting its
++			 * soft_rt_next_start to infinity. In fact,
++			 * without this assignment, the application
++			 * would be incorrectly deemed as soft
++			 * real-time if:
++			 * 1) it issued a new request before the
++			 *    completion of all its in-flight
++			 *    requests, and
++			 * 2) at that time, its soft_rt_next_start
++			 *    happened to be in the past.
++			 */
++			bfqq->soft_rt_next_start =
++				bfq_infinity_from_now(jiffies);
++			/*
++			 * Schedule an update of soft_rt_next_start to when
++			 * the task may be discovered to be isochronous.
++			 */
++			bfq_mark_bfqq_softrt_update(bfqq);
++		}
++	}
++
++	bfq_log_bfqq(bfqd, bfqq,
++		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
++		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
++
++	/*
++	 * Increase, decrease or leave budget unchanged according to
++	 * reason.
++	 */
++	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
++	__bfq_bfqq_expire(bfqd, bfqq);
++}
++
++/*
++ * Budget timeout is not implemented through a dedicated timer, but
++ * just checked on request arrivals and completions, as well as on
++ * idle timer expirations.
++ */
++static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
++{
++	if (bfq_bfqq_budget_new(bfqq) ||
++	    time_before(jiffies, bfqq->budget_timeout))
++		return 0;
++	return 1;
++}
++
++/*
++ * If we expire a queue that is waiting for the arrival of a new
++ * request, we may prevent the fictitious timestamp back-shifting that
++ * allows the guarantees of the queue to be preserved (see [1] for
++ * this tricky aspect). Hence we return true only if this condition
++ * does not hold, or if the queue is slow enough to deserve only to be
++ * kicked off for preserving a high throughput.
++*/
++static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
++{
++	bfq_log_bfqq(bfqq->bfqd, bfqq,
++		"may_budget_timeout: wait_request %d left %d timeout %d",
++		bfq_bfqq_wait_request(bfqq),
++			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
++		bfq_bfqq_budget_timeout(bfqq));
++
++	return (!bfq_bfqq_wait_request(bfqq) ||
++		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
++		&&
++		bfq_bfqq_budget_timeout(bfqq);
++}
++
++/*
++ * Device idling is allowed only for the queues for which this function
++ * returns true. For this reason, the return value of this function plays a
++ * critical role for both throughput boosting and service guarantees. The
++ * return value is computed through a logical expression. In this rather
++ * long comment, we try to briefly describe all the details and motivations
++ * behind the components of this logical expression.
++ *
++ * First, the expression may be true only for sync queues. Besides, if
++ * bfqq is also being weight-raised, then the expression always evaluates
++ * to true, as device idling is instrumental for preserving low-latency
++ * guarantees (see [1]). Otherwise, the expression evaluates to true only
++ * if bfqq has a non-null idle window and at least one of the following
++ * two conditions holds. The first condition is that the device is not
++ * performing NCQ, because idling the device most certainly boosts the
++ * throughput if this condition holds and bfqq has been granted a non-null
++ * idle window. The second compound condition is made of the logical AND of
++ * two components.
++ *
++ * The first component is true only if there is no weight-raised busy
++ * queue. This guarantees that the device is not idled for a sync non-
++ * weight-raised queue when there are busy weight-raised queues. The former
++ * is then expired immediately if empty. Combined with the timestamping
++ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
++ * queues to get a lower number of requests served, and hence to ask for a
++ * lower number of requests from the request pool, before the busy weight-
++ * raised queues get served again.
++ *
++ * This is beneficial for the processes associated with weight-raised
++ * queues, when the request pool is saturated (e.g., in the presence of
++ * write hogs). In fact, if the processes associated with the other queues
++ * ask for requests at a lower rate, then weight-raised processes have a
++ * higher probability to get a request from the pool immediately (or at
++ * least soon) when they need one. Hence they have a higher probability to
++ * actually get a fraction of the disk throughput proportional to their
++ * high weight. This is especially true with NCQ-capable drives, which
++ * enqueue several requests in advance and further reorder internally-
++ * queued requests.
++ *
++ * In the end, mistreating non-weight-raised queues when there are busy
++ * weight-raised queues seems to mitigate starvation problems in the
++ * presence of heavy write workloads and NCQ, and hence to guarantee a
++ * higher application and system responsiveness in these hostile scenarios.
++ *
++ * If the first component of the compound condition is instead true, i.e.,
++ * there is no weight-raised busy queue, then the second component of the
++ * compound condition takes into account service-guarantee and throughput
++ * issues related to NCQ (recall that the compound condition is evaluated
++ * only if the device is detected as supporting NCQ).
++ *
++ * As for service guarantees, allowing the drive to enqueue more than one
++ * request at a time, and hence delegating de facto final scheduling
++ * decisions to the drive's internal scheduler, causes loss of control on
++ * the actual request service order. In this respect, when the drive is
++ * allowed to enqueue more than one request at a time, the service
++ * distribution enforced by the drive's internal scheduler is likely to
++ * coincide with the desired device-throughput distribution only in the
++ * following, perfectly symmetric, scenario:
++ * 1) all active queues have the same weight,
++ * 2) all active groups at the same level in the groups tree have the same
++ *    weight,
++ * 3) all active groups at the same level in the groups tree have the same
++ *    number of children.
++ *
++ * Even in such a scenario, sequential I/O may still receive a preferential
++ * treatment, but this is not likely to be a big issue with flash-based
++ * devices, because of their non-dramatic loss of throughput with random
++ * I/O. Things do differ with HDDs, for which additional care is taken, as
++ * explained after completing the discussion for flash-based devices.
++ *
++ * Unfortunately, keeping the necessary state for evaluating exactly the
++ * above symmetry conditions would be quite complex and time-consuming.
++ * Therefore BFQ evaluates instead the following stronger sub-conditions,
++ * for which it is much easier to maintain the needed state:
++ * 1) all active queues have the same weight,
++ * 2) all active groups have the same weight,
++ * 3) all active groups have at most one active child each.
++ * In particular, the last two conditions are always true if hierarchical
++ * support and the cgroups interface are not enabled, hence no state needs
++ * to be maintained in this case.
++ *
++ * According to the above considerations, the second component of the
++ * compound condition evaluates to true if any of the above symmetry
++ * sub-condition does not hold, or the device is not flash-based. Therefore,
++ * if also the first component is true, then idling is allowed for a sync
++ * queue. These are the only sub-conditions considered if the device is
++ * flash-based, as, for such a device, it is sensible to force idling only
++ * for service-guarantee issues. In fact, as for throughput, idling
++ * NCQ-capable flash-based devices would not boost the throughput even
++ * with sequential I/O; rather it would lower the throughput in proportion
++ * to how fast the device is. In the end, (only) if all the three
++ * sub-conditions hold and the device is flash-based, the compound
++ * condition evaluates to false and therefore no idling is performed.
++ *
++ * As already said, things change with a rotational device, where idling
++ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
++ * such a device the second component of the compound condition evaluates
++ * to true also if the following additional sub-condition does not hold:
++ * the queue is constantly seeky. Unfortunately, this different behavior
++ * with respect to flash-based devices causes an additional asymmetry: if
++ * some sync queues enjoy idling and some other sync queues do not, then
++ * the latter get a low share of the device throughput, simply because the
++ * former get many requests served after being set as in service, whereas
++ * the latter do not. As a consequence, to guarantee the desired throughput
++ * distribution, on HDDs the compound expression evaluates to true (and
++ * hence device idling is performed) also if the following last symmetry
++ * condition does not hold: no other queue is benefiting from idling. Also
++ * this last condition is actually replaced with a simpler-to-maintain and
++ * stronger condition: there is no busy queue which is not constantly seeky
++ * (and hence may also benefit from idling).
++ *
++ * To sum up, when all the required symmetry and throughput-boosting
++ * sub-conditions hold, the second component of the compound condition
++ * evaluates to false, and hence no idling is performed. This helps to
++ * keep the drives' internal queues full on NCQ-capable devices, and hence
++ * to boost the throughput, without causing 'almost' any loss of service
++ * guarantees. The 'almost' follows from the fact that, if the internal
++ * queue of one such device is filled while all the sub-conditions hold,
++ * but at some point in time some sub-condition stops to hold, then it may
++ * become impossible to let requests be served in the new desired order
++ * until all the requests already queued in the device have been served.
++ */
++static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
++{
++	struct bfq_data *bfqd = bfqq->bfqd;
++#ifdef CONFIG_CGROUP_BFQIO
++#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
++				   !bfq_differentiated_weights(bfqd))
++#else
++#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
++#endif
++#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
++				   bfqd->busy_in_flight_queues == \
++				   bfqd->const_seeky_busy_in_flight_queues)
++/*
++ * Condition for expiring a non-weight-raised queue (and hence not idling
++ * the device).
++ */
++#define cond_for_expiring_non_wr  (bfqd->hw_tag && \
++				   (bfqd->wr_busy_queues > 0 || \
++				    (symmetric_scenario && \
++				     (blk_queue_nonrot(bfqd->queue) || \
++				      cond_for_seeky_on_ncq_hdd))))
++
++	return bfq_bfqq_sync(bfqq) &&
++		(bfq_bfqq_IO_bound(bfqq) || bfqq->wr_coeff > 1) &&
++		(bfqq->wr_coeff > 1 ||
++		 (bfq_bfqq_idle_window(bfqq) &&
++		  !cond_for_expiring_non_wr)
++	);
++}
++
++/*
++ * If the in-service queue is empty but sync, and the function
++ * bfq_bfqq_must_not_expire returns true, then:
++ * 1) the queue must remain in service and cannot be expired, and
++ * 2) the disk must be idled to wait for the possible arrival of a new
++ *    request for the queue.
++ * See the comments to the function bfq_bfqq_must_not_expire for the reasons
++ * why performing device idling is the best choice to boost the throughput
++ * and preserve service guarantees when bfq_bfqq_must_not_expire itself
++ * returns true.
++ */
++static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
++{
++	struct bfq_data *bfqd = bfqq->bfqd;
++
++	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
++	       bfq_bfqq_must_not_expire(bfqq);
++}
++
++/*
++ * Select a queue for service.  If we have a current queue in service,
++ * check whether to continue servicing it, or retrieve and set a new one.
++ */
++static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq, *new_bfqq = NULL;
++	struct request *next_rq;
++	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
++
++	bfqq = bfqd->in_service_queue;
++	if (bfqq == NULL)
++		goto new_queue;
++
++	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
++
++	/*
++         * If another queue has a request waiting within our mean seek
++         * distance, let it run. The expire code will check for close
++         * cooperators and put the close queue at the front of the
++         * service tree. If possible, merge the expiring queue with the
++         * new bfqq.
++         */
++        new_bfqq = bfq_close_cooperator(bfqd, bfqq);
++        if (new_bfqq != NULL && bfqq->new_bfqq == NULL)
++                bfq_setup_merge(bfqq, new_bfqq);
++
++	if (bfq_may_expire_for_budg_timeout(bfqq) &&
++	    !timer_pending(&bfqd->idle_slice_timer) &&
++	    !bfq_bfqq_must_idle(bfqq))
++		goto expire;
++
++	next_rq = bfqq->next_rq;
++	/*
++	 * If bfqq has requests queued and it has enough budget left to
++	 * serve them, keep the queue, otherwise expire it.
++	 */
++	if (next_rq != NULL) {
++		if (bfq_serv_to_charge(next_rq, bfqq) >
++			bfq_bfqq_budget_left(bfqq)) {
++			reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
++			goto expire;
++		} else {
++			/*
++			 * The idle timer may be pending because we may
++			 * not disable disk idling even when a new request
++			 * arrives.
++			 */
++			if (timer_pending(&bfqd->idle_slice_timer)) {
++				/*
++				 * If we get here: 1) at least a new request
++				 * has arrived but we have not disabled the
++				 * timer because the request was too small,
++				 * 2) then the block layer has unplugged
++				 * the device, causing the dispatch to be
++				 * invoked.
++				 *
++				 * Since the device is unplugged, now the
++				 * requests are probably large enough to
++				 * provide a reasonable throughput.
++				 * So we disable idling.
++				 */
++				bfq_clear_bfqq_wait_request(bfqq);
++				del_timer(&bfqd->idle_slice_timer);
++			}
++			if (new_bfqq == NULL)
++				goto keep_queue;
++			else
++				goto expire;
++		}
++	}
++
++	/*
++	 * No requests pending.  If the in-service queue still has requests
++	 * in flight (possibly waiting for a completion) or is idling for a
++	 * new request, then keep it.
++	 */
++	if (new_bfqq == NULL && (timer_pending(&bfqd->idle_slice_timer) ||
++	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq)))) {
++		bfqq = NULL;
++		goto keep_queue;
++	} else if (new_bfqq != NULL && timer_pending(&bfqd->idle_slice_timer)) {
++		/*
++		 * Expiring the queue because there is a close cooperator,
++		 * cancel timer.
++		 */
++		bfq_clear_bfqq_wait_request(bfqq);
++		del_timer(&bfqd->idle_slice_timer);
++	}
++
++	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
++expire:
++	bfq_bfqq_expire(bfqd, bfqq, 0, reason);
++new_queue:
++	bfqq = bfq_set_in_service_queue(bfqd, new_bfqq);
++	bfq_log(bfqd, "select_queue: new queue %d returned",
++		bfqq != NULL ? bfqq->pid : 0);
++keep_queue:
++	return bfqq;
++}
++
++static void bfq_update_wr_data(struct bfq_data *bfqd,
++			       struct bfq_queue *bfqq)
++{
++	if (bfqq->wr_coeff > 1) { /* queue is being boosted */
++		struct bfq_entity *entity = &bfqq->entity;
++
++		bfq_log_bfqq(bfqd, bfqq,
++			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
++			jiffies_to_msecs(jiffies -
++				bfqq->last_wr_start_finish),
++			jiffies_to_msecs(bfqq->wr_cur_max_time),
++			bfqq->wr_coeff,
++			bfqq->entity.weight, bfqq->entity.orig_weight);
++
++		BUG_ON(bfqq != bfqd->in_service_queue && entity->weight !=
++		       entity->orig_weight * bfqq->wr_coeff);
++		if (entity->ioprio_changed)
++			bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
++		/*
++		 * If too much time has elapsed from the beginning
++		 * of this weight-raising, stop it.
++		 */
++		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++					   bfqq->wr_cur_max_time)) {
++			bfqq->last_wr_start_finish = jiffies;
++			bfq_log_bfqq(bfqd, bfqq,
++				     "wrais ending at %lu, rais_max_time %u",
++				     bfqq->last_wr_start_finish,
++				     jiffies_to_msecs(bfqq->wr_cur_max_time));
++			bfq_bfqq_end_wr(bfqq);
++			__bfq_entity_update_weight_prio(
++				bfq_entity_service_tree(entity),
++				entity);
++		}
++	}
++}
++
++/*
++ * Dispatch one request from bfqq, moving it to the request queue
++ * dispatch list.
++ */
++static int bfq_dispatch_request(struct bfq_data *bfqd,
++				struct bfq_queue *bfqq)
++{
++	int dispatched = 0;
++	struct request *rq;
++	unsigned long service_to_charge;
++
++	BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list));
++
++	/* Follow expired path, else get first next available. */
++	rq = bfq_check_fifo(bfqq);
++	if (rq == NULL)
++		rq = bfqq->next_rq;
++	service_to_charge = bfq_serv_to_charge(rq, bfqq);
++
++	if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
++		/*
++		 * This may happen if the next rq is chosen in fifo order
++		 * instead of sector order. The budget is properly
++		 * dimensioned to be always sufficient to serve the next
++		 * request only if it is chosen in sector order. The reason
++		 * is that it would be quite inefficient and little useful
++		 * to always make sure that the budget is large enough to
++		 * serve even the possible next rq in fifo order.
++		 * In fact, requests are seldom served in fifo order.
++		 *
++		 * Expire the queue for budget exhaustion, and make sure
++		 * that the next act_budget is enough to serve the next
++		 * request, even if it comes from the fifo expired path.
++		 */
++		bfqq->next_rq = rq;
++		/*
++		 * Since this dispatch is failed, make sure that
++		 * a new one will be performed
++		 */
++		if (!bfqd->rq_in_driver)
++			bfq_schedule_dispatch(bfqd);
++		goto expire;
++	}
++
++	/* Finally, insert request into driver dispatch list. */
++	bfq_bfqq_served(bfqq, service_to_charge);
++	bfq_dispatch_insert(bfqd->queue, rq);
++
++	bfq_update_wr_data(bfqd, bfqq);
++
++	bfq_log_bfqq(bfqd, bfqq,
++			"dispatched %u sec req (%llu), budg left %lu",
++			blk_rq_sectors(rq),
++			(long long unsigned)blk_rq_pos(rq),
++			bfq_bfqq_budget_left(bfqq));
++
++	dispatched++;
++
++	if (bfqd->in_service_bic == NULL) {
++		atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
++		bfqd->in_service_bic = RQ_BIC(rq);
++	}
++
++	if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
++	    dispatched >= bfqd->bfq_max_budget_async_rq) ||
++	    bfq_class_idle(bfqq)))
++		goto expire;
++
++	return dispatched;
++
++expire:
++	bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED);
++	return dispatched;
++}
++
++static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
++{
++	int dispatched = 0;
++
++	while (bfqq->next_rq != NULL) {
++		bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
++		dispatched++;
++	}
++
++	BUG_ON(!list_empty(&bfqq->fifo));
++	return dispatched;
++}
++
++/*
++ * Drain our current requests.
++ * Used for barriers and when switching io schedulers on-the-fly.
++ */
++static int bfq_forced_dispatch(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq, *n;
++	struct bfq_service_tree *st;
++	int dispatched = 0;
++
++	bfqq = bfqd->in_service_queue;
++	if (bfqq != NULL)
++		__bfq_bfqq_expire(bfqd, bfqq);
++
++	/*
++	 * Loop through classes, and be careful to leave the scheduler
++	 * in a consistent state, as feedback mechanisms and vtime
++	 * updates cannot be disabled during the process.
++	 */
++	list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
++		st = bfq_entity_service_tree(&bfqq->entity);
++
++		dispatched += __bfq_forced_dispatch_bfqq(bfqq);
++		bfqq->max_budget = bfq_max_budget(bfqd);
++
++		bfq_forget_idle(st);
++	}
++
++	BUG_ON(bfqd->busy_queues != 0);
++
++	return dispatched;
++}
++
++static int bfq_dispatch_requests(struct request_queue *q, int force)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_queue *bfqq;
++	int max_dispatch;
++
++	bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
++	if (bfqd->busy_queues == 0)
++		return 0;
++
++	if (unlikely(force))
++		return bfq_forced_dispatch(bfqd);
++
++	bfqq = bfq_select_queue(bfqd);
++	if (bfqq == NULL)
++		return 0;
++
++	max_dispatch = bfqd->bfq_quantum;
++	if (bfq_class_idle(bfqq))
++		max_dispatch = 1;
++
++	if (!bfq_bfqq_sync(bfqq))
++		max_dispatch = bfqd->bfq_max_budget_async_rq;
++
++	if (bfqq->dispatched >= max_dispatch) {
++		if (bfqd->busy_queues > 1)
++			return 0;
++		if (bfqq->dispatched >= 4 * max_dispatch)
++			return 0;
++	}
++
++	if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))
++		return 0;
++
++	bfq_clear_bfqq_wait_request(bfqq);
++	BUG_ON(timer_pending(&bfqd->idle_slice_timer));
++
++	if (!bfq_dispatch_request(bfqd, bfqq))
++		return 0;
++
++	bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
++			bfqq->pid, max_dispatch);
++
++	return 1;
++}
++
++/*
++ * Task holds one reference to the queue, dropped when task exits.  Each rq
++ * in-flight on this queue also holds a reference, dropped when rq is freed.
++ *
++ * Queue lock must be held here.
++ */
++static void bfq_put_queue(struct bfq_queue *bfqq)
++{
++	struct bfq_data *bfqd = bfqq->bfqd;
++
++	BUG_ON(atomic_read(&bfqq->ref) <= 0);
++
++	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq,
++		     atomic_read(&bfqq->ref));
++	if (!atomic_dec_and_test(&bfqq->ref))
++		return;
++
++	BUG_ON(rb_first(&bfqq->sort_list) != NULL);
++	BUG_ON(bfqq->allocated[READ] + bfqq->allocated[WRITE] != 0);
++	BUG_ON(bfqq->entity.tree != NULL);
++	BUG_ON(bfq_bfqq_busy(bfqq));
++	BUG_ON(bfqd->in_service_queue == bfqq);
++
++	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
++
++	kmem_cache_free(bfq_pool, bfqq);
++}
++
++static void bfq_put_cooperator(struct bfq_queue *bfqq)
++{
++	struct bfq_queue *__bfqq, *next;
++
++	/*
++	 * If this queue was scheduled to merge with another queue, be
++	 * sure to drop the reference taken on that queue (and others in
++	 * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
++	 */
++	__bfqq = bfqq->new_bfqq;
++	while (__bfqq) {
++		if (__bfqq == bfqq)
++			break;
++		next = __bfqq->new_bfqq;
++		bfq_put_queue(__bfqq);
++		__bfqq = next;
++	}
++}
++
++static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	if (bfqq == bfqd->in_service_queue) {
++		__bfq_bfqq_expire(bfqd, bfqq);
++		bfq_schedule_dispatch(bfqd);
++	}
++
++	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
++		     atomic_read(&bfqq->ref));
++
++	bfq_put_cooperator(bfqq);
++
++	bfq_put_queue(bfqq);
++}
++
++static inline void bfq_init_icq(struct io_cq *icq)
++{
++	struct bfq_io_cq *bic = icq_to_bic(icq);
++
++	bic->ttime.last_end_request = jiffies;
++}
++
++static void bfq_exit_icq(struct io_cq *icq)
++{
++	struct bfq_io_cq *bic = icq_to_bic(icq);
++	struct bfq_data *bfqd = bic_to_bfqd(bic);
++
++	if (bic->bfqq[BLK_RW_ASYNC]) {
++		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]);
++		bic->bfqq[BLK_RW_ASYNC] = NULL;
++	}
++
++	if (bic->bfqq[BLK_RW_SYNC]) {
++		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
++		bic->bfqq[BLK_RW_SYNC] = NULL;
++	}
++}
++
++/*
++ * Update the entity prio values; note that the new values will not
++ * be used until the next (re)activation.
++ */
++static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
++{
++	struct task_struct *tsk = current;
++	int ioprio_class;
++
++	if (!bfq_bfqq_prio_changed(bfqq))
++		return;
++
++	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
++	switch (ioprio_class) {
++	default:
++		dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
++			"bfq: bad prio %x\n", ioprio_class);
++	case IOPRIO_CLASS_NONE:
++		/*
++		 * No prio set, inherit CPU scheduling settings.
++		 */
++		bfqq->entity.new_ioprio = task_nice_ioprio(tsk);
++		bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk);
++		break;
++	case IOPRIO_CLASS_RT:
++		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT;
++		break;
++	case IOPRIO_CLASS_BE:
++		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE;
++		break;
++	case IOPRIO_CLASS_IDLE:
++		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE;
++		bfqq->entity.new_ioprio = 7;
++		bfq_clear_bfqq_idle_window(bfqq);
++		break;
++	}
++
++	bfqq->entity.ioprio_changed = 1;
++
++	bfq_clear_bfqq_prio_changed(bfqq);
++}
++
++static void bfq_changed_ioprio(struct bfq_io_cq *bic)
++{
++	struct bfq_data *bfqd;
++	struct bfq_queue *bfqq, *new_bfqq;
++	struct bfq_group *bfqg;
++	unsigned long uninitialized_var(flags);
++	int ioprio = bic->icq.ioc->ioprio;
++
++	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
++				   &flags);
++	/*
++	 * This condition may trigger on a newly created bic, be sure to
++	 * drop the lock before returning.
++	 */
++	if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio))
++		goto out;
++
++	bfqq = bic->bfqq[BLK_RW_ASYNC];
++	if (bfqq != NULL) {
++		bfqg = container_of(bfqq->entity.sched_data, struct bfq_group,
++				    sched_data);
++		new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic,
++					 GFP_ATOMIC);
++		if (new_bfqq != NULL) {
++			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
++			bfq_log_bfqq(bfqd, bfqq,
++				     "changed_ioprio: bfqq %p %d",
++				     bfqq, atomic_read(&bfqq->ref));
++			bfq_put_queue(bfqq);
++		}
++	}
++
++	bfqq = bic->bfqq[BLK_RW_SYNC];
++	if (bfqq != NULL)
++		bfq_mark_bfqq_prio_changed(bfqq);
++
++	bic->ioprio = ioprio;
++
++out:
++	bfq_put_bfqd_unlock(bfqd, &flags);
++}
++
++static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			  pid_t pid, int is_sync)
++{
++	RB_CLEAR_NODE(&bfqq->entity.rb_node);
++	INIT_LIST_HEAD(&bfqq->fifo);
++
++	atomic_set(&bfqq->ref, 0);
++	bfqq->bfqd = bfqd;
++
++	bfq_mark_bfqq_prio_changed(bfqq);
++
++	if (is_sync) {
++		if (!bfq_class_idle(bfqq))
++			bfq_mark_bfqq_idle_window(bfqq);
++		bfq_mark_bfqq_sync(bfqq);
++	}
++	bfq_mark_bfqq_IO_bound(bfqq);
++
++	/* Tentative initial value to trade off between thr and lat */
++	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
++	bfqq->pid = pid;
++
++	bfqq->wr_coeff = 1;
++	bfqq->last_wr_start_finish = 0;
++	/*
++	 * Set to the value for which bfqq will not be deemed as
++	 * soft rt when it becomes backlogged.
++	 */
++	bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies);
++}
++
++static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
++					      struct bfq_group *bfqg,
++					      int is_sync,
++					      struct bfq_io_cq *bic,
++					      gfp_t gfp_mask)
++{
++	struct bfq_queue *bfqq, *new_bfqq = NULL;
++
++retry:
++	/* bic always exists here */
++	bfqq = bic_to_bfqq(bic, is_sync);
++
++	/*
++	 * Always try a new alloc if we fall back to the OOM bfqq
++	 * originally, since it should just be a temporary situation.
++	 */
++	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
++		bfqq = NULL;
++		if (new_bfqq != NULL) {
++			bfqq = new_bfqq;
++			new_bfqq = NULL;
++		} else if (gfp_mask & __GFP_WAIT) {
++			spin_unlock_irq(bfqd->queue->queue_lock);
++			new_bfqq = kmem_cache_alloc_node(bfq_pool,
++					gfp_mask | __GFP_ZERO,
++					bfqd->queue->node);
++			spin_lock_irq(bfqd->queue->queue_lock);
++			if (new_bfqq != NULL)
++				goto retry;
++		} else {
++			bfqq = kmem_cache_alloc_node(bfq_pool,
++					gfp_mask | __GFP_ZERO,
++					bfqd->queue->node);
++		}
++
++		if (bfqq != NULL) {
++			bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync);
++			bfq_log_bfqq(bfqd, bfqq, "allocated");
++		} else {
++			bfqq = &bfqd->oom_bfqq;
++			bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
++		}
++
++		bfq_init_prio_data(bfqq, bic);
++		bfq_init_entity(&bfqq->entity, bfqg);
++	}
++
++	if (new_bfqq != NULL)
++		kmem_cache_free(bfq_pool, new_bfqq);
++
++	return bfqq;
++}
++
++static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
++					       struct bfq_group *bfqg,
++					       int ioprio_class, int ioprio)
++{
++	switch (ioprio_class) {
++	case IOPRIO_CLASS_RT:
++		return &bfqg->async_bfqq[0][ioprio];
++	case IOPRIO_CLASS_NONE:
++		ioprio = IOPRIO_NORM;
++		/* fall through */
++	case IOPRIO_CLASS_BE:
++		return &bfqg->async_bfqq[1][ioprio];
++	case IOPRIO_CLASS_IDLE:
++		return &bfqg->async_idle_bfqq;
++	default:
++		BUG();
++	}
++}
++
++static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
++				       struct bfq_group *bfqg, int is_sync,
++				       struct bfq_io_cq *bic, gfp_t gfp_mask)
++{
++	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
++	struct bfq_queue **async_bfqq = NULL;
++	struct bfq_queue *bfqq = NULL;
++
++	if (!is_sync) {
++		async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
++						  ioprio);
++		bfqq = *async_bfqq;
++	}
++
++	if (bfqq == NULL)
++		bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
++
++	/*
++	 * Pin the queue now that it's allocated, scheduler exit will
++	 * prune it.
++	 */
++	if (!is_sync && *async_bfqq == NULL) {
++		atomic_inc(&bfqq->ref);
++		bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
++			     bfqq, atomic_read(&bfqq->ref));
++		*async_bfqq = bfqq;
++	}
++
++	atomic_inc(&bfqq->ref);
++	bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq,
++		     atomic_read(&bfqq->ref));
++	return bfqq;
++}
++
++static void bfq_update_io_thinktime(struct bfq_data *bfqd,
++				    struct bfq_io_cq *bic)
++{
++	unsigned long elapsed = jiffies - bic->ttime.last_end_request;
++	unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle);
++
++	bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
++	bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8;
++	bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) /
++				bic->ttime.ttime_samples;
++}
++
++static void bfq_update_io_seektime(struct bfq_data *bfqd,
++				   struct bfq_queue *bfqq,
++				   struct request *rq)
++{
++	sector_t sdist;
++	u64 total;
++
++	if (bfqq->last_request_pos < blk_rq_pos(rq))
++		sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
++	else
++		sdist = bfqq->last_request_pos - blk_rq_pos(rq);
++
++	/*
++	 * Don't allow the seek distance to get too large from the
++	 * odd fragment, pagein, etc.
++	 */
++	if (bfqq->seek_samples == 0) /* first request, not really a seek */
++		sdist = 0;
++	else if (bfqq->seek_samples <= 60) /* second & third seek */
++		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
++	else
++		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
++
++	bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
++	bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
++	total = bfqq->seek_total + (bfqq->seek_samples/2);
++	do_div(total, bfqq->seek_samples);
++	bfqq->seek_mean = (sector_t)total;
++
++	bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist,
++			(u64)bfqq->seek_mean);
++}
++
++/*
++ * Disable idle window if the process thinks too long or seeks so much that
++ * it doesn't matter.
++ */
++static void bfq_update_idle_window(struct bfq_data *bfqd,
++				   struct bfq_queue *bfqq,
++				   struct bfq_io_cq *bic)
++{
++	int enable_idle;
++
++	/* Don't idle for async or idle io prio class. */
++	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
++		return;
++
++	enable_idle = bfq_bfqq_idle_window(bfqq);
++
++	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
++	    bfqd->bfq_slice_idle == 0 ||
++		(bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
++			bfqq->wr_coeff == 1))
++		enable_idle = 0;
++	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
++		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
++			bfqq->wr_coeff == 1)
++			enable_idle = 0;
++		else
++			enable_idle = 1;
++	}
++	bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
++		enable_idle);
++
++	if (enable_idle)
++		bfq_mark_bfqq_idle_window(bfqq);
++	else
++		bfq_clear_bfqq_idle_window(bfqq);
++}
++
++/*
++ * Called when a new fs request (rq) is added to bfqq.  Check if there's
++ * something we should do about it.
++ */
++static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			    struct request *rq)
++{
++	struct bfq_io_cq *bic = RQ_BIC(rq);
++
++	if (rq->cmd_flags & REQ_META)
++		bfqq->meta_pending++;
++
++	bfq_update_io_thinktime(bfqd, bic);
++	bfq_update_io_seektime(bfqd, bfqq, rq);
++	if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
++		bfq_clear_bfqq_constantly_seeky(bfqq);
++		if (!blk_queue_nonrot(bfqd->queue)) {
++			BUG_ON(!bfqd->const_seeky_busy_in_flight_queues);
++			bfqd->const_seeky_busy_in_flight_queues--;
++		}
++	}
++	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
++	    !BFQQ_SEEKY(bfqq))
++		bfq_update_idle_window(bfqd, bfqq, bic);
++
++	bfq_log_bfqq(bfqd, bfqq,
++		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
++		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
++		     (long long unsigned)bfqq->seek_mean);
++
++	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
++
++	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
++		int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
++				blk_rq_sectors(rq) < 32;
++		int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
++
++		/*
++		 * There is just this request queued: if the request
++		 * is small and the queue is not to be expired, then
++		 * just exit.
++		 *
++		 * In this way, if the disk is being idled to wait for
++		 * a new request from the in-service queue, we avoid
++		 * unplugging the device and committing the disk to serve
++		 * just a small request. On the contrary, we wait for
++		 * the block layer to decide when to unplug the device:
++		 * hopefully, new requests will be merged to this one
++		 * quickly, then the device will be unplugged and
++		 * larger requests will be dispatched.
++		 */
++		if (small_req && !budget_timeout)
++			return;
++
++		/*
++		 * A large enough request arrived, or the queue is to
++		 * be expired: in both cases disk idling is to be
++		 * stopped, so clear wait_request flag and reset
++		 * timer.
++		 */
++		bfq_clear_bfqq_wait_request(bfqq);
++		del_timer(&bfqd->idle_slice_timer);
++
++		/*
++		 * The queue is not empty, because a new request just
++		 * arrived. Hence we can safely expire the queue, in
++		 * case of budget timeout, without risking that the
++		 * timestamps of the queue are not updated correctly.
++		 * See [1] for more details.
++		 */
++		if (budget_timeout)
++			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
++
++		/*
++		 * Let the request rip immediately, or let a new queue be
++		 * selected if bfqq has just been expired.
++		 */
++		__blk_run_queue(bfqd->queue);
++	}
++}
++
++static void bfq_insert_request(struct request_queue *q, struct request *rq)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++	assert_spin_locked(bfqd->queue->queue_lock);
++	bfq_init_prio_data(bfqq, RQ_BIC(rq));
++
++	bfq_add_request(rq);
++
++	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
++	list_add_tail(&rq->queuelist, &bfqq->fifo);
++
++	bfq_rq_enqueued(bfqd, bfqq, rq);
++}
++
++static void bfq_update_hw_tag(struct bfq_data *bfqd)
++{
++	bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
++				     bfqd->rq_in_driver);
++
++	if (bfqd->hw_tag == 1)
++		return;
++
++	/*
++	 * This sample is valid if the number of outstanding requests
++	 * is large enough to allow a queueing behavior.  Note that the
++	 * sum is not exact, as it's not taking into account deactivated
++	 * requests.
++	 */
++	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
++		return;
++
++	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
++		return;
++
++	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
++	bfqd->max_rq_in_driver = 0;
++	bfqd->hw_tag_samples = 0;
++}
++
++static void bfq_completed_request(struct request_queue *q, struct request *rq)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++	struct bfq_data *bfqd = bfqq->bfqd;
++	bool sync = bfq_bfqq_sync(bfqq);
++
++	bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)",
++		     blk_rq_sectors(rq), sync);
++
++	bfq_update_hw_tag(bfqd);
++
++	BUG_ON(!bfqd->rq_in_driver);
++	BUG_ON(!bfqq->dispatched);
++	bfqd->rq_in_driver--;
++	bfqq->dispatched--;
++
++	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
++		bfq_weights_tree_remove(bfqd, &bfqq->entity,
++					&bfqd->queue_weights_tree);
++		if (!blk_queue_nonrot(bfqd->queue)) {
++			BUG_ON(!bfqd->busy_in_flight_queues);
++			bfqd->busy_in_flight_queues--;
++			if (bfq_bfqq_constantly_seeky(bfqq)) {
++				BUG_ON(!bfqd->
++					const_seeky_busy_in_flight_queues);
++				bfqd->const_seeky_busy_in_flight_queues--;
++			}
++		}
++	}
++
++	if (sync) {
++		bfqd->sync_flight--;
++		RQ_BIC(rq)->ttime.last_end_request = jiffies;
++	}
++
++	/*
++	 * If we are waiting to discover whether the request pattern of the
++	 * task associated with the queue is actually isochronous, and
++	 * both requisites for this condition to hold are satisfied, then
++	 * compute soft_rt_next_start (see the comments to the function
++	 * bfq_bfqq_softrt_next_start()).
++	 */
++	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
++	    RB_EMPTY_ROOT(&bfqq->sort_list))
++		bfqq->soft_rt_next_start =
++			bfq_bfqq_softrt_next_start(bfqd, bfqq);
++
++	/*
++	 * If this is the in-service queue, check if it needs to be expired,
++	 * or if we want to idle in case it has no pending requests.
++	 */
++	if (bfqd->in_service_queue == bfqq) {
++		if (bfq_bfqq_budget_new(bfqq))
++			bfq_set_budget_timeout(bfqd);
++
++		if (bfq_bfqq_must_idle(bfqq)) {
++			bfq_arm_slice_timer(bfqd);
++			goto out;
++		} else if (bfq_may_expire_for_budg_timeout(bfqq))
++			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
++		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
++			 (bfqq->dispatched == 0 ||
++			  !bfq_bfqq_must_not_expire(bfqq)))
++			bfq_bfqq_expire(bfqd, bfqq, 0,
++					BFQ_BFQQ_NO_MORE_REQUESTS);
++	}
++
++	if (!bfqd->rq_in_driver)
++		bfq_schedule_dispatch(bfqd);
++
++out:
++	return;
++}
++
++static inline int __bfq_may_queue(struct bfq_queue *bfqq)
++{
++	if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
++		bfq_clear_bfqq_must_alloc(bfqq);
++		return ELV_MQUEUE_MUST;
++	}
++
++	return ELV_MQUEUE_MAY;
++}
++
++static int bfq_may_queue(struct request_queue *q, int rw)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct task_struct *tsk = current;
++	struct bfq_io_cq *bic;
++	struct bfq_queue *bfqq;
++
++	/*
++	 * Don't force setup of a queue from here, as a call to may_queue
++	 * does not necessarily imply that a request actually will be
++	 * queued. So just lookup a possibly existing queue, or return
++	 * 'may queue' if that fails.
++	 */
++	bic = bfq_bic_lookup(bfqd, tsk->io_context);
++	if (bic == NULL)
++		return ELV_MQUEUE_MAY;
++
++	bfqq = bic_to_bfqq(bic, rw_is_sync(rw));
++	if (bfqq != NULL) {
++		bfq_init_prio_data(bfqq, bic);
++
++		return __bfq_may_queue(bfqq);
++	}
++
++	return ELV_MQUEUE_MAY;
++}
++
++/*
++ * Queue lock held here.
++ */
++static void bfq_put_request(struct request *rq)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++	if (bfqq != NULL) {
++		const int rw = rq_data_dir(rq);
++
++		BUG_ON(!bfqq->allocated[rw]);
++		bfqq->allocated[rw]--;
++
++		rq->elv.priv[0] = NULL;
++		rq->elv.priv[1] = NULL;
++
++		bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
++			     bfqq, atomic_read(&bfqq->ref));
++		bfq_put_queue(bfqq);
++	}
++}
++
++static struct bfq_queue *
++bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
++		struct bfq_queue *bfqq)
++{
++	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
++		(long unsigned)bfqq->new_bfqq->pid);
++	bic_set_bfqq(bic, bfqq->new_bfqq, 1);
++	bfq_mark_bfqq_coop(bfqq->new_bfqq);
++	bfq_put_queue(bfqq);
++	return bic_to_bfqq(bic, 1);
++}
++
++/*
++ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
++ * was the last process referring to said bfqq.
++ */
++static struct bfq_queue *
++bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
++{
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
++	if (bfqq_process_refs(bfqq) == 1) {
++		bfqq->pid = current->pid;
++		bfq_clear_bfqq_coop(bfqq);
++		bfq_clear_bfqq_split_coop(bfqq);
++		return bfqq;
++	}
++
++	bic_set_bfqq(bic, NULL, 1);
++
++	bfq_put_cooperator(bfqq);
++
++	bfq_put_queue(bfqq);
++	return NULL;
++}
++
++/*
++ * Allocate bfq data structures associated with this request.
++ */
++static int bfq_set_request(struct request_queue *q, struct request *rq,
++			   struct bio *bio, gfp_t gfp_mask)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
++	const int rw = rq_data_dir(rq);
++	const int is_sync = rq_is_sync(rq);
++	struct bfq_queue *bfqq;
++	struct bfq_group *bfqg;
++	unsigned long flags;
++
++	might_sleep_if(gfp_mask & __GFP_WAIT);
++
++	bfq_changed_ioprio(bic);
++
++	spin_lock_irqsave(q->queue_lock, flags);
++
++	if (bic == NULL)
++		goto queue_fail;
++
++	bfqg = bfq_bic_update_cgroup(bic);
++
++new_queue:
++	bfqq = bic_to_bfqq(bic, is_sync);
++	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
++		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
++		bic_set_bfqq(bic, bfqq, is_sync);
++	} else {
++		/*
++		 * If the queue was seeky for too long, break it apart.
++		 */
++		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
++			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
++			bfqq = bfq_split_bfqq(bic, bfqq);
++			if (!bfqq)
++				goto new_queue;
++		}
++
++		/*
++		 * Check to see if this queue is scheduled to merge with
++		 * another closely cooperating queue. The merging of queues
++		 * happens here as it must be done in process context.
++		 * The reference on new_bfqq was taken in merge_bfqqs.
++		 */
++		if (bfqq->new_bfqq != NULL)
++			bfqq = bfq_merge_bfqqs(bfqd, bic, bfqq);
++	}
++
++	bfqq->allocated[rw]++;
++	atomic_inc(&bfqq->ref);
++	bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq,
++		     atomic_read(&bfqq->ref));
++
++	rq->elv.priv[0] = bic;
++	rq->elv.priv[1] = bfqq;
++
++	spin_unlock_irqrestore(q->queue_lock, flags);
++
++	return 0;
++
++queue_fail:
++	bfq_schedule_dispatch(bfqd);
++	spin_unlock_irqrestore(q->queue_lock, flags);
++
++	return 1;
++}
++
++static void bfq_kick_queue(struct work_struct *work)
++{
++	struct bfq_data *bfqd =
++		container_of(work, struct bfq_data, unplug_work);
++	struct request_queue *q = bfqd->queue;
++
++	spin_lock_irq(q->queue_lock);
++	__blk_run_queue(q);
++	spin_unlock_irq(q->queue_lock);
++}
++
++/*
++ * Handler of the expiration of the timer running if the in-service queue
++ * is idling inside its time slice.
++ */
++static void bfq_idle_slice_timer(unsigned long data)
++{
++	struct bfq_data *bfqd = (struct bfq_data *)data;
++	struct bfq_queue *bfqq;
++	unsigned long flags;
++	enum bfqq_expiration reason;
++
++	spin_lock_irqsave(bfqd->queue->queue_lock, flags);
++
++	bfqq = bfqd->in_service_queue;
++	/*
++	 * Theoretical race here: the in-service queue can be NULL or
++	 * different from the queue that was idling if the timer handler
++	 * spins on the queue_lock and a new request arrives for the
++	 * current queue and there is a full dispatch cycle that changes
++	 * the in-service queue.  This can hardly happen, but in the worst
++	 * case we just expire a queue too early.
++	 */
++	if (bfqq != NULL) {
++		bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
++		if (bfq_bfqq_budget_timeout(bfqq))
++			/*
++			 * Also here the queue can be safely expired
++			 * for budget timeout without wasting
++			 * guarantees
++			 */
++			reason = BFQ_BFQQ_BUDGET_TIMEOUT;
++		else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
++			/*
++			 * The queue may not be empty upon timer expiration,
++			 * because we may not disable the timer when the
++			 * first request of the in-service queue arrives
++			 * during disk idling.
++			 */
++			reason = BFQ_BFQQ_TOO_IDLE;
++		else
++			goto schedule_dispatch;
++
++		bfq_bfqq_expire(bfqd, bfqq, 1, reason);
++	}
++
++schedule_dispatch:
++	bfq_schedule_dispatch(bfqd);
++
++	spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
++}
++
++static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
++{
++	del_timer_sync(&bfqd->idle_slice_timer);
++	cancel_work_sync(&bfqd->unplug_work);
++}
++
++static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
++					struct bfq_queue **bfqq_ptr)
++{
++	struct bfq_group *root_group = bfqd->root_group;
++	struct bfq_queue *bfqq = *bfqq_ptr;
++
++	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
++	if (bfqq != NULL) {
++		bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group);
++		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
++			     bfqq, atomic_read(&bfqq->ref));
++		bfq_put_queue(bfqq);
++		*bfqq_ptr = NULL;
++	}
++}
++
++/*
++ * Release all the bfqg references to its async queues.  If we are
++ * deallocating the group these queues may still contain requests, so
++ * we reparent them to the root cgroup (i.e., the only one that will
++ * exist for sure until all the requests on a device are gone).
++ */
++static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
++{
++	int i, j;
++
++	for (i = 0; i < 2; i++)
++		for (j = 0; j < IOPRIO_BE_NR; j++)
++			__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
++
++	__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
++}
++
++static void bfq_exit_queue(struct elevator_queue *e)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	struct request_queue *q = bfqd->queue;
++	struct bfq_queue *bfqq, *n;
++
++	bfq_shutdown_timer_wq(bfqd);
++
++	spin_lock_irq(q->queue_lock);
++
++	BUG_ON(bfqd->in_service_queue != NULL);
++	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
++		bfq_deactivate_bfqq(bfqd, bfqq, 0);
++
++	bfq_disconnect_groups(bfqd);
++	spin_unlock_irq(q->queue_lock);
++
++	bfq_shutdown_timer_wq(bfqd);
++
++	synchronize_rcu();
++
++	BUG_ON(timer_pending(&bfqd->idle_slice_timer));
++
++	bfq_free_root_group(bfqd);
++	kfree(bfqd);
++}
++
++static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
++{
++	struct bfq_group *bfqg;
++	struct bfq_data *bfqd;
++	struct elevator_queue *eq;
++
++	eq = elevator_alloc(q, e);
++	if (eq == NULL)
++		return -ENOMEM;
++
++	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
++	if (bfqd == NULL) {
++		kobject_put(&eq->kobj);
++		return -ENOMEM;
++	}
++	eq->elevator_data = bfqd;
++
++	/*
++	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
++	 * Grab a permanent reference to it, so that the normal code flow
++	 * will not attempt to free it.
++	 */
++	bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0);
++	atomic_inc(&bfqd->oom_bfqq.ref);
++
++	bfqd->queue = q;
++
++	spin_lock_irq(q->queue_lock);
++	q->elevator = eq;
++	spin_unlock_irq(q->queue_lock);
++
++	bfqg = bfq_alloc_root_group(bfqd, q->node);
++	if (bfqg == NULL) {
++		kfree(bfqd);
++		kobject_put(&eq->kobj);
++		return -ENOMEM;
++	}
++
++	bfqd->root_group = bfqg;
++#ifdef CONFIG_CGROUP_BFQIO
++	bfqd->active_numerous_groups = 0;
++#endif
++
++	init_timer(&bfqd->idle_slice_timer);
++	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
++	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
++
++	bfqd->rq_pos_tree = RB_ROOT;
++	bfqd->queue_weights_tree = RB_ROOT;
++	bfqd->group_weights_tree = RB_ROOT;
++
++	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
++
++	INIT_LIST_HEAD(&bfqd->active_list);
++	INIT_LIST_HEAD(&bfqd->idle_list);
++
++	bfqd->hw_tag = -1;
++
++	bfqd->bfq_max_budget = bfq_default_max_budget;
++
++	bfqd->bfq_quantum = bfq_quantum;
++	bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
++	bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
++	bfqd->bfq_back_max = bfq_back_max;
++	bfqd->bfq_back_penalty = bfq_back_penalty;
++	bfqd->bfq_slice_idle = bfq_slice_idle;
++	bfqd->bfq_class_idle_last_service = 0;
++	bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq;
++	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
++	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
++
++	bfqd->bfq_coop_thresh = 2;
++	bfqd->bfq_failed_cooperations = 7000;
++	bfqd->bfq_requests_within_timer = 120;
++
++	bfqd->low_latency = true;
++
++	bfqd->bfq_wr_coeff = 20;
++	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
++	bfqd->bfq_wr_max_time = 0;
++	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
++	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
++	bfqd->bfq_wr_max_softrt_rate = 7000; /*
++					      * Approximate rate required
++					      * to playback or record a
++					      * high-definition compressed
++					      * video.
++					      */
++	bfqd->wr_busy_queues = 0;
++	bfqd->busy_in_flight_queues = 0;
++	bfqd->const_seeky_busy_in_flight_queues = 0;
++
++	/*
++	 * Begin by assuming, optimistically, that the device peak rate is
++	 * equal to the highest reference rate.
++	 */
++	bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
++			T_fast[blk_queue_nonrot(bfqd->queue)];
++	bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)];
++	bfqd->device_speed = BFQ_BFQD_FAST;
++
++	return 0;
++}
++
++static void bfq_slab_kill(void)
++{
++	if (bfq_pool != NULL)
++		kmem_cache_destroy(bfq_pool);
++}
++
++static int __init bfq_slab_setup(void)
++{
++	bfq_pool = KMEM_CACHE(bfq_queue, 0);
++	if (bfq_pool == NULL)
++		return -ENOMEM;
++	return 0;
++}
++
++static ssize_t bfq_var_show(unsigned int var, char *page)
++{
++	return sprintf(page, "%d\n", var);
++}
++
++static ssize_t bfq_var_store(unsigned long *var, const char *page,
++			     size_t count)
++{
++	unsigned long new_val;
++	int ret = kstrtoul(page, 10, &new_val);
++
++	if (ret == 0)
++		*var = new_val;
++
++	return count;
++}
++
++static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
++		       jiffies_to_msecs(bfqd->bfq_wr_max_time) :
++		       jiffies_to_msecs(bfq_wr_duration(bfqd)));
++}
++
++static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
++{
++	struct bfq_queue *bfqq;
++	struct bfq_data *bfqd = e->elevator_data;
++	ssize_t num_char = 0;
++
++	num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
++			    bfqd->queued);
++
++	spin_lock_irq(bfqd->queue->queue_lock);
++
++	num_char += sprintf(page + num_char, "Active:\n");
++	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
++	  num_char += sprintf(page + num_char,
++			      "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n",
++			      bfqq->pid,
++			      bfqq->entity.weight,
++			      bfqq->queued[0],
++			      bfqq->queued[1],
++			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
++			jiffies_to_msecs(bfqq->wr_cur_max_time));
++	}
++
++	num_char += sprintf(page + num_char, "Idle:\n");
++	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
++			num_char += sprintf(page + num_char,
++				"pid%d: weight %hu, dur %d/%u\n",
++				bfqq->pid,
++				bfqq->entity.weight,
++				jiffies_to_msecs(jiffies -
++					bfqq->last_wr_start_finish),
++				jiffies_to_msecs(bfqq->wr_cur_max_time));
++	}
++
++	spin_unlock_irq(bfqd->queue->queue_lock);
++
++	return num_char;
++}
++
++#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
++static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
++{									\
++	struct bfq_data *bfqd = e->elevator_data;			\
++	unsigned int __data = __VAR;					\
++	if (__CONV)							\
++		__data = jiffies_to_msecs(__data);			\
++	return bfq_var_show(__data, (page));				\
++}
++SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0);
++SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1);
++SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1);
++SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
++SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
++SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1);
++SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
++SHOW_FUNCTION(bfq_max_budget_async_rq_show,
++	      bfqd->bfq_max_budget_async_rq, 0);
++SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
++SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
++SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
++SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
++SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
++SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
++SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
++	1);
++SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
++#undef SHOW_FUNCTION
++
++#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
++static ssize_t								\
++__FUNC(struct elevator_queue *e, const char *page, size_t count)	\
++{									\
++	struct bfq_data *bfqd = e->elevator_data;			\
++	unsigned long uninitialized_var(__data);			\
++	int ret = bfq_var_store(&__data, (page), count);		\
++	if (__data < (MIN))						\
++		__data = (MIN);						\
++	else if (__data > (MAX))					\
++		__data = (MAX);						\
++	if (__CONV)							\
++		*(__PTR) = msecs_to_jiffies(__data);			\
++	else								\
++		*(__PTR) = __data;					\
++	return ret;							\
++}
++STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
++		INT_MAX, 1);
++STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
++		INT_MAX, 1);
++STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
++STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
++		INT_MAX, 0);
++STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
++		1, INT_MAX, 0);
++STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
++		INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
++		1);
++STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
++		INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
++		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
++		INT_MAX, 0);
++#undef STORE_FUNCTION
++
++/* do nothing for the moment */
++static ssize_t bfq_weights_store(struct elevator_queue *e,
++				    const char *page, size_t count)
++{
++	return count;
++}
++
++static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
++{
++	u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
++
++	if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
++		return bfq_calc_max_budget(bfqd->peak_rate, timeout);
++	else
++		return bfq_default_max_budget;
++}
++
++static ssize_t bfq_max_budget_store(struct elevator_queue *e,
++				    const char *page, size_t count)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	unsigned long uninitialized_var(__data);
++	int ret = bfq_var_store(&__data, (page), count);
++
++	if (__data == 0)
++		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
++	else {
++		if (__data > INT_MAX)
++			__data = INT_MAX;
++		bfqd->bfq_max_budget = __data;
++	}
++
++	bfqd->bfq_user_max_budget = __data;
++
++	return ret;
++}
++
++static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
++				      const char *page, size_t count)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	unsigned long uninitialized_var(__data);
++	int ret = bfq_var_store(&__data, (page), count);
++
++	if (__data < 1)
++		__data = 1;
++	else if (__data > INT_MAX)
++		__data = INT_MAX;
++
++	bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data);
++	if (bfqd->bfq_user_max_budget == 0)
++		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
++
++	return ret;
++}
++
++static ssize_t bfq_low_latency_store(struct elevator_queue *e,
++				     const char *page, size_t count)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	unsigned long uninitialized_var(__data);
++	int ret = bfq_var_store(&__data, (page), count);
++
++	if (__data > 1)
++		__data = 1;
++	if (__data == 0 && bfqd->low_latency != 0)
++		bfq_end_wr(bfqd);
++	bfqd->low_latency = __data;
++
++	return ret;
++}
++
++#define BFQ_ATTR(name) \
++	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
++
++static struct elv_fs_entry bfq_attrs[] = {
++	BFQ_ATTR(quantum),
++	BFQ_ATTR(fifo_expire_sync),
++	BFQ_ATTR(fifo_expire_async),
++	BFQ_ATTR(back_seek_max),
++	BFQ_ATTR(back_seek_penalty),
++	BFQ_ATTR(slice_idle),
++	BFQ_ATTR(max_budget),
++	BFQ_ATTR(max_budget_async_rq),
++	BFQ_ATTR(timeout_sync),
++	BFQ_ATTR(timeout_async),
++	BFQ_ATTR(low_latency),
++	BFQ_ATTR(wr_coeff),
++	BFQ_ATTR(wr_max_time),
++	BFQ_ATTR(wr_rt_max_time),
++	BFQ_ATTR(wr_min_idle_time),
++	BFQ_ATTR(wr_min_inter_arr_async),
++	BFQ_ATTR(wr_max_softrt_rate),
++	BFQ_ATTR(weights),
++	__ATTR_NULL
++};
++
++static struct elevator_type iosched_bfq = {
++	.ops = {
++		.elevator_merge_fn =		bfq_merge,
++		.elevator_merged_fn =		bfq_merged_request,
++		.elevator_merge_req_fn =	bfq_merged_requests,
++		.elevator_allow_merge_fn =	bfq_allow_merge,
++		.elevator_dispatch_fn =		bfq_dispatch_requests,
++		.elevator_add_req_fn =		bfq_insert_request,
++		.elevator_activate_req_fn =	bfq_activate_request,
++		.elevator_deactivate_req_fn =	bfq_deactivate_request,
++		.elevator_completed_req_fn =	bfq_completed_request,
++		.elevator_former_req_fn =	elv_rb_former_request,
++		.elevator_latter_req_fn =	elv_rb_latter_request,
++		.elevator_init_icq_fn =		bfq_init_icq,
++		.elevator_exit_icq_fn =		bfq_exit_icq,
++		.elevator_set_req_fn =		bfq_set_request,
++		.elevator_put_req_fn =		bfq_put_request,
++		.elevator_may_queue_fn =	bfq_may_queue,
++		.elevator_init_fn =		bfq_init_queue,
++		.elevator_exit_fn =		bfq_exit_queue,
++	},
++	.icq_size =		sizeof(struct bfq_io_cq),
++	.icq_align =		__alignof__(struct bfq_io_cq),
++	.elevator_attrs =	bfq_attrs,
++	.elevator_name =	"bfq",
++	.elevator_owner =	THIS_MODULE,
++};
++
++static int __init bfq_init(void)
++{
++	/*
++	 * Can be 0 on HZ < 1000 setups.
++	 */
++	if (bfq_slice_idle == 0)
++		bfq_slice_idle = 1;
++
++	if (bfq_timeout_async == 0)
++		bfq_timeout_async = 1;
++
++	if (bfq_slab_setup())
++		return -ENOMEM;
++
++	/*
++	 * Times to load large popular applications for the typical systems
++	 * installed on the reference devices (see the comments before the
++	 * definitions of the two arrays).
++	 */
++	T_slow[0] = msecs_to_jiffies(2600);
++	T_slow[1] = msecs_to_jiffies(1000);
++	T_fast[0] = msecs_to_jiffies(5500);
++	T_fast[1] = msecs_to_jiffies(2000);
++
++	/*
++	 * Thresholds that determine the switch between speed classes (see
++	 * the comments before the definition of the array).
++	 */
++	device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2;
++	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
++
++	elv_register(&iosched_bfq);
++	pr_info("BFQ I/O-scheduler version: v7r5");
++
++	return 0;
++}
++
++static void __exit bfq_exit(void)
++{
++	elv_unregister(&iosched_bfq);
++	bfq_slab_kill();
++}
++
++module_init(bfq_init);
++module_exit(bfq_exit);
++
++MODULE_AUTHOR("Fabio Checconi, Paolo Valente");
++MODULE_LICENSE("GPL");
+diff --git a/block/bfq-sched.c b/block/bfq-sched.c
+new file mode 100644
+index 0000000..c4831b7
+--- /dev/null
++++ b/block/bfq-sched.c
+@@ -0,0 +1,1207 @@
++/*
++ * BFQ: Hierarchical B-WF2Q+ scheduler.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++#ifdef CONFIG_CGROUP_BFQIO
++#define for_each_entity(entity)	\
++	for (; entity != NULL; entity = entity->parent)
++
++#define for_each_entity_safe(entity, parent) \
++	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
++
++static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
++						 int extract,
++						 struct bfq_data *bfqd);
++
++static inline void bfq_update_budget(struct bfq_entity *next_in_service)
++{
++	struct bfq_entity *bfqg_entity;
++	struct bfq_group *bfqg;
++	struct bfq_sched_data *group_sd;
++
++	BUG_ON(next_in_service == NULL);
++
++	group_sd = next_in_service->sched_data;
++
++	bfqg = container_of(group_sd, struct bfq_group, sched_data);
++	/*
++	 * bfq_group's my_entity field is not NULL only if the group
++	 * is not the root group. We must not touch the root entity
++	 * as it must never become an in-service entity.
++	 */
++	bfqg_entity = bfqg->my_entity;
++	if (bfqg_entity != NULL)
++		bfqg_entity->budget = next_in_service->budget;
++}
++
++static int bfq_update_next_in_service(struct bfq_sched_data *sd)
++{
++	struct bfq_entity *next_in_service;
++
++	if (sd->in_service_entity != NULL)
++		/* will update/requeue at the end of service */
++		return 0;
++
++	/*
++	 * NOTE: this can be improved in many ways, such as returning
++	 * 1 (and thus propagating upwards the update) only when the
++	 * budget changes, or caching the bfqq that will be scheduled
++	 * next from this subtree.  By now we worry more about
++	 * correctness than about performance...
++	 */
++	next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
++	sd->next_in_service = next_in_service;
++
++	if (next_in_service != NULL)
++		bfq_update_budget(next_in_service);
++
++	return 1;
++}
++
++static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
++					     struct bfq_entity *entity)
++{
++	BUG_ON(sd->next_in_service != entity);
++}
++#else
++#define for_each_entity(entity)	\
++	for (; entity != NULL; entity = NULL)
++
++#define for_each_entity_safe(entity, parent) \
++	for (parent = NULL; entity != NULL; entity = parent)
++
++static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
++{
++	return 0;
++}
++
++static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
++					     struct bfq_entity *entity)
++{
++}
++
++static inline void bfq_update_budget(struct bfq_entity *next_in_service)
++{
++}
++#endif
++
++/*
++ * Shift for timestamp calculations.  This actually limits the maximum
++ * service allowed in one timestamp delta (small shift values increase it),
++ * the maximum total weight that can be used for the queues in the system
++ * (big shift values increase it), and the period of virtual time
++ * wraparounds.
++ */
++#define WFQ_SERVICE_SHIFT	22
++
++/**
++ * bfq_gt - compare two timestamps.
++ * @a: first ts.
++ * @b: second ts.
++ *
++ * Return @a > @b, dealing with wrapping correctly.
++ */
++static inline int bfq_gt(u64 a, u64 b)
++{
++	return (s64)(a - b) > 0;
++}
++
++static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = NULL;
++
++	BUG_ON(entity == NULL);
++
++	if (entity->my_sched_data == NULL)
++		bfqq = container_of(entity, struct bfq_queue, entity);
++
++	return bfqq;
++}
++
++
++/**
++ * bfq_delta - map service into the virtual time domain.
++ * @service: amount of service.
++ * @weight: scale factor (weight of an entity or weight sum).
++ */
++static inline u64 bfq_delta(unsigned long service,
++					unsigned long weight)
++{
++	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
++
++	do_div(d, weight);
++	return d;
++}
++
++/**
++ * bfq_calc_finish - assign the finish time to an entity.
++ * @entity: the entity to act upon.
++ * @service: the service to be charged to the entity.
++ */
++static inline void bfq_calc_finish(struct bfq_entity *entity,
++				   unsigned long service)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++	BUG_ON(entity->weight == 0);
++
++	entity->finish = entity->start +
++		bfq_delta(service, entity->weight);
++
++	if (bfqq != NULL) {
++		bfq_log_bfqq(bfqq->bfqd, bfqq,
++			"calc_finish: serv %lu, w %d",
++			service, entity->weight);
++		bfq_log_bfqq(bfqq->bfqd, bfqq,
++			"calc_finish: start %llu, finish %llu, delta %llu",
++			entity->start, entity->finish,
++			bfq_delta(service, entity->weight));
++	}
++}
++
++/**
++ * bfq_entity_of - get an entity from a node.
++ * @node: the node field of the entity.
++ *
++ * Convert a node pointer to the relative entity.  This is used only
++ * to simplify the logic of some functions and not as the generic
++ * conversion mechanism because, e.g., in the tree walking functions,
++ * the check for a %NULL value would be redundant.
++ */
++static inline struct bfq_entity *bfq_entity_of(struct rb_node *node)
++{
++	struct bfq_entity *entity = NULL;
++
++	if (node != NULL)
++		entity = rb_entry(node, struct bfq_entity, rb_node);
++
++	return entity;
++}
++
++/**
++ * bfq_extract - remove an entity from a tree.
++ * @root: the tree root.
++ * @entity: the entity to remove.
++ */
++static inline void bfq_extract(struct rb_root *root,
++			       struct bfq_entity *entity)
++{
++	BUG_ON(entity->tree != root);
++
++	entity->tree = NULL;
++	rb_erase(&entity->rb_node, root);
++}
++
++/**
++ * bfq_idle_extract - extract an entity from the idle tree.
++ * @st: the service tree of the owning @entity.
++ * @entity: the entity being removed.
++ */
++static void bfq_idle_extract(struct bfq_service_tree *st,
++			     struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct rb_node *next;
++
++	BUG_ON(entity->tree != &st->idle);
++
++	if (entity == st->first_idle) {
++		next = rb_next(&entity->rb_node);
++		st->first_idle = bfq_entity_of(next);
++	}
++
++	if (entity == st->last_idle) {
++		next = rb_prev(&entity->rb_node);
++		st->last_idle = bfq_entity_of(next);
++	}
++
++	bfq_extract(&st->idle, entity);
++
++	if (bfqq != NULL)
++		list_del(&bfqq->bfqq_list);
++}
++
++/**
++ * bfq_insert - generic tree insertion.
++ * @root: tree root.
++ * @entity: entity to insert.
++ *
++ * This is used for the idle and the active tree, since they are both
++ * ordered by finish time.
++ */
++static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
++{
++	struct bfq_entity *entry;
++	struct rb_node **node = &root->rb_node;
++	struct rb_node *parent = NULL;
++
++	BUG_ON(entity->tree != NULL);
++
++	while (*node != NULL) {
++		parent = *node;
++		entry = rb_entry(parent, struct bfq_entity, rb_node);
++
++		if (bfq_gt(entry->finish, entity->finish))
++			node = &parent->rb_left;
++		else
++			node = &parent->rb_right;
++	}
++
++	rb_link_node(&entity->rb_node, parent, node);
++	rb_insert_color(&entity->rb_node, root);
++
++	entity->tree = root;
++}
++
++/**
++ * bfq_update_min - update the min_start field of a entity.
++ * @entity: the entity to update.
++ * @node: one of its children.
++ *
++ * This function is called when @entity may store an invalid value for
++ * min_start due to updates to the active tree.  The function  assumes
++ * that the subtree rooted at @node (which may be its left or its right
++ * child) has a valid min_start value.
++ */
++static inline void bfq_update_min(struct bfq_entity *entity,
++				  struct rb_node *node)
++{
++	struct bfq_entity *child;
++
++	if (node != NULL) {
++		child = rb_entry(node, struct bfq_entity, rb_node);
++		if (bfq_gt(entity->min_start, child->min_start))
++			entity->min_start = child->min_start;
++	}
++}
++
++/**
++ * bfq_update_active_node - recalculate min_start.
++ * @node: the node to update.
++ *
++ * @node may have changed position or one of its children may have moved,
++ * this function updates its min_start value.  The left and right subtrees
++ * are assumed to hold a correct min_start value.
++ */
++static inline void bfq_update_active_node(struct rb_node *node)
++{
++	struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
++
++	entity->min_start = entity->start;
++	bfq_update_min(entity, node->rb_right);
++	bfq_update_min(entity, node->rb_left);
++}
++
++/**
++ * bfq_update_active_tree - update min_start for the whole active tree.
++ * @node: the starting node.
++ *
++ * @node must be the deepest modified node after an update.  This function
++ * updates its min_start using the values held by its children, assuming
++ * that they did not change, and then updates all the nodes that may have
++ * changed in the path to the root.  The only nodes that may have changed
++ * are the ones in the path or their siblings.
++ */
++static void bfq_update_active_tree(struct rb_node *node)
++{
++	struct rb_node *parent;
++
++up:
++	bfq_update_active_node(node);
++
++	parent = rb_parent(node);
++	if (parent == NULL)
++		return;
++
++	if (node == parent->rb_left && parent->rb_right != NULL)
++		bfq_update_active_node(parent->rb_right);
++	else if (parent->rb_left != NULL)
++		bfq_update_active_node(parent->rb_left);
++
++	node = parent;
++	goto up;
++}
++
++static void bfq_weights_tree_add(struct bfq_data *bfqd,
++				 struct bfq_entity *entity,
++				 struct rb_root *root);
++
++static void bfq_weights_tree_remove(struct bfq_data *bfqd,
++				    struct bfq_entity *entity,
++				    struct rb_root *root);
++
++
++/**
++ * bfq_active_insert - insert an entity in the active tree of its
++ *                     group/device.
++ * @st: the service tree of the entity.
++ * @entity: the entity being inserted.
++ *
++ * The active tree is ordered by finish time, but an extra key is kept
++ * per each node, containing the minimum value for the start times of
++ * its children (and the node itself), so it's possible to search for
++ * the eligible node with the lowest finish time in logarithmic time.
++ */
++static void bfq_active_insert(struct bfq_service_tree *st,
++			      struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct rb_node *node = &entity->rb_node;
++#ifdef CONFIG_CGROUP_BFQIO
++	struct bfq_sched_data *sd = NULL;
++	struct bfq_group *bfqg = NULL;
++	struct bfq_data *bfqd = NULL;
++#endif
++
++	bfq_insert(&st->active, entity);
++
++	if (node->rb_left != NULL)
++		node = node->rb_left;
++	else if (node->rb_right != NULL)
++		node = node->rb_right;
++
++	bfq_update_active_tree(node);
++
++#ifdef CONFIG_CGROUP_BFQIO
++	sd = entity->sched_data;
++	bfqg = container_of(sd, struct bfq_group, sched_data);
++	BUG_ON(!bfqg);
++	bfqd = (struct bfq_data *)bfqg->bfqd;
++#endif
++	if (bfqq != NULL)
++		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
++#ifdef CONFIG_CGROUP_BFQIO
++	else { /* bfq_group */
++		BUG_ON(!bfqd);
++		bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
++	}
++	if (bfqg != bfqd->root_group) {
++		BUG_ON(!bfqg);
++		BUG_ON(!bfqd);
++		bfqg->active_entities++;
++		if (bfqg->active_entities == 2)
++			bfqd->active_numerous_groups++;
++	}
++#endif
++}
++
++/**
++ * bfq_ioprio_to_weight - calc a weight from an ioprio.
++ * @ioprio: the ioprio value to convert.
++ */
++static inline unsigned short bfq_ioprio_to_weight(int ioprio)
++{
++	BUG_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
++	return IOPRIO_BE_NR - ioprio;
++}
++
++/**
++ * bfq_weight_to_ioprio - calc an ioprio from a weight.
++ * @weight: the weight value to convert.
++ *
++ * To preserve as mush as possible the old only-ioprio user interface,
++ * 0 is used as an escape ioprio value for weights (numerically) equal or
++ * larger than IOPRIO_BE_NR
++ */
++static inline unsigned short bfq_weight_to_ioprio(int weight)
++{
++	BUG_ON(weight < BFQ_MIN_WEIGHT || weight > BFQ_MAX_WEIGHT);
++	return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight;
++}
++
++static inline void bfq_get_entity(struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++	if (bfqq != NULL) {
++		atomic_inc(&bfqq->ref);
++		bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
++			     bfqq, atomic_read(&bfqq->ref));
++	}
++}
++
++/**
++ * bfq_find_deepest - find the deepest node that an extraction can modify.
++ * @node: the node being removed.
++ *
++ * Do the first step of an extraction in an rb tree, looking for the
++ * node that will replace @node, and returning the deepest node that
++ * the following modifications to the tree can touch.  If @node is the
++ * last node in the tree return %NULL.
++ */
++static struct rb_node *bfq_find_deepest(struct rb_node *node)
++{
++	struct rb_node *deepest;
++
++	if (node->rb_right == NULL && node->rb_left == NULL)
++		deepest = rb_parent(node);
++	else if (node->rb_right == NULL)
++		deepest = node->rb_left;
++	else if (node->rb_left == NULL)
++		deepest = node->rb_right;
++	else {
++		deepest = rb_next(node);
++		if (deepest->rb_right != NULL)
++			deepest = deepest->rb_right;
++		else if (rb_parent(deepest) != node)
++			deepest = rb_parent(deepest);
++	}
++
++	return deepest;
++}
++
++/**
++ * bfq_active_extract - remove an entity from the active tree.
++ * @st: the service_tree containing the tree.
++ * @entity: the entity being removed.
++ */
++static void bfq_active_extract(struct bfq_service_tree *st,
++			       struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct rb_node *node;
++#ifdef CONFIG_CGROUP_BFQIO
++	struct bfq_sched_data *sd = NULL;
++	struct bfq_group *bfqg = NULL;
++	struct bfq_data *bfqd = NULL;
++#endif
++
++	node = bfq_find_deepest(&entity->rb_node);
++	bfq_extract(&st->active, entity);
++
++	if (node != NULL)
++		bfq_update_active_tree(node);
++
++#ifdef CONFIG_CGROUP_BFQIO
++	sd = entity->sched_data;
++	bfqg = container_of(sd, struct bfq_group, sched_data);
++	BUG_ON(!bfqg);
++	bfqd = (struct bfq_data *)bfqg->bfqd;
++#endif
++	if (bfqq != NULL)
++		list_del(&bfqq->bfqq_list);
++#ifdef CONFIG_CGROUP_BFQIO
++	else { /* bfq_group */
++		BUG_ON(!bfqd);
++		bfq_weights_tree_remove(bfqd, entity,
++					&bfqd->group_weights_tree);
++	}
++	if (bfqg != bfqd->root_group) {
++		BUG_ON(!bfqg);
++		BUG_ON(!bfqd);
++		BUG_ON(!bfqg->active_entities);
++		bfqg->active_entities--;
++		if (bfqg->active_entities == 1) {
++			BUG_ON(!bfqd->active_numerous_groups);
++			bfqd->active_numerous_groups--;
++		}
++	}
++#endif
++}
++
++/**
++ * bfq_idle_insert - insert an entity into the idle tree.
++ * @st: the service tree containing the tree.
++ * @entity: the entity to insert.
++ */
++static void bfq_idle_insert(struct bfq_service_tree *st,
++			    struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct bfq_entity *first_idle = st->first_idle;
++	struct bfq_entity *last_idle = st->last_idle;
++
++	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
++		st->first_idle = entity;
++	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
++		st->last_idle = entity;
++
++	bfq_insert(&st->idle, entity);
++
++	if (bfqq != NULL)
++		list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
++}
++
++/**
++ * bfq_forget_entity - remove an entity from the wfq trees.
++ * @st: the service tree.
++ * @entity: the entity being removed.
++ *
++ * Update the device status and forget everything about @entity, putting
++ * the device reference to it, if it is a queue.  Entities belonging to
++ * groups are not refcounted.
++ */
++static void bfq_forget_entity(struct bfq_service_tree *st,
++			      struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct bfq_sched_data *sd;
++
++	BUG_ON(!entity->on_st);
++
++	entity->on_st = 0;
++	st->wsum -= entity->weight;
++	if (bfqq != NULL) {
++		sd = entity->sched_data;
++		bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
++			     bfqq, atomic_read(&bfqq->ref));
++		bfq_put_queue(bfqq);
++	}
++}
++
++/**
++ * bfq_put_idle_entity - release the idle tree ref of an entity.
++ * @st: service tree for the entity.
++ * @entity: the entity being released.
++ */
++static void bfq_put_idle_entity(struct bfq_service_tree *st,
++				struct bfq_entity *entity)
++{
++	bfq_idle_extract(st, entity);
++	bfq_forget_entity(st, entity);
++}
++
++/**
++ * bfq_forget_idle - update the idle tree if necessary.
++ * @st: the service tree to act upon.
++ *
++ * To preserve the global O(log N) complexity we only remove one entry here;
++ * as the idle tree will not grow indefinitely this can be done safely.
++ */
++static void bfq_forget_idle(struct bfq_service_tree *st)
++{
++	struct bfq_entity *first_idle = st->first_idle;
++	struct bfq_entity *last_idle = st->last_idle;
++
++	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
++	    !bfq_gt(last_idle->finish, st->vtime)) {
++		/*
++		 * Forget the whole idle tree, increasing the vtime past
++		 * the last finish time of idle entities.
++		 */
++		st->vtime = last_idle->finish;
++	}
++
++	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
++		bfq_put_idle_entity(st, first_idle);
++}
++
++static struct bfq_service_tree *
++__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
++			 struct bfq_entity *entity)
++{
++	struct bfq_service_tree *new_st = old_st;
++
++	if (entity->ioprio_changed) {
++		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++		unsigned short prev_weight, new_weight;
++		struct bfq_data *bfqd = NULL;
++		struct rb_root *root;
++#ifdef CONFIG_CGROUP_BFQIO
++		struct bfq_sched_data *sd;
++		struct bfq_group *bfqg;
++#endif
++
++		if (bfqq != NULL)
++			bfqd = bfqq->bfqd;
++#ifdef CONFIG_CGROUP_BFQIO
++		else {
++			sd = entity->my_sched_data;
++			bfqg = container_of(sd, struct bfq_group, sched_data);
++			BUG_ON(!bfqg);
++			bfqd = (struct bfq_data *)bfqg->bfqd;
++			BUG_ON(!bfqd);
++		}
++#endif
++
++		BUG_ON(old_st->wsum < entity->weight);
++		old_st->wsum -= entity->weight;
++
++		if (entity->new_weight != entity->orig_weight) {
++			entity->orig_weight = entity->new_weight;
++			entity->ioprio =
++				bfq_weight_to_ioprio(entity->orig_weight);
++		} else if (entity->new_ioprio != entity->ioprio) {
++			entity->ioprio = entity->new_ioprio;
++			entity->orig_weight =
++					bfq_ioprio_to_weight(entity->ioprio);
++		} else
++			entity->new_weight = entity->orig_weight =
++				bfq_ioprio_to_weight(entity->ioprio);
++
++		entity->ioprio_class = entity->new_ioprio_class;
++		entity->ioprio_changed = 0;
++
++		/*
++		 * NOTE: here we may be changing the weight too early,
++		 * this will cause unfairness.  The correct approach
++		 * would have required additional complexity to defer
++		 * weight changes to the proper time instants (i.e.,
++		 * when entity->finish <= old_st->vtime).
++		 */
++		new_st = bfq_entity_service_tree(entity);
++
++		prev_weight = entity->weight;
++		new_weight = entity->orig_weight *
++			     (bfqq != NULL ? bfqq->wr_coeff : 1);
++		/*
++		 * If the weight of the entity changes, remove the entity
++		 * from its old weight counter (if there is a counter
++		 * associated with the entity), and add it to the counter
++		 * associated with its new weight.
++		 */
++		if (prev_weight != new_weight) {
++			root = bfqq ? &bfqd->queue_weights_tree :
++				      &bfqd->group_weights_tree;
++			bfq_weights_tree_remove(bfqd, entity, root);
++		}
++		entity->weight = new_weight;
++		/*
++		 * Add the entity to its weights tree only if it is
++		 * not associated with a weight-raised queue.
++		 */
++		if (prev_weight != new_weight &&
++		    (bfqq ? bfqq->wr_coeff == 1 : 1))
++			/* If we get here, root has been initialized. */
++			bfq_weights_tree_add(bfqd, entity, root);
++
++		new_st->wsum += entity->weight;
++
++		if (new_st != old_st)
++			entity->start = new_st->vtime;
++	}
++
++	return new_st;
++}
++
++/**
++ * bfq_bfqq_served - update the scheduler status after selection for
++ *                   service.
++ * @bfqq: the queue being served.
++ * @served: bytes to transfer.
++ *
++ * NOTE: this can be optimized, as the timestamps of upper level entities
++ * are synchronized every time a new bfqq is selected for service.  By now,
++ * we keep it to better check consistency.
++ */
++static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++	struct bfq_service_tree *st;
++
++	for_each_entity(entity) {
++		st = bfq_entity_service_tree(entity);
++
++		entity->service += served;
++		BUG_ON(entity->service > entity->budget);
++		BUG_ON(st->wsum == 0);
++
++		st->vtime += bfq_delta(served, st->wsum);
++		bfq_forget_idle(st);
++	}
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served);
++}
++
++/**
++ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
++ * @bfqq: the queue that needs a service update.
++ *
++ * When it's not possible to be fair in the service domain, because
++ * a queue is not consuming its budget fast enough (the meaning of
++ * fast depends on the timeout parameter), we charge it a full
++ * budget.  In this way we should obtain a sort of time-domain
++ * fairness among all the seeky/slow queues.
++ */
++static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
++
++	bfq_bfqq_served(bfqq, entity->budget - entity->service);
++}
++
++/**
++ * __bfq_activate_entity - activate an entity.
++ * @entity: the entity being activated.
++ *
++ * Called whenever an entity is activated, i.e., it is not active and one
++ * of its children receives a new request, or has to be reactivated due to
++ * budget exhaustion.  It uses the current budget of the entity (and the
++ * service received if @entity is active) of the queue to calculate its
++ * timestamps.
++ */
++static void __bfq_activate_entity(struct bfq_entity *entity)
++{
++	struct bfq_sched_data *sd = entity->sched_data;
++	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++
++	if (entity == sd->in_service_entity) {
++		BUG_ON(entity->tree != NULL);
++		/*
++		 * If we are requeueing the current entity we have
++		 * to take care of not charging to it service it has
++		 * not received.
++		 */
++		bfq_calc_finish(entity, entity->service);
++		entity->start = entity->finish;
++		sd->in_service_entity = NULL;
++	} else if (entity->tree == &st->active) {
++		/*
++		 * Requeueing an entity due to a change of some
++		 * next_in_service entity below it.  We reuse the
++		 * old start time.
++		 */
++		bfq_active_extract(st, entity);
++	} else if (entity->tree == &st->idle) {
++		/*
++		 * Must be on the idle tree, bfq_idle_extract() will
++		 * check for that.
++		 */
++		bfq_idle_extract(st, entity);
++		entity->start = bfq_gt(st->vtime, entity->finish) ?
++				       st->vtime : entity->finish;
++	} else {
++		/*
++		 * The finish time of the entity may be invalid, and
++		 * it is in the past for sure, otherwise the queue
++		 * would have been on the idle tree.
++		 */
++		entity->start = st->vtime;
++		st->wsum += entity->weight;
++		bfq_get_entity(entity);
++
++		BUG_ON(entity->on_st);
++		entity->on_st = 1;
++	}
++
++	st = __bfq_entity_update_weight_prio(st, entity);
++	bfq_calc_finish(entity, entity->budget);
++	bfq_active_insert(st, entity);
++}
++
++/**
++ * bfq_activate_entity - activate an entity and its ancestors if necessary.
++ * @entity: the entity to activate.
++ *
++ * Activate @entity and all the entities on the path from it to the root.
++ */
++static void bfq_activate_entity(struct bfq_entity *entity)
++{
++	struct bfq_sched_data *sd;
++
++	for_each_entity(entity) {
++		__bfq_activate_entity(entity);
++
++		sd = entity->sched_data;
++		if (!bfq_update_next_in_service(sd))
++			/*
++			 * No need to propagate the activation to the
++			 * upper entities, as they will be updated when
++			 * the in-service entity is rescheduled.
++			 */
++			break;
++	}
++}
++
++/**
++ * __bfq_deactivate_entity - deactivate an entity from its service tree.
++ * @entity: the entity to deactivate.
++ * @requeue: if false, the entity will not be put into the idle tree.
++ *
++ * Deactivate an entity, independently from its previous state.  If the
++ * entity was not on a service tree just return, otherwise if it is on
++ * any scheduler tree, extract it from that tree, and if necessary
++ * and if the caller did not specify @requeue, put it on the idle tree.
++ *
++ * Return %1 if the caller should update the entity hierarchy, i.e.,
++ * if the entity was in service or if it was the next_in_service for
++ * its sched_data; return %0 otherwise.
++ */
++static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
++{
++	struct bfq_sched_data *sd = entity->sched_data;
++	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++	int was_in_service = entity == sd->in_service_entity;
++	int ret = 0;
++
++	if (!entity->on_st)
++		return 0;
++
++	BUG_ON(was_in_service && entity->tree != NULL);
++
++	if (was_in_service) {
++		bfq_calc_finish(entity, entity->service);
++		sd->in_service_entity = NULL;
++	} else if (entity->tree == &st->active)
++		bfq_active_extract(st, entity);
++	else if (entity->tree == &st->idle)
++		bfq_idle_extract(st, entity);
++	else if (entity->tree != NULL)
++		BUG();
++
++	if (was_in_service || sd->next_in_service == entity)
++		ret = bfq_update_next_in_service(sd);
++
++	if (!requeue || !bfq_gt(entity->finish, st->vtime))
++		bfq_forget_entity(st, entity);
++	else
++		bfq_idle_insert(st, entity);
++
++	BUG_ON(sd->in_service_entity == entity);
++	BUG_ON(sd->next_in_service == entity);
++
++	return ret;
++}
++
++/**
++ * bfq_deactivate_entity - deactivate an entity.
++ * @entity: the entity to deactivate.
++ * @requeue: true if the entity can be put on the idle tree
++ */
++static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
++{
++	struct bfq_sched_data *sd;
++	struct bfq_entity *parent;
++
++	for_each_entity_safe(entity, parent) {
++		sd = entity->sched_data;
++
++		if (!__bfq_deactivate_entity(entity, requeue))
++			/*
++			 * The parent entity is still backlogged, and
++			 * we don't need to update it as it is still
++			 * in service.
++			 */
++			break;
++
++		if (sd->next_in_service != NULL)
++			/*
++			 * The parent entity is still backlogged and
++			 * the budgets on the path towards the root
++			 * need to be updated.
++			 */
++			goto update;
++
++		/*
++		 * If we reach there the parent is no more backlogged and
++		 * we want to propagate the dequeue upwards.
++		 */
++		requeue = 1;
++	}
++
++	return;
++
++update:
++	entity = parent;
++	for_each_entity(entity) {
++		__bfq_activate_entity(entity);
++
++		sd = entity->sched_data;
++		if (!bfq_update_next_in_service(sd))
++			break;
++	}
++}
++
++/**
++ * bfq_update_vtime - update vtime if necessary.
++ * @st: the service tree to act upon.
++ *
++ * If necessary update the service tree vtime to have at least one
++ * eligible entity, skipping to its start time.  Assumes that the
++ * active tree of the device is not empty.
++ *
++ * NOTE: this hierarchical implementation updates vtimes quite often,
++ * we may end up with reactivated processes getting timestamps after a
++ * vtime skip done because we needed a ->first_active entity on some
++ * intermediate node.
++ */
++static void bfq_update_vtime(struct bfq_service_tree *st)
++{
++	struct bfq_entity *entry;
++	struct rb_node *node = st->active.rb_node;
++
++	entry = rb_entry(node, struct bfq_entity, rb_node);
++	if (bfq_gt(entry->min_start, st->vtime)) {
++		st->vtime = entry->min_start;
++		bfq_forget_idle(st);
++	}
++}
++
++/**
++ * bfq_first_active_entity - find the eligible entity with
++ *                           the smallest finish time
++ * @st: the service tree to select from.
++ *
++ * This function searches the first schedulable entity, starting from the
++ * root of the tree and going on the left every time on this side there is
++ * a subtree with at least one eligible (start >= vtime) entity. The path on
++ * the right is followed only if a) the left subtree contains no eligible
++ * entities and b) no eligible entity has been found yet.
++ */
++static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
++{
++	struct bfq_entity *entry, *first = NULL;
++	struct rb_node *node = st->active.rb_node;
++
++	while (node != NULL) {
++		entry = rb_entry(node, struct bfq_entity, rb_node);
++left:
++		if (!bfq_gt(entry->start, st->vtime))
++			first = entry;
++
++		BUG_ON(bfq_gt(entry->min_start, st->vtime));
++
++		if (node->rb_left != NULL) {
++			entry = rb_entry(node->rb_left,
++					 struct bfq_entity, rb_node);
++			if (!bfq_gt(entry->min_start, st->vtime)) {
++				node = node->rb_left;
++				goto left;
++			}
++		}
++		if (first != NULL)
++			break;
++		node = node->rb_right;
++	}
++
++	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
++	return first;
++}
++
++/**
++ * __bfq_lookup_next_entity - return the first eligible entity in @st.
++ * @st: the service tree.
++ *
++ * Update the virtual time in @st and return the first eligible entity
++ * it contains.
++ */
++static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
++						   bool force)
++{
++	struct bfq_entity *entity, *new_next_in_service = NULL;
++
++	if (RB_EMPTY_ROOT(&st->active))
++		return NULL;
++
++	bfq_update_vtime(st);
++	entity = bfq_first_active_entity(st);
++	BUG_ON(bfq_gt(entity->start, st->vtime));
++
++	/*
++	 * If the chosen entity does not match with the sched_data's
++	 * next_in_service and we are forcedly serving the IDLE priority
++	 * class tree, bubble up budget update.
++	 */
++	if (unlikely(force && entity != entity->sched_data->next_in_service)) {
++		new_next_in_service = entity;
++		for_each_entity(new_next_in_service)
++			bfq_update_budget(new_next_in_service);
++	}
++
++	return entity;
++}
++
++/**
++ * bfq_lookup_next_entity - return the first eligible entity in @sd.
++ * @sd: the sched_data.
++ * @extract: if true the returned entity will be also extracted from @sd.
++ *
++ * NOTE: since we cache the next_in_service entity at each level of the
++ * hierarchy, the complexity of the lookup can be decreased with
++ * absolutely no effort just returning the cached next_in_service value;
++ * we prefer to do full lookups to test the consistency of * the data
++ * structures.
++ */
++static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
++						 int extract,
++						 struct bfq_data *bfqd)
++{
++	struct bfq_service_tree *st = sd->service_tree;
++	struct bfq_entity *entity;
++	int i = 0;
++
++	BUG_ON(sd->in_service_entity != NULL);
++
++	if (bfqd != NULL &&
++	    jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) {
++		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
++						  true);
++		if (entity != NULL) {
++			i = BFQ_IOPRIO_CLASSES - 1;
++			bfqd->bfq_class_idle_last_service = jiffies;
++			sd->next_in_service = entity;
++		}
++	}
++	for (; i < BFQ_IOPRIO_CLASSES; i++) {
++		entity = __bfq_lookup_next_entity(st + i, false);
++		if (entity != NULL) {
++			if (extract) {
++				bfq_check_next_in_service(sd, entity);
++				bfq_active_extract(st + i, entity);
++				sd->in_service_entity = entity;
++				sd->next_in_service = NULL;
++			}
++			break;
++		}
++	}
++
++	return entity;
++}
++
++/*
++ * Get next queue for service.
++ */
++static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
++{
++	struct bfq_entity *entity = NULL;
++	struct bfq_sched_data *sd;
++	struct bfq_queue *bfqq;
++
++	BUG_ON(bfqd->in_service_queue != NULL);
++
++	if (bfqd->busy_queues == 0)
++		return NULL;
++
++	sd = &bfqd->root_group->sched_data;
++	for (; sd != NULL; sd = entity->my_sched_data) {
++		entity = bfq_lookup_next_entity(sd, 1, bfqd);
++		BUG_ON(entity == NULL);
++		entity->service = 0;
++	}
++
++	bfqq = bfq_entity_to_bfqq(entity);
++	BUG_ON(bfqq == NULL);
++
++	return bfqq;
++}
++
++/*
++ * Forced extraction of the given queue.
++ */
++static void bfq_get_next_queue_forced(struct bfq_data *bfqd,
++				      struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity;
++	struct bfq_sched_data *sd;
++
++	BUG_ON(bfqd->in_service_queue != NULL);
++
++	entity = &bfqq->entity;
++	/*
++	 * Bubble up extraction/update from the leaf to the root.
++	*/
++	for_each_entity(entity) {
++		sd = entity->sched_data;
++		bfq_update_budget(entity);
++		bfq_update_vtime(bfq_entity_service_tree(entity));
++		bfq_active_extract(bfq_entity_service_tree(entity), entity);
++		sd->in_service_entity = entity;
++		sd->next_in_service = NULL;
++		entity->service = 0;
++	}
++
++	return;
++}
++
++static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
++{
++	if (bfqd->in_service_bic != NULL) {
++		put_io_context(bfqd->in_service_bic->icq.ioc);
++		bfqd->in_service_bic = NULL;
++	}
++
++	bfqd->in_service_queue = NULL;
++	del_timer(&bfqd->idle_slice_timer);
++}
++
++static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++				int requeue)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++
++	if (bfqq == bfqd->in_service_queue)
++		__bfq_bfqd_reset_in_service(bfqd);
++
++	bfq_deactivate_entity(entity, requeue);
++}
++
++static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++
++	bfq_activate_entity(entity);
++}
++
++/*
++ * Called when the bfqq no longer has requests pending, remove it from
++ * the service tree.
++ */
++static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			      int requeue)
++{
++	BUG_ON(!bfq_bfqq_busy(bfqq));
++	BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list));
++
++	bfq_log_bfqq(bfqd, bfqq, "del from busy");
++
++	bfq_clear_bfqq_busy(bfqq);
++
++	BUG_ON(bfqd->busy_queues == 0);
++	bfqd->busy_queues--;
++
++	if (!bfqq->dispatched) {
++		bfq_weights_tree_remove(bfqd, &bfqq->entity,
++					&bfqd->queue_weights_tree);
++		if (!blk_queue_nonrot(bfqd->queue)) {
++			BUG_ON(!bfqd->busy_in_flight_queues);
++			bfqd->busy_in_flight_queues--;
++			if (bfq_bfqq_constantly_seeky(bfqq)) {
++				BUG_ON(!bfqd->
++					const_seeky_busy_in_flight_queues);
++				bfqd->const_seeky_busy_in_flight_queues--;
++			}
++		}
++	}
++	if (bfqq->wr_coeff > 1)
++		bfqd->wr_busy_queues--;
++
++	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
++}
++
++/*
++ * Called when an inactive queue receives a new request.
++ */
++static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	BUG_ON(bfq_bfqq_busy(bfqq));
++	BUG_ON(bfqq == bfqd->in_service_queue);
++
++	bfq_log_bfqq(bfqd, bfqq, "add to busy");
++
++	bfq_activate_bfqq(bfqd, bfqq);
++
++	bfq_mark_bfqq_busy(bfqq);
++	bfqd->busy_queues++;
++
++	if (!bfqq->dispatched) {
++		if (bfqq->wr_coeff == 1)
++			bfq_weights_tree_add(bfqd, &bfqq->entity,
++					     &bfqd->queue_weights_tree);
++		if (!blk_queue_nonrot(bfqd->queue)) {
++			bfqd->busy_in_flight_queues++;
++			if (bfq_bfqq_constantly_seeky(bfqq))
++				bfqd->const_seeky_busy_in_flight_queues++;
++		}
++	}
++	if (bfqq->wr_coeff > 1)
++		bfqd->wr_busy_queues++;
++}
+diff --git a/block/bfq.h b/block/bfq.h
+new file mode 100644
+index 0000000..a83e69d
+--- /dev/null
++++ b/block/bfq.h
+@@ -0,0 +1,742 @@
++/*
++ * BFQ-v7r5 for 3.16.0: data structures and common functions prototypes.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++#ifndef _BFQ_H
++#define _BFQ_H
++
++#include <linux/blktrace_api.h>
++#include <linux/hrtimer.h>
++#include <linux/ioprio.h>
++#include <linux/rbtree.h>
++
++#define BFQ_IOPRIO_CLASSES	3
++#define BFQ_CL_IDLE_TIMEOUT	(HZ/5)
++
++#define BFQ_MIN_WEIGHT	1
++#define BFQ_MAX_WEIGHT	1000
++
++#define BFQ_DEFAULT_GRP_WEIGHT	10
++#define BFQ_DEFAULT_GRP_IOPRIO	0
++#define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
++
++struct bfq_entity;
++
++/**
++ * struct bfq_service_tree - per ioprio_class service tree.
++ * @active: tree for active entities (i.e., those backlogged).
++ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
++ * @first_idle: idle entity with minimum F_i.
++ * @last_idle: idle entity with maximum F_i.
++ * @vtime: scheduler virtual time.
++ * @wsum: scheduler weight sum; active and idle entities contribute to it.
++ *
++ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
++ * ioprio_class has its own independent scheduler, and so its own
++ * bfq_service_tree.  All the fields are protected by the queue lock
++ * of the containing bfqd.
++ */
++struct bfq_service_tree {
++	struct rb_root active;
++	struct rb_root idle;
++
++	struct bfq_entity *first_idle;
++	struct bfq_entity *last_idle;
++
++	u64 vtime;
++	unsigned long wsum;
++};
++
++/**
++ * struct bfq_sched_data - multi-class scheduler.
++ * @in_service_entity: entity in service.
++ * @next_in_service: head-of-the-line entity in the scheduler.
++ * @service_tree: array of service trees, one per ioprio_class.
++ *
++ * bfq_sched_data is the basic scheduler queue.  It supports three
++ * ioprio_classes, and can be used either as a toplevel queue or as
++ * an intermediate queue on a hierarchical setup.
++ * @next_in_service points to the active entity of the sched_data
++ * service trees that will be scheduled next.
++ *
++ * The supported ioprio_classes are the same as in CFQ, in descending
++ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
++ * Requests from higher priority queues are served before all the
++ * requests from lower priority queues; among requests of the same
++ * queue requests are served according to B-WF2Q+.
++ * All the fields are protected by the queue lock of the containing bfqd.
++ */
++struct bfq_sched_data {
++	struct bfq_entity *in_service_entity;
++	struct bfq_entity *next_in_service;
++	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
++};
++
++/**
++ * struct bfq_weight_counter - counter of the number of all active entities
++ *                             with a given weight.
++ * @weight: weight of the entities that this counter refers to.
++ * @num_active: number of active entities with this weight.
++ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
++ *                and @group_weights_tree).
++ */
++struct bfq_weight_counter {
++	short int weight;
++	unsigned int num_active;
++	struct rb_node weights_node;
++};
++
++/**
++ * struct bfq_entity - schedulable entity.
++ * @rb_node: service_tree member.
++ * @weight_counter: pointer to the weight counter associated with this entity.
++ * @on_st: flag, true if the entity is on a tree (either the active or
++ *         the idle one of its service_tree).
++ * @finish: B-WF2Q+ finish timestamp (aka F_i).
++ * @start: B-WF2Q+ start timestamp (aka S_i).
++ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
++ * @min_start: minimum start time of the (active) subtree rooted at
++ *             this entity; used for O(log N) lookups into active trees.
++ * @service: service received during the last round of service.
++ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
++ * @weight: weight of the queue
++ * @parent: parent entity, for hierarchical scheduling.
++ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
++ *                 associated scheduler queue, %NULL on leaf nodes.
++ * @sched_data: the scheduler queue this entity belongs to.
++ * @ioprio: the ioprio in use.
++ * @new_weight: when a weight change is requested, the new weight value.
++ * @orig_weight: original weight, used to implement weight boosting
++ * @new_ioprio: when an ioprio change is requested, the new ioprio value.
++ * @ioprio_class: the ioprio_class in use.
++ * @new_ioprio_class: when an ioprio_class change is requested, the new
++ *                    ioprio_class value.
++ * @ioprio_changed: flag, true when the user requested a weight, ioprio or
++ *                  ioprio_class change.
++ *
++ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
++ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
++ * entity belongs to the sched_data of the parent group in the cgroup
++ * hierarchy.  Non-leaf entities have also their own sched_data, stored
++ * in @my_sched_data.
++ *
++ * Each entity stores independently its priority values; this would
++ * allow different weights on different devices, but this
++ * functionality is not exported to userspace by now.  Priorities and
++ * weights are updated lazily, first storing the new values into the
++ * new_* fields, then setting the @ioprio_changed flag.  As soon as
++ * there is a transition in the entity state that allows the priority
++ * update to take place the effective and the requested priority
++ * values are synchronized.
++ *
++ * Unless cgroups are used, the weight value is calculated from the
++ * ioprio to export the same interface as CFQ.  When dealing with
++ * ``well-behaved'' queues (i.e., queues that do not spend too much
++ * time to consume their budget and have true sequential behavior, and
++ * when there are no external factors breaking anticipation) the
++ * relative weights at each level of the cgroups hierarchy should be
++ * guaranteed.  All the fields are protected by the queue lock of the
++ * containing bfqd.
++ */
++struct bfq_entity {
++	struct rb_node rb_node;
++	struct bfq_weight_counter *weight_counter;
++
++	int on_st;
++
++	u64 finish;
++	u64 start;
++
++	struct rb_root *tree;
++
++	u64 min_start;
++
++	unsigned long service, budget;
++	unsigned short weight, new_weight;
++	unsigned short orig_weight;
++
++	struct bfq_entity *parent;
++
++	struct bfq_sched_data *my_sched_data;
++	struct bfq_sched_data *sched_data;
++
++	unsigned short ioprio, new_ioprio;
++	unsigned short ioprio_class, new_ioprio_class;
++
++	int ioprio_changed;
++};
++
++struct bfq_group;
++
++/**
++ * struct bfq_queue - leaf schedulable entity.
++ * @ref: reference counter.
++ * @bfqd: parent bfq_data.
++ * @new_bfqq: shared bfq_queue if queue is cooperating with
++ *           one or more other queues.
++ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree).
++ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree).
++ * @sort_list: sorted list of pending requests.
++ * @next_rq: if fifo isn't expired, next request to serve.
++ * @queued: nr of requests queued in @sort_list.
++ * @allocated: currently allocated requests.
++ * @meta_pending: pending metadata requests.
++ * @fifo: fifo list of requests in sort_list.
++ * @entity: entity representing this queue in the scheduler.
++ * @max_budget: maximum budget allowed from the feedback mechanism.
++ * @budget_timeout: budget expiration (in jiffies).
++ * @dispatched: number of requests on the dispatch list or inside driver.
++ * @flags: status flags.
++ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
++ * @seek_samples: number of seeks sampled
++ * @seek_total: sum of the distances of the seeks sampled
++ * @seek_mean: mean seek distance
++ * @last_request_pos: position of the last request enqueued
++ * @requests_within_timer: number of consecutive pairs of request completion
++ *                         and arrival, such that the queue becomes idle
++ *                         after the completion, but the next request arrives
++ *                         within an idle time slice; used only if the queue's
++ *                         IO_bound has been cleared.
++ * @pid: pid of the process owning the queue, used for logging purposes.
++ * @last_wr_start_finish: start time of the current weight-raising period if
++ *                        the @bfq-queue is being weight-raised, otherwise
++ *                        finish time of the last weight-raising period
++ * @wr_cur_max_time: current max raising time for this queue
++ * @soft_rt_next_start: minimum time instant such that, only if a new
++ *                      request is enqueued after this time instant in an
++ *                      idle @bfq_queue with no outstanding requests, then
++ *                      the task associated with the queue it is deemed as
++ *                      soft real-time (see the comments to the function
++ *                      bfq_bfqq_softrt_next_start()).
++ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
++ *                      idle to backlogged
++ * @service_from_backlogged: cumulative service received from the @bfq_queue
++ *                           since the last transition from idle to
++ *                           backlogged
++ *
++ * A bfq_queue is a leaf request queue; it can be associated with an io_context
++ * or more, if it is async or shared between cooperating processes. @cgroup
++ * holds a reference to the cgroup, to be sure that it does not disappear while
++ * a bfqq still references it (mostly to avoid races between request issuing and
++ * task migration followed by cgroup destruction).
++ * All the fields are protected by the queue lock of the containing bfqd.
++ */
++struct bfq_queue {
++	atomic_t ref;
++	struct bfq_data *bfqd;
++
++	/* fields for cooperating queues handling */
++	struct bfq_queue *new_bfqq;
++	struct rb_node pos_node;
++	struct rb_root *pos_root;
++
++	struct rb_root sort_list;
++	struct request *next_rq;
++	int queued[2];
++	int allocated[2];
++	int meta_pending;
++	struct list_head fifo;
++
++	struct bfq_entity entity;
++
++	unsigned long max_budget;
++	unsigned long budget_timeout;
++
++	int dispatched;
++
++	unsigned int flags;
++
++	struct list_head bfqq_list;
++
++	unsigned int seek_samples;
++	u64 seek_total;
++	sector_t seek_mean;
++	sector_t last_request_pos;
++
++	unsigned int requests_within_timer;
++
++	pid_t pid;
++
++	/* weight-raising fields */
++	unsigned long wr_cur_max_time;
++	unsigned long soft_rt_next_start;
++	unsigned long last_wr_start_finish;
++	unsigned int wr_coeff;
++	unsigned long last_idle_bklogged;
++	unsigned long service_from_backlogged;
++};
++
++/**
++ * struct bfq_ttime - per process thinktime stats.
++ * @ttime_total: total process thinktime
++ * @ttime_samples: number of thinktime samples
++ * @ttime_mean: average process thinktime
++ */
++struct bfq_ttime {
++	unsigned long last_end_request;
++
++	unsigned long ttime_total;
++	unsigned long ttime_samples;
++	unsigned long ttime_mean;
++};
++
++/**
++ * struct bfq_io_cq - per (request_queue, io_context) structure.
++ * @icq: associated io_cq structure
++ * @bfqq: array of two process queues, the sync and the async
++ * @ttime: associated @bfq_ttime struct
++ */
++struct bfq_io_cq {
++	struct io_cq icq; /* must be the first member */
++	struct bfq_queue *bfqq[2];
++	struct bfq_ttime ttime;
++	int ioprio;
++};
++
++enum bfq_device_speed {
++	BFQ_BFQD_FAST,
++	BFQ_BFQD_SLOW,
++};
++
++/**
++ * struct bfq_data - per device data structure.
++ * @queue: request queue for the managed device.
++ * @root_group: root bfq_group for the device.
++ * @rq_pos_tree: rbtree sorted by next_request position, used when
++ *               determining if two or more queues have interleaving
++ *               requests (see bfq_close_cooperator()).
++ * @active_numerous_groups: number of bfq_groups containing more than one
++ *                          active @bfq_entity.
++ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
++ *                      weight. Used to keep track of whether all @bfq_queues
++ *                     have the same weight. The tree contains one counter
++ *                     for each distinct weight associated to some active
++ *                     and not weight-raised @bfq_queue (see the comments to
++ *                      the functions bfq_weights_tree_[add|remove] for
++ *                     further details).
++ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
++ *                      by weight. Used to keep track of whether all
++ *                     @bfq_groups have the same weight. The tree contains
++ *                     one counter for each distinct weight associated to
++ *                     some active @bfq_group (see the comments to the
++ *                     functions bfq_weights_tree_[add|remove] for further
++ *                     details).
++ * @busy_queues: number of bfq_queues containing requests (including the
++ *		 queue in service, even if it is idling).
++ * @busy_in_flight_queues: number of @bfq_queues containing pending or
++ *                         in-flight requests, plus the @bfq_queue in
++ *                         service, even if idle but waiting for the
++ *                         possible arrival of its next sync request. This
++ *                         field is updated only if the device is rotational,
++ *                         but used only if the device is also NCQ-capable.
++ *                         The reason why the field is updated also for non-
++ *                         NCQ-capable rotational devices is related to the
++ *                         fact that the value of @hw_tag may be set also
++ *                         later than when busy_in_flight_queues may need to
++ *                         be incremented for the first time(s). Taking also
++ *                         this possibility into account, to avoid unbalanced
++ *                         increments/decrements, would imply more overhead
++ *                         than just updating busy_in_flight_queues
++ *                         regardless of the value of @hw_tag.
++ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
++ *                                     (that is, seeky queues that expired
++ *                                     for budget timeout at least once)
++ *                                     containing pending or in-flight
++ *                                     requests, including the in-service
++ *                                     @bfq_queue if constantly seeky. This
++ *                                     field is updated only if the device
++ *                                     is rotational, but used only if the
++ *                                     device is also NCQ-capable (see the
++ *                                     comments to @busy_in_flight_queues).
++ * @wr_busy_queues: number of weight-raised busy @bfq_queues.
++ * @queued: number of queued requests.
++ * @rq_in_driver: number of requests dispatched and waiting for completion.
++ * @sync_flight: number of sync requests in the driver.
++ * @max_rq_in_driver: max number of reqs in driver in the last
++ *                    @hw_tag_samples completed requests.
++ * @hw_tag_samples: nr of samples used to calculate hw_tag.
++ * @hw_tag: flag set to one if the driver is showing a queueing behavior.
++ * @budgets_assigned: number of budgets assigned.
++ * @idle_slice_timer: timer set when idling for the next sequential request
++ *                    from the queue in service.
++ * @unplug_work: delayed work to restart dispatching on the request queue.
++ * @in_service_queue: bfq_queue in service.
++ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue.
++ * @last_position: on-disk position of the last served request.
++ * @last_budget_start: beginning of the last budget.
++ * @last_idling_start: beginning of the last idle slice.
++ * @peak_rate: peak transfer rate observed for a budget.
++ * @peak_rate_samples: number of samples used to calculate @peak_rate.
++ * @bfq_max_budget: maximum budget allotted to a bfq_queue before
++ *                  rescheduling.
++ * @group_list: list of all the bfq_groups active on the device.
++ * @active_list: list of all the bfq_queues active on the device.
++ * @idle_list: list of all the bfq_queues idle on the device.
++ * @bfq_quantum: max number of requests dispatched per dispatch round.
++ * @bfq_fifo_expire: timeout for async/sync requests; when it expires
++ *                   requests are served in fifo order.
++ * @bfq_back_penalty: weight of backward seeks wrt forward ones.
++ * @bfq_back_max: maximum allowed backward seek.
++ * @bfq_slice_idle: maximum idling time.
++ * @bfq_user_max_budget: user-configured max budget value
++ *                       (0 for auto-tuning).
++ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to
++ *                           async queues.
++ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to
++ *               to prevent seeky queues to impose long latencies to well
++ *               behaved ones (this also implies that seeky queues cannot
++ *               receive guarantees in the service domain; after a timeout
++ *               they are charged for the whole allocated budget, to try
++ *               to preserve a behavior reasonably fair among them, but
++ *               without service-domain guarantees).
++ * @bfq_coop_thresh: number of queue merges after which a @bfq_queue is
++ *                   no more granted any weight-raising.
++ * @bfq_failed_cooperations: number of consecutive failed cooperation
++ *                           chances after which weight-raising is restored
++ *                           to a queue subject to more than bfq_coop_thresh
++ *                           queue merges.
++ * @bfq_requests_within_timer: number of consecutive requests that must be
++ *                             issued within the idle time slice to set
++ *                             again idling to a queue which was marked as
++ *                             non-I/O-bound (see the definition of the
++ *                             IO_bound flag for further details).
++ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
++ *                queue is multiplied
++ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
++ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
++ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
++ *			  may be reactivated for a queue (in jiffies)
++ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
++ *				after which weight-raising may be
++ *				reactivated for an already busy queue
++ *				(in jiffies)
++ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
++ *			    sectors per seconds
++ * @RT_prod: cached value of the product R*T used for computing the maximum
++ *	     duration of the weight raising automatically
++ * @device_speed: device-speed class for the low-latency heuristic
++ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
++ *
++ * All the fields are protected by the @queue lock.
++ */
++struct bfq_data {
++	struct request_queue *queue;
++
++	struct bfq_group *root_group;
++	struct rb_root rq_pos_tree;
++
++#ifdef CONFIG_CGROUP_BFQIO
++	int active_numerous_groups;
++#endif
++
++	struct rb_root queue_weights_tree;
++	struct rb_root group_weights_tree;
++
++	int busy_queues;
++	int busy_in_flight_queues;
++	int const_seeky_busy_in_flight_queues;
++	int wr_busy_queues;
++	int queued;
++	int rq_in_driver;
++	int sync_flight;
++
++	int max_rq_in_driver;
++	int hw_tag_samples;
++	int hw_tag;
++
++	int budgets_assigned;
++
++	struct timer_list idle_slice_timer;
++	struct work_struct unplug_work;
++
++	struct bfq_queue *in_service_queue;
++	struct bfq_io_cq *in_service_bic;
++
++	sector_t last_position;
++
++	ktime_t last_budget_start;
++	ktime_t last_idling_start;
++	int peak_rate_samples;
++	u64 peak_rate;
++	unsigned long bfq_max_budget;
++
++	struct hlist_head group_list;
++	struct list_head active_list;
++	struct list_head idle_list;
++
++	unsigned int bfq_quantum;
++	unsigned int bfq_fifo_expire[2];
++	unsigned int bfq_back_penalty;
++	unsigned int bfq_back_max;
++	unsigned int bfq_slice_idle;
++	u64 bfq_class_idle_last_service;
++
++	unsigned int bfq_user_max_budget;
++	unsigned int bfq_max_budget_async_rq;
++	unsigned int bfq_timeout[2];
++
++	unsigned int bfq_coop_thresh;
++	unsigned int bfq_failed_cooperations;
++	unsigned int bfq_requests_within_timer;
++
++	bool low_latency;
++
++	/* parameters of the low_latency heuristics */
++	unsigned int bfq_wr_coeff;
++	unsigned int bfq_wr_max_time;
++	unsigned int bfq_wr_rt_max_time;
++	unsigned int bfq_wr_min_idle_time;
++	unsigned long bfq_wr_min_inter_arr_async;
++	unsigned int bfq_wr_max_softrt_rate;
++	u64 RT_prod;
++	enum bfq_device_speed device_speed;
++
++	struct bfq_queue oom_bfqq;
++};
++
++enum bfqq_state_flags {
++	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
++	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
++	BFQ_BFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
++	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
++	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
++	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
++	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
++	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
++	BFQ_BFQQ_FLAG_IO_bound,         /*
++					 * bfqq has timed-out at least once
++					 * having consumed at most 2/10 of
++					 * its budget
++					 */
++	BFQ_BFQQ_FLAG_constantly_seeky,	/*
++					 * bfqq has proved to be slow and
++					 * seeky until budget timeout
++					 */
++	BFQ_BFQQ_FLAG_softrt_update,    /*
++					 * may need softrt-next-start
++					 * update
++					 */
++	BFQ_BFQQ_FLAG_coop,		/* bfqq is shared */
++	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be splitted */
++};
++
++#define BFQ_BFQQ_FNS(name)						\
++static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
++{									\
++	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
++}									\
++static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)	\
++{									\
++	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
++}									\
++static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
++{									\
++	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
++}
++
++BFQ_BFQQ_FNS(busy);
++BFQ_BFQQ_FNS(wait_request);
++BFQ_BFQQ_FNS(must_alloc);
++BFQ_BFQQ_FNS(fifo_expire);
++BFQ_BFQQ_FNS(idle_window);
++BFQ_BFQQ_FNS(prio_changed);
++BFQ_BFQQ_FNS(sync);
++BFQ_BFQQ_FNS(budget_new);
++BFQ_BFQQ_FNS(IO_bound);
++BFQ_BFQQ_FNS(constantly_seeky);
++BFQ_BFQQ_FNS(coop);
++BFQ_BFQQ_FNS(split_coop);
++BFQ_BFQQ_FNS(softrt_update);
++#undef BFQ_BFQQ_FNS
++
++/* Logging facilities. */
++#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
++	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
++
++#define bfq_log(bfqd, fmt, args...) \
++	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
++
++/* Expiration reasons. */
++enum bfqq_expiration {
++	BFQ_BFQQ_TOO_IDLE = 0,		/*
++					 * queue has been idling for
++					 * too long
++					 */
++	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
++	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
++	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
++};
++
++#ifdef CONFIG_CGROUP_BFQIO
++/**
++ * struct bfq_group - per (device, cgroup) data structure.
++ * @entity: schedulable entity to insert into the parent group sched_data.
++ * @sched_data: own sched_data, to contain child entities (they may be
++ *              both bfq_queues and bfq_groups).
++ * @group_node: node to be inserted into the bfqio_cgroup->group_data
++ *              list of the containing cgroup's bfqio_cgroup.
++ * @bfqd_node: node to be inserted into the @bfqd->group_list list
++ *             of the groups active on the same device; used for cleanup.
++ * @bfqd: the bfq_data for the device this group acts upon.
++ * @async_bfqq: array of async queues for all the tasks belonging to
++ *              the group, one queue per ioprio value per ioprio_class,
++ *              except for the idle class that has only one queue.
++ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
++ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
++ *             to avoid too many special cases during group creation/
++ *             migration.
++ * @active_entities: number of active entities belonging to the group;
++ *                   unused for the root group. Used to know whether there
++ *                   are groups with more than one active @bfq_entity
++ *                   (see the comments to the function
++ *                   bfq_bfqq_must_not_expire()).
++ *
++ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
++ * there is a set of bfq_groups, each one collecting the lower-level
++ * entities belonging to the group that are acting on the same device.
++ *
++ * Locking works as follows:
++ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
++ *      via RCU from its readers.
++ *    o @bfqd is protected by the queue lock, RCU is used to access it
++ *      from the readers.
++ *    o All the other fields are protected by the @bfqd queue lock.
++ */
++struct bfq_group {
++	struct bfq_entity entity;
++	struct bfq_sched_data sched_data;
++
++	struct hlist_node group_node;
++	struct hlist_node bfqd_node;
++
++	void *bfqd;
++
++	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
++	struct bfq_queue *async_idle_bfqq;
++
++	struct bfq_entity *my_entity;
++
++	int active_entities;
++};
++
++/**
++ * struct bfqio_cgroup - bfq cgroup data structure.
++ * @css: subsystem state for bfq in the containing cgroup.
++ * @online: flag marked when the subsystem is inserted.
++ * @weight: cgroup weight.
++ * @ioprio: cgroup ioprio.
++ * @ioprio_class: cgroup ioprio_class.
++ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
++ * @group_data: list containing the bfq_group belonging to this cgroup.
++ *
++ * @group_data is accessed using RCU, with @lock protecting the updates,
++ * @ioprio and @ioprio_class are protected by @lock.
++ */
++struct bfqio_cgroup {
++	struct cgroup_subsys_state css;
++	bool online;
++
++	unsigned short weight, ioprio, ioprio_class;
++
++	spinlock_t lock;
++	struct hlist_head group_data;
++};
++#else
++struct bfq_group {
++	struct bfq_sched_data sched_data;
++
++	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
++	struct bfq_queue *async_idle_bfqq;
++};
++#endif
++
++static inline struct bfq_service_tree *
++bfq_entity_service_tree(struct bfq_entity *entity)
++{
++	struct bfq_sched_data *sched_data = entity->sched_data;
++	unsigned int idx = entity->ioprio_class - 1;
++
++	BUG_ON(idx >= BFQ_IOPRIO_CLASSES);
++	BUG_ON(sched_data == NULL);
++
++	return sched_data->service_tree + idx;
++}
++
++static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
++					    int is_sync)
++{
++	return bic->bfqq[!!is_sync];
++}
++
++static inline void bic_set_bfqq(struct bfq_io_cq *bic,
++				struct bfq_queue *bfqq, int is_sync)
++{
++	bic->bfqq[!!is_sync] = bfqq;
++}
++
++static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
++{
++	return bic->icq.q->elevator->elevator_data;
++}
++
++/**
++ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer.
++ * @ptr: a pointer to a bfqd.
++ * @flags: storage for the flags to be saved.
++ *
++ * This function allows bfqg->bfqd to be protected by the
++ * queue lock of the bfqd they reference; the pointer is dereferenced
++ * under RCU, so the storage for bfqd is assured to be safe as long
++ * as the RCU read side critical section does not end.  After the
++ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be
++ * sure that no other writer accessed it.  If we raced with a writer,
++ * the function returns NULL, with the queue unlocked, otherwise it
++ * returns the dereferenced pointer, with the queue locked.
++ */
++static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr,
++						   unsigned long *flags)
++{
++	struct bfq_data *bfqd;
++
++	rcu_read_lock();
++	bfqd = rcu_dereference(*(struct bfq_data **)ptr);
++
++	if (bfqd != NULL) {
++		spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
++		if (*ptr == bfqd)
++			goto out;
++		spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
++	}
++
++	bfqd = NULL;
++out:
++	rcu_read_unlock();
++	return bfqd;
++}
++
++static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
++				       unsigned long *flags)
++{
++	spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
++}
++
++static void bfq_changed_ioprio(struct bfq_io_cq *bic);
++static void bfq_put_queue(struct bfq_queue *bfqq);
++static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
++static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
++				       struct bfq_group *bfqg, int is_sync,
++				       struct bfq_io_cq *bic, gfp_t gfp_mask);
++static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
++				    struct bfq_group *bfqg);
++static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
++static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
++
++#endif /* _BFQ_H */
+-- 
+2.0.3
+

diff --git a/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
new file mode 100644
index 0000000..e606f5d
--- /dev/null
+++ b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
@@ -0,0 +1,1188 @@
+From 5b290be286aa74051b4b77a216032b771ceadd23 Mon Sep 17 00:00:00 2001
+From: Mauro Andreolini <mauro.andreolini@unimore.it>
+Date: Wed, 18 Jun 2014 17:38:07 +0200
+Subject: [PATCH 3/3] block, bfq: add Early Queue Merge (EQM) to BFQ-v7r5 for
+ 3.16.0
+
+A set of processes may happen  to  perform interleaved reads, i.e.,requests
+whose union would give rise to a  sequential read  pattern.  There are two
+typical  cases: in the first  case,   processes  read  fixed-size chunks of
+data at a fixed distance from each other, while in the second case processes
+may read variable-size chunks at  variable distances. The latter case occurs
+for  example with  QEMU, which  splits the  I/O generated  by the  guest into
+multiple chunks,  and lets these chunks  be served by a  pool of cooperating
+processes,  iteratively  assigning  the  next  chunk of  I/O  to  the first
+available  process. CFQ  uses actual  queue merging  for the  first type of
+rocesses, whereas it  uses preemption to get a sequential  read pattern out
+of the read requests  performed by the second type of  processes. In the end
+it uses  two different  mechanisms to  achieve the  same goal: boosting the
+throughput with interleaved I/O.
+
+This patch introduces  Early Queue Merge (EQM), a unified mechanism to get a
+sequential  read pattern  with both  types of  processes. The  main idea is
+checking newly arrived requests against the next request of the active queue
+both in case of actual request insert and in case of request merge. By doing
+so, both the types of processes can be handled by just merging their queues.
+EQM is  then simpler and  more compact than the  pair of mechanisms used in
+CFQ.
+
+Finally, EQM  also preserves the  typical low-latency properties of BFQ, by
+properly restoring the weight-raising state of  a queue when it gets back to
+a non-merged state.
+
+Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+---
+ block/bfq-iosched.c | 736 ++++++++++++++++++++++++++++++++++++----------------
+ block/bfq-sched.c   |  28 --
+ block/bfq.h         |  46 +++-
+ 3 files changed, 556 insertions(+), 254 deletions(-)
+
+diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
+index 0a0891b..d1d8e67 100644
+--- a/block/bfq-iosched.c
++++ b/block/bfq-iosched.c
+@@ -571,6 +571,57 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+ 	return dur;
+ }
+ 
++static inline unsigned
++bfq_bfqq_cooperations(struct bfq_queue *bfqq)
++{
++	return bfqq->bic ? bfqq->bic->cooperations : 0;
++}
++
++static inline void
++bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
++{
++	if (bic->saved_idle_window)
++		bfq_mark_bfqq_idle_window(bfqq);
++	else
++		bfq_clear_bfqq_idle_window(bfqq);
++	if (bic->saved_IO_bound)
++		bfq_mark_bfqq_IO_bound(bfqq);
++	else
++		bfq_clear_bfqq_IO_bound(bfqq);
++	if (bic->wr_time_left && bfqq->bfqd->low_latency &&
++	    bic->cooperations < bfqq->bfqd->bfq_coop_thresh) {
++		/*
++		 * Start a weight raising period with the duration given by
++		 * the raising_time_left snapshot.
++		 */
++		if (bfq_bfqq_busy(bfqq))
++			bfqq->bfqd->wr_busy_queues++;
++		bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff;
++		bfqq->wr_cur_max_time = bic->wr_time_left;
++		bfqq->last_wr_start_finish = jiffies;
++		bfqq->entity.ioprio_changed = 1;
++	}
++	/*
++	 * Clear wr_time_left to prevent bfq_bfqq_save_state() from
++	 * getting confused about the queue's need of a weight-raising
++	 * period.
++	 */
++	bic->wr_time_left = 0;
++}
++
++/*
++ * Must be called with the queue_lock held.
++ */
++static int bfqq_process_refs(struct bfq_queue *bfqq)
++{
++	int process_refs, io_refs;
++
++	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
++	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
++	BUG_ON(process_refs < 0);
++	return process_refs;
++}
++
+ static void bfq_add_request(struct request *rq)
+ {
+ 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+@@ -602,8 +653,11 @@ static void bfq_add_request(struct request *rq)
+ 
+ 	if (!bfq_bfqq_busy(bfqq)) {
+ 		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++			bfq_bfqq_cooperations(bfqq) < bfqd->bfq_coop_thresh &&
+ 			time_is_before_jiffies(bfqq->soft_rt_next_start);
+-		idle_for_long_time = time_is_before_jiffies(
++		idle_for_long_time = bfq_bfqq_cooperations(bfqq) <
++				     bfqd->bfq_coop_thresh &&
++			time_is_before_jiffies(
+ 			bfqq->budget_timeout +
+ 			bfqd->bfq_wr_min_idle_time);
+ 		entity->budget = max_t(unsigned long, bfqq->max_budget,
+@@ -624,11 +678,20 @@ static void bfq_add_request(struct request *rq)
+ 		if (!bfqd->low_latency)
+ 			goto add_bfqq_busy;
+ 
++		if (bfq_bfqq_just_split(bfqq))
++			goto set_ioprio_changed;
++
+ 		/*
+-		 * If the queue is not being boosted and has been idle
+-		 * for enough time, start a weight-raising period
++		 * If the queue:
++		 * - is not being boosted,
++		 * - has been idle for enough time,
++		 * - is not a sync queue or is linked to a bfq_io_cq (it is
++		 *   shared "for its nature" or it is not shared and its
++		 *   requests have not been redirected to a shared queue)
++		 * start a weight-raising period.
+ 		 */
+-		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
++		    (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
+ 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ 			if (idle_for_long_time)
+ 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+@@ -642,9 +705,11 @@ static void bfq_add_request(struct request *rq)
+ 		} else if (old_wr_coeff > 1) {
+ 			if (idle_for_long_time)
+ 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+-			else if (bfqq->wr_cur_max_time ==
+-				 bfqd->bfq_wr_rt_max_time &&
+-				 !soft_rt) {
++			else if (bfq_bfqq_cooperations(bfqq) >=
++					bfqd->bfq_coop_thresh ||
++				 (bfqq->wr_cur_max_time ==
++				  bfqd->bfq_wr_rt_max_time &&
++				  !soft_rt)) {
+ 				bfqq->wr_coeff = 1;
+ 				bfq_log_bfqq(bfqd, bfqq,
+ 					"wrais ending at %lu, rais_max_time %u",
+@@ -660,18 +725,18 @@ static void bfq_add_request(struct request *rq)
+ 				/*
+ 				 *
+ 				 * The remaining weight-raising time is lower
+-				 * than bfqd->bfq_wr_rt_max_time, which
+-				 * means that the application is enjoying
+-				 * weight raising either because deemed soft-
+-				 * rt in the near past, or because deemed
+-				 * interactive a long ago. In both cases,
+-				 * resetting now the current remaining weight-
+-				 * raising time for the application to the
+-				 * weight-raising duration for soft rt
+-				 * applications would not cause any latency
+-				 * increase for the application (as the new
+-				 * duration would be higher than the remaining
+-				 * time).
++				 * than bfqd->bfq_wr_rt_max_time, which means
++				 * that the application is enjoying weight
++				 * raising either because deemed soft-rt in
++				 * the near past, or because deemed interactive
++				 * a long ago.
++				 * In both cases, resetting now the current
++				 * remaining weight-raising time for the
++				 * application to the weight-raising duration
++				 * for soft rt applications would not cause any
++				 * latency increase for the application (as the
++				 * new duration would be higher than the
++				 * remaining time).
+ 				 *
+ 				 * In addition, the application is now meeting
+ 				 * the requirements for being deemed soft rt.
+@@ -706,6 +771,7 @@ static void bfq_add_request(struct request *rq)
+ 					bfqd->bfq_wr_rt_max_time;
+ 			}
+ 		}
++set_ioprio_changed:
+ 		if (old_wr_coeff != bfqq->wr_coeff)
+ 			entity->ioprio_changed = 1;
+ add_bfqq_busy:
+@@ -918,90 +984,35 @@ static void bfq_end_wr(struct bfq_data *bfqd)
+ 	spin_unlock_irq(bfqd->queue->queue_lock);
+ }
+ 
+-static int bfq_allow_merge(struct request_queue *q, struct request *rq,
+-			   struct bio *bio)
++static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
+ {
+-	struct bfq_data *bfqd = q->elevator->elevator_data;
+-	struct bfq_io_cq *bic;
+-	struct bfq_queue *bfqq;
+-
+-	/*
+-	 * Disallow merge of a sync bio into an async request.
+-	 */
+-	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+-		return 0;
+-
+-	/*
+-	 * Lookup the bfqq that this bio will be queued with. Allow
+-	 * merge only if rq is queued there.
+-	 * Queue lock is held here.
+-	 */
+-	bic = bfq_bic_lookup(bfqd, current->io_context);
+-	if (bic == NULL)
+-		return 0;
+-
+-	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+-	return bfqq == RQ_BFQQ(rq);
+-}
+-
+-static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+-				       struct bfq_queue *bfqq)
+-{
+-	if (bfqq != NULL) {
+-		bfq_mark_bfqq_must_alloc(bfqq);
+-		bfq_mark_bfqq_budget_new(bfqq);
+-		bfq_clear_bfqq_fifo_expire(bfqq);
+-
+-		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+-
+-		bfq_log_bfqq(bfqd, bfqq,
+-			     "set_in_service_queue, cur-budget = %lu",
+-			     bfqq->entity.budget);
+-	}
+-
+-	bfqd->in_service_queue = bfqq;
+-}
+-
+-/*
+- * Get and set a new queue for service.
+- */
+-static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd,
+-						  struct bfq_queue *bfqq)
+-{
+-	if (!bfqq)
+-		bfqq = bfq_get_next_queue(bfqd);
++	if (request)
++		return blk_rq_pos(io_struct);
+ 	else
+-		bfq_get_next_queue_forced(bfqd, bfqq);
+-
+-	__bfq_set_in_service_queue(bfqd, bfqq);
+-	return bfqq;
++		return ((struct bio *)io_struct)->bi_iter.bi_sector;
+ }
+ 
+-static inline sector_t bfq_dist_from_last(struct bfq_data *bfqd,
+-					  struct request *rq)
++static inline sector_t bfq_dist_from(sector_t pos1,
++				     sector_t pos2)
+ {
+-	if (blk_rq_pos(rq) >= bfqd->last_position)
+-		return blk_rq_pos(rq) - bfqd->last_position;
++	if (pos1 >= pos2)
++		return pos1 - pos2;
+ 	else
+-		return bfqd->last_position - blk_rq_pos(rq);
++		return pos2 - pos1;
+ }
+ 
+-/*
+- * Return true if bfqq has no request pending and rq is close enough to
+- * bfqd->last_position, or if rq is closer to bfqd->last_position than
+- * bfqq->next_rq
+- */
+-static inline int bfq_rq_close(struct bfq_data *bfqd, struct request *rq)
++static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
++					 sector_t sector)
+ {
+-	return bfq_dist_from_last(bfqd, rq) <= BFQQ_SEEK_THR;
++	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
++	       BFQQ_SEEK_THR;
+ }
+ 
+-static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
++static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
+ {
+ 	struct rb_root *root = &bfqd->rq_pos_tree;
+ 	struct rb_node *parent, *node;
+ 	struct bfq_queue *__bfqq;
+-	sector_t sector = bfqd->last_position;
+ 
+ 	if (RB_EMPTY_ROOT(root))
+ 		return NULL;
+@@ -1020,7 +1031,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ 	 * next_request position).
+ 	 */
+ 	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+-	if (bfq_rq_close(bfqd, __bfqq->next_rq))
++	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ 		return __bfqq;
+ 
+ 	if (blk_rq_pos(__bfqq->next_rq) < sector)
+@@ -1031,7 +1042,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ 		return NULL;
+ 
+ 	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
+-	if (bfq_rq_close(bfqd, __bfqq->next_rq))
++	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ 		return __bfqq;
+ 
+ 	return NULL;
+@@ -1040,14 +1051,12 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ /*
+  * bfqd - obvious
+  * cur_bfqq - passed in so that we don't decide that the current queue
+- *            is closely cooperating with itself.
+- *
+- * We are assuming that cur_bfqq has dispatched at least one request,
+- * and that bfqd->last_position reflects a position on the disk associated
+- * with the I/O issued by cur_bfqq.
++ *            is closely cooperating with itself
++ * sector - used as a reference point to search for a close queue
+  */
+ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+-					      struct bfq_queue *cur_bfqq)
++					      struct bfq_queue *cur_bfqq,
++					      sector_t sector)
+ {
+ 	struct bfq_queue *bfqq;
+ 
+@@ -1067,7 +1076,7 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+ 	 * working closely on the same area of the disk. In that case,
+ 	 * we can group them together and don't waste time idling.
+ 	 */
+-	bfqq = bfqq_close(bfqd);
++	bfqq = bfqq_close(bfqd, sector);
+ 	if (bfqq == NULL || bfqq == cur_bfqq)
+ 		return NULL;
+ 
+@@ -1094,6 +1103,305 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+ 	return bfqq;
+ }
+ 
++static struct bfq_queue *
++bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++	int process_refs, new_process_refs;
++	struct bfq_queue *__bfqq;
++
++	/*
++	 * If there are no process references on the new_bfqq, then it is
++	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
++	 * may have dropped their last reference (not just their last process
++	 * reference).
++	 */
++	if (!bfqq_process_refs(new_bfqq))
++		return NULL;
++
++	/* Avoid a circular list and skip interim queue merges. */
++	while ((__bfqq = new_bfqq->new_bfqq)) {
++		if (__bfqq == bfqq)
++			return NULL;
++		new_bfqq = __bfqq;
++	}
++
++	process_refs = bfqq_process_refs(bfqq);
++	new_process_refs = bfqq_process_refs(new_bfqq);
++	/*
++	 * If the process for the bfqq has gone away, there is no
++	 * sense in merging the queues.
++	 */
++	if (process_refs == 0 || new_process_refs == 0)
++		return NULL;
++
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
++		new_bfqq->pid);
++
++	/*
++	 * Merging is just a redirection: the requests of the process
++	 * owning one of the two queues are redirected to the other queue.
++	 * The latter queue, in its turn, is set as shared if this is the
++	 * first time that the requests of some process are redirected to
++	 * it.
++	 *
++	 * We redirect bfqq to new_bfqq and not the opposite, because we
++	 * are in the context of the process owning bfqq, hence we have
++	 * the io_cq of this process. So we can immediately configure this
++	 * io_cq to redirect the requests of the process to new_bfqq.
++	 *
++	 * NOTE, even if new_bfqq coincides with the in-service queue, the
++	 * io_cq of new_bfqq is not available, because, if the in-service
++	 * queue is shared, bfqd->in_service_bic may not point to the
++	 * io_cq of the in-service queue.
++	 * Redirecting the requests of the process owning bfqq to the
++	 * currently in-service queue is in any case the best option, as
++	 * we feed the in-service queue with new requests close to the
++	 * last request served and, by doing so, hopefully increase the
++	 * throughput.
++	 */
++	bfqq->new_bfqq = new_bfqq;
++	atomic_add(process_refs, &new_bfqq->ref);
++	return new_bfqq;
++}
++
++/*
++ * Attempt to schedule a merge of bfqq with the currently in-service queue
++ * or with a close queue among the scheduled queues.
++ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
++ * structure otherwise.
++ */
++static struct bfq_queue *
++bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++		     void *io_struct, bool request)
++{
++	struct bfq_queue *in_service_bfqq, *new_bfqq;
++
++	if (bfqq->new_bfqq)
++		return bfqq->new_bfqq;
++
++	if (!io_struct)
++		return NULL;
++
++	in_service_bfqq = bfqd->in_service_queue;
++
++	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
++	    !bfqd->in_service_bic)
++		goto check_scheduled;
++
++	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
++		goto check_scheduled;
++
++	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
++		goto check_scheduled;
++
++	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
++		goto check_scheduled;
++
++	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
++	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
++		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
++		if (new_bfqq != NULL)
++			return new_bfqq; /* Merge with in-service queue */
++	}
++
++	/*
++	 * Check whether there is a cooperator among currently scheduled
++	 * queues. The only thing we need is that the bio/request is not
++	 * NULL, as we need it to establish whether a cooperator exists.
++	 */
++check_scheduled:
++	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
++					bfq_io_struct_pos(io_struct, request));
++	if (new_bfqq)
++		return bfq_setup_merge(bfqq, new_bfqq);
++
++	return NULL;
++}
++
++static inline void
++bfq_bfqq_save_state(struct bfq_queue *bfqq)
++{
++	/*
++	 * If bfqq->bic == NULL, the queue is already shared or its requests
++	 * have already been redirected to a shared queue; both idle window
++	 * and weight raising state have already been saved. Do nothing.
++	 */
++	if (bfqq->bic == NULL)
++		return;
++	if (bfqq->bic->wr_time_left)
++		/*
++		 * This is the queue of a just-started process, and would
++		 * deserve weight raising: we set wr_time_left to the full
++		 * weight-raising duration to trigger weight-raising when
++		 * and if the queue is split and the first request of the
++		 * queue is enqueued.
++		 */
++		bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd);
++	else if (bfqq->wr_coeff > 1) {
++		unsigned long wr_duration =
++			jiffies - bfqq->last_wr_start_finish;
++		/*
++		 * It may happen that a queue's weight raising period lasts
++		 * longer than its wr_cur_max_time, as weight raising is
++		 * handled only when a request is enqueued or dispatched (it
++		 * does not use any timer). If the weight raising period is
++		 * about to end, don't save it.
++		 */
++		if (bfqq->wr_cur_max_time <= wr_duration)
++			bfqq->bic->wr_time_left = 0;
++		else
++			bfqq->bic->wr_time_left =
++				bfqq->wr_cur_max_time - wr_duration;
++		/*
++		 * The bfq_queue is becoming shared or the requests of the
++		 * process owning the queue are being redirected to a shared
++		 * queue. Stop the weight raising period of the queue, as in
++		 * both cases it should not be owned by an interactive or
++		 * soft real-time application.
++		 */
++		bfq_bfqq_end_wr(bfqq);
++	} else
++		bfqq->bic->wr_time_left = 0;
++	bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
++	bfqq->bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
++	bfqq->bic->cooperations++;
++	bfqq->bic->failed_cooperations = 0;
++}
++
++static inline void
++bfq_get_bic_reference(struct bfq_queue *bfqq)
++{
++	/*
++	 * If bfqq->bic has a non-NULL value, the bic to which it belongs
++	 * is about to begin using a shared bfq_queue.
++	 */
++	if (bfqq->bic)
++		atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
++}
++
++static void
++bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
++		struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
++		(long unsigned)new_bfqq->pid);
++	/* Save weight raising and idle window of the merged queues */
++	bfq_bfqq_save_state(bfqq);
++	bfq_bfqq_save_state(new_bfqq);
++	if (bfq_bfqq_IO_bound(bfqq))
++		bfq_mark_bfqq_IO_bound(new_bfqq);
++	bfq_clear_bfqq_IO_bound(bfqq);
++	/*
++	 * Grab a reference to the bic, to prevent it from being destroyed
++	 * before being possibly touched by a bfq_split_bfqq().
++	 */
++	bfq_get_bic_reference(bfqq);
++	bfq_get_bic_reference(new_bfqq);
++	/*
++	 * Merge queues (that is, let bic redirect its requests to new_bfqq)
++	 */
++	bic_set_bfqq(bic, new_bfqq, 1);
++	bfq_mark_bfqq_coop(new_bfqq);
++	/*
++	 * new_bfqq now belongs to at least two bics (it is a shared queue):
++	 * set new_bfqq->bic to NULL. bfqq either:
++	 * - does not belong to any bic any more, and hence bfqq->bic must
++	 *   be set to NULL, or
++	 * - is a queue whose owning bics have already been redirected to a
++	 *   different queue, hence the queue is destined to not belong to
++	 *   any bic soon and bfqq->bic is already NULL (therefore the next
++	 *   assignment causes no harm).
++	 */
++	new_bfqq->bic = NULL;
++	bfqq->bic = NULL;
++	bfq_put_queue(bfqq);
++}
++
++static inline void bfq_bfqq_increase_failed_cooperations(struct bfq_queue *bfqq)
++{
++	struct bfq_io_cq *bic = bfqq->bic;
++	struct bfq_data *bfqd = bfqq->bfqd;
++
++	if (bic && bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh) {
++		bic->failed_cooperations++;
++		if (bic->failed_cooperations >= bfqd->bfq_failed_cooperations)
++			bic->cooperations = 0;
++	}
++}
++
++static int bfq_allow_merge(struct request_queue *q, struct request *rq,
++			   struct bio *bio)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_io_cq *bic;
++	struct bfq_queue *bfqq, *new_bfqq;
++
++	/*
++	 * Disallow merge of a sync bio into an async request.
++	 */
++	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
++		return 0;
++
++	/*
++	 * Lookup the bfqq that this bio will be queued with. Allow
++	 * merge only if rq is queued there.
++	 * Queue lock is held here.
++	 */
++	bic = bfq_bic_lookup(bfqd, current->io_context);
++	if (bic == NULL)
++		return 0;
++
++	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++	/*
++	 * We take advantage of this function to perform an early merge
++	 * of the queues of possible cooperating processes.
++	 */
++	if (bfqq != NULL) {
++		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
++		if (new_bfqq != NULL) {
++			bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq);
++			/*
++			 * If we get here, the bio will be queued in the
++			 * shared queue, i.e., new_bfqq, so use new_bfqq
++			 * to decide whether bio and rq can be merged.
++			 */
++			bfqq = new_bfqq;
++		} else
++			bfq_bfqq_increase_failed_cooperations(bfqq);
++	}
++
++	return bfqq == RQ_BFQQ(rq);
++}
++
++static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
++				       struct bfq_queue *bfqq)
++{
++	if (bfqq != NULL) {
++		bfq_mark_bfqq_must_alloc(bfqq);
++		bfq_mark_bfqq_budget_new(bfqq);
++		bfq_clear_bfqq_fifo_expire(bfqq);
++
++		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
++
++		bfq_log_bfqq(bfqd, bfqq,
++			     "set_in_service_queue, cur-budget = %lu",
++			     bfqq->entity.budget);
++	}
++
++	bfqd->in_service_queue = bfqq;
++}
++
++/*
++ * Get and set a new queue for service.
++ */
++static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
++
++	__bfq_set_in_service_queue(bfqd, bfqq);
++	return bfqq;
++}
++
+ /*
+  * If enough samples have been computed, return the current max budget
+  * stored in bfqd, which is dynamically updated according to the
+@@ -1237,63 +1545,6 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+ 	return rq;
+ }
+ 
+-/*
+- * Must be called with the queue_lock held.
+- */
+-static int bfqq_process_refs(struct bfq_queue *bfqq)
+-{
+-	int process_refs, io_refs;
+-
+-	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
+-	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
+-	BUG_ON(process_refs < 0);
+-	return process_refs;
+-}
+-
+-static void bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+-{
+-	int process_refs, new_process_refs;
+-	struct bfq_queue *__bfqq;
+-
+-	/*
+-	 * If there are no process references on the new_bfqq, then it is
+-	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+-	 * may have dropped their last reference (not just their last process
+-	 * reference).
+-	 */
+-	if (!bfqq_process_refs(new_bfqq))
+-		return;
+-
+-	/* Avoid a circular list and skip interim queue merges. */
+-	while ((__bfqq = new_bfqq->new_bfqq)) {
+-		if (__bfqq == bfqq)
+-			return;
+-		new_bfqq = __bfqq;
+-	}
+-
+-	process_refs = bfqq_process_refs(bfqq);
+-	new_process_refs = bfqq_process_refs(new_bfqq);
+-	/*
+-	 * If the process for the bfqq has gone away, there is no
+-	 * sense in merging the queues.
+-	 */
+-	if (process_refs == 0 || new_process_refs == 0)
+-		return;
+-
+-	/*
+-	 * Merge in the direction of the lesser amount of work.
+-	 */
+-	if (new_process_refs >= process_refs) {
+-		bfqq->new_bfqq = new_bfqq;
+-		atomic_add(process_refs, &new_bfqq->ref);
+-	} else {
+-		new_bfqq->new_bfqq = bfqq;
+-		atomic_add(new_process_refs, &bfqq->ref);
+-	}
+-	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+-		new_bfqq->pid);
+-}
+-
+ static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+ {
+ 	struct bfq_entity *entity = &bfqq->entity;
+@@ -2011,7 +2262,7 @@ static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+  */
+ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ {
+-	struct bfq_queue *bfqq, *new_bfqq = NULL;
++	struct bfq_queue *bfqq;
+ 	struct request *next_rq;
+ 	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+ 
+@@ -2021,17 +2272,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ 
+ 	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+ 
+-	/*
+-         * If another queue has a request waiting within our mean seek
+-         * distance, let it run. The expire code will check for close
+-         * cooperators and put the close queue at the front of the
+-         * service tree. If possible, merge the expiring queue with the
+-         * new bfqq.
+-         */
+-        new_bfqq = bfq_close_cooperator(bfqd, bfqq);
+-        if (new_bfqq != NULL && bfqq->new_bfqq == NULL)
+-                bfq_setup_merge(bfqq, new_bfqq);
+-
+ 	if (bfq_may_expire_for_budg_timeout(bfqq) &&
+ 	    !timer_pending(&bfqd->idle_slice_timer) &&
+ 	    !bfq_bfqq_must_idle(bfqq))
+@@ -2070,10 +2310,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ 				bfq_clear_bfqq_wait_request(bfqq);
+ 				del_timer(&bfqd->idle_slice_timer);
+ 			}
+-			if (new_bfqq == NULL)
+-				goto keep_queue;
+-			else
+-				goto expire;
++			goto keep_queue;
+ 		}
+ 	}
+ 
+@@ -2082,40 +2319,30 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ 	 * in flight (possibly waiting for a completion) or is idling for a
+ 	 * new request, then keep it.
+ 	 */
+-	if (new_bfqq == NULL && (timer_pending(&bfqd->idle_slice_timer) ||
+-	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq)))) {
++	if (timer_pending(&bfqd->idle_slice_timer) ||
++	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) {
+ 		bfqq = NULL;
+ 		goto keep_queue;
+-	} else if (new_bfqq != NULL && timer_pending(&bfqd->idle_slice_timer)) {
+-		/*
+-		 * Expiring the queue because there is a close cooperator,
+-		 * cancel timer.
+-		 */
+-		bfq_clear_bfqq_wait_request(bfqq);
+-		del_timer(&bfqd->idle_slice_timer);
+ 	}
+ 
+ 	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+ expire:
+ 	bfq_bfqq_expire(bfqd, bfqq, 0, reason);
+ new_queue:
+-	bfqq = bfq_set_in_service_queue(bfqd, new_bfqq);
++	bfqq = bfq_set_in_service_queue(bfqd);
+ 	bfq_log(bfqd, "select_queue: new queue %d returned",
+ 		bfqq != NULL ? bfqq->pid : 0);
+ keep_queue:
+ 	return bfqq;
+ }
+ 
+-static void bfq_update_wr_data(struct bfq_data *bfqd,
+-			       struct bfq_queue *bfqq)
++static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+ {
+-	if (bfqq->wr_coeff > 1) { /* queue is being boosted */
+-		struct bfq_entity *entity = &bfqq->entity;
+-
++	struct bfq_entity *entity = &bfqq->entity;
++	if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+ 		bfq_log_bfqq(bfqd, bfqq,
+ 			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+-			jiffies_to_msecs(jiffies -
+-				bfqq->last_wr_start_finish),
++			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+ 			jiffies_to_msecs(bfqq->wr_cur_max_time),
+ 			bfqq->wr_coeff,
+ 			bfqq->entity.weight, bfqq->entity.orig_weight);
+@@ -2124,11 +2351,15 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+ 		       entity->orig_weight * bfqq->wr_coeff);
+ 		if (entity->ioprio_changed)
+ 			bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
++
+ 		/*
+ 		 * If too much time has elapsed from the beginning
+-		 * of this weight-raising, stop it.
++		 * of this weight-raising period, or the queue has
++		 * exceeded the acceptable number of cooperations,
++		 * stop it.
+ 		 */
+-		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++		if (bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh ||
++		    time_is_before_jiffies(bfqq->last_wr_start_finish +
+ 					   bfqq->wr_cur_max_time)) {
+ 			bfqq->last_wr_start_finish = jiffies;
+ 			bfq_log_bfqq(bfqd, bfqq,
+@@ -2136,11 +2367,13 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+ 				     bfqq->last_wr_start_finish,
+ 				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+ 			bfq_bfqq_end_wr(bfqq);
+-			__bfq_entity_update_weight_prio(
+-				bfq_entity_service_tree(entity),
+-				entity);
+ 		}
+ 	}
++	/* Update weight both if it must be raised and if it must be lowered */
++	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
++		__bfq_entity_update_weight_prio(
++			bfq_entity_service_tree(entity),
++			entity);
+ }
+ 
+ /*
+@@ -2377,6 +2610,25 @@ static inline void bfq_init_icq(struct io_cq *icq)
+ 	struct bfq_io_cq *bic = icq_to_bic(icq);
+ 
+ 	bic->ttime.last_end_request = jiffies;
++	/*
++	 * A newly created bic indicates that the process has just
++	 * started doing I/O, and is probably mapping into memory its
++	 * executable and libraries: it definitely needs weight raising.
++	 * There is however the possibility that the process performs,
++	 * for a while, I/O close to some other process. EQM intercepts
++	 * this behavior and may merge the queue corresponding to the
++	 * process  with some other queue, BEFORE the weight of the queue
++	 * is raised. Merged queues are not weight-raised (they are assumed
++	 * to belong to processes that benefit only from high throughput).
++	 * If the merge is basically the consequence of an accident, then
++	 * the queue will be split soon and will get back its old weight.
++	 * It is then important to write down somewhere that this queue
++	 * does need weight raising, even if it did not make it to get its
++	 * weight raised before being merged. To this purpose, we overload
++	 * the field raising_time_left and assign 1 to it, to mark the queue
++	 * as needing weight raising.
++	 */
++	bic->wr_time_left = 1;
+ }
+ 
+ static void bfq_exit_icq(struct io_cq *icq)
+@@ -2390,6 +2642,13 @@ static void bfq_exit_icq(struct io_cq *icq)
+ 	}
+ 
+ 	if (bic->bfqq[BLK_RW_SYNC]) {
++		/*
++		 * If the bic is using a shared queue, put the reference
++		 * taken on the io_context when the bic started using a
++		 * shared bfq_queue.
++		 */
++		if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC]))
++			put_io_context(icq->ioc);
+ 		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+ 		bic->bfqq[BLK_RW_SYNC] = NULL;
+ 	}
+@@ -2678,6 +2937,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
+ 	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+ 		return;
+ 
++	/* Idle window just restored, statistics are meaningless. */
++	if (bfq_bfqq_just_split(bfqq))
++		return;
++
+ 	enable_idle = bfq_bfqq_idle_window(bfqq);
+ 
+ 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+@@ -2725,6 +2988,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+ 	    !BFQQ_SEEKY(bfqq))
+ 		bfq_update_idle_window(bfqd, bfqq, bic);
++	bfq_clear_bfqq_just_split(bfqq);
+ 
+ 	bfq_log_bfqq(bfqd, bfqq,
+ 		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
+@@ -2785,13 +3049,49 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ static void bfq_insert_request(struct request_queue *q, struct request *rq)
+ {
+ 	struct bfq_data *bfqd = q->elevator->elevator_data;
+-	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++	struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
+ 
+ 	assert_spin_locked(bfqd->queue->queue_lock);
++
++	/*
++	 * An unplug may trigger a requeue of a request from the device
++	 * driver: make sure we are in process context while trying to
++	 * merge two bfq_queues.
++	 */
++	if (!in_interrupt()) {
++		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
++		if (new_bfqq != NULL) {
++			if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
++				new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
++			/*
++			 * Release the request's reference to the old bfqq
++			 * and make sure one is taken to the shared queue.
++			 */
++			new_bfqq->allocated[rq_data_dir(rq)]++;
++			bfqq->allocated[rq_data_dir(rq)]--;
++			atomic_inc(&new_bfqq->ref);
++			bfq_put_queue(bfqq);
++			if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
++				bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
++						bfqq, new_bfqq);
++			rq->elv.priv[1] = new_bfqq;
++			bfqq = new_bfqq;
++		} else
++			bfq_bfqq_increase_failed_cooperations(bfqq);
++	}
++
+ 	bfq_init_prio_data(bfqq, RQ_BIC(rq));
+ 
+ 	bfq_add_request(rq);
+ 
++	/*
++	 * Here a newly-created bfq_queue has already started a weight-raising
++	 * period: clear raising_time_left to prevent bfq_bfqq_save_state()
++	 * from assigning it a full weight-raising period. See the detailed
++	 * comments about this field in bfq_init_icq().
++	 */
++	if (bfqq->bic != NULL)
++		bfqq->bic->wr_time_left = 0;
+ 	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+ 	list_add_tail(&rq->queuelist, &bfqq->fifo);
+ 
+@@ -2956,18 +3256,6 @@ static void bfq_put_request(struct request *rq)
+ 	}
+ }
+ 
+-static struct bfq_queue *
+-bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+-		struct bfq_queue *bfqq)
+-{
+-	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+-		(long unsigned)bfqq->new_bfqq->pid);
+-	bic_set_bfqq(bic, bfqq->new_bfqq, 1);
+-	bfq_mark_bfqq_coop(bfqq->new_bfqq);
+-	bfq_put_queue(bfqq);
+-	return bic_to_bfqq(bic, 1);
+-}
+-
+ /*
+  * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+  * was the last process referring to said bfqq.
+@@ -2976,6 +3264,9 @@ static struct bfq_queue *
+ bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+ {
+ 	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
++
++	put_io_context(bic->icq.ioc);
++
+ 	if (bfqq_process_refs(bfqq) == 1) {
+ 		bfqq->pid = current->pid;
+ 		bfq_clear_bfqq_coop(bfqq);
+@@ -3004,6 +3295,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
+ 	struct bfq_queue *bfqq;
+ 	struct bfq_group *bfqg;
+ 	unsigned long flags;
++	bool split = false;
+ 
+ 	might_sleep_if(gfp_mask & __GFP_WAIT);
+ 
+@@ -3022,24 +3314,14 @@ new_queue:
+ 		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
+ 		bic_set_bfqq(bic, bfqq, is_sync);
+ 	} else {
+-		/*
+-		 * If the queue was seeky for too long, break it apart.
+-		 */
++		/* If the queue was seeky for too long, break it apart. */
+ 		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+ 			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+ 			bfqq = bfq_split_bfqq(bic, bfqq);
++			split = true;
+ 			if (!bfqq)
+ 				goto new_queue;
+ 		}
+-
+-		/*
+-		 * Check to see if this queue is scheduled to merge with
+-		 * another closely cooperating queue. The merging of queues
+-		 * happens here as it must be done in process context.
+-		 * The reference on new_bfqq was taken in merge_bfqqs.
+-		 */
+-		if (bfqq->new_bfqq != NULL)
+-			bfqq = bfq_merge_bfqqs(bfqd, bic, bfqq);
+ 	}
+ 
+ 	bfqq->allocated[rw]++;
+@@ -3050,6 +3332,26 @@ new_queue:
+ 	rq->elv.priv[0] = bic;
+ 	rq->elv.priv[1] = bfqq;
+ 
++	/*
++	 * If a bfq_queue has only one process reference, it is owned
++	 * by only one bfq_io_cq: we can set the bic field of the
++	 * bfq_queue to the address of that structure. Also, if the
++	 * queue has just been split, mark a flag so that the
++	 * information is available to the other scheduler hooks.
++	 */
++	if (bfqq_process_refs(bfqq) == 1) {
++		bfqq->bic = bic;
++		if (split) {
++			bfq_mark_bfqq_just_split(bfqq);
++			/*
++			 * If the queue has just been split from a shared
++			 * queue, restore the idle window and the possible
++			 * weight raising period.
++			 */
++			bfq_bfqq_resume_state(bfqq, bic);
++		}
++	}
++
+ 	spin_unlock_irqrestore(q->queue_lock, flags);
+ 
+ 	return 0;
+diff --git a/block/bfq-sched.c b/block/bfq-sched.c
+index c4831b7..546a254 100644
+--- a/block/bfq-sched.c
++++ b/block/bfq-sched.c
+@@ -1084,34 +1084,6 @@ static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+ 	return bfqq;
+ }
+ 
+-/*
+- * Forced extraction of the given queue.
+- */
+-static void bfq_get_next_queue_forced(struct bfq_data *bfqd,
+-				      struct bfq_queue *bfqq)
+-{
+-	struct bfq_entity *entity;
+-	struct bfq_sched_data *sd;
+-
+-	BUG_ON(bfqd->in_service_queue != NULL);
+-
+-	entity = &bfqq->entity;
+-	/*
+-	 * Bubble up extraction/update from the leaf to the root.
+-	*/
+-	for_each_entity(entity) {
+-		sd = entity->sched_data;
+-		bfq_update_budget(entity);
+-		bfq_update_vtime(bfq_entity_service_tree(entity));
+-		bfq_active_extract(bfq_entity_service_tree(entity), entity);
+-		sd->in_service_entity = entity;
+-		sd->next_in_service = NULL;
+-		entity->service = 0;
+-	}
+-
+-	return;
+-}
+-
+ static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+ {
+ 	if (bfqd->in_service_bic != NULL) {
+diff --git a/block/bfq.h b/block/bfq.h
+index a83e69d..ebbd040 100644
+--- a/block/bfq.h
++++ b/block/bfq.h
+@@ -215,18 +215,21 @@ struct bfq_group;
+  *                      idle @bfq_queue with no outstanding requests, then
+  *                      the task associated with the queue it is deemed as
+  *                      soft real-time (see the comments to the function
+- *                      bfq_bfqq_softrt_next_start()).
++ *                      bfq_bfqq_softrt_next_start())
+  * @last_idle_bklogged: time of the last transition of the @bfq_queue from
+  *                      idle to backlogged
+  * @service_from_backlogged: cumulative service received from the @bfq_queue
+  *                           since the last transition from idle to
+  *                           backlogged
++ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the
++ *	 queue is shared
+  *
+- * A bfq_queue is a leaf request queue; it can be associated with an io_context
+- * or more, if it is async or shared between cooperating processes. @cgroup
+- * holds a reference to the cgroup, to be sure that it does not disappear while
+- * a bfqq still references it (mostly to avoid races between request issuing and
+- * task migration followed by cgroup destruction).
++ * A bfq_queue is a leaf request queue; it can be associated with an
++ * io_context or more, if it  is  async or shared  between  cooperating
++ * processes. @cgroup holds a reference to the cgroup, to be sure that it
++ * does not disappear while a bfqq still references it (mostly to avoid
++ * races between request issuing and task migration followed by cgroup
++ * destruction).
+  * All the fields are protected by the queue lock of the containing bfqd.
+  */
+ struct bfq_queue {
+@@ -264,6 +267,7 @@ struct bfq_queue {
+ 	unsigned int requests_within_timer;
+ 
+ 	pid_t pid;
++	struct bfq_io_cq *bic;
+ 
+ 	/* weight-raising fields */
+ 	unsigned long wr_cur_max_time;
+@@ -293,12 +297,34 @@ struct bfq_ttime {
+  * @icq: associated io_cq structure
+  * @bfqq: array of two process queues, the sync and the async
+  * @ttime: associated @bfq_ttime struct
++ * @wr_time_left: snapshot of the time left before weight raising ends
++ *                for the sync queue associated to this process; this
++ *		  snapshot is taken to remember this value while the weight
++ *		  raising is suspended because the queue is merged with a
++ *		  shared queue, and is used to set @raising_cur_max_time
++ *		  when the queue is split from the shared queue and its
++ *		  weight is raised again
++ * @saved_idle_window: same purpose as the previous field for the idle
++ *                     window
++ * @saved_IO_bound: same purpose as the previous two fields for the I/O
++ *                  bound classification of a queue
++ * @cooperations: counter of consecutive successful queue merges underwent
++ *                by any of the process' @bfq_queues
++ * @failed_cooperations: counter of consecutive failed queue merges of any
++ *                       of the process' @bfq_queues
+  */
+ struct bfq_io_cq {
+ 	struct io_cq icq; /* must be the first member */
+ 	struct bfq_queue *bfqq[2];
+ 	struct bfq_ttime ttime;
+ 	int ioprio;
++
++	unsigned int wr_time_left;
++	unsigned int saved_idle_window;
++	unsigned int saved_IO_bound;
++
++	unsigned int cooperations;
++	unsigned int failed_cooperations;
+ };
+ 
+ enum bfq_device_speed {
+@@ -511,7 +537,7 @@ enum bfqq_state_flags {
+ 	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
+ 	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
+ 	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+-	BFQ_BFQQ_FLAG_IO_bound,         /*
++	BFQ_BFQQ_FLAG_IO_bound,		/*
+ 					 * bfqq has timed-out at least once
+ 					 * having consumed at most 2/10 of
+ 					 * its budget
+@@ -520,12 +546,13 @@ enum bfqq_state_flags {
+ 					 * bfqq has proved to be slow and
+ 					 * seeky until budget timeout
+ 					 */
+-	BFQ_BFQQ_FLAG_softrt_update,    /*
++	BFQ_BFQQ_FLAG_softrt_update,	/*
+ 					 * may need softrt-next-start
+ 					 * update
+ 					 */
+ 	BFQ_BFQQ_FLAG_coop,		/* bfqq is shared */
+-	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be splitted */
++	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be split */
++	BFQ_BFQQ_FLAG_just_split,	/* queue has just been split */
+ };
+ 
+ #define BFQ_BFQQ_FNS(name)						\
+@@ -554,6 +581,7 @@ BFQ_BFQQ_FNS(IO_bound);
+ BFQ_BFQQ_FNS(constantly_seeky);
+ BFQ_BFQQ_FNS(coop);
+ BFQ_BFQQ_FNS(split_coop);
++BFQ_BFQQ_FNS(just_split);
+ BFQ_BFQQ_FNS(softrt_update);
+ #undef BFQ_BFQQ_FNS
+ 
+-- 
+2.0.3
+


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
  2014-08-19 11:44 Mike Pagano
@ 2014-08-14 11:51 ` Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-08-14 11:51 UTC (permalink / raw
  To: gentoo-commits

commit:     a2032151afc204dbfddee6acc420e09c3295ece5
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Thu Aug 14 11:51:26 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Thu Aug 14 11:51:26 2014 +0000
URL:        http://git.overlays.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=a2032151

Linux patch 3.16.1

---
 0000_README             |   3 +
 1000_linux-3.16.1.patch | 507 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 510 insertions(+)

diff --git a/0000_README b/0000_README
index a6ec2e6..f57085e 100644
--- a/0000_README
+++ b/0000_README
@@ -42,6 +42,9 @@ EXPERIMENTAL
 
 Individual Patch Descriptions:
 --------------------------------------------------------------------------
+Patch:  1000_linux-3.16.1.patch
+From:   http://www.kernel.org
+Desc:   Linux 3.16.1
 
 Patch:  2400_kcopy-patch-for-infiniband-driver.patch
 From:   Alexey Shvetsov <alexxy@gentoo.org>

diff --git a/1000_linux-3.16.1.patch b/1000_linux-3.16.1.patch
new file mode 100644
index 0000000..29ac346
--- /dev/null
+++ b/1000_linux-3.16.1.patch
@@ -0,0 +1,507 @@
+diff --git a/Makefile b/Makefile
+index d0901b46b4bf..87663a2d1d10 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,8 +1,8 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 0
++SUBLEVEL = 1
+ EXTRAVERSION =
+-NAME = Shuffling Zombie Juror
++NAME = Museum of Fishiegoodies
+ 
+ # *DOCUMENTATION*
+ # To see a list of typical targets execute "make help"
+diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
+index 816d8202fa0a..dea1cfa2122b 100644
+--- a/arch/sparc/include/asm/tlbflush_64.h
++++ b/arch/sparc/include/asm/tlbflush_64.h
+@@ -34,6 +34,8 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
+ {
+ }
+ 
++void flush_tlb_kernel_range(unsigned long start, unsigned long end);
++
+ #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+ 
+ void flush_tlb_pending(void);
+@@ -48,11 +50,6 @@ void __flush_tlb_kernel_range(unsigned long start, unsigned long end);
+ 
+ #ifndef CONFIG_SMP
+ 
+-#define flush_tlb_kernel_range(start,end) \
+-do {	flush_tsb_kernel_range(start,end); \
+-	__flush_tlb_kernel_range(start,end); \
+-} while (0)
+-
+ static inline void global_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
+ {
+ 	__flush_tlb_page(CTX_HWBITS(mm->context), vaddr);
+@@ -63,11 +60,6 @@ static inline void global_flush_tlb_page(struct mm_struct *mm, unsigned long vad
+ void smp_flush_tlb_kernel_range(unsigned long start, unsigned long end);
+ void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr);
+ 
+-#define flush_tlb_kernel_range(start, end) \
+-do {	flush_tsb_kernel_range(start,end); \
+-	smp_flush_tlb_kernel_range(start, end); \
+-} while (0)
+-
+ #define global_flush_tlb_page(mm, vaddr) \
+ 	smp_flush_tlb_page(mm, vaddr)
+ 
+diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
+index e01d75d40329..66dacd56bb10 100644
+--- a/arch/sparc/kernel/ldc.c
++++ b/arch/sparc/kernel/ldc.c
+@@ -1336,7 +1336,7 @@ int ldc_connect(struct ldc_channel *lp)
+ 	if (!(lp->flags & LDC_FLAG_ALLOCED_QUEUES) ||
+ 	    !(lp->flags & LDC_FLAG_REGISTERED_QUEUES) ||
+ 	    lp->hs_state != LDC_HS_OPEN)
+-		err = -EINVAL;
++		err = ((lp->hs_state > LDC_HS_OPEN) ? 0 : -EINVAL);
+ 	else
+ 		err = start_handshake(lp);
+ 
+diff --git a/arch/sparc/math-emu/math_32.c b/arch/sparc/math-emu/math_32.c
+index aa4d55b0bdf0..5ce8f2f64604 100644
+--- a/arch/sparc/math-emu/math_32.c
++++ b/arch/sparc/math-emu/math_32.c
+@@ -499,7 +499,7 @@ static int do_one_mathemu(u32 insn, unsigned long *pfsr, unsigned long *fregs)
+ 		case 0: fsr = *pfsr;
+ 			if (IR == -1) IR = 2;
+ 			/* fcc is always fcc0 */
+-			fsr &= ~0xc00; fsr |= (IR << 10); break;
++			fsr &= ~0xc00; fsr |= (IR << 10);
+ 			*pfsr = fsr;
+ 			break;
+ 		case 1: rd->s = IR; break;
+diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
+index 16b58ff11e65..2cfb0f25e0ed 100644
+--- a/arch/sparc/mm/init_64.c
++++ b/arch/sparc/mm/init_64.c
+@@ -351,6 +351,10 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *
+ 
+ 	mm = vma->vm_mm;
+ 
++	/* Don't insert a non-valid PTE into the TSB, we'll deadlock.  */
++	if (!pte_accessible(mm, pte))
++		return;
++
+ 	spin_lock_irqsave(&mm->context.lock, flags);
+ 
+ #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+@@ -2619,6 +2623,10 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
+ 
+ 	pte = pmd_val(entry);
+ 
++	/* Don't insert a non-valid PMD into the TSB, we'll deadlock.  */
++	if (!(pte & _PAGE_VALID))
++		return;
++
+ 	/* We are fabricating 8MB pages using 4MB real hw pages.  */
+ 	pte |= (addr & (1UL << REAL_HPAGE_SHIFT));
+ 
+@@ -2699,3 +2707,26 @@ void hugetlb_setup(struct pt_regs *regs)
+ 	}
+ }
+ #endif
++
++#ifdef CONFIG_SMP
++#define do_flush_tlb_kernel_range	smp_flush_tlb_kernel_range
++#else
++#define do_flush_tlb_kernel_range	__flush_tlb_kernel_range
++#endif
++
++void flush_tlb_kernel_range(unsigned long start, unsigned long end)
++{
++	if (start < HI_OBP_ADDRESS && end > LOW_OBP_ADDRESS) {
++		if (start < LOW_OBP_ADDRESS) {
++			flush_tsb_kernel_range(start, LOW_OBP_ADDRESS);
++			do_flush_tlb_kernel_range(start, LOW_OBP_ADDRESS);
++		}
++		if (end > HI_OBP_ADDRESS) {
++			flush_tsb_kernel_range(end, HI_OBP_ADDRESS);
++			do_flush_tlb_kernel_range(end, HI_OBP_ADDRESS);
++		}
++	} else {
++		flush_tsb_kernel_range(start, end);
++		do_flush_tlb_kernel_range(start, end);
++	}
++}
+diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
+index 8afa579e7c40..a3dd5dc64f4c 100644
+--- a/drivers/net/ethernet/broadcom/tg3.c
++++ b/drivers/net/ethernet/broadcom/tg3.c
+@@ -7830,17 +7830,18 @@ static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
+ 
+ static netdev_tx_t tg3_start_xmit(struct sk_buff *, struct net_device *);
+ 
+-/* Use GSO to workaround a rare TSO bug that may be triggered when the
+- * TSO header is greater than 80 bytes.
++/* Use GSO to workaround all TSO packets that meet HW bug conditions
++ * indicated in tg3_tx_frag_set()
+  */
+-static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
++static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
++		       struct netdev_queue *txq, struct sk_buff *skb)
+ {
+ 	struct sk_buff *segs, *nskb;
+ 	u32 frag_cnt_est = skb_shinfo(skb)->gso_segs * 3;
+ 
+ 	/* Estimate the number of fragments in the worst case */
+-	if (unlikely(tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)) {
+-		netif_stop_queue(tp->dev);
++	if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {
++		netif_tx_stop_queue(txq);
+ 
+ 		/* netif_tx_stop_queue() must be done before checking
+ 		 * checking tx index in tg3_tx_avail() below, because in
+@@ -7848,13 +7849,14 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
+ 		 * netif_tx_queue_stopped().
+ 		 */
+ 		smp_mb();
+-		if (tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)
++		if (tg3_tx_avail(tnapi) <= frag_cnt_est)
+ 			return NETDEV_TX_BUSY;
+ 
+-		netif_wake_queue(tp->dev);
++		netif_tx_wake_queue(txq);
+ 	}
+ 
+-	segs = skb_gso_segment(skb, tp->dev->features & ~(NETIF_F_TSO | NETIF_F_TSO6));
++	segs = skb_gso_segment(skb, tp->dev->features &
++				    ~(NETIF_F_TSO | NETIF_F_TSO6));
+ 	if (IS_ERR(segs) || !segs)
+ 		goto tg3_tso_bug_end;
+ 
+@@ -7930,7 +7932,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ 		if (!skb_is_gso_v6(skb)) {
+ 			if (unlikely((ETH_HLEN + hdr_len) > 80) &&
+ 			    tg3_flag(tp, TSO_BUG))
+-				return tg3_tso_bug(tp, skb);
++				return tg3_tso_bug(tp, tnapi, txq, skb);
+ 
+ 			ip_csum = iph->check;
+ 			ip_tot_len = iph->tot_len;
+@@ -8061,7 +8063,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ 				iph->tot_len = ip_tot_len;
+ 			}
+ 			tcph->check = tcp_csum;
+-			return tg3_tso_bug(tp, skb);
++			return tg3_tso_bug(tp, tnapi, txq, skb);
+ 		}
+ 
+ 		/* If the workaround fails due to memory/mapping
+diff --git a/drivers/net/ethernet/brocade/bna/bnad.c b/drivers/net/ethernet/brocade/bna/bnad.c
+index 3a77f9ead004..556aab75f490 100644
+--- a/drivers/net/ethernet/brocade/bna/bnad.c
++++ b/drivers/net/ethernet/brocade/bna/bnad.c
+@@ -600,9 +600,9 @@ bnad_cq_process(struct bnad *bnad, struct bna_ccb *ccb, int budget)
+ 	prefetch(bnad->netdev);
+ 
+ 	cq = ccb->sw_q;
+-	cmpl = &cq[ccb->producer_index];
+ 
+ 	while (packets < budget) {
++		cmpl = &cq[ccb->producer_index];
+ 		if (!cmpl->valid)
+ 			break;
+ 		/* The 'valid' field is set by the adapter, only after writing
+diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
+index 958df383068a..ef8a5c20236a 100644
+--- a/drivers/net/macvlan.c
++++ b/drivers/net/macvlan.c
+@@ -646,6 +646,7 @@ static int macvlan_init(struct net_device *dev)
+ 				  (lowerdev->state & MACVLAN_STATE_MASK);
+ 	dev->features 		= lowerdev->features & MACVLAN_FEATURES;
+ 	dev->features		|= ALWAYS_ON_FEATURES;
++	dev->vlan_features	= lowerdev->vlan_features & MACVLAN_FEATURES;
+ 	dev->gso_max_size	= lowerdev->gso_max_size;
+ 	dev->iflink		= lowerdev->ifindex;
+ 	dev->hard_header_len	= lowerdev->hard_header_len;
+diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
+index 203651ebccb0..4eaadcfcb0fe 100644
+--- a/drivers/net/phy/mdio_bus.c
++++ b/drivers/net/phy/mdio_bus.c
+@@ -255,7 +255,6 @@ int mdiobus_register(struct mii_bus *bus)
+ 
+ 	bus->dev.parent = bus->parent;
+ 	bus->dev.class = &mdio_bus_class;
+-	bus->dev.driver = bus->parent->driver;
+ 	bus->dev.groups = NULL;
+ 	dev_set_name(&bus->dev, "%s", bus->id);
+ 
+diff --git a/drivers/sbus/char/bbc_envctrl.c b/drivers/sbus/char/bbc_envctrl.c
+index 160e7510aca6..0787b9756165 100644
+--- a/drivers/sbus/char/bbc_envctrl.c
++++ b/drivers/sbus/char/bbc_envctrl.c
+@@ -452,6 +452,9 @@ static void attach_one_temp(struct bbc_i2c_bus *bp, struct platform_device *op,
+ 	if (!tp)
+ 		return;
+ 
++	INIT_LIST_HEAD(&tp->bp_list);
++	INIT_LIST_HEAD(&tp->glob_list);
++
+ 	tp->client = bbc_i2c_attach(bp, op);
+ 	if (!tp->client) {
+ 		kfree(tp);
+@@ -497,6 +500,9 @@ static void attach_one_fan(struct bbc_i2c_bus *bp, struct platform_device *op,
+ 	if (!fp)
+ 		return;
+ 
++	INIT_LIST_HEAD(&fp->bp_list);
++	INIT_LIST_HEAD(&fp->glob_list);
++
+ 	fp->client = bbc_i2c_attach(bp, op);
+ 	if (!fp->client) {
+ 		kfree(fp);
+diff --git a/drivers/sbus/char/bbc_i2c.c b/drivers/sbus/char/bbc_i2c.c
+index c7763e482eb2..812b5f0361b6 100644
+--- a/drivers/sbus/char/bbc_i2c.c
++++ b/drivers/sbus/char/bbc_i2c.c
+@@ -300,13 +300,18 @@ static struct bbc_i2c_bus * attach_one_i2c(struct platform_device *op, int index
+ 	if (!bp)
+ 		return NULL;
+ 
++	INIT_LIST_HEAD(&bp->temps);
++	INIT_LIST_HEAD(&bp->fans);
++
+ 	bp->i2c_control_regs = of_ioremap(&op->resource[0], 0, 0x2, "bbc_i2c_regs");
+ 	if (!bp->i2c_control_regs)
+ 		goto fail;
+ 
+-	bp->i2c_bussel_reg = of_ioremap(&op->resource[1], 0, 0x1, "bbc_i2c_bussel");
+-	if (!bp->i2c_bussel_reg)
+-		goto fail;
++	if (op->num_resources == 2) {
++		bp->i2c_bussel_reg = of_ioremap(&op->resource[1], 0, 0x1, "bbc_i2c_bussel");
++		if (!bp->i2c_bussel_reg)
++			goto fail;
++	}
+ 
+ 	bp->waiting = 0;
+ 	init_waitqueue_head(&bp->wq);
+diff --git a/drivers/tty/serial/sunsab.c b/drivers/tty/serial/sunsab.c
+index 2f57df9a71d9..a1e09c0d46f2 100644
+--- a/drivers/tty/serial/sunsab.c
++++ b/drivers/tty/serial/sunsab.c
+@@ -157,6 +157,15 @@ receive_chars(struct uart_sunsab_port *up,
+ 	    (up->port.line == up->port.cons->index))
+ 		saw_console_brk = 1;
+ 
++	if (count == 0) {
++		if (unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) {
++			stat->sreg.isr0 &= ~(SAB82532_ISR0_PERR |
++					     SAB82532_ISR0_FERR);
++			up->port.icount.brk++;
++			uart_handle_break(&up->port);
++		}
++	}
++
+ 	for (i = 0; i < count; i++) {
+ 		unsigned char ch = buf[i], flag;
+ 
+diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
+index a4daf9eb8562..8dd8cab88b87 100644
+--- a/include/net/ip_tunnels.h
++++ b/include/net/ip_tunnels.h
+@@ -40,6 +40,7 @@ struct ip_tunnel_prl_entry {
+ 
+ struct ip_tunnel_dst {
+ 	struct dst_entry __rcu 		*dst;
++	__be32				 saddr;
+ };
+ 
+ struct ip_tunnel {
+diff --git a/lib/iovec.c b/lib/iovec.c
+index 7a7c2da4cddf..df3abd1eaa4a 100644
+--- a/lib/iovec.c
++++ b/lib/iovec.c
+@@ -85,6 +85,10 @@ EXPORT_SYMBOL(memcpy_toiovecend);
+ int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov,
+ 			int offset, int len)
+ {
++	/* No data? Done! */
++	if (len == 0)
++		return 0;
++
+ 	/* Skip over the finished iovecs */
+ 	while (offset >= iov->iov_len) {
+ 		offset -= iov->iov_len;
+diff --git a/net/batman-adv/fragmentation.c b/net/batman-adv/fragmentation.c
+index f14e54a05691..022d18ab27a6 100644
+--- a/net/batman-adv/fragmentation.c
++++ b/net/batman-adv/fragmentation.c
+@@ -128,6 +128,7 @@ static bool batadv_frag_insert_packet(struct batadv_orig_node *orig_node,
+ {
+ 	struct batadv_frag_table_entry *chain;
+ 	struct batadv_frag_list_entry *frag_entry_new = NULL, *frag_entry_curr;
++	struct batadv_frag_list_entry *frag_entry_last = NULL;
+ 	struct batadv_frag_packet *frag_packet;
+ 	uint8_t bucket;
+ 	uint16_t seqno, hdr_size = sizeof(struct batadv_frag_packet);
+@@ -180,11 +181,14 @@ static bool batadv_frag_insert_packet(struct batadv_orig_node *orig_node,
+ 			ret = true;
+ 			goto out;
+ 		}
++
++		/* store current entry because it could be the last in list */
++		frag_entry_last = frag_entry_curr;
+ 	}
+ 
+-	/* Reached the end of the list, so insert after 'frag_entry_curr'. */
+-	if (likely(frag_entry_curr)) {
+-		hlist_add_after(&frag_entry_curr->list, &frag_entry_new->list);
++	/* Reached the end of the list, so insert after 'frag_entry_last'. */
++	if (likely(frag_entry_last)) {
++		hlist_add_after(&frag_entry_last->list, &frag_entry_new->list);
+ 		chain->size += skb->len - hdr_size;
+ 		chain->timestamp = jiffies;
+ 		ret = true;
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..58ff88edbefd 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -2976,9 +2976,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
+ 		tail = nskb;
+ 
+ 		__copy_skb_header(nskb, head_skb);
+-		nskb->mac_len = head_skb->mac_len;
+ 
+ 		skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
++		skb_reset_mac_len(nskb);
+ 
+ 		skb_copy_from_linear_data_offset(head_skb, -tnl_hlen,
+ 						 nskb->data - tnl_hlen,
+diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
+index 6f9de61dce5f..45920d928341 100644
+--- a/net/ipv4/ip_tunnel.c
++++ b/net/ipv4/ip_tunnel.c
+@@ -69,23 +69,25 @@ static unsigned int ip_tunnel_hash(__be32 key, __be32 remote)
+ }
+ 
+ static void __tunnel_dst_set(struct ip_tunnel_dst *idst,
+-			     struct dst_entry *dst)
++			     struct dst_entry *dst, __be32 saddr)
+ {
+ 	struct dst_entry *old_dst;
+ 
+ 	dst_clone(dst);
+ 	old_dst = xchg((__force struct dst_entry **)&idst->dst, dst);
+ 	dst_release(old_dst);
++	idst->saddr = saddr;
+ }
+ 
+-static void tunnel_dst_set(struct ip_tunnel *t, struct dst_entry *dst)
++static void tunnel_dst_set(struct ip_tunnel *t,
++			   struct dst_entry *dst, __be32 saddr)
+ {
+-	__tunnel_dst_set(this_cpu_ptr(t->dst_cache), dst);
++	__tunnel_dst_set(this_cpu_ptr(t->dst_cache), dst, saddr);
+ }
+ 
+ static void tunnel_dst_reset(struct ip_tunnel *t)
+ {
+-	tunnel_dst_set(t, NULL);
++	tunnel_dst_set(t, NULL, 0);
+ }
+ 
+ void ip_tunnel_dst_reset_all(struct ip_tunnel *t)
+@@ -93,20 +95,25 @@ void ip_tunnel_dst_reset_all(struct ip_tunnel *t)
+ 	int i;
+ 
+ 	for_each_possible_cpu(i)
+-		__tunnel_dst_set(per_cpu_ptr(t->dst_cache, i), NULL);
++		__tunnel_dst_set(per_cpu_ptr(t->dst_cache, i), NULL, 0);
+ }
+ EXPORT_SYMBOL(ip_tunnel_dst_reset_all);
+ 
+-static struct rtable *tunnel_rtable_get(struct ip_tunnel *t, u32 cookie)
++static struct rtable *tunnel_rtable_get(struct ip_tunnel *t,
++					u32 cookie, __be32 *saddr)
+ {
++	struct ip_tunnel_dst *idst;
+ 	struct dst_entry *dst;
+ 
+ 	rcu_read_lock();
+-	dst = rcu_dereference(this_cpu_ptr(t->dst_cache)->dst);
++	idst = this_cpu_ptr(t->dst_cache);
++	dst = rcu_dereference(idst->dst);
+ 	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+ 		dst = NULL;
+ 	if (dst) {
+-		if (dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
++		if (!dst->obsolete || dst->ops->check(dst, cookie)) {
++			*saddr = idst->saddr;
++		} else {
+ 			tunnel_dst_reset(t);
+ 			dst_release(dst);
+ 			dst = NULL;
+@@ -367,7 +374,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev)
+ 
+ 		if (!IS_ERR(rt)) {
+ 			tdev = rt->dst.dev;
+-			tunnel_dst_set(tunnel, &rt->dst);
++			tunnel_dst_set(tunnel, &rt->dst, fl4.saddr);
+ 			ip_rt_put(rt);
+ 		}
+ 		if (dev->type != ARPHRD_ETHER)
+@@ -610,7 +617,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
+ 	init_tunnel_flow(&fl4, protocol, dst, tnl_params->saddr,
+ 			 tunnel->parms.o_key, RT_TOS(tos), tunnel->parms.link);
+ 
+-	rt = connected ? tunnel_rtable_get(tunnel, 0) : NULL;
++	rt = connected ? tunnel_rtable_get(tunnel, 0, &fl4.saddr) : NULL;
+ 
+ 	if (!rt) {
+ 		rt = ip_route_output_key(tunnel->net, &fl4);
+@@ -620,7 +627,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
+ 			goto tx_error;
+ 		}
+ 		if (connected)
+-			tunnel_dst_set(tunnel, &rt->dst);
++			tunnel_dst_set(tunnel, &rt->dst, fl4.saddr);
+ 	}
+ 
+ 	if (rt->dst.dev == dev) {
+diff --git a/net/ipv4/tcp_vegas.c b/net/ipv4/tcp_vegas.c
+index 9a5e05f27f4f..b40ad897f945 100644
+--- a/net/ipv4/tcp_vegas.c
++++ b/net/ipv4/tcp_vegas.c
+@@ -218,7 +218,8 @@ static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+ 			 * This is:
+ 			 *     (actual rate in segments) * baseRTT
+ 			 */
+-			target_cwnd = tp->snd_cwnd * vegas->baseRTT / rtt;
++			target_cwnd = (u64)tp->snd_cwnd * vegas->baseRTT;
++			do_div(target_cwnd, rtt);
+ 
+ 			/* Calculate the difference between the window we had,
+ 			 * and the window we would like to have. This quantity
+diff --git a/net/ipv4/tcp_veno.c b/net/ipv4/tcp_veno.c
+index 27b9825753d1..8276977d2c85 100644
+--- a/net/ipv4/tcp_veno.c
++++ b/net/ipv4/tcp_veno.c
+@@ -144,7 +144,7 @@ static void tcp_veno_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+ 
+ 		rtt = veno->minrtt;
+ 
+-		target_cwnd = (tp->snd_cwnd * veno->basertt);
++		target_cwnd = (u64)tp->snd_cwnd * veno->basertt;
+ 		target_cwnd <<= V_PARAM_SHIFT;
+ 		do_div(target_cwnd, rtt);
+ 
+diff --git a/net/sctp/output.c b/net/sctp/output.c
+index 01ab8e0723f0..407ae2bf97b0 100644
+--- a/net/sctp/output.c
++++ b/net/sctp/output.c
+@@ -599,7 +599,7 @@ out:
+ 	return err;
+ no_route:
+ 	kfree_skb(nskb);
+-	IP_INC_STATS_BH(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
++	IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
+ 
+ 	/* FIXME: Returning the 'err' will effect all the associations
+ 	 * associated with a socket, although only one of the paths of the


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-08-19 11:44 Mike Pagano
  2014-08-14 11:51 ` Mike Pagano
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Pagano @ 2014-08-19 11:44 UTC (permalink / raw
  To: gentoo-commits

commit:     a2032151afc204dbfddee6acc420e09c3295ece5
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Thu Aug 14 11:51:26 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Thu Aug 14 11:51:26 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=a2032151

Linux patch 3.16.1

---
 0000_README             |   3 +
 1000_linux-3.16.1.patch | 507 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 510 insertions(+)

diff --git a/0000_README b/0000_README
index a6ec2e6..f57085e 100644
--- a/0000_README
+++ b/0000_README
@@ -42,6 +42,9 @@ EXPERIMENTAL
 
 Individual Patch Descriptions:
 --------------------------------------------------------------------------
+Patch:  1000_linux-3.16.1.patch
+From:   http://www.kernel.org
+Desc:   Linux 3.16.1
 
 Patch:  2400_kcopy-patch-for-infiniband-driver.patch
 From:   Alexey Shvetsov <alexxy@gentoo.org>

diff --git a/1000_linux-3.16.1.patch b/1000_linux-3.16.1.patch
new file mode 100644
index 0000000..29ac346
--- /dev/null
+++ b/1000_linux-3.16.1.patch
@@ -0,0 +1,507 @@
+diff --git a/Makefile b/Makefile
+index d0901b46b4bf..87663a2d1d10 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,8 +1,8 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 0
++SUBLEVEL = 1
+ EXTRAVERSION =
+-NAME = Shuffling Zombie Juror
++NAME = Museum of Fishiegoodies
+ 
+ # *DOCUMENTATION*
+ # To see a list of typical targets execute "make help"
+diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
+index 816d8202fa0a..dea1cfa2122b 100644
+--- a/arch/sparc/include/asm/tlbflush_64.h
++++ b/arch/sparc/include/asm/tlbflush_64.h
+@@ -34,6 +34,8 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
+ {
+ }
+ 
++void flush_tlb_kernel_range(unsigned long start, unsigned long end);
++
+ #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+ 
+ void flush_tlb_pending(void);
+@@ -48,11 +50,6 @@ void __flush_tlb_kernel_range(unsigned long start, unsigned long end);
+ 
+ #ifndef CONFIG_SMP
+ 
+-#define flush_tlb_kernel_range(start,end) \
+-do {	flush_tsb_kernel_range(start,end); \
+-	__flush_tlb_kernel_range(start,end); \
+-} while (0)
+-
+ static inline void global_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
+ {
+ 	__flush_tlb_page(CTX_HWBITS(mm->context), vaddr);
+@@ -63,11 +60,6 @@ static inline void global_flush_tlb_page(struct mm_struct *mm, unsigned long vad
+ void smp_flush_tlb_kernel_range(unsigned long start, unsigned long end);
+ void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr);
+ 
+-#define flush_tlb_kernel_range(start, end) \
+-do {	flush_tsb_kernel_range(start,end); \
+-	smp_flush_tlb_kernel_range(start, end); \
+-} while (0)
+-
+ #define global_flush_tlb_page(mm, vaddr) \
+ 	smp_flush_tlb_page(mm, vaddr)
+ 
+diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
+index e01d75d40329..66dacd56bb10 100644
+--- a/arch/sparc/kernel/ldc.c
++++ b/arch/sparc/kernel/ldc.c
+@@ -1336,7 +1336,7 @@ int ldc_connect(struct ldc_channel *lp)
+ 	if (!(lp->flags & LDC_FLAG_ALLOCED_QUEUES) ||
+ 	    !(lp->flags & LDC_FLAG_REGISTERED_QUEUES) ||
+ 	    lp->hs_state != LDC_HS_OPEN)
+-		err = -EINVAL;
++		err = ((lp->hs_state > LDC_HS_OPEN) ? 0 : -EINVAL);
+ 	else
+ 		err = start_handshake(lp);
+ 
+diff --git a/arch/sparc/math-emu/math_32.c b/arch/sparc/math-emu/math_32.c
+index aa4d55b0bdf0..5ce8f2f64604 100644
+--- a/arch/sparc/math-emu/math_32.c
++++ b/arch/sparc/math-emu/math_32.c
+@@ -499,7 +499,7 @@ static int do_one_mathemu(u32 insn, unsigned long *pfsr, unsigned long *fregs)
+ 		case 0: fsr = *pfsr;
+ 			if (IR == -1) IR = 2;
+ 			/* fcc is always fcc0 */
+-			fsr &= ~0xc00; fsr |= (IR << 10); break;
++			fsr &= ~0xc00; fsr |= (IR << 10);
+ 			*pfsr = fsr;
+ 			break;
+ 		case 1: rd->s = IR; break;
+diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
+index 16b58ff11e65..2cfb0f25e0ed 100644
+--- a/arch/sparc/mm/init_64.c
++++ b/arch/sparc/mm/init_64.c
+@@ -351,6 +351,10 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *
+ 
+ 	mm = vma->vm_mm;
+ 
++	/* Don't insert a non-valid PTE into the TSB, we'll deadlock.  */
++	if (!pte_accessible(mm, pte))
++		return;
++
+ 	spin_lock_irqsave(&mm->context.lock, flags);
+ 
+ #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE)
+@@ -2619,6 +2623,10 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
+ 
+ 	pte = pmd_val(entry);
+ 
++	/* Don't insert a non-valid PMD into the TSB, we'll deadlock.  */
++	if (!(pte & _PAGE_VALID))
++		return;
++
+ 	/* We are fabricating 8MB pages using 4MB real hw pages.  */
+ 	pte |= (addr & (1UL << REAL_HPAGE_SHIFT));
+ 
+@@ -2699,3 +2707,26 @@ void hugetlb_setup(struct pt_regs *regs)
+ 	}
+ }
+ #endif
++
++#ifdef CONFIG_SMP
++#define do_flush_tlb_kernel_range	smp_flush_tlb_kernel_range
++#else
++#define do_flush_tlb_kernel_range	__flush_tlb_kernel_range
++#endif
++
++void flush_tlb_kernel_range(unsigned long start, unsigned long end)
++{
++	if (start < HI_OBP_ADDRESS && end > LOW_OBP_ADDRESS) {
++		if (start < LOW_OBP_ADDRESS) {
++			flush_tsb_kernel_range(start, LOW_OBP_ADDRESS);
++			do_flush_tlb_kernel_range(start, LOW_OBP_ADDRESS);
++		}
++		if (end > HI_OBP_ADDRESS) {
++			flush_tsb_kernel_range(end, HI_OBP_ADDRESS);
++			do_flush_tlb_kernel_range(end, HI_OBP_ADDRESS);
++		}
++	} else {
++		flush_tsb_kernel_range(start, end);
++		do_flush_tlb_kernel_range(start, end);
++	}
++}
+diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
+index 8afa579e7c40..a3dd5dc64f4c 100644
+--- a/drivers/net/ethernet/broadcom/tg3.c
++++ b/drivers/net/ethernet/broadcom/tg3.c
+@@ -7830,17 +7830,18 @@ static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
+ 
+ static netdev_tx_t tg3_start_xmit(struct sk_buff *, struct net_device *);
+ 
+-/* Use GSO to workaround a rare TSO bug that may be triggered when the
+- * TSO header is greater than 80 bytes.
++/* Use GSO to workaround all TSO packets that meet HW bug conditions
++ * indicated in tg3_tx_frag_set()
+  */
+-static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
++static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
++		       struct netdev_queue *txq, struct sk_buff *skb)
+ {
+ 	struct sk_buff *segs, *nskb;
+ 	u32 frag_cnt_est = skb_shinfo(skb)->gso_segs * 3;
+ 
+ 	/* Estimate the number of fragments in the worst case */
+-	if (unlikely(tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)) {
+-		netif_stop_queue(tp->dev);
++	if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {
++		netif_tx_stop_queue(txq);
+ 
+ 		/* netif_tx_stop_queue() must be done before checking
+ 		 * checking tx index in tg3_tx_avail() below, because in
+@@ -7848,13 +7849,14 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
+ 		 * netif_tx_queue_stopped().
+ 		 */
+ 		smp_mb();
+-		if (tg3_tx_avail(&tp->napi[0]) <= frag_cnt_est)
++		if (tg3_tx_avail(tnapi) <= frag_cnt_est)
+ 			return NETDEV_TX_BUSY;
+ 
+-		netif_wake_queue(tp->dev);
++		netif_tx_wake_queue(txq);
+ 	}
+ 
+-	segs = skb_gso_segment(skb, tp->dev->features & ~(NETIF_F_TSO | NETIF_F_TSO6));
++	segs = skb_gso_segment(skb, tp->dev->features &
++				    ~(NETIF_F_TSO | NETIF_F_TSO6));
+ 	if (IS_ERR(segs) || !segs)
+ 		goto tg3_tso_bug_end;
+ 
+@@ -7930,7 +7932,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ 		if (!skb_is_gso_v6(skb)) {
+ 			if (unlikely((ETH_HLEN + hdr_len) > 80) &&
+ 			    tg3_flag(tp, TSO_BUG))
+-				return tg3_tso_bug(tp, skb);
++				return tg3_tso_bug(tp, tnapi, txq, skb);
+ 
+ 			ip_csum = iph->check;
+ 			ip_tot_len = iph->tot_len;
+@@ -8061,7 +8063,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ 				iph->tot_len = ip_tot_len;
+ 			}
+ 			tcph->check = tcp_csum;
+-			return tg3_tso_bug(tp, skb);
++			return tg3_tso_bug(tp, tnapi, txq, skb);
+ 		}
+ 
+ 		/* If the workaround fails due to memory/mapping
+diff --git a/drivers/net/ethernet/brocade/bna/bnad.c b/drivers/net/ethernet/brocade/bna/bnad.c
+index 3a77f9ead004..556aab75f490 100644
+--- a/drivers/net/ethernet/brocade/bna/bnad.c
++++ b/drivers/net/ethernet/brocade/bna/bnad.c
+@@ -600,9 +600,9 @@ bnad_cq_process(struct bnad *bnad, struct bna_ccb *ccb, int budget)
+ 	prefetch(bnad->netdev);
+ 
+ 	cq = ccb->sw_q;
+-	cmpl = &cq[ccb->producer_index];
+ 
+ 	while (packets < budget) {
++		cmpl = &cq[ccb->producer_index];
+ 		if (!cmpl->valid)
+ 			break;
+ 		/* The 'valid' field is set by the adapter, only after writing
+diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
+index 958df383068a..ef8a5c20236a 100644
+--- a/drivers/net/macvlan.c
++++ b/drivers/net/macvlan.c
+@@ -646,6 +646,7 @@ static int macvlan_init(struct net_device *dev)
+ 				  (lowerdev->state & MACVLAN_STATE_MASK);
+ 	dev->features 		= lowerdev->features & MACVLAN_FEATURES;
+ 	dev->features		|= ALWAYS_ON_FEATURES;
++	dev->vlan_features	= lowerdev->vlan_features & MACVLAN_FEATURES;
+ 	dev->gso_max_size	= lowerdev->gso_max_size;
+ 	dev->iflink		= lowerdev->ifindex;
+ 	dev->hard_header_len	= lowerdev->hard_header_len;
+diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
+index 203651ebccb0..4eaadcfcb0fe 100644
+--- a/drivers/net/phy/mdio_bus.c
++++ b/drivers/net/phy/mdio_bus.c
+@@ -255,7 +255,6 @@ int mdiobus_register(struct mii_bus *bus)
+ 
+ 	bus->dev.parent = bus->parent;
+ 	bus->dev.class = &mdio_bus_class;
+-	bus->dev.driver = bus->parent->driver;
+ 	bus->dev.groups = NULL;
+ 	dev_set_name(&bus->dev, "%s", bus->id);
+ 
+diff --git a/drivers/sbus/char/bbc_envctrl.c b/drivers/sbus/char/bbc_envctrl.c
+index 160e7510aca6..0787b9756165 100644
+--- a/drivers/sbus/char/bbc_envctrl.c
++++ b/drivers/sbus/char/bbc_envctrl.c
+@@ -452,6 +452,9 @@ static void attach_one_temp(struct bbc_i2c_bus *bp, struct platform_device *op,
+ 	if (!tp)
+ 		return;
+ 
++	INIT_LIST_HEAD(&tp->bp_list);
++	INIT_LIST_HEAD(&tp->glob_list);
++
+ 	tp->client = bbc_i2c_attach(bp, op);
+ 	if (!tp->client) {
+ 		kfree(tp);
+@@ -497,6 +500,9 @@ static void attach_one_fan(struct bbc_i2c_bus *bp, struct platform_device *op,
+ 	if (!fp)
+ 		return;
+ 
++	INIT_LIST_HEAD(&fp->bp_list);
++	INIT_LIST_HEAD(&fp->glob_list);
++
+ 	fp->client = bbc_i2c_attach(bp, op);
+ 	if (!fp->client) {
+ 		kfree(fp);
+diff --git a/drivers/sbus/char/bbc_i2c.c b/drivers/sbus/char/bbc_i2c.c
+index c7763e482eb2..812b5f0361b6 100644
+--- a/drivers/sbus/char/bbc_i2c.c
++++ b/drivers/sbus/char/bbc_i2c.c
+@@ -300,13 +300,18 @@ static struct bbc_i2c_bus * attach_one_i2c(struct platform_device *op, int index
+ 	if (!bp)
+ 		return NULL;
+ 
++	INIT_LIST_HEAD(&bp->temps);
++	INIT_LIST_HEAD(&bp->fans);
++
+ 	bp->i2c_control_regs = of_ioremap(&op->resource[0], 0, 0x2, "bbc_i2c_regs");
+ 	if (!bp->i2c_control_regs)
+ 		goto fail;
+ 
+-	bp->i2c_bussel_reg = of_ioremap(&op->resource[1], 0, 0x1, "bbc_i2c_bussel");
+-	if (!bp->i2c_bussel_reg)
+-		goto fail;
++	if (op->num_resources == 2) {
++		bp->i2c_bussel_reg = of_ioremap(&op->resource[1], 0, 0x1, "bbc_i2c_bussel");
++		if (!bp->i2c_bussel_reg)
++			goto fail;
++	}
+ 
+ 	bp->waiting = 0;
+ 	init_waitqueue_head(&bp->wq);
+diff --git a/drivers/tty/serial/sunsab.c b/drivers/tty/serial/sunsab.c
+index 2f57df9a71d9..a1e09c0d46f2 100644
+--- a/drivers/tty/serial/sunsab.c
++++ b/drivers/tty/serial/sunsab.c
+@@ -157,6 +157,15 @@ receive_chars(struct uart_sunsab_port *up,
+ 	    (up->port.line == up->port.cons->index))
+ 		saw_console_brk = 1;
+ 
++	if (count == 0) {
++		if (unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) {
++			stat->sreg.isr0 &= ~(SAB82532_ISR0_PERR |
++					     SAB82532_ISR0_FERR);
++			up->port.icount.brk++;
++			uart_handle_break(&up->port);
++		}
++	}
++
+ 	for (i = 0; i < count; i++) {
+ 		unsigned char ch = buf[i], flag;
+ 
+diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
+index a4daf9eb8562..8dd8cab88b87 100644
+--- a/include/net/ip_tunnels.h
++++ b/include/net/ip_tunnels.h
+@@ -40,6 +40,7 @@ struct ip_tunnel_prl_entry {
+ 
+ struct ip_tunnel_dst {
+ 	struct dst_entry __rcu 		*dst;
++	__be32				 saddr;
+ };
+ 
+ struct ip_tunnel {
+diff --git a/lib/iovec.c b/lib/iovec.c
+index 7a7c2da4cddf..df3abd1eaa4a 100644
+--- a/lib/iovec.c
++++ b/lib/iovec.c
+@@ -85,6 +85,10 @@ EXPORT_SYMBOL(memcpy_toiovecend);
+ int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov,
+ 			int offset, int len)
+ {
++	/* No data? Done! */
++	if (len == 0)
++		return 0;
++
+ 	/* Skip over the finished iovecs */
+ 	while (offset >= iov->iov_len) {
+ 		offset -= iov->iov_len;
+diff --git a/net/batman-adv/fragmentation.c b/net/batman-adv/fragmentation.c
+index f14e54a05691..022d18ab27a6 100644
+--- a/net/batman-adv/fragmentation.c
++++ b/net/batman-adv/fragmentation.c
+@@ -128,6 +128,7 @@ static bool batadv_frag_insert_packet(struct batadv_orig_node *orig_node,
+ {
+ 	struct batadv_frag_table_entry *chain;
+ 	struct batadv_frag_list_entry *frag_entry_new = NULL, *frag_entry_curr;
++	struct batadv_frag_list_entry *frag_entry_last = NULL;
+ 	struct batadv_frag_packet *frag_packet;
+ 	uint8_t bucket;
+ 	uint16_t seqno, hdr_size = sizeof(struct batadv_frag_packet);
+@@ -180,11 +181,14 @@ static bool batadv_frag_insert_packet(struct batadv_orig_node *orig_node,
+ 			ret = true;
+ 			goto out;
+ 		}
++
++		/* store current entry because it could be the last in list */
++		frag_entry_last = frag_entry_curr;
+ 	}
+ 
+-	/* Reached the end of the list, so insert after 'frag_entry_curr'. */
+-	if (likely(frag_entry_curr)) {
+-		hlist_add_after(&frag_entry_curr->list, &frag_entry_new->list);
++	/* Reached the end of the list, so insert after 'frag_entry_last'. */
++	if (likely(frag_entry_last)) {
++		hlist_add_after(&frag_entry_last->list, &frag_entry_new->list);
+ 		chain->size += skb->len - hdr_size;
+ 		chain->timestamp = jiffies;
+ 		ret = true;
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..58ff88edbefd 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -2976,9 +2976,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
+ 		tail = nskb;
+ 
+ 		__copy_skb_header(nskb, head_skb);
+-		nskb->mac_len = head_skb->mac_len;
+ 
+ 		skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
++		skb_reset_mac_len(nskb);
+ 
+ 		skb_copy_from_linear_data_offset(head_skb, -tnl_hlen,
+ 						 nskb->data - tnl_hlen,
+diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
+index 6f9de61dce5f..45920d928341 100644
+--- a/net/ipv4/ip_tunnel.c
++++ b/net/ipv4/ip_tunnel.c
+@@ -69,23 +69,25 @@ static unsigned int ip_tunnel_hash(__be32 key, __be32 remote)
+ }
+ 
+ static void __tunnel_dst_set(struct ip_tunnel_dst *idst,
+-			     struct dst_entry *dst)
++			     struct dst_entry *dst, __be32 saddr)
+ {
+ 	struct dst_entry *old_dst;
+ 
+ 	dst_clone(dst);
+ 	old_dst = xchg((__force struct dst_entry **)&idst->dst, dst);
+ 	dst_release(old_dst);
++	idst->saddr = saddr;
+ }
+ 
+-static void tunnel_dst_set(struct ip_tunnel *t, struct dst_entry *dst)
++static void tunnel_dst_set(struct ip_tunnel *t,
++			   struct dst_entry *dst, __be32 saddr)
+ {
+-	__tunnel_dst_set(this_cpu_ptr(t->dst_cache), dst);
++	__tunnel_dst_set(this_cpu_ptr(t->dst_cache), dst, saddr);
+ }
+ 
+ static void tunnel_dst_reset(struct ip_tunnel *t)
+ {
+-	tunnel_dst_set(t, NULL);
++	tunnel_dst_set(t, NULL, 0);
+ }
+ 
+ void ip_tunnel_dst_reset_all(struct ip_tunnel *t)
+@@ -93,20 +95,25 @@ void ip_tunnel_dst_reset_all(struct ip_tunnel *t)
+ 	int i;
+ 
+ 	for_each_possible_cpu(i)
+-		__tunnel_dst_set(per_cpu_ptr(t->dst_cache, i), NULL);
++		__tunnel_dst_set(per_cpu_ptr(t->dst_cache, i), NULL, 0);
+ }
+ EXPORT_SYMBOL(ip_tunnel_dst_reset_all);
+ 
+-static struct rtable *tunnel_rtable_get(struct ip_tunnel *t, u32 cookie)
++static struct rtable *tunnel_rtable_get(struct ip_tunnel *t,
++					u32 cookie, __be32 *saddr)
+ {
++	struct ip_tunnel_dst *idst;
+ 	struct dst_entry *dst;
+ 
+ 	rcu_read_lock();
+-	dst = rcu_dereference(this_cpu_ptr(t->dst_cache)->dst);
++	idst = this_cpu_ptr(t->dst_cache);
++	dst = rcu_dereference(idst->dst);
+ 	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+ 		dst = NULL;
+ 	if (dst) {
+-		if (dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
++		if (!dst->obsolete || dst->ops->check(dst, cookie)) {
++			*saddr = idst->saddr;
++		} else {
+ 			tunnel_dst_reset(t);
+ 			dst_release(dst);
+ 			dst = NULL;
+@@ -367,7 +374,7 @@ static int ip_tunnel_bind_dev(struct net_device *dev)
+ 
+ 		if (!IS_ERR(rt)) {
+ 			tdev = rt->dst.dev;
+-			tunnel_dst_set(tunnel, &rt->dst);
++			tunnel_dst_set(tunnel, &rt->dst, fl4.saddr);
+ 			ip_rt_put(rt);
+ 		}
+ 		if (dev->type != ARPHRD_ETHER)
+@@ -610,7 +617,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
+ 	init_tunnel_flow(&fl4, protocol, dst, tnl_params->saddr,
+ 			 tunnel->parms.o_key, RT_TOS(tos), tunnel->parms.link);
+ 
+-	rt = connected ? tunnel_rtable_get(tunnel, 0) : NULL;
++	rt = connected ? tunnel_rtable_get(tunnel, 0, &fl4.saddr) : NULL;
+ 
+ 	if (!rt) {
+ 		rt = ip_route_output_key(tunnel->net, &fl4);
+@@ -620,7 +627,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
+ 			goto tx_error;
+ 		}
+ 		if (connected)
+-			tunnel_dst_set(tunnel, &rt->dst);
++			tunnel_dst_set(tunnel, &rt->dst, fl4.saddr);
+ 	}
+ 
+ 	if (rt->dst.dev == dev) {
+diff --git a/net/ipv4/tcp_vegas.c b/net/ipv4/tcp_vegas.c
+index 9a5e05f27f4f..b40ad897f945 100644
+--- a/net/ipv4/tcp_vegas.c
++++ b/net/ipv4/tcp_vegas.c
+@@ -218,7 +218,8 @@ static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+ 			 * This is:
+ 			 *     (actual rate in segments) * baseRTT
+ 			 */
+-			target_cwnd = tp->snd_cwnd * vegas->baseRTT / rtt;
++			target_cwnd = (u64)tp->snd_cwnd * vegas->baseRTT;
++			do_div(target_cwnd, rtt);
+ 
+ 			/* Calculate the difference between the window we had,
+ 			 * and the window we would like to have. This quantity
+diff --git a/net/ipv4/tcp_veno.c b/net/ipv4/tcp_veno.c
+index 27b9825753d1..8276977d2c85 100644
+--- a/net/ipv4/tcp_veno.c
++++ b/net/ipv4/tcp_veno.c
+@@ -144,7 +144,7 @@ static void tcp_veno_cong_avoid(struct sock *sk, u32 ack, u32 acked)
+ 
+ 		rtt = veno->minrtt;
+ 
+-		target_cwnd = (tp->snd_cwnd * veno->basertt);
++		target_cwnd = (u64)tp->snd_cwnd * veno->basertt;
+ 		target_cwnd <<= V_PARAM_SHIFT;
+ 		do_div(target_cwnd, rtt);
+ 
+diff --git a/net/sctp/output.c b/net/sctp/output.c
+index 01ab8e0723f0..407ae2bf97b0 100644
+--- a/net/sctp/output.c
++++ b/net/sctp/output.c
+@@ -599,7 +599,7 @@ out:
+ 	return err;
+ no_route:
+ 	kfree_skb(nskb);
+-	IP_INC_STATS_BH(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
++	IP_INC_STATS(sock_net(asoc->base.sk), IPSTATS_MIB_OUTNOROUTES);
+ 
+ 	/* FIXME: Returning the 'err' will effect all the associations
+ 	 * associated with a socket, although only one of the paths of the


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
  2014-08-08 19:48 Mike Pagano
@ 2014-08-19 11:44 ` Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-08-19 11:44 UTC (permalink / raw
  To: gentoo-commits

commit:     9df8c18cd85acf5655794c6de5da3a0690675965
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Fri Aug  8 19:48:09 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Fri Aug  8 19:48:09 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=9df8c18c

BFQ patch for 3.16

---
 0000_README                                        |   11 +
 ...-cgroups-kconfig-build-bits-for-v7r5-3.16.patch |  104 +
 ...ck-introduce-the-v7r5-I-O-sched-for-3.16.patch1 | 6635 ++++++++++++++++++++
 ...add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch | 1188 ++++
 4 files changed, 7938 insertions(+)

diff --git a/0000_README b/0000_README
index da7da0d..a6ec2e6 100644
--- a/0000_README
+++ b/0000_README
@@ -75,3 +75,14 @@ Patch:  5000_enable-additional-cpu-optimizations-for-gcc.patch
 From:   https://github.com/graysky2/kernel_gcc_patch/
 Desc:   Kernel patch enables gcc optimizations for additional CPUs.
 
+Patch:  5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
+From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc:   BFQ v7r5 patch 1 for 3.16: Build, cgroups and kconfig bits
+
+Patch:  5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
+From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc:   BFQ v7r5 patch 2 for 3.16: BFQ Scheduler
+
+Patch:  5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
+From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
+Desc:   BFQ v7r5 patch 3 for 3.16: Early Queue Merge (EQM)

diff --git a/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
new file mode 100644
index 0000000..088bd05
--- /dev/null
+++ b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
@@ -0,0 +1,104 @@
+From 6519e5beef1063a86d3fc917cff2592cb599e824 Mon Sep 17 00:00:00 2001
+From: Paolo Valente <paolo.valente@unimore.it>
+Date: Thu, 22 May 2014 11:59:35 +0200
+Subject: [PATCH 1/3] block: cgroups, kconfig, build bits for BFQ-v7r5-3.16
+
+Update Kconfig.iosched and do the related Makefile changes to include
+kernel configuration options for BFQ. Also add the bfqio controller
+to the cgroups subsystem.
+
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+---
+ block/Kconfig.iosched         | 32 ++++++++++++++++++++++++++++++++
+ block/Makefile                |  1 +
+ include/linux/cgroup_subsys.h |  4 ++++
+ 3 files changed, 37 insertions(+)
+
+diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
+index 421bef9..0ee5f0f 100644
+--- a/block/Kconfig.iosched
++++ b/block/Kconfig.iosched
+@@ -39,6 +39,27 @@ config CFQ_GROUP_IOSCHED
+ 	---help---
+ 	  Enable group IO scheduling in CFQ.
+ 
++config IOSCHED_BFQ
++	tristate "BFQ I/O scheduler"
++	default n
++	---help---
++	  The BFQ I/O scheduler tries to distribute bandwidth among
++	  all processes according to their weights.
++	  It aims at distributing the bandwidth as desired, independently of
++	  the disk parameters and with any workload. It also tries to
++	  guarantee low latency to interactive and soft real-time
++	  applications. If compiled built-in (saying Y here), BFQ can
++	  be configured to support hierarchical scheduling.
++
++config CGROUP_BFQIO
++	bool "BFQ hierarchical scheduling support"
++	depends on CGROUPS && IOSCHED_BFQ=y
++	default n
++	---help---
++	  Enable hierarchical scheduling in BFQ, using the cgroups
++	  filesystem interface.  The name of the subsystem will be
++	  bfqio.
++
+ choice
+ 	prompt "Default I/O scheduler"
+ 	default DEFAULT_CFQ
+@@ -52,6 +73,16 @@ choice
+ 	config DEFAULT_CFQ
+ 		bool "CFQ" if IOSCHED_CFQ=y
+ 
++	config DEFAULT_BFQ
++		bool "BFQ" if IOSCHED_BFQ=y
++		help
++		  Selects BFQ as the default I/O scheduler which will be
++		  used by default for all block devices.
++		  The BFQ I/O scheduler aims at distributing the bandwidth
++		  as desired, independently of the disk parameters and with
++		  any workload. It also tries to guarantee low latency to
++		  interactive and soft real-time applications.
++
+ 	config DEFAULT_NOOP
+ 		bool "No-op"
+ 
+@@ -61,6 +92,7 @@ config DEFAULT_IOSCHED
+ 	string
+ 	default "deadline" if DEFAULT_DEADLINE
+ 	default "cfq" if DEFAULT_CFQ
++	default "bfq" if DEFAULT_BFQ
+ 	default "noop" if DEFAULT_NOOP
+ 
+ endmenu
+diff --git a/block/Makefile b/block/Makefile
+index a2ce6ac..a0fc06a 100644
+--- a/block/Makefile
++++ b/block/Makefile
+@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
+ obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
+ obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
+ obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
++obj-$(CONFIG_IOSCHED_BFQ)	+= bfq-iosched.o
+ 
+ obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
+ obj-$(CONFIG_BLK_DEV_INTEGRITY)	+= blk-integrity.o
+diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
+index 98c4f9b..13b010d 100644
+--- a/include/linux/cgroup_subsys.h
++++ b/include/linux/cgroup_subsys.h
+@@ -35,6 +35,10 @@ SUBSYS(net_cls)
+ SUBSYS(blkio)
+ #endif
+ 
++#if IS_ENABLED(CONFIG_CGROUP_BFQIO)
++SUBSYS(bfqio)
++#endif
++
+ #if IS_ENABLED(CONFIG_CGROUP_PERF)
+ SUBSYS(perf_event)
+ #endif
+-- 
+2.0.3
+

diff --git a/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1 b/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
new file mode 100644
index 0000000..6f630ba
--- /dev/null
+++ b/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
@@ -0,0 +1,6635 @@
+From c56e6c5db41f7137d3e0b38063ef0c944eec1898 Mon Sep 17 00:00:00 2001
+From: Paolo Valente <paolo.valente@unimore.it>
+Date: Thu, 9 May 2013 19:10:02 +0200
+Subject: [PATCH 2/3] block: introduce the BFQ-v7r5 I/O sched for 3.16
+
+Add the BFQ-v7r5 I/O scheduler to 3.16.
+The general structure is borrowed from CFQ, as much of the code for
+handling I/O contexts. Over time, several useful features have been
+ported from CFQ as well (details in the changelog in README.BFQ). A
+(bfq_)queue is associated to each task doing I/O on a device, and each
+time a scheduling decision has to be made a queue is selected and served
+until it expires.
+
+    - Slices are given in the service domain: tasks are assigned
+      budgets, measured in number of sectors. Once got the disk, a task
+      must however consume its assigned budget within a configurable
+      maximum time (by default, the maximum possible value of the
+      budgets is automatically computed to comply with this timeout).
+      This allows the desired latency vs "throughput boosting" tradeoff
+      to be set.
+
+    - Budgets are scheduled according to a variant of WF2Q+, implemented
+      using an augmented rb-tree to take eligibility into account while
+      preserving an O(log N) overall complexity.
+
+    - A low-latency tunable is provided; if enabled, both interactive
+      and soft real-time applications are guaranteed a very low latency.
+
+    - Latency guarantees are preserved also in the presence of NCQ.
+
+    - Also with flash-based devices, a high throughput is achieved
+      while still preserving latency guarantees.
+
+    - BFQ features Early Queue Merge (EQM), a sort of fusion of the
+      cooperating-queue-merging and the preemption mechanisms present
+      in CFQ. EQM is in fact a unified mechanism that tries to get a
+      sequential read pattern, and hence a high throughput, with any
+      set of processes performing interleaved I/O over a contiguous
+      sequence of sectors.
+
+    - BFQ supports full hierarchical scheduling, exporting a cgroups
+      interface.  Since each node has a full scheduler, each group can
+      be assigned its own weight.
+
+    - If the cgroups interface is not used, only I/O priorities can be
+      assigned to processes, with ioprio values mapped to weights
+      with the relation weight = IOPRIO_BE_NR - ioprio.
+
+    - ioprio classes are served in strict priority order, i.e., lower
+      priority queues are not served as long as there are higher
+      priority queues.  Among queues in the same class the bandwidth is
+      distributed in proportion to the weight of each queue. A very
+      thin extra bandwidth is however guaranteed to the Idle class, to
+      prevent it from starving.
+
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+---
+ block/bfq-cgroup.c  |  930 +++++++++++++
+ block/bfq-ioc.c     |   36 +
+ block/bfq-iosched.c | 3617 +++++++++++++++++++++++++++++++++++++++++++++++++++
+ block/bfq-sched.c   | 1207 +++++++++++++++++
+ block/bfq.h         |  742 +++++++++++
+ 5 files changed, 6532 insertions(+)
+ create mode 100644 block/bfq-cgroup.c
+ create mode 100644 block/bfq-ioc.c
+ create mode 100644 block/bfq-iosched.c
+ create mode 100644 block/bfq-sched.c
+ create mode 100644 block/bfq.h
+
+diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
+new file mode 100644
+index 0000000..f742806
+--- /dev/null
++++ b/block/bfq-cgroup.c
+@@ -0,0 +1,930 @@
++/*
++ * BFQ: CGROUPS support.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
++ * file.
++ */
++
++#ifdef CONFIG_CGROUP_BFQIO
++
++static DEFINE_MUTEX(bfqio_mutex);
++
++static bool bfqio_is_removed(struct bfqio_cgroup *bgrp)
++{
++	return bgrp ? !bgrp->online : false;
++}
++
++static struct bfqio_cgroup bfqio_root_cgroup = {
++	.weight = BFQ_DEFAULT_GRP_WEIGHT,
++	.ioprio = BFQ_DEFAULT_GRP_IOPRIO,
++	.ioprio_class = BFQ_DEFAULT_GRP_CLASS,
++};
++
++static inline void bfq_init_entity(struct bfq_entity *entity,
++				   struct bfq_group *bfqg)
++{
++	entity->weight = entity->new_weight;
++	entity->orig_weight = entity->new_weight;
++	entity->ioprio = entity->new_ioprio;
++	entity->ioprio_class = entity->new_ioprio_class;
++	entity->parent = bfqg->my_entity;
++	entity->sched_data = &bfqg->sched_data;
++}
++
++static struct bfqio_cgroup *css_to_bfqio(struct cgroup_subsys_state *css)
++{
++	return css ? container_of(css, struct bfqio_cgroup, css) : NULL;
++}
++
++/*
++ * Search the bfq_group for bfqd into the hash table (by now only a list)
++ * of bgrp.  Must be called under rcu_read_lock().
++ */
++static struct bfq_group *bfqio_lookup_group(struct bfqio_cgroup *bgrp,
++					    struct bfq_data *bfqd)
++{
++	struct bfq_group *bfqg;
++	void *key;
++
++	hlist_for_each_entry_rcu(bfqg, &bgrp->group_data, group_node) {
++		key = rcu_dereference(bfqg->bfqd);
++		if (key == bfqd)
++			return bfqg;
++	}
++
++	return NULL;
++}
++
++static inline void bfq_group_init_entity(struct bfqio_cgroup *bgrp,
++					 struct bfq_group *bfqg)
++{
++	struct bfq_entity *entity = &bfqg->entity;
++
++	/*
++	 * If the weight of the entity has never been set via the sysfs
++	 * interface, then bgrp->weight == 0. In this case we initialize
++	 * the weight from the current ioprio value. Otherwise, the group
++	 * weight, if set, has priority over the ioprio value.
++	 */
++	if (bgrp->weight == 0) {
++		entity->new_weight = bfq_ioprio_to_weight(bgrp->ioprio);
++		entity->new_ioprio = bgrp->ioprio;
++	} else {
++		entity->new_weight = bgrp->weight;
++		entity->new_ioprio = bfq_weight_to_ioprio(bgrp->weight);
++	}
++	entity->orig_weight = entity->weight = entity->new_weight;
++	entity->ioprio = entity->new_ioprio;
++	entity->ioprio_class = entity->new_ioprio_class = bgrp->ioprio_class;
++	entity->my_sched_data = &bfqg->sched_data;
++	bfqg->active_entities = 0;
++}
++
++static inline void bfq_group_set_parent(struct bfq_group *bfqg,
++					struct bfq_group *parent)
++{
++	struct bfq_entity *entity;
++
++	BUG_ON(parent == NULL);
++	BUG_ON(bfqg == NULL);
++
++	entity = &bfqg->entity;
++	entity->parent = parent->my_entity;
++	entity->sched_data = &parent->sched_data;
++}
++
++/**
++ * bfq_group_chain_alloc - allocate a chain of groups.
++ * @bfqd: queue descriptor.
++ * @css: the leaf cgroup_subsys_state this chain starts from.
++ *
++ * Allocate a chain of groups starting from the one belonging to
++ * @cgroup up to the root cgroup.  Stop if a cgroup on the chain
++ * to the root has already an allocated group on @bfqd.
++ */
++static struct bfq_group *bfq_group_chain_alloc(struct bfq_data *bfqd,
++					       struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp;
++	struct bfq_group *bfqg, *prev = NULL, *leaf = NULL;
++
++	for (; css != NULL; css = css->parent) {
++		bgrp = css_to_bfqio(css);
++
++		bfqg = bfqio_lookup_group(bgrp, bfqd);
++		if (bfqg != NULL) {
++			/*
++			 * All the cgroups in the path from there to the
++			 * root must have a bfq_group for bfqd, so we don't
++			 * need any more allocations.
++			 */
++			break;
++		}
++
++		bfqg = kzalloc(sizeof(*bfqg), GFP_ATOMIC);
++		if (bfqg == NULL)
++			goto cleanup;
++
++		bfq_group_init_entity(bgrp, bfqg);
++		bfqg->my_entity = &bfqg->entity;
++
++		if (leaf == NULL) {
++			leaf = bfqg;
++			prev = leaf;
++		} else {
++			bfq_group_set_parent(prev, bfqg);
++			/*
++			 * Build a list of allocated nodes using the bfqd
++			 * filed, that is still unused and will be
++			 * initialized only after the node will be
++			 * connected.
++			 */
++			prev->bfqd = bfqg;
++			prev = bfqg;
++		}
++	}
++
++	return leaf;
++
++cleanup:
++	while (leaf != NULL) {
++		prev = leaf;
++		leaf = leaf->bfqd;
++		kfree(prev);
++	}
++
++	return NULL;
++}
++
++/**
++ * bfq_group_chain_link - link an allocated group chain to a cgroup
++ *                        hierarchy.
++ * @bfqd: the queue descriptor.
++ * @css: the leaf cgroup_subsys_state to start from.
++ * @leaf: the leaf group (to be associated to @cgroup).
++ *
++ * Try to link a chain of groups to a cgroup hierarchy, connecting the
++ * nodes bottom-up, so we can be sure that when we find a cgroup in the
++ * hierarchy that already as a group associated to @bfqd all the nodes
++ * in the path to the root cgroup have one too.
++ *
++ * On locking: the queue lock protects the hierarchy (there is a hierarchy
++ * per device) while the bfqio_cgroup lock protects the list of groups
++ * belonging to the same cgroup.
++ */
++static void bfq_group_chain_link(struct bfq_data *bfqd,
++				 struct cgroup_subsys_state *css,
++				 struct bfq_group *leaf)
++{
++	struct bfqio_cgroup *bgrp;
++	struct bfq_group *bfqg, *next, *prev = NULL;
++	unsigned long flags;
++
++	assert_spin_locked(bfqd->queue->queue_lock);
++
++	for (; css != NULL && leaf != NULL; css = css->parent) {
++		bgrp = css_to_bfqio(css);
++		next = leaf->bfqd;
++
++		bfqg = bfqio_lookup_group(bgrp, bfqd);
++		BUG_ON(bfqg != NULL);
++
++		spin_lock_irqsave(&bgrp->lock, flags);
++
++		rcu_assign_pointer(leaf->bfqd, bfqd);
++		hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
++		hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
++
++		spin_unlock_irqrestore(&bgrp->lock, flags);
++
++		prev = leaf;
++		leaf = next;
++	}
++
++	BUG_ON(css == NULL && leaf != NULL);
++	if (css != NULL && prev != NULL) {
++		bgrp = css_to_bfqio(css);
++		bfqg = bfqio_lookup_group(bgrp, bfqd);
++		bfq_group_set_parent(prev, bfqg);
++	}
++}
++
++/**
++ * bfq_find_alloc_group - return the group associated to @bfqd in @cgroup.
++ * @bfqd: queue descriptor.
++ * @cgroup: cgroup being searched for.
++ *
++ * Return a group associated to @bfqd in @cgroup, allocating one if
++ * necessary.  When a group is returned all the cgroups in the path
++ * to the root have a group associated to @bfqd.
++ *
++ * If the allocation fails, return the root group: this breaks guarantees
++ * but is a safe fallback.  If this loss becomes a problem it can be
++ * mitigated using the equivalent weight (given by the product of the
++ * weights of the groups in the path from @group to the root) in the
++ * root scheduler.
++ *
++ * We allocate all the missing nodes in the path from the leaf cgroup
++ * to the root and we connect the nodes only after all the allocations
++ * have been successful.
++ */
++static struct bfq_group *bfq_find_alloc_group(struct bfq_data *bfqd,
++					      struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++	struct bfq_group *bfqg;
++
++	bfqg = bfqio_lookup_group(bgrp, bfqd);
++	if (bfqg != NULL)
++		return bfqg;
++
++	bfqg = bfq_group_chain_alloc(bfqd, css);
++	if (bfqg != NULL)
++		bfq_group_chain_link(bfqd, css, bfqg);
++	else
++		bfqg = bfqd->root_group;
++
++	return bfqg;
++}
++
++/**
++ * bfq_bfqq_move - migrate @bfqq to @bfqg.
++ * @bfqd: queue descriptor.
++ * @bfqq: the queue to move.
++ * @entity: @bfqq's entity.
++ * @bfqg: the group to move to.
++ *
++ * Move @bfqq to @bfqg, deactivating it from its old group and reactivating
++ * it on the new one.  Avoid putting the entity on the old group idle tree.
++ *
++ * Must be called under the queue lock; the cgroup owning @bfqg must
++ * not disappear (by now this just means that we are called under
++ * rcu_read_lock()).
++ */
++static void bfq_bfqq_move(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			  struct bfq_entity *entity, struct bfq_group *bfqg)
++{
++	int busy, resume;
++
++	busy = bfq_bfqq_busy(bfqq);
++	resume = !RB_EMPTY_ROOT(&bfqq->sort_list);
++
++	BUG_ON(resume && !entity->on_st);
++	BUG_ON(busy && !resume && entity->on_st &&
++	       bfqq != bfqd->in_service_queue);
++
++	if (busy) {
++		BUG_ON(atomic_read(&bfqq->ref) < 2);
++
++		if (!resume)
++			bfq_del_bfqq_busy(bfqd, bfqq, 0);
++		else
++			bfq_deactivate_bfqq(bfqd, bfqq, 0);
++	} else if (entity->on_st)
++		bfq_put_idle_entity(bfq_entity_service_tree(entity), entity);
++
++	/*
++	 * Here we use a reference to bfqg.  We don't need a refcounter
++	 * as the cgroup reference will not be dropped, so that its
++	 * destroy() callback will not be invoked.
++	 */
++	entity->parent = bfqg->my_entity;
++	entity->sched_data = &bfqg->sched_data;
++
++	if (busy && resume)
++		bfq_activate_bfqq(bfqd, bfqq);
++
++	if (bfqd->in_service_queue == NULL && !bfqd->rq_in_driver)
++		bfq_schedule_dispatch(bfqd);
++}
++
++/**
++ * __bfq_bic_change_cgroup - move @bic to @cgroup.
++ * @bfqd: the queue descriptor.
++ * @bic: the bic to move.
++ * @cgroup: the cgroup to move to.
++ *
++ * Move bic to cgroup, assuming that bfqd->queue is locked; the caller
++ * has to make sure that the reference to cgroup is valid across the call.
++ *
++ * NOTE: an alternative approach might have been to store the current
++ * cgroup in bfqq and getting a reference to it, reducing the lookup
++ * time here, at the price of slightly more complex code.
++ */
++static struct bfq_group *__bfq_bic_change_cgroup(struct bfq_data *bfqd,
++						struct bfq_io_cq *bic,
++						struct cgroup_subsys_state *css)
++{
++	struct bfq_queue *async_bfqq = bic_to_bfqq(bic, 0);
++	struct bfq_queue *sync_bfqq = bic_to_bfqq(bic, 1);
++	struct bfq_entity *entity;
++	struct bfq_group *bfqg;
++	struct bfqio_cgroup *bgrp;
++
++	bgrp = css_to_bfqio(css);
++
++	bfqg = bfq_find_alloc_group(bfqd, css);
++	if (async_bfqq != NULL) {
++		entity = &async_bfqq->entity;
++
++		if (entity->sched_data != &bfqg->sched_data) {
++			bic_set_bfqq(bic, NULL, 0);
++			bfq_log_bfqq(bfqd, async_bfqq,
++				     "bic_change_group: %p %d",
++				     async_bfqq, atomic_read(&async_bfqq->ref));
++			bfq_put_queue(async_bfqq);
++		}
++	}
++
++	if (sync_bfqq != NULL) {
++		entity = &sync_bfqq->entity;
++		if (entity->sched_data != &bfqg->sched_data)
++			bfq_bfqq_move(bfqd, sync_bfqq, entity, bfqg);
++	}
++
++	return bfqg;
++}
++
++/**
++ * bfq_bic_change_cgroup - move @bic to @cgroup.
++ * @bic: the bic being migrated.
++ * @cgroup: the destination cgroup.
++ *
++ * When the task owning @bic is moved to @cgroup, @bic is immediately
++ * moved into its new parent group.
++ */
++static void bfq_bic_change_cgroup(struct bfq_io_cq *bic,
++				  struct cgroup_subsys_state *css)
++{
++	struct bfq_data *bfqd;
++	unsigned long uninitialized_var(flags);
++
++	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
++				   &flags);
++	if (bfqd != NULL) {
++		__bfq_bic_change_cgroup(bfqd, bic, css);
++		bfq_put_bfqd_unlock(bfqd, &flags);
++	}
++}
++
++/**
++ * bfq_bic_update_cgroup - update the cgroup of @bic.
++ * @bic: the @bic to update.
++ *
++ * Make sure that @bic is enqueued in the cgroup of the current task.
++ * We need this in addition to moving bics during the cgroup attach
++ * phase because the task owning @bic could be at its first disk
++ * access or we may end up in the root cgroup as the result of a
++ * memory allocation failure and here we try to move to the right
++ * group.
++ *
++ * Must be called under the queue lock.  It is safe to use the returned
++ * value even after the rcu_read_unlock() as the migration/destruction
++ * paths act under the queue lock too.  IOW it is impossible to race with
++ * group migration/destruction and end up with an invalid group as:
++ *   a) here cgroup has not yet been destroyed, nor its destroy callback
++ *      has started execution, as current holds a reference to it,
++ *   b) if it is destroyed after rcu_read_unlock() [after current is
++ *      migrated to a different cgroup] its attach() callback will have
++ *      taken care of remove all the references to the old cgroup data.
++ */
++static struct bfq_group *bfq_bic_update_cgroup(struct bfq_io_cq *bic)
++{
++	struct bfq_data *bfqd = bic_to_bfqd(bic);
++	struct bfq_group *bfqg;
++	struct cgroup_subsys_state *css;
++
++	BUG_ON(bfqd == NULL);
++
++	rcu_read_lock();
++	css = task_css(current, bfqio_cgrp_id);
++	bfqg = __bfq_bic_change_cgroup(bfqd, bic, css);
++	rcu_read_unlock();
++
++	return bfqg;
++}
++
++/**
++ * bfq_flush_idle_tree - deactivate any entity on the idle tree of @st.
++ * @st: the service tree being flushed.
++ */
++static inline void bfq_flush_idle_tree(struct bfq_service_tree *st)
++{
++	struct bfq_entity *entity = st->first_idle;
++
++	for (; entity != NULL; entity = st->first_idle)
++		__bfq_deactivate_entity(entity, 0);
++}
++
++/**
++ * bfq_reparent_leaf_entity - move leaf entity to the root_group.
++ * @bfqd: the device data structure with the root group.
++ * @entity: the entity to move.
++ */
++static inline void bfq_reparent_leaf_entity(struct bfq_data *bfqd,
++					    struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++	BUG_ON(bfqq == NULL);
++	bfq_bfqq_move(bfqd, bfqq, entity, bfqd->root_group);
++	return;
++}
++
++/**
++ * bfq_reparent_active_entities - move to the root group all active
++ *                                entities.
++ * @bfqd: the device data structure with the root group.
++ * @bfqg: the group to move from.
++ * @st: the service tree with the entities.
++ *
++ * Needs queue_lock to be taken and reference to be valid over the call.
++ */
++static inline void bfq_reparent_active_entities(struct bfq_data *bfqd,
++						struct bfq_group *bfqg,
++						struct bfq_service_tree *st)
++{
++	struct rb_root *active = &st->active;
++	struct bfq_entity *entity = NULL;
++
++	if (!RB_EMPTY_ROOT(&st->active))
++		entity = bfq_entity_of(rb_first(active));
++
++	for (; entity != NULL; entity = bfq_entity_of(rb_first(active)))
++		bfq_reparent_leaf_entity(bfqd, entity);
++
++	if (bfqg->sched_data.in_service_entity != NULL)
++		bfq_reparent_leaf_entity(bfqd,
++			bfqg->sched_data.in_service_entity);
++
++	return;
++}
++
++/**
++ * bfq_destroy_group - destroy @bfqg.
++ * @bgrp: the bfqio_cgroup containing @bfqg.
++ * @bfqg: the group being destroyed.
++ *
++ * Destroy @bfqg, making sure that it is not referenced from its parent.
++ */
++static void bfq_destroy_group(struct bfqio_cgroup *bgrp, struct bfq_group *bfqg)
++{
++	struct bfq_data *bfqd;
++	struct bfq_service_tree *st;
++	struct bfq_entity *entity = bfqg->my_entity;
++	unsigned long uninitialized_var(flags);
++	int i;
++
++	hlist_del(&bfqg->group_node);
++
++	/*
++	 * Empty all service_trees belonging to this group before
++	 * deactivating the group itself.
++	 */
++	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++) {
++		st = bfqg->sched_data.service_tree + i;
++
++		/*
++		 * The idle tree may still contain bfq_queues belonging
++		 * to exited task because they never migrated to a different
++		 * cgroup from the one being destroyed now.  No one else
++		 * can access them so it's safe to act without any lock.
++		 */
++		bfq_flush_idle_tree(st);
++
++		/*
++		 * It may happen that some queues are still active
++		 * (busy) upon group destruction (if the corresponding
++		 * processes have been forced to terminate). We move
++		 * all the leaf entities corresponding to these queues
++		 * to the root_group.
++		 * Also, it may happen that the group has an entity
++		 * in service, which is disconnected from the active
++		 * tree: it must be moved, too.
++		 * There is no need to put the sync queues, as the
++		 * scheduler has taken no reference.
++		 */
++		bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
++		if (bfqd != NULL) {
++			bfq_reparent_active_entities(bfqd, bfqg, st);
++			bfq_put_bfqd_unlock(bfqd, &flags);
++		}
++		BUG_ON(!RB_EMPTY_ROOT(&st->active));
++		BUG_ON(!RB_EMPTY_ROOT(&st->idle));
++	}
++	BUG_ON(bfqg->sched_data.next_in_service != NULL);
++	BUG_ON(bfqg->sched_data.in_service_entity != NULL);
++
++	/*
++	 * We may race with device destruction, take extra care when
++	 * dereferencing bfqg->bfqd.
++	 */
++	bfqd = bfq_get_bfqd_locked(&bfqg->bfqd, &flags);
++	if (bfqd != NULL) {
++		hlist_del(&bfqg->bfqd_node);
++		__bfq_deactivate_entity(entity, 0);
++		bfq_put_async_queues(bfqd, bfqg);
++		bfq_put_bfqd_unlock(bfqd, &flags);
++	}
++	BUG_ON(entity->tree != NULL);
++
++	/*
++	 * No need to defer the kfree() to the end of the RCU grace
++	 * period: we are called from the destroy() callback of our
++	 * cgroup, so we can be sure that no one is a) still using
++	 * this cgroup or b) doing lookups in it.
++	 */
++	kfree(bfqg);
++}
++
++static void bfq_end_wr_async(struct bfq_data *bfqd)
++{
++	struct hlist_node *tmp;
++	struct bfq_group *bfqg;
++
++	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node)
++		bfq_end_wr_async_queues(bfqd, bfqg);
++	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
++}
++
++/**
++ * bfq_disconnect_groups - disconnect @bfqd from all its groups.
++ * @bfqd: the device descriptor being exited.
++ *
++ * When the device exits we just make sure that no lookup can return
++ * the now unused group structures.  They will be deallocated on cgroup
++ * destruction.
++ */
++static void bfq_disconnect_groups(struct bfq_data *bfqd)
++{
++	struct hlist_node *tmp;
++	struct bfq_group *bfqg;
++
++	bfq_log(bfqd, "disconnect_groups beginning");
++	hlist_for_each_entry_safe(bfqg, tmp, &bfqd->group_list, bfqd_node) {
++		hlist_del(&bfqg->bfqd_node);
++
++		__bfq_deactivate_entity(bfqg->my_entity, 0);
++
++		/*
++		 * Don't remove from the group hash, just set an
++		 * invalid key.  No lookups can race with the
++		 * assignment as bfqd is being destroyed; this
++		 * implies also that new elements cannot be added
++		 * to the list.
++		 */
++		rcu_assign_pointer(bfqg->bfqd, NULL);
++
++		bfq_log(bfqd, "disconnect_groups: put async for group %p",
++			bfqg);
++		bfq_put_async_queues(bfqd, bfqg);
++	}
++}
++
++static inline void bfq_free_root_group(struct bfq_data *bfqd)
++{
++	struct bfqio_cgroup *bgrp = &bfqio_root_cgroup;
++	struct bfq_group *bfqg = bfqd->root_group;
++
++	bfq_put_async_queues(bfqd, bfqg);
++
++	spin_lock_irq(&bgrp->lock);
++	hlist_del_rcu(&bfqg->group_node);
++	spin_unlock_irq(&bgrp->lock);
++
++	/*
++	 * No need to synchronize_rcu() here: since the device is gone
++	 * there cannot be any read-side access to its root_group.
++	 */
++	kfree(bfqg);
++}
++
++static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
++{
++	struct bfq_group *bfqg;
++	struct bfqio_cgroup *bgrp;
++	int i;
++
++	bfqg = kzalloc_node(sizeof(*bfqg), GFP_KERNEL, node);
++	if (bfqg == NULL)
++		return NULL;
++
++	bfqg->entity.parent = NULL;
++	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
++		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
++
++	bgrp = &bfqio_root_cgroup;
++	spin_lock_irq(&bgrp->lock);
++	rcu_assign_pointer(bfqg->bfqd, bfqd);
++	hlist_add_head_rcu(&bfqg->group_node, &bgrp->group_data);
++	spin_unlock_irq(&bgrp->lock);
++
++	return bfqg;
++}
++
++#define SHOW_FUNCTION(__VAR)						\
++static u64 bfqio_cgroup_##__VAR##_read(struct cgroup_subsys_state *css, \
++				       struct cftype *cftype)		\
++{									\
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
++	u64 ret = -ENODEV;						\
++									\
++	mutex_lock(&bfqio_mutex);					\
++	if (bfqio_is_removed(bgrp))					\
++		goto out_unlock;					\
++									\
++	spin_lock_irq(&bgrp->lock);					\
++	ret = bgrp->__VAR;						\
++	spin_unlock_irq(&bgrp->lock);					\
++									\
++out_unlock:								\
++	mutex_unlock(&bfqio_mutex);					\
++	return ret;							\
++}
++
++SHOW_FUNCTION(weight);
++SHOW_FUNCTION(ioprio);
++SHOW_FUNCTION(ioprio_class);
++#undef SHOW_FUNCTION
++
++#define STORE_FUNCTION(__VAR, __MIN, __MAX)				\
++static int bfqio_cgroup_##__VAR##_write(struct cgroup_subsys_state *css,\
++					struct cftype *cftype,		\
++					u64 val)			\
++{									\
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);			\
++	struct bfq_group *bfqg;						\
++	int ret = -EINVAL;						\
++									\
++	if (val < (__MIN) || val > (__MAX))				\
++		return ret;						\
++									\
++	ret = -ENODEV;							\
++	mutex_lock(&bfqio_mutex);					\
++	if (bfqio_is_removed(bgrp))					\
++		goto out_unlock;					\
++	ret = 0;							\
++									\
++	spin_lock_irq(&bgrp->lock);					\
++	bgrp->__VAR = (unsigned short)val;				\
++	hlist_for_each_entry(bfqg, &bgrp->group_data, group_node) {	\
++		/*							\
++		 * Setting the ioprio_changed flag of the entity        \
++		 * to 1 with new_##__VAR == ##__VAR would re-set        \
++		 * the value of the weight to its ioprio mapping.       \
++		 * Set the flag only if necessary.			\
++		 */							\
++		if ((unsigned short)val != bfqg->entity.new_##__VAR) {  \
++			bfqg->entity.new_##__VAR = (unsigned short)val; \
++			/*						\
++			 * Make sure that the above new value has been	\
++			 * stored in bfqg->entity.new_##__VAR before	\
++			 * setting the ioprio_changed flag. In fact,	\
++			 * this flag may be read asynchronously (in	\
++			 * critical sections protected by a different	\
++			 * lock than that held here), and finding this	\
++			 * flag set may cause the execution of the code	\
++			 * for updating parameters whose value may	\
++			 * depend also on bfqg->entity.new_##__VAR (in	\
++			 * __bfq_entity_update_weight_prio).		\
++			 * This barrier makes sure that the new value	\
++			 * of bfqg->entity.new_##__VAR is correctly	\
++			 * seen in that code.				\
++			 */						\
++			smp_wmb();                                      \
++			bfqg->entity.ioprio_changed = 1;                \
++		}							\
++	}								\
++	spin_unlock_irq(&bgrp->lock);					\
++									\
++out_unlock:								\
++	mutex_unlock(&bfqio_mutex);					\
++	return ret;							\
++}
++
++STORE_FUNCTION(weight, BFQ_MIN_WEIGHT, BFQ_MAX_WEIGHT);
++STORE_FUNCTION(ioprio, 0, IOPRIO_BE_NR - 1);
++STORE_FUNCTION(ioprio_class, IOPRIO_CLASS_RT, IOPRIO_CLASS_IDLE);
++#undef STORE_FUNCTION
++
++static struct cftype bfqio_files[] = {
++	{
++		.name = "weight",
++		.read_u64 = bfqio_cgroup_weight_read,
++		.write_u64 = bfqio_cgroup_weight_write,
++	},
++	{
++		.name = "ioprio",
++		.read_u64 = bfqio_cgroup_ioprio_read,
++		.write_u64 = bfqio_cgroup_ioprio_write,
++	},
++	{
++		.name = "ioprio_class",
++		.read_u64 = bfqio_cgroup_ioprio_class_read,
++		.write_u64 = bfqio_cgroup_ioprio_class_write,
++	},
++	{ },	/* terminate */
++};
++
++static struct cgroup_subsys_state *bfqio_create(struct cgroup_subsys_state
++						*parent_css)
++{
++	struct bfqio_cgroup *bgrp;
++
++	if (parent_css != NULL) {
++		bgrp = kzalloc(sizeof(*bgrp), GFP_KERNEL);
++		if (bgrp == NULL)
++			return ERR_PTR(-ENOMEM);
++	} else
++		bgrp = &bfqio_root_cgroup;
++
++	spin_lock_init(&bgrp->lock);
++	INIT_HLIST_HEAD(&bgrp->group_data);
++	bgrp->ioprio = BFQ_DEFAULT_GRP_IOPRIO;
++	bgrp->ioprio_class = BFQ_DEFAULT_GRP_CLASS;
++
++	return &bgrp->css;
++}
++
++/*
++ * We cannot support shared io contexts, as we have no means to support
++ * two tasks with the same ioc in two different groups without major rework
++ * of the main bic/bfqq data structures.  By now we allow a task to change
++ * its cgroup only if it's the only owner of its ioc; the drawback of this
++ * behavior is that a group containing a task that forked using CLONE_IO
++ * will not be destroyed until the tasks sharing the ioc die.
++ */
++static int bfqio_can_attach(struct cgroup_subsys_state *css,
++			    struct cgroup_taskset *tset)
++{
++	struct task_struct *task;
++	struct io_context *ioc;
++	int ret = 0;
++
++	cgroup_taskset_for_each(task, tset) {
++		/*
++		 * task_lock() is needed to avoid races with
++		 * exit_io_context()
++		 */
++		task_lock(task);
++		ioc = task->io_context;
++		if (ioc != NULL && atomic_read(&ioc->nr_tasks) > 1)
++			/*
++			 * ioc == NULL means that the task is either too
++			 * young or exiting: if it has still no ioc the
++			 * ioc can't be shared, if the task is exiting the
++			 * attach will fail anyway, no matter what we
++			 * return here.
++			 */
++			ret = -EINVAL;
++		task_unlock(task);
++		if (ret)
++			break;
++	}
++
++	return ret;
++}
++
++static void bfqio_attach(struct cgroup_subsys_state *css,
++			 struct cgroup_taskset *tset)
++{
++	struct task_struct *task;
++	struct io_context *ioc;
++	struct io_cq *icq;
++
++	/*
++	 * IMPORTANT NOTE: The move of more than one process at a time to a
++	 * new group has not yet been tested.
++	 */
++	cgroup_taskset_for_each(task, tset) {
++		ioc = get_task_io_context(task, GFP_ATOMIC, NUMA_NO_NODE);
++		if (ioc) {
++			/*
++			 * Handle cgroup change here.
++			 */
++			rcu_read_lock();
++			hlist_for_each_entry_rcu(icq, &ioc->icq_list, ioc_node)
++				if (!strncmp(
++					icq->q->elevator->type->elevator_name,
++					"bfq", ELV_NAME_MAX))
++					bfq_bic_change_cgroup(icq_to_bic(icq),
++							      css);
++			rcu_read_unlock();
++			put_io_context(ioc);
++		}
++	}
++}
++
++static void bfqio_destroy(struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++	struct hlist_node *tmp;
++	struct bfq_group *bfqg;
++
++	/*
++	 * Since we are destroying the cgroup, there are no more tasks
++	 * referencing it, and all the RCU grace periods that may have
++	 * referenced it are ended (as the destruction of the parent
++	 * cgroup is RCU-safe); bgrp->group_data will not be accessed by
++	 * anything else and we don't need any synchronization.
++	 */
++	hlist_for_each_entry_safe(bfqg, tmp, &bgrp->group_data, group_node)
++		bfq_destroy_group(bgrp, bfqg);
++
++	BUG_ON(!hlist_empty(&bgrp->group_data));
++
++	kfree(bgrp);
++}
++
++static int bfqio_css_online(struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++
++	mutex_lock(&bfqio_mutex);
++	bgrp->online = true;
++	mutex_unlock(&bfqio_mutex);
++
++	return 0;
++}
++
++static void bfqio_css_offline(struct cgroup_subsys_state *css)
++{
++	struct bfqio_cgroup *bgrp = css_to_bfqio(css);
++
++	mutex_lock(&bfqio_mutex);
++	bgrp->online = false;
++	mutex_unlock(&bfqio_mutex);
++}
++
++struct cgroup_subsys bfqio_cgrp_subsys = {
++	.css_alloc = bfqio_create,
++	.css_online = bfqio_css_online,
++	.css_offline = bfqio_css_offline,
++	.can_attach = bfqio_can_attach,
++	.attach = bfqio_attach,
++	.css_free = bfqio_destroy,
++	.base_cftypes = bfqio_files,
++};
++#else
++static inline void bfq_init_entity(struct bfq_entity *entity,
++				   struct bfq_group *bfqg)
++{
++	entity->weight = entity->new_weight;
++	entity->orig_weight = entity->new_weight;
++	entity->ioprio = entity->new_ioprio;
++	entity->ioprio_class = entity->new_ioprio_class;
++	entity->sched_data = &bfqg->sched_data;
++}
++
++static inline struct bfq_group *
++bfq_bic_update_cgroup(struct bfq_io_cq *bic)
++{
++	struct bfq_data *bfqd = bic_to_bfqd(bic);
++	return bfqd->root_group;
++}
++
++static inline void bfq_bfqq_move(struct bfq_data *bfqd,
++				 struct bfq_queue *bfqq,
++				 struct bfq_entity *entity,
++				 struct bfq_group *bfqg)
++{
++}
++
++static void bfq_end_wr_async(struct bfq_data *bfqd)
++{
++	bfq_end_wr_async_queues(bfqd, bfqd->root_group);
++}
++
++static inline void bfq_disconnect_groups(struct bfq_data *bfqd)
++{
++	bfq_put_async_queues(bfqd, bfqd->root_group);
++}
++
++static inline void bfq_free_root_group(struct bfq_data *bfqd)
++{
++	kfree(bfqd->root_group);
++}
++
++static struct bfq_group *bfq_alloc_root_group(struct bfq_data *bfqd, int node)
++{
++	struct bfq_group *bfqg;
++	int i;
++
++	bfqg = kmalloc_node(sizeof(*bfqg), GFP_KERNEL | __GFP_ZERO, node);
++	if (bfqg == NULL)
++		return NULL;
++
++	for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)
++		bfqg->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;
++
++	return bfqg;
++}
++#endif
+diff --git a/block/bfq-ioc.c b/block/bfq-ioc.c
+new file mode 100644
+index 0000000..7f6b000
+--- /dev/null
++++ b/block/bfq-ioc.c
+@@ -0,0 +1,36 @@
++/*
++ * BFQ: I/O context handling.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++/**
++ * icq_to_bic - convert iocontext queue structure to bfq_io_cq.
++ * @icq: the iocontext queue.
++ */
++static inline struct bfq_io_cq *icq_to_bic(struct io_cq *icq)
++{
++	/* bic->icq is the first member, %NULL will convert to %NULL */
++	return container_of(icq, struct bfq_io_cq, icq);
++}
++
++/**
++ * bfq_bic_lookup - search into @ioc a bic associated to @bfqd.
++ * @bfqd: the lookup key.
++ * @ioc: the io_context of the process doing I/O.
++ *
++ * Queue lock must be held.
++ */
++static inline struct bfq_io_cq *bfq_bic_lookup(struct bfq_data *bfqd,
++					       struct io_context *ioc)
++{
++	if (ioc)
++		return icq_to_bic(ioc_lookup_icq(ioc, bfqd->queue));
++	return NULL;
++}
+diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
+new file mode 100644
+index 0000000..0a0891b
+--- /dev/null
++++ b/block/bfq-iosched.c
+@@ -0,0 +1,3617 @@
++/*
++ * Budget Fair Queueing (BFQ) disk scheduler.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Licensed under the GPL-2 as detailed in the accompanying COPYING.BFQ
++ * file.
++ *
++ * BFQ is a proportional-share storage-I/O scheduling algorithm based on
++ * the slice-by-slice service scheme of CFQ. But BFQ assigns budgets,
++ * measured in number of sectors, to processes instead of time slices. The
++ * device is not granted to the in-service process for a given time slice,
++ * but until it has exhausted its assigned budget. This change from the time
++ * to the service domain allows BFQ to distribute the device throughput
++ * among processes as desired, without any distortion due to ZBR, workload
++ * fluctuations or other factors. BFQ uses an ad hoc internal scheduler,
++ * called B-WF2Q+, to schedule processes according to their budgets. More
++ * precisely, BFQ schedules queues associated to processes. Thanks to the
++ * accurate policy of B-WF2Q+, BFQ can afford to assign high budgets to
++ * I/O-bound processes issuing sequential requests (to boost the
++ * throughput), and yet guarantee a low latency to interactive and soft
++ * real-time applications.
++ *
++ * BFQ is described in [1], where also a reference to the initial, more
++ * theoretical paper on BFQ can be found. The interested reader can find
++ * in the latter paper full details on the main algorithm, as well as
++ * formulas of the guarantees and formal proofs of all the properties.
++ * With respect to the version of BFQ presented in these papers, this
++ * implementation adds a few more heuristics, such as the one that
++ * guarantees a low latency to soft real-time applications, and a
++ * hierarchical extension based on H-WF2Q+.
++ *
++ * B-WF2Q+ is based on WF2Q+, that is described in [2], together with
++ * H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
++ * complexity derives from the one introduced with EEVDF in [3].
++ *
++ * [1] P. Valente and M. Andreolini, ``Improving Application Responsiveness
++ *     with the BFQ Disk I/O Scheduler'',
++ *     Proceedings of the 5th Annual International Systems and Storage
++ *     Conference (SYSTOR '12), June 2012.
++ *
++ * http://algogroup.unimo.it/people/paolo/disk_sched/bf1-v1-suite-results.pdf
++ *
++ * [2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
++ *     Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
++ *     Oct 1997.
++ *
++ * http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
++ *
++ * [3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
++ *     First: A Flexible and Accurate Mechanism for Proportional Share
++ *     Resource Allocation,'' technical report.
++ *
++ * http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
++ */
++#include <linux/module.h>
++#include <linux/slab.h>
++#include <linux/blkdev.h>
++#include <linux/cgroup.h>
++#include <linux/elevator.h>
++#include <linux/jiffies.h>
++#include <linux/rbtree.h>
++#include <linux/ioprio.h>
++#include "bfq.h"
++#include "blk.h"
++
++/* Max number of dispatches in one round of service. */
++static const int bfq_quantum = 4;
++
++/* Expiration time of sync (0) and async (1) requests, in jiffies. */
++static const int bfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
++
++/* Maximum backwards seek, in KiB. */
++static const int bfq_back_max = 16 * 1024;
++
++/* Penalty of a backwards seek, in number of sectors. */
++static const int bfq_back_penalty = 2;
++
++/* Idling period duration, in jiffies. */
++static int bfq_slice_idle = HZ / 125;
++
++/* Default maximum budget values, in sectors and number of requests. */
++static const int bfq_default_max_budget = 16 * 1024;
++static const int bfq_max_budget_async_rq = 4;
++
++/*
++ * Async to sync throughput distribution is controlled as follows:
++ * when an async request is served, the entity is charged the number
++ * of sectors of the request, multiplied by the factor below
++ */
++static const int bfq_async_charge_factor = 10;
++
++/* Default timeout values, in jiffies, approximating CFQ defaults. */
++static const int bfq_timeout_sync = HZ / 8;
++static int bfq_timeout_async = HZ / 25;
++
++struct kmem_cache *bfq_pool;
++
++/* Below this threshold (in ms), we consider thinktime immediate. */
++#define BFQ_MIN_TT		2
++
++/* hw_tag detection: parallel requests threshold and min samples needed. */
++#define BFQ_HW_QUEUE_THRESHOLD	4
++#define BFQ_HW_QUEUE_SAMPLES	32
++
++#define BFQQ_SEEK_THR	 (sector_t)(8 * 1024)
++#define BFQQ_SEEKY(bfqq) ((bfqq)->seek_mean > BFQQ_SEEK_THR)
++
++/* Min samples used for peak rate estimation (for autotuning). */
++#define BFQ_PEAK_RATE_SAMPLES	32
++
++/* Shift used for peak rate fixed precision calculations. */
++#define BFQ_RATE_SHIFT		16
++
++/*
++ * By default, BFQ computes the duration of the weight raising for
++ * interactive applications automatically, using the following formula:
++ * duration = (R / r) * T, where r is the peak rate of the device, and
++ * R and T are two reference parameters.
++ * In particular, R is the peak rate of the reference device (see below),
++ * and T is a reference time: given the systems that are likely to be
++ * installed on the reference device according to its speed class, T is
++ * about the maximum time needed, under BFQ and while reading two files in
++ * parallel, to load typical large applications on these systems.
++ * In practice, the slower/faster the device at hand is, the more/less it
++ * takes to load applications with respect to the reference device.
++ * Accordingly, the longer/shorter BFQ grants weight raising to interactive
++ * applications.
++ *
++ * BFQ uses four different reference pairs (R, T), depending on:
++ * . whether the device is rotational or non-rotational;
++ * . whether the device is slow, such as old or portable HDDs, as well as
++ *   SD cards, or fast, such as newer HDDs and SSDs.
++ *
++ * The device's speed class is dynamically (re)detected in
++ * bfq_update_peak_rate() every time the estimated peak rate is updated.
++ *
++ * In the following definitions, R_slow[0]/R_fast[0] and T_slow[0]/T_fast[0]
++ * are the reference values for a slow/fast rotational device, whereas
++ * R_slow[1]/R_fast[1] and T_slow[1]/T_fast[1] are the reference values for
++ * a slow/fast non-rotational device. Finally, device_speed_thresh are the
++ * thresholds used to switch between speed classes.
++ * Both the reference peak rates and the thresholds are measured in
++ * sectors/usec, left-shifted by BFQ_RATE_SHIFT.
++ */
++static int R_slow[2] = {1536, 10752};
++static int R_fast[2] = {17415, 34791};
++/*
++ * To improve readability, a conversion function is used to initialize the
++ * following arrays, which entails that they can be initialized only in a
++ * function.
++ */
++static int T_slow[2];
++static int T_fast[2];
++static int device_speed_thresh[2];
++
++#define BFQ_SERVICE_TREE_INIT	((struct bfq_service_tree)		\
++				{ RB_ROOT, RB_ROOT, NULL, NULL, 0, 0 })
++
++#define RQ_BIC(rq)		((struct bfq_io_cq *) (rq)->elv.priv[0])
++#define RQ_BFQQ(rq)		((rq)->elv.priv[1])
++
++static inline void bfq_schedule_dispatch(struct bfq_data *bfqd);
++
++#include "bfq-ioc.c"
++#include "bfq-sched.c"
++#include "bfq-cgroup.c"
++
++#define bfq_class_idle(bfqq)	((bfqq)->entity.ioprio_class ==\
++				 IOPRIO_CLASS_IDLE)
++#define bfq_class_rt(bfqq)	((bfqq)->entity.ioprio_class ==\
++				 IOPRIO_CLASS_RT)
++
++#define bfq_sample_valid(samples)	((samples) > 80)
++
++/*
++ * We regard a request as SYNC, if either it's a read or has the SYNC bit
++ * set (in which case it could also be a direct WRITE).
++ */
++static inline int bfq_bio_sync(struct bio *bio)
++{
++	if (bio_data_dir(bio) == READ || (bio->bi_rw & REQ_SYNC))
++		return 1;
++
++	return 0;
++}
++
++/*
++ * Scheduler run of queue, if there are requests pending and no one in the
++ * driver that will restart queueing.
++ */
++static inline void bfq_schedule_dispatch(struct bfq_data *bfqd)
++{
++	if (bfqd->queued != 0) {
++		bfq_log(bfqd, "schedule dispatch");
++		kblockd_schedule_work(&bfqd->unplug_work);
++	}
++}
++
++/*
++ * Lifted from AS - choose which of rq1 and rq2 that is best served now.
++ * We choose the request that is closesr to the head right now.  Distance
++ * behind the head is penalized and only allowed to a certain extent.
++ */
++static struct request *bfq_choose_req(struct bfq_data *bfqd,
++				      struct request *rq1,
++				      struct request *rq2,
++				      sector_t last)
++{
++	sector_t s1, s2, d1 = 0, d2 = 0;
++	unsigned long back_max;
++#define BFQ_RQ1_WRAP	0x01 /* request 1 wraps */
++#define BFQ_RQ2_WRAP	0x02 /* request 2 wraps */
++	unsigned wrap = 0; /* bit mask: requests behind the disk head? */
++
++	if (rq1 == NULL || rq1 == rq2)
++		return rq2;
++	if (rq2 == NULL)
++		return rq1;
++
++	if (rq_is_sync(rq1) && !rq_is_sync(rq2))
++		return rq1;
++	else if (rq_is_sync(rq2) && !rq_is_sync(rq1))
++		return rq2;
++	if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))
++		return rq1;
++	else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))
++		return rq2;
++
++	s1 = blk_rq_pos(rq1);
++	s2 = blk_rq_pos(rq2);
++
++	/*
++	 * By definition, 1KiB is 2 sectors.
++	 */
++	back_max = bfqd->bfq_back_max * 2;
++
++	/*
++	 * Strict one way elevator _except_ in the case where we allow
++	 * short backward seeks which are biased as twice the cost of a
++	 * similar forward seek.
++	 */
++	if (s1 >= last)
++		d1 = s1 - last;
++	else if (s1 + back_max >= last)
++		d1 = (last - s1) * bfqd->bfq_back_penalty;
++	else
++		wrap |= BFQ_RQ1_WRAP;
++
++	if (s2 >= last)
++		d2 = s2 - last;
++	else if (s2 + back_max >= last)
++		d2 = (last - s2) * bfqd->bfq_back_penalty;
++	else
++		wrap |= BFQ_RQ2_WRAP;
++
++	/* Found required data */
++
++	/*
++	 * By doing switch() on the bit mask "wrap" we avoid having to
++	 * check two variables for all permutations: --> faster!
++	 */
++	switch (wrap) {
++	case 0: /* common case for CFQ: rq1 and rq2 not wrapped */
++		if (d1 < d2)
++			return rq1;
++		else if (d2 < d1)
++			return rq2;
++		else {
++			if (s1 >= s2)
++				return rq1;
++			else
++				return rq2;
++		}
++
++	case BFQ_RQ2_WRAP:
++		return rq1;
++	case BFQ_RQ1_WRAP:
++		return rq2;
++	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */
++	default:
++		/*
++		 * Since both rqs are wrapped,
++		 * start with the one that's further behind head
++		 * (--> only *one* back seek required),
++		 * since back seek takes more time than forward.
++		 */
++		if (s1 <= s2)
++			return rq1;
++		else
++			return rq2;
++	}
++}
++
++static struct bfq_queue *
++bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,
++		     sector_t sector, struct rb_node **ret_parent,
++		     struct rb_node ***rb_link)
++{
++	struct rb_node **p, *parent;
++	struct bfq_queue *bfqq = NULL;
++
++	parent = NULL;
++	p = &root->rb_node;
++	while (*p) {
++		struct rb_node **n;
++
++		parent = *p;
++		bfqq = rb_entry(parent, struct bfq_queue, pos_node);
++
++		/*
++		 * Sort strictly based on sector. Smallest to the left,
++		 * largest to the right.
++		 */
++		if (sector > blk_rq_pos(bfqq->next_rq))
++			n = &(*p)->rb_right;
++		else if (sector < blk_rq_pos(bfqq->next_rq))
++			n = &(*p)->rb_left;
++		else
++			break;
++		p = n;
++		bfqq = NULL;
++	}
++
++	*ret_parent = parent;
++	if (rb_link)
++		*rb_link = p;
++
++	bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",
++		(long long unsigned)sector,
++		bfqq != NULL ? bfqq->pid : 0);
++
++	return bfqq;
++}
++
++static void bfq_rq_pos_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	struct rb_node **p, *parent;
++	struct bfq_queue *__bfqq;
++
++	if (bfqq->pos_root != NULL) {
++		rb_erase(&bfqq->pos_node, bfqq->pos_root);
++		bfqq->pos_root = NULL;
++	}
++
++	if (bfq_class_idle(bfqq))
++		return;
++	if (!bfqq->next_rq)
++		return;
++
++	bfqq->pos_root = &bfqd->rq_pos_tree;
++	__bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,
++			blk_rq_pos(bfqq->next_rq), &parent, &p);
++	if (__bfqq == NULL) {
++		rb_link_node(&bfqq->pos_node, parent, p);
++		rb_insert_color(&bfqq->pos_node, bfqq->pos_root);
++	} else
++		bfqq->pos_root = NULL;
++}
++
++/*
++ * Tell whether there are active queues or groups with differentiated weights.
++ */
++static inline bool bfq_differentiated_weights(struct bfq_data *bfqd)
++{
++	BUG_ON(!bfqd->hw_tag);
++	/*
++	 * For weights to differ, at least one of the trees must contain
++	 * at least two nodes.
++	 */
++	return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
++		(bfqd->queue_weights_tree.rb_node->rb_left ||
++		 bfqd->queue_weights_tree.rb_node->rb_right)
++#ifdef CONFIG_CGROUP_BFQIO
++	       ) ||
++	       (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
++		(bfqd->group_weights_tree.rb_node->rb_left ||
++		 bfqd->group_weights_tree.rb_node->rb_right)
++#endif
++	       );
++}
++
++/*
++ * If the weight-counter tree passed as input contains no counter for
++ * the weight of the input entity, then add that counter; otherwise just
++ * increment the existing counter.
++ *
++ * Note that weight-counter trees contain few nodes in mostly symmetric
++ * scenarios. For example, if all queues have the same weight, then the
++ * weight-counter tree for the queues may contain at most one node.
++ * This holds even if low_latency is on, because weight-raised queues
++ * are not inserted in the tree.
++ * In most scenarios, the rate at which nodes are created/destroyed
++ * should be low too.
++ */
++static void bfq_weights_tree_add(struct bfq_data *bfqd,
++				 struct bfq_entity *entity,
++				 struct rb_root *root)
++{
++	struct rb_node **new = &(root->rb_node), *parent = NULL;
++
++	/*
++	 * Do not insert if:
++	 * - the device does not support queueing;
++	 * - the entity is already associated with a counter, which happens if:
++	 *   1) the entity is associated with a queue, 2) a request arrival
++	 *   has caused the queue to become both non-weight-raised, and hence
++	 *   change its weight, and backlogged; in this respect, each
++	 *   of the two events causes an invocation of this function,
++	 *   3) this is the invocation of this function caused by the second
++	 *   event. This second invocation is actually useless, and we handle
++	 *   this fact by exiting immediately. More efficient or clearer
++	 *   solutions might possibly be adopted.
++	 */
++	if (!bfqd->hw_tag || entity->weight_counter)
++		return;
++
++	while (*new) {
++		struct bfq_weight_counter *__counter = container_of(*new,
++						struct bfq_weight_counter,
++						weights_node);
++		parent = *new;
++
++		if (entity->weight == __counter->weight) {
++			entity->weight_counter = __counter;
++			goto inc_counter;
++		}
++		if (entity->weight < __counter->weight)
++			new = &((*new)->rb_left);
++		else
++			new = &((*new)->rb_right);
++	}
++
++	entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
++					 GFP_ATOMIC);
++	entity->weight_counter->weight = entity->weight;
++	rb_link_node(&entity->weight_counter->weights_node, parent, new);
++	rb_insert_color(&entity->weight_counter->weights_node, root);
++
++inc_counter:
++	entity->weight_counter->num_active++;
++}
++
++/*
++ * Decrement the weight counter associated with the entity, and, if the
++ * counter reaches 0, remove the counter from the tree.
++ * See the comments to the function bfq_weights_tree_add() for considerations
++ * about overhead.
++ */
++static void bfq_weights_tree_remove(struct bfq_data *bfqd,
++				    struct bfq_entity *entity,
++				    struct rb_root *root)
++{
++	/*
++	 * Check whether the entity is actually associated with a counter.
++	 * In fact, the device may not be considered NCQ-capable for a while,
++	 * which implies that no insertion in the weight trees is performed,
++	 * after which the device may start to be deemed NCQ-capable, and hence
++	 * this function may start to be invoked. This may cause the function
++	 * to be invoked for entities that are not associated with any counter.
++	 */
++	if (!entity->weight_counter)
++		return;
++
++	BUG_ON(RB_EMPTY_ROOT(root));
++	BUG_ON(entity->weight_counter->weight != entity->weight);
++
++	BUG_ON(!entity->weight_counter->num_active);
++	entity->weight_counter->num_active--;
++	if (entity->weight_counter->num_active > 0)
++		goto reset_entity_pointer;
++
++	rb_erase(&entity->weight_counter->weights_node, root);
++	kfree(entity->weight_counter);
++
++reset_entity_pointer:
++	entity->weight_counter = NULL;
++}
++
++static struct request *bfq_find_next_rq(struct bfq_data *bfqd,
++					struct bfq_queue *bfqq,
++					struct request *last)
++{
++	struct rb_node *rbnext = rb_next(&last->rb_node);
++	struct rb_node *rbprev = rb_prev(&last->rb_node);
++	struct request *next = NULL, *prev = NULL;
++
++	BUG_ON(RB_EMPTY_NODE(&last->rb_node));
++
++	if (rbprev != NULL)
++		prev = rb_entry_rq(rbprev);
++
++	if (rbnext != NULL)
++		next = rb_entry_rq(rbnext);
++	else {
++		rbnext = rb_first(&bfqq->sort_list);
++		if (rbnext && rbnext != &last->rb_node)
++			next = rb_entry_rq(rbnext);
++	}
++
++	return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));
++}
++
++/* see the definition of bfq_async_charge_factor for details */
++static inline unsigned long bfq_serv_to_charge(struct request *rq,
++					       struct bfq_queue *bfqq)
++{
++	return blk_rq_sectors(rq) *
++		(1 + ((!bfq_bfqq_sync(bfqq)) * (bfqq->wr_coeff == 1) *
++		bfq_async_charge_factor));
++}
++
++/**
++ * bfq_updated_next_req - update the queue after a new next_rq selection.
++ * @bfqd: the device data the queue belongs to.
++ * @bfqq: the queue to update.
++ *
++ * If the first request of a queue changes we make sure that the queue
++ * has enough budget to serve at least its first request (if the
++ * request has grown).  We do this because if the queue has not enough
++ * budget for its first request, it has to go through two dispatch
++ * rounds to actually get it dispatched.
++ */
++static void bfq_updated_next_req(struct bfq_data *bfqd,
++				 struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++	struct request *next_rq = bfqq->next_rq;
++	unsigned long new_budget;
++
++	if (next_rq == NULL)
++		return;
++
++	if (bfqq == bfqd->in_service_queue)
++		/*
++		 * In order not to break guarantees, budgets cannot be
++		 * changed after an entity has been selected.
++		 */
++		return;
++
++	BUG_ON(entity->tree != &st->active);
++	BUG_ON(entity == entity->sched_data->in_service_entity);
++
++	new_budget = max_t(unsigned long, bfqq->max_budget,
++			   bfq_serv_to_charge(next_rq, bfqq));
++	if (entity->budget != new_budget) {
++		entity->budget = new_budget;
++		bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
++					 new_budget);
++		bfq_activate_bfqq(bfqd, bfqq);
++	}
++}
++
++static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
++{
++	u64 dur;
++
++	if (bfqd->bfq_wr_max_time > 0)
++		return bfqd->bfq_wr_max_time;
++
++	dur = bfqd->RT_prod;
++	do_div(dur, bfqd->peak_rate);
++
++	return dur;
++}
++
++static void bfq_add_request(struct request *rq)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++	struct bfq_entity *entity = &bfqq->entity;
++	struct bfq_data *bfqd = bfqq->bfqd;
++	struct request *next_rq, *prev;
++	unsigned long old_wr_coeff = bfqq->wr_coeff;
++	int idle_for_long_time = 0;
++
++	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
++	bfqq->queued[rq_is_sync(rq)]++;
++	bfqd->queued++;
++
++	elv_rb_add(&bfqq->sort_list, rq);
++
++	/*
++	 * Check if this request is a better next-serve candidate.
++	 */
++	prev = bfqq->next_rq;
++	next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);
++	BUG_ON(next_rq == NULL);
++	bfqq->next_rq = next_rq;
++
++	/*
++	 * Adjust priority tree position, if next_rq changes.
++	 */
++	if (prev != bfqq->next_rq)
++		bfq_rq_pos_tree_add(bfqd, bfqq);
++
++	if (!bfq_bfqq_busy(bfqq)) {
++		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++			time_is_before_jiffies(bfqq->soft_rt_next_start);
++		idle_for_long_time = time_is_before_jiffies(
++			bfqq->budget_timeout +
++			bfqd->bfq_wr_min_idle_time);
++		entity->budget = max_t(unsigned long, bfqq->max_budget,
++				       bfq_serv_to_charge(next_rq, bfqq));
++
++		if (!bfq_bfqq_IO_bound(bfqq)) {
++			if (time_before(jiffies,
++					RQ_BIC(rq)->ttime.last_end_request +
++					bfqd->bfq_slice_idle)) {
++				bfqq->requests_within_timer++;
++				if (bfqq->requests_within_timer >=
++				    bfqd->bfq_requests_within_timer)
++					bfq_mark_bfqq_IO_bound(bfqq);
++			} else
++				bfqq->requests_within_timer = 0;
++		}
++
++		if (!bfqd->low_latency)
++			goto add_bfqq_busy;
++
++		/*
++		 * If the queue is not being boosted and has been idle
++		 * for enough time, start a weight-raising period
++		 */
++		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
++			if (idle_for_long_time)
++				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++			else
++				bfqq->wr_cur_max_time =
++					bfqd->bfq_wr_rt_max_time;
++			bfq_log_bfqq(bfqd, bfqq,
++				     "wrais starting at %lu, rais_max_time %u",
++				     jiffies,
++				     jiffies_to_msecs(bfqq->wr_cur_max_time));
++		} else if (old_wr_coeff > 1) {
++			if (idle_for_long_time)
++				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++			else if (bfqq->wr_cur_max_time ==
++				 bfqd->bfq_wr_rt_max_time &&
++				 !soft_rt) {
++				bfqq->wr_coeff = 1;
++				bfq_log_bfqq(bfqd, bfqq,
++					"wrais ending at %lu, rais_max_time %u",
++					jiffies,
++					jiffies_to_msecs(bfqq->
++						wr_cur_max_time));
++			} else if (time_before(
++					bfqq->last_wr_start_finish +
++					bfqq->wr_cur_max_time,
++					jiffies +
++					bfqd->bfq_wr_rt_max_time) &&
++				   soft_rt) {
++				/*
++				 *
++				 * The remaining weight-raising time is lower
++				 * than bfqd->bfq_wr_rt_max_time, which
++				 * means that the application is enjoying
++				 * weight raising either because deemed soft-
++				 * rt in the near past, or because deemed
++				 * interactive a long ago. In both cases,
++				 * resetting now the current remaining weight-
++				 * raising time for the application to the
++				 * weight-raising duration for soft rt
++				 * applications would not cause any latency
++				 * increase for the application (as the new
++				 * duration would be higher than the remaining
++				 * time).
++				 *
++				 * In addition, the application is now meeting
++				 * the requirements for being deemed soft rt.
++				 * In the end we can correctly and safely
++				 * (re)charge the weight-raising duration for
++				 * the application with the weight-raising
++				 * duration for soft rt applications.
++				 *
++				 * In particular, doing this recharge now, i.e.,
++				 * before the weight-raising period for the
++				 * application finishes, reduces the probability
++				 * of the following negative scenario:
++				 * 1) the weight of a soft rt application is
++				 *    raised at startup (as for any newly
++				 *    created application),
++				 * 2) since the application is not interactive,
++				 *    at a certain time weight-raising is
++				 *    stopped for the application,
++				 * 3) at that time the application happens to
++				 *    still have pending requests, and hence
++				 *    is destined to not have a chance to be
++				 *    deemed soft rt before these requests are
++				 *    completed (see the comments to the
++				 *    function bfq_bfqq_softrt_next_start()
++				 *    for details on soft rt detection),
++				 * 4) these pending requests experience a high
++				 *    latency because the application is not
++				 *    weight-raised while they are pending.
++				 */
++				bfqq->last_wr_start_finish = jiffies;
++				bfqq->wr_cur_max_time =
++					bfqd->bfq_wr_rt_max_time;
++			}
++		}
++		if (old_wr_coeff != bfqq->wr_coeff)
++			entity->ioprio_changed = 1;
++add_bfqq_busy:
++		bfqq->last_idle_bklogged = jiffies;
++		bfqq->service_from_backlogged = 0;
++		bfq_clear_bfqq_softrt_update(bfqq);
++		bfq_add_bfqq_busy(bfqd, bfqq);
++	} else {
++		if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&
++		    time_is_before_jiffies(
++				bfqq->last_wr_start_finish +
++				bfqd->bfq_wr_min_inter_arr_async)) {
++			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
++			bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
++
++			bfqd->wr_busy_queues++;
++			entity->ioprio_changed = 1;
++			bfq_log_bfqq(bfqd, bfqq,
++			    "non-idle wrais starting at %lu, rais_max_time %u",
++			    jiffies,
++			    jiffies_to_msecs(bfqq->wr_cur_max_time));
++		}
++		if (prev != bfqq->next_rq)
++			bfq_updated_next_req(bfqd, bfqq);
++	}
++
++	if (bfqd->low_latency &&
++		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
++		 idle_for_long_time))
++		bfqq->last_wr_start_finish = jiffies;
++}
++
++static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,
++					  struct bio *bio)
++{
++	struct task_struct *tsk = current;
++	struct bfq_io_cq *bic;
++	struct bfq_queue *bfqq;
++
++	bic = bfq_bic_lookup(bfqd, tsk->io_context);
++	if (bic == NULL)
++		return NULL;
++
++	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++	if (bfqq != NULL)
++		return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));
++
++	return NULL;
++}
++
++static void bfq_activate_request(struct request_queue *q, struct request *rq)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++
++	bfqd->rq_in_driver++;
++	bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
++	bfq_log(bfqd, "activate_request: new bfqd->last_position %llu",
++		(long long unsigned)bfqd->last_position);
++}
++
++static inline void bfq_deactivate_request(struct request_queue *q,
++					  struct request *rq)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++
++	BUG_ON(bfqd->rq_in_driver == 0);
++	bfqd->rq_in_driver--;
++}
++
++static void bfq_remove_request(struct request *rq)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++	struct bfq_data *bfqd = bfqq->bfqd;
++	const int sync = rq_is_sync(rq);
++
++	if (bfqq->next_rq == rq) {
++		bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);
++		bfq_updated_next_req(bfqd, bfqq);
++	}
++
++	list_del_init(&rq->queuelist);
++	BUG_ON(bfqq->queued[sync] == 0);
++	bfqq->queued[sync]--;
++	bfqd->queued--;
++	elv_rb_del(&bfqq->sort_list, rq);
++
++	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
++		if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue)
++			bfq_del_bfqq_busy(bfqd, bfqq, 1);
++		/*
++		 * Remove queue from request-position tree as it is empty.
++		 */
++		if (bfqq->pos_root != NULL) {
++			rb_erase(&bfqq->pos_node, bfqq->pos_root);
++			bfqq->pos_root = NULL;
++		}
++	}
++
++	if (rq->cmd_flags & REQ_META) {
++		BUG_ON(bfqq->meta_pending == 0);
++		bfqq->meta_pending--;
++	}
++}
++
++static int bfq_merge(struct request_queue *q, struct request **req,
++		     struct bio *bio)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct request *__rq;
++
++	__rq = bfq_find_rq_fmerge(bfqd, bio);
++	if (__rq != NULL && elv_rq_merge_ok(__rq, bio)) {
++		*req = __rq;
++		return ELEVATOR_FRONT_MERGE;
++	}
++
++	return ELEVATOR_NO_MERGE;
++}
++
++static void bfq_merged_request(struct request_queue *q, struct request *req,
++			       int type)
++{
++	if (type == ELEVATOR_FRONT_MERGE &&
++	    rb_prev(&req->rb_node) &&
++	    blk_rq_pos(req) <
++	    blk_rq_pos(container_of(rb_prev(&req->rb_node),
++				    struct request, rb_node))) {
++		struct bfq_queue *bfqq = RQ_BFQQ(req);
++		struct bfq_data *bfqd = bfqq->bfqd;
++		struct request *prev, *next_rq;
++
++		/* Reposition request in its sort_list */
++		elv_rb_del(&bfqq->sort_list, req);
++		elv_rb_add(&bfqq->sort_list, req);
++		/* Choose next request to be served for bfqq */
++		prev = bfqq->next_rq;
++		next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,
++					 bfqd->last_position);
++		BUG_ON(next_rq == NULL);
++		bfqq->next_rq = next_rq;
++		/*
++		 * If next_rq changes, update both the queue's budget to
++		 * fit the new request and the queue's position in its
++		 * rq_pos_tree.
++		 */
++		if (prev != bfqq->next_rq) {
++			bfq_updated_next_req(bfqd, bfqq);
++			bfq_rq_pos_tree_add(bfqd, bfqq);
++		}
++	}
++}
++
++static void bfq_merged_requests(struct request_queue *q, struct request *rq,
++				struct request *next)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++	/*
++	 * Reposition in fifo if next is older than rq.
++	 */
++	if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
++	    time_before(next->fifo_time, rq->fifo_time)) {
++		list_move(&rq->queuelist, &next->queuelist);
++		rq->fifo_time = next->fifo_time;
++	}
++
++	if (bfqq->next_rq == next)
++		bfqq->next_rq = rq;
++
++	bfq_remove_request(next);
++}
++
++/* Must be called with bfqq != NULL */
++static inline void bfq_bfqq_end_wr(struct bfq_queue *bfqq)
++{
++	BUG_ON(bfqq == NULL);
++	if (bfq_bfqq_busy(bfqq))
++		bfqq->bfqd->wr_busy_queues--;
++	bfqq->wr_coeff = 1;
++	bfqq->wr_cur_max_time = 0;
++	/* Trigger a weight change on the next activation of the queue */
++	bfqq->entity.ioprio_changed = 1;
++}
++
++static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
++				    struct bfq_group *bfqg)
++{
++	int i, j;
++
++	for (i = 0; i < 2; i++)
++		for (j = 0; j < IOPRIO_BE_NR; j++)
++			if (bfqg->async_bfqq[i][j] != NULL)
++				bfq_bfqq_end_wr(bfqg->async_bfqq[i][j]);
++	if (bfqg->async_idle_bfqq != NULL)
++		bfq_bfqq_end_wr(bfqg->async_idle_bfqq);
++}
++
++static void bfq_end_wr(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq;
++
++	spin_lock_irq(bfqd->queue->queue_lock);
++
++	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
++		bfq_bfqq_end_wr(bfqq);
++	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)
++		bfq_bfqq_end_wr(bfqq);
++	bfq_end_wr_async(bfqd);
++
++	spin_unlock_irq(bfqd->queue->queue_lock);
++}
++
++static int bfq_allow_merge(struct request_queue *q, struct request *rq,
++			   struct bio *bio)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_io_cq *bic;
++	struct bfq_queue *bfqq;
++
++	/*
++	 * Disallow merge of a sync bio into an async request.
++	 */
++	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
++		return 0;
++
++	/*
++	 * Lookup the bfqq that this bio will be queued with. Allow
++	 * merge only if rq is queued there.
++	 * Queue lock is held here.
++	 */
++	bic = bfq_bic_lookup(bfqd, current->io_context);
++	if (bic == NULL)
++		return 0;
++
++	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++	return bfqq == RQ_BFQQ(rq);
++}
++
++static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
++				       struct bfq_queue *bfqq)
++{
++	if (bfqq != NULL) {
++		bfq_mark_bfqq_must_alloc(bfqq);
++		bfq_mark_bfqq_budget_new(bfqq);
++		bfq_clear_bfqq_fifo_expire(bfqq);
++
++		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
++
++		bfq_log_bfqq(bfqd, bfqq,
++			     "set_in_service_queue, cur-budget = %lu",
++			     bfqq->entity.budget);
++	}
++
++	bfqd->in_service_queue = bfqq;
++}
++
++/*
++ * Get and set a new queue for service.
++ */
++static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd,
++						  struct bfq_queue *bfqq)
++{
++	if (!bfqq)
++		bfqq = bfq_get_next_queue(bfqd);
++	else
++		bfq_get_next_queue_forced(bfqd, bfqq);
++
++	__bfq_set_in_service_queue(bfqd, bfqq);
++	return bfqq;
++}
++
++static inline sector_t bfq_dist_from_last(struct bfq_data *bfqd,
++					  struct request *rq)
++{
++	if (blk_rq_pos(rq) >= bfqd->last_position)
++		return blk_rq_pos(rq) - bfqd->last_position;
++	else
++		return bfqd->last_position - blk_rq_pos(rq);
++}
++
++/*
++ * Return true if bfqq has no request pending and rq is close enough to
++ * bfqd->last_position, or if rq is closer to bfqd->last_position than
++ * bfqq->next_rq
++ */
++static inline int bfq_rq_close(struct bfq_data *bfqd, struct request *rq)
++{
++	return bfq_dist_from_last(bfqd, rq) <= BFQQ_SEEK_THR;
++}
++
++static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
++{
++	struct rb_root *root = &bfqd->rq_pos_tree;
++	struct rb_node *parent, *node;
++	struct bfq_queue *__bfqq;
++	sector_t sector = bfqd->last_position;
++
++	if (RB_EMPTY_ROOT(root))
++		return NULL;
++
++	/*
++	 * First, if we find a request starting at the end of the last
++	 * request, choose it.
++	 */
++	__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);
++	if (__bfqq != NULL)
++		return __bfqq;
++
++	/*
++	 * If the exact sector wasn't found, the parent of the NULL leaf
++	 * will contain the closest sector (rq_pos_tree sorted by
++	 * next_request position).
++	 */
++	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
++	if (bfq_rq_close(bfqd, __bfqq->next_rq))
++		return __bfqq;
++
++	if (blk_rq_pos(__bfqq->next_rq) < sector)
++		node = rb_next(&__bfqq->pos_node);
++	else
++		node = rb_prev(&__bfqq->pos_node);
++	if (node == NULL)
++		return NULL;
++
++	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
++	if (bfq_rq_close(bfqd, __bfqq->next_rq))
++		return __bfqq;
++
++	return NULL;
++}
++
++/*
++ * bfqd - obvious
++ * cur_bfqq - passed in so that we don't decide that the current queue
++ *            is closely cooperating with itself.
++ *
++ * We are assuming that cur_bfqq has dispatched at least one request,
++ * and that bfqd->last_position reflects a position on the disk associated
++ * with the I/O issued by cur_bfqq.
++ */
++static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
++					      struct bfq_queue *cur_bfqq)
++{
++	struct bfq_queue *bfqq;
++
++	if (bfq_class_idle(cur_bfqq))
++		return NULL;
++	if (!bfq_bfqq_sync(cur_bfqq))
++		return NULL;
++	if (BFQQ_SEEKY(cur_bfqq))
++		return NULL;
++
++	/* If device has only one backlogged bfq_queue, don't search. */
++	if (bfqd->busy_queues == 1)
++		return NULL;
++
++	/*
++	 * We should notice if some of the queues are cooperating, e.g.
++	 * working closely on the same area of the disk. In that case,
++	 * we can group them together and don't waste time idling.
++	 */
++	bfqq = bfqq_close(bfqd);
++	if (bfqq == NULL || bfqq == cur_bfqq)
++		return NULL;
++
++	/*
++	 * Do not merge queues from different bfq_groups.
++	*/
++	if (bfqq->entity.parent != cur_bfqq->entity.parent)
++		return NULL;
++
++	/*
++	 * It only makes sense to merge sync queues.
++	 */
++	if (!bfq_bfqq_sync(bfqq))
++		return NULL;
++	if (BFQQ_SEEKY(bfqq))
++		return NULL;
++
++	/*
++	 * Do not merge queues of different priority classes.
++	 */
++	if (bfq_class_rt(bfqq) != bfq_class_rt(cur_bfqq))
++		return NULL;
++
++	return bfqq;
++}
++
++/*
++ * If enough samples have been computed, return the current max budget
++ * stored in bfqd, which is dynamically updated according to the
++ * estimated disk peak rate; otherwise return the default max budget
++ */
++static inline unsigned long bfq_max_budget(struct bfq_data *bfqd)
++{
++	if (bfqd->budgets_assigned < 194)
++		return bfq_default_max_budget;
++	else
++		return bfqd->bfq_max_budget;
++}
++
++/*
++ * Return min budget, which is a fraction of the current or default
++ * max budget (trying with 1/32)
++ */
++static inline unsigned long bfq_min_budget(struct bfq_data *bfqd)
++{
++	if (bfqd->budgets_assigned < 194)
++		return bfq_default_max_budget / 32;
++	else
++		return bfqd->bfq_max_budget / 32;
++}
++
++static void bfq_arm_slice_timer(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq = bfqd->in_service_queue;
++	struct bfq_io_cq *bic;
++	unsigned long sl;
++
++	BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list));
++
++	/* Processes have exited, don't wait. */
++	bic = bfqd->in_service_bic;
++	if (bic == NULL || atomic_read(&bic->icq.ioc->active_ref) == 0)
++		return;
++
++	bfq_mark_bfqq_wait_request(bfqq);
++
++	/*
++	 * We don't want to idle for seeks, but we do want to allow
++	 * fair distribution of slice time for a process doing back-to-back
++	 * seeks. So allow a little bit of time for him to submit a new rq.
++	 *
++	 * To prevent processes with (partly) seeky workloads from
++	 * being too ill-treated, grant them a small fraction of the
++	 * assigned budget before reducing the waiting time to
++	 * BFQ_MIN_TT. This happened to help reduce latency.
++	 */
++	sl = bfqd->bfq_slice_idle;
++	/*
++	 * Unless the queue is being weight-raised, grant only minimum idle
++	 * time if the queue either has been seeky for long enough or has
++	 * already proved to be constantly seeky.
++	 */
++	if (bfq_sample_valid(bfqq->seek_samples) &&
++	    ((BFQQ_SEEKY(bfqq) && bfqq->entity.service >
++				  bfq_max_budget(bfqq->bfqd) / 8) ||
++	      bfq_bfqq_constantly_seeky(bfqq)) && bfqq->wr_coeff == 1)
++		sl = min(sl, msecs_to_jiffies(BFQ_MIN_TT));
++	else if (bfqq->wr_coeff > 1)
++		sl = sl * 3;
++	bfqd->last_idling_start = ktime_get();
++	mod_timer(&bfqd->idle_slice_timer, jiffies + sl);
++	bfq_log(bfqd, "arm idle: %u/%u ms",
++		jiffies_to_msecs(sl), jiffies_to_msecs(bfqd->bfq_slice_idle));
++}
++
++/*
++ * Set the maximum time for the in-service queue to consume its
++ * budget. This prevents seeky processes from lowering the disk
++ * throughput (always guaranteed with a time slice scheme as in CFQ).
++ */
++static void bfq_set_budget_timeout(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq = bfqd->in_service_queue;
++	unsigned int timeout_coeff;
++	if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)
++		timeout_coeff = 1;
++	else
++		timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;
++
++	bfqd->last_budget_start = ktime_get();
++
++	bfq_clear_bfqq_budget_new(bfqq);
++	bfqq->budget_timeout = jiffies +
++		bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] * timeout_coeff;
++
++	bfq_log_bfqq(bfqd, bfqq, "set budget_timeout %u",
++		jiffies_to_msecs(bfqd->bfq_timeout[bfq_bfqq_sync(bfqq)] *
++		timeout_coeff));
++}
++
++/*
++ * Move request from internal lists to the request queue dispatch list.
++ */
++static void bfq_dispatch_insert(struct request_queue *q, struct request *rq)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++	/*
++	 * For consistency, the next instruction should have been executed
++	 * after removing the request from the queue and dispatching it.
++	 * We execute instead this instruction before bfq_remove_request()
++	 * (and hence introduce a temporary inconsistency), for efficiency.
++	 * In fact, in a forced_dispatch, this prevents two counters related
++	 * to bfqq->dispatched to risk to be uselessly decremented if bfqq
++	 * is not in service, and then to be incremented again after
++	 * incrementing bfqq->dispatched.
++	 */
++	bfqq->dispatched++;
++	bfq_remove_request(rq);
++	elv_dispatch_sort(q, rq);
++
++	if (bfq_bfqq_sync(bfqq))
++		bfqd->sync_flight++;
++}
++
++/*
++ * Return expired entry, or NULL to just start from scratch in rbtree.
++ */
++static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
++{
++	struct request *rq = NULL;
++
++	if (bfq_bfqq_fifo_expire(bfqq))
++		return NULL;
++
++	bfq_mark_bfqq_fifo_expire(bfqq);
++
++	if (list_empty(&bfqq->fifo))
++		return NULL;
++
++	rq = rq_entry_fifo(bfqq->fifo.next);
++
++	if (time_before(jiffies, rq->fifo_time))
++		return NULL;
++
++	return rq;
++}
++
++/*
++ * Must be called with the queue_lock held.
++ */
++static int bfqq_process_refs(struct bfq_queue *bfqq)
++{
++	int process_refs, io_refs;
++
++	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
++	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
++	BUG_ON(process_refs < 0);
++	return process_refs;
++}
++
++static void bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++	int process_refs, new_process_refs;
++	struct bfq_queue *__bfqq;
++
++	/*
++	 * If there are no process references on the new_bfqq, then it is
++	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
++	 * may have dropped their last reference (not just their last process
++	 * reference).
++	 */
++	if (!bfqq_process_refs(new_bfqq))
++		return;
++
++	/* Avoid a circular list and skip interim queue merges. */
++	while ((__bfqq = new_bfqq->new_bfqq)) {
++		if (__bfqq == bfqq)
++			return;
++		new_bfqq = __bfqq;
++	}
++
++	process_refs = bfqq_process_refs(bfqq);
++	new_process_refs = bfqq_process_refs(new_bfqq);
++	/*
++	 * If the process for the bfqq has gone away, there is no
++	 * sense in merging the queues.
++	 */
++	if (process_refs == 0 || new_process_refs == 0)
++		return;
++
++	/*
++	 * Merge in the direction of the lesser amount of work.
++	 */
++	if (new_process_refs >= process_refs) {
++		bfqq->new_bfqq = new_bfqq;
++		atomic_add(process_refs, &new_bfqq->ref);
++	} else {
++		new_bfqq->new_bfqq = bfqq;
++		atomic_add(new_process_refs, &bfqq->ref);
++	}
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
++		new_bfqq->pid);
++}
++
++static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++	return entity->budget - entity->service;
++}
++
++static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	BUG_ON(bfqq != bfqd->in_service_queue);
++
++	__bfq_bfqd_reset_in_service(bfqd);
++
++	/*
++	 * If this bfqq is shared between multiple processes, check
++	 * to make sure that those processes are still issuing I/Os
++	 * within the mean seek distance. If not, it may be time to
++	 * break the queues apart again.
++	 */
++	if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))
++		bfq_mark_bfqq_split_coop(bfqq);
++
++	if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
++		/*
++		 * Overloading budget_timeout field to store the time
++		 * at which the queue remains with no backlog; used by
++		 * the weight-raising mechanism.
++		 */
++		bfqq->budget_timeout = jiffies;
++		bfq_del_bfqq_busy(bfqd, bfqq, 1);
++	} else {
++		bfq_activate_bfqq(bfqd, bfqq);
++		/*
++		 * Resort priority tree of potential close cooperators.
++		 */
++		bfq_rq_pos_tree_add(bfqd, bfqq);
++	}
++}
++
++/**
++ * __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.
++ * @bfqd: device data.
++ * @bfqq: queue to update.
++ * @reason: reason for expiration.
++ *
++ * Handle the feedback on @bfqq budget.  See the body for detailed
++ * comments.
++ */
++static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,
++				     struct bfq_queue *bfqq,
++				     enum bfqq_expiration reason)
++{
++	struct request *next_rq;
++	unsigned long budget, min_budget;
++
++	budget = bfqq->max_budget;
++	min_budget = bfq_min_budget(bfqd);
++
++	BUG_ON(bfqq != bfqd->in_service_queue);
++
++	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %lu, budg left %lu",
++		bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));
++	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %lu, min budg %lu",
++		budget, bfq_min_budget(bfqd));
++	bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",
++		bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));
++
++	if (bfq_bfqq_sync(bfqq)) {
++		switch (reason) {
++		/*
++		 * Caveat: in all the following cases we trade latency
++		 * for throughput.
++		 */
++		case BFQ_BFQQ_TOO_IDLE:
++			/*
++			 * This is the only case where we may reduce
++			 * the budget: if there is no request of the
++			 * process still waiting for completion, then
++			 * we assume (tentatively) that the timer has
++			 * expired because the batch of requests of
++			 * the process could have been served with a
++			 * smaller budget.  Hence, betting that
++			 * process will behave in the same way when it
++			 * becomes backlogged again, we reduce its
++			 * next budget.  As long as we guess right,
++			 * this budget cut reduces the latency
++			 * experienced by the process.
++			 *
++			 * However, if there are still outstanding
++			 * requests, then the process may have not yet
++			 * issued its next request just because it is
++			 * still waiting for the completion of some of
++			 * the still outstanding ones.  So in this
++			 * subcase we do not reduce its budget, on the
++			 * contrary we increase it to possibly boost
++			 * the throughput, as discussed in the
++			 * comments to the BUDGET_TIMEOUT case.
++			 */
++			if (bfqq->dispatched > 0) /* still outstanding reqs */
++				budget = min(budget * 2, bfqd->bfq_max_budget);
++			else {
++				if (budget > 5 * min_budget)
++					budget -= 4 * min_budget;
++				else
++					budget = min_budget;
++			}
++			break;
++		case BFQ_BFQQ_BUDGET_TIMEOUT:
++			/*
++			 * We double the budget here because: 1) it
++			 * gives the chance to boost the throughput if
++			 * this is not a seeky process (which may have
++			 * bumped into this timeout because of, e.g.,
++			 * ZBR), 2) together with charge_full_budget
++			 * it helps give seeky processes higher
++			 * timestamps, and hence be served less
++			 * frequently.
++			 */
++			budget = min(budget * 2, bfqd->bfq_max_budget);
++			break;
++		case BFQ_BFQQ_BUDGET_EXHAUSTED:
++			/*
++			 * The process still has backlog, and did not
++			 * let either the budget timeout or the disk
++			 * idling timeout expire. Hence it is not
++			 * seeky, has a short thinktime and may be
++			 * happy with a higher budget too. So
++			 * definitely increase the budget of this good
++			 * candidate to boost the disk throughput.
++			 */
++			budget = min(budget * 4, bfqd->bfq_max_budget);
++			break;
++		case BFQ_BFQQ_NO_MORE_REQUESTS:
++		       /*
++			* Leave the budget unchanged.
++			*/
++		default:
++			return;
++		}
++	} else /* async queue */
++	    /* async queues get always the maximum possible budget
++	     * (their ability to dispatch is limited by
++	     * @bfqd->bfq_max_budget_async_rq).
++	     */
++		budget = bfqd->bfq_max_budget;
++
++	bfqq->max_budget = budget;
++
++	if (bfqd->budgets_assigned >= 194 && bfqd->bfq_user_max_budget == 0 &&
++	    bfqq->max_budget > bfqd->bfq_max_budget)
++		bfqq->max_budget = bfqd->bfq_max_budget;
++
++	/*
++	 * Make sure that we have enough budget for the next request.
++	 * Since the finish time of the bfqq must be kept in sync with
++	 * the budget, be sure to call __bfq_bfqq_expire() after the
++	 * update.
++	 */
++	next_rq = bfqq->next_rq;
++	if (next_rq != NULL)
++		bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,
++					    bfq_serv_to_charge(next_rq, bfqq));
++	else
++		bfqq->entity.budget = bfqq->max_budget;
++
++	bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %lu",
++			next_rq != NULL ? blk_rq_sectors(next_rq) : 0,
++			bfqq->entity.budget);
++}
++
++static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
++{
++	unsigned long max_budget;
++
++	/*
++	 * The max_budget calculated when autotuning is equal to the
++	 * amount of sectors transfered in timeout_sync at the
++	 * estimated peak rate.
++	 */
++	max_budget = (unsigned long)(peak_rate * 1000 *
++				     timeout >> BFQ_RATE_SHIFT);
++
++	return max_budget;
++}
++
++/*
++ * In addition to updating the peak rate, checks whether the process
++ * is "slow", and returns 1 if so. This slow flag is used, in addition
++ * to the budget timeout, to reduce the amount of service provided to
++ * seeky processes, and hence reduce their chances to lower the
++ * throughput. See the code for more details.
++ */
++static int bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++				int compensate, enum bfqq_expiration reason)
++{
++	u64 bw, usecs, expected, timeout;
++	ktime_t delta;
++	int update = 0;
++
++	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
++		return 0;
++
++	if (compensate)
++		delta = bfqd->last_idling_start;
++	else
++		delta = ktime_get();
++	delta = ktime_sub(delta, bfqd->last_budget_start);
++	usecs = ktime_to_us(delta);
++
++	/* Don't trust short/unrealistic values. */
++	if (usecs < 100 || usecs >= LONG_MAX)
++		return 0;
++
++	/*
++	 * Calculate the bandwidth for the last slice.  We use a 64 bit
++	 * value to store the peak rate, in sectors per usec in fixed
++	 * point math.  We do so to have enough precision in the estimate
++	 * and to avoid overflows.
++	 */
++	bw = (u64)bfqq->entity.service << BFQ_RATE_SHIFT;
++	do_div(bw, (unsigned long)usecs);
++
++	timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
++
++	/*
++	 * Use only long (> 20ms) intervals to filter out spikes for
++	 * the peak rate estimation.
++	 */
++	if (usecs > 20000) {
++		if (bw > bfqd->peak_rate ||
++		   (!BFQQ_SEEKY(bfqq) &&
++		    reason == BFQ_BFQQ_BUDGET_TIMEOUT)) {
++			bfq_log(bfqd, "measured bw =%llu", bw);
++			/*
++			 * To smooth oscillations use a low-pass filter with
++			 * alpha=7/8, i.e.,
++			 * new_rate = (7/8) * old_rate + (1/8) * bw
++			 */
++			do_div(bw, 8);
++			if (bw == 0)
++				return 0;
++			bfqd->peak_rate *= 7;
++			do_div(bfqd->peak_rate, 8);
++			bfqd->peak_rate += bw;
++			update = 1;
++			bfq_log(bfqd, "new peak_rate=%llu", bfqd->peak_rate);
++		}
++
++		update |= bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES - 1;
++
++		if (bfqd->peak_rate_samples < BFQ_PEAK_RATE_SAMPLES)
++			bfqd->peak_rate_samples++;
++
++		if (bfqd->peak_rate_samples == BFQ_PEAK_RATE_SAMPLES &&
++		    update) {
++			int dev_type = blk_queue_nonrot(bfqd->queue);
++			if (bfqd->bfq_user_max_budget == 0) {
++				bfqd->bfq_max_budget =
++					bfq_calc_max_budget(bfqd->peak_rate,
++							    timeout);
++				bfq_log(bfqd, "new max_budget=%lu",
++					bfqd->bfq_max_budget);
++			}
++			if (bfqd->device_speed == BFQ_BFQD_FAST &&
++			    bfqd->peak_rate < device_speed_thresh[dev_type]) {
++				bfqd->device_speed = BFQ_BFQD_SLOW;
++				bfqd->RT_prod = R_slow[dev_type] *
++						T_slow[dev_type];
++			} else if (bfqd->device_speed == BFQ_BFQD_SLOW &&
++			    bfqd->peak_rate > device_speed_thresh[dev_type]) {
++				bfqd->device_speed = BFQ_BFQD_FAST;
++				bfqd->RT_prod = R_fast[dev_type] *
++						T_fast[dev_type];
++			}
++		}
++	}
++
++	/*
++	 * If the process has been served for a too short time
++	 * interval to let its possible sequential accesses prevail on
++	 * the initial seek time needed to move the disk head on the
++	 * first sector it requested, then give the process a chance
++	 * and for the moment return false.
++	 */
++	if (bfqq->entity.budget <= bfq_max_budget(bfqd) / 8)
++		return 0;
++
++	/*
++	 * A process is considered ``slow'' (i.e., seeky, so that we
++	 * cannot treat it fairly in the service domain, as it would
++	 * slow down too much the other processes) if, when a slice
++	 * ends for whatever reason, it has received service at a
++	 * rate that would not be high enough to complete the budget
++	 * before the budget timeout expiration.
++	 */
++	expected = bw * 1000 * timeout >> BFQ_RATE_SHIFT;
++
++	/*
++	 * Caveat: processes doing IO in the slower disk zones will
++	 * tend to be slow(er) even if not seeky. And the estimated
++	 * peak rate will actually be an average over the disk
++	 * surface. Hence, to not be too harsh with unlucky processes,
++	 * we keep a budget/3 margin of safety before declaring a
++	 * process slow.
++	 */
++	return expected > (4 * bfqq->entity.budget) / 3;
++}
++
++/*
++ * To be deemed as soft real-time, an application must meet two
++ * requirements. First, the application must not require an average
++ * bandwidth higher than the approximate bandwidth required to playback or
++ * record a compressed high-definition video.
++ * The next function is invoked on the completion of the last request of a
++ * batch, to compute the next-start time instant, soft_rt_next_start, such
++ * that, if the next request of the application does not arrive before
++ * soft_rt_next_start, then the above requirement on the bandwidth is met.
++ *
++ * The second requirement is that the request pattern of the application is
++ * isochronous, i.e., that, after issuing a request or a batch of requests,
++ * the application stops issuing new requests until all its pending requests
++ * have been completed. After that, the application may issue a new batch,
++ * and so on.
++ * For this reason the next function is invoked to compute
++ * soft_rt_next_start only for applications that meet this requirement,
++ * whereas soft_rt_next_start is set to infinity for applications that do
++ * not.
++ *
++ * Unfortunately, even a greedy application may happen to behave in an
++ * isochronous way if the CPU load is high. In fact, the application may
++ * stop issuing requests while the CPUs are busy serving other processes,
++ * then restart, then stop again for a while, and so on. In addition, if
++ * the disk achieves a low enough throughput with the request pattern
++ * issued by the application (e.g., because the request pattern is random
++ * and/or the device is slow), then the application may meet the above
++ * bandwidth requirement too. To prevent such a greedy application to be
++ * deemed as soft real-time, a further rule is used in the computation of
++ * soft_rt_next_start: soft_rt_next_start must be higher than the current
++ * time plus the maximum time for which the arrival of a request is waited
++ * for when a sync queue becomes idle, namely bfqd->bfq_slice_idle.
++ * This filters out greedy applications, as the latter issue instead their
++ * next request as soon as possible after the last one has been completed
++ * (in contrast, when a batch of requests is completed, a soft real-time
++ * application spends some time processing data).
++ *
++ * Unfortunately, the last filter may easily generate false positives if
++ * only bfqd->bfq_slice_idle is used as a reference time interval and one
++ * or both the following cases occur:
++ * 1) HZ is so low that the duration of a jiffy is comparable to or higher
++ *    than bfqd->bfq_slice_idle. This happens, e.g., on slow devices with
++ *    HZ=100.
++ * 2) jiffies, instead of increasing at a constant rate, may stop increasing
++ *    for a while, then suddenly 'jump' by several units to recover the lost
++ *    increments. This seems to happen, e.g., inside virtual machines.
++ * To address this issue, we do not use as a reference time interval just
++ * bfqd->bfq_slice_idle, but bfqd->bfq_slice_idle plus a few jiffies. In
++ * particular we add the minimum number of jiffies for which the filter
++ * seems to be quite precise also in embedded systems and KVM/QEMU virtual
++ * machines.
++ */
++static inline unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,
++						       struct bfq_queue *bfqq)
++{
++	return max(bfqq->last_idle_bklogged +
++		   HZ * bfqq->service_from_backlogged /
++		   bfqd->bfq_wr_max_softrt_rate,
++		   jiffies + bfqq->bfqd->bfq_slice_idle + 4);
++}
++
++/*
++ * Return the largest-possible time instant such that, for as long as possible,
++ * the current time will be lower than this time instant according to the macro
++ * time_is_before_jiffies().
++ */
++static inline unsigned long bfq_infinity_from_now(unsigned long now)
++{
++	return now + ULONG_MAX / 2;
++}
++
++/**
++ * bfq_bfqq_expire - expire a queue.
++ * @bfqd: device owning the queue.
++ * @bfqq: the queue to expire.
++ * @compensate: if true, compensate for the time spent idling.
++ * @reason: the reason causing the expiration.
++ *
++ *
++ * If the process associated to the queue is slow (i.e., seeky), or in
++ * case of budget timeout, or, finally, if it is async, we
++ * artificially charge it an entire budget (independently of the
++ * actual service it received). As a consequence, the queue will get
++ * higher timestamps than the correct ones upon reactivation, and
++ * hence it will be rescheduled as if it had received more service
++ * than what it actually received. In the end, this class of processes
++ * will receive less service in proportion to how slowly they consume
++ * their budgets (and hence how seriously they tend to lower the
++ * throughput).
++ *
++ * In contrast, when a queue expires because it has been idling for
++ * too much or because it exhausted its budget, we do not touch the
++ * amount of service it has received. Hence when the queue will be
++ * reactivated and its timestamps updated, the latter will be in sync
++ * with the actual service received by the queue until expiration.
++ *
++ * Charging a full budget to the first type of queues and the exact
++ * service to the others has the effect of using the WF2Q+ policy to
++ * schedule the former on a timeslice basis, without violating the
++ * service domain guarantees of the latter.
++ */
++static void bfq_bfqq_expire(struct bfq_data *bfqd,
++			    struct bfq_queue *bfqq,
++			    int compensate,
++			    enum bfqq_expiration reason)
++{
++	int slow;
++	BUG_ON(bfqq != bfqd->in_service_queue);
++
++	/* Update disk peak rate for autotuning and check whether the
++	 * process is slow (see bfq_update_peak_rate).
++	 */
++	slow = bfq_update_peak_rate(bfqd, bfqq, compensate, reason);
++
++	/*
++	 * As above explained, 'punish' slow (i.e., seeky), timed-out
++	 * and async queues, to favor sequential sync workloads.
++	 *
++	 * Processes doing I/O in the slower disk zones will tend to be
++	 * slow(er) even if not seeky. Hence, since the estimated peak
++	 * rate is actually an average over the disk surface, these
++	 * processes may timeout just for bad luck. To avoid punishing
++	 * them we do not charge a full budget to a process that
++	 * succeeded in consuming at least 2/3 of its budget.
++	 */
++	if (slow || (reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
++		     bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3))
++		bfq_bfqq_charge_full_budget(bfqq);
++
++	bfqq->service_from_backlogged += bfqq->entity.service;
++
++	if (BFQQ_SEEKY(bfqq) && reason == BFQ_BFQQ_BUDGET_TIMEOUT &&
++	    !bfq_bfqq_constantly_seeky(bfqq)) {
++		bfq_mark_bfqq_constantly_seeky(bfqq);
++		if (!blk_queue_nonrot(bfqd->queue))
++			bfqd->const_seeky_busy_in_flight_queues++;
++	}
++
++	if (reason == BFQ_BFQQ_TOO_IDLE &&
++	    bfqq->entity.service <= 2 * bfqq->entity.budget / 10 )
++		bfq_clear_bfqq_IO_bound(bfqq);
++
++	if (bfqd->low_latency && bfqq->wr_coeff == 1)
++		bfqq->last_wr_start_finish = jiffies;
++
++	if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&
++	    RB_EMPTY_ROOT(&bfqq->sort_list)) {
++		/*
++		 * If we get here, and there are no outstanding requests,
++		 * then the request pattern is isochronous (see the comments
++		 * to the function bfq_bfqq_softrt_next_start()). Hence we
++		 * can compute soft_rt_next_start. If, instead, the queue
++		 * still has outstanding requests, then we have to wait
++		 * for the completion of all the outstanding requests to
++		 * discover whether the request pattern is actually
++		 * isochronous.
++		 */
++		if (bfqq->dispatched == 0)
++			bfqq->soft_rt_next_start =
++				bfq_bfqq_softrt_next_start(bfqd, bfqq);
++		else {
++			/*
++			 * The application is still waiting for the
++			 * completion of one or more requests:
++			 * prevent it from possibly being incorrectly
++			 * deemed as soft real-time by setting its
++			 * soft_rt_next_start to infinity. In fact,
++			 * without this assignment, the application
++			 * would be incorrectly deemed as soft
++			 * real-time if:
++			 * 1) it issued a new request before the
++			 *    completion of all its in-flight
++			 *    requests, and
++			 * 2) at that time, its soft_rt_next_start
++			 *    happened to be in the past.
++			 */
++			bfqq->soft_rt_next_start =
++				bfq_infinity_from_now(jiffies);
++			/*
++			 * Schedule an update of soft_rt_next_start to when
++			 * the task may be discovered to be isochronous.
++			 */
++			bfq_mark_bfqq_softrt_update(bfqq);
++		}
++	}
++
++	bfq_log_bfqq(bfqd, bfqq,
++		"expire (%d, slow %d, num_disp %d, idle_win %d)", reason,
++		slow, bfqq->dispatched, bfq_bfqq_idle_window(bfqq));
++
++	/*
++	 * Increase, decrease or leave budget unchanged according to
++	 * reason.
++	 */
++	__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
++	__bfq_bfqq_expire(bfqd, bfqq);
++}
++
++/*
++ * Budget timeout is not implemented through a dedicated timer, but
++ * just checked on request arrivals and completions, as well as on
++ * idle timer expirations.
++ */
++static int bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
++{
++	if (bfq_bfqq_budget_new(bfqq) ||
++	    time_before(jiffies, bfqq->budget_timeout))
++		return 0;
++	return 1;
++}
++
++/*
++ * If we expire a queue that is waiting for the arrival of a new
++ * request, we may prevent the fictitious timestamp back-shifting that
++ * allows the guarantees of the queue to be preserved (see [1] for
++ * this tricky aspect). Hence we return true only if this condition
++ * does not hold, or if the queue is slow enough to deserve only to be
++ * kicked off for preserving a high throughput.
++*/
++static inline int bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)
++{
++	bfq_log_bfqq(bfqq->bfqd, bfqq,
++		"may_budget_timeout: wait_request %d left %d timeout %d",
++		bfq_bfqq_wait_request(bfqq),
++			bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3,
++		bfq_bfqq_budget_timeout(bfqq));
++
++	return (!bfq_bfqq_wait_request(bfqq) ||
++		bfq_bfqq_budget_left(bfqq) >=  bfqq->entity.budget / 3)
++		&&
++		bfq_bfqq_budget_timeout(bfqq);
++}
++
++/*
++ * Device idling is allowed only for the queues for which this function
++ * returns true. For this reason, the return value of this function plays a
++ * critical role for both throughput boosting and service guarantees. The
++ * return value is computed through a logical expression. In this rather
++ * long comment, we try to briefly describe all the details and motivations
++ * behind the components of this logical expression.
++ *
++ * First, the expression may be true only for sync queues. Besides, if
++ * bfqq is also being weight-raised, then the expression always evaluates
++ * to true, as device idling is instrumental for preserving low-latency
++ * guarantees (see [1]). Otherwise, the expression evaluates to true only
++ * if bfqq has a non-null idle window and at least one of the following
++ * two conditions holds. The first condition is that the device is not
++ * performing NCQ, because idling the device most certainly boosts the
++ * throughput if this condition holds and bfqq has been granted a non-null
++ * idle window. The second compound condition is made of the logical AND of
++ * two components.
++ *
++ * The first component is true only if there is no weight-raised busy
++ * queue. This guarantees that the device is not idled for a sync non-
++ * weight-raised queue when there are busy weight-raised queues. The former
++ * is then expired immediately if empty. Combined with the timestamping
++ * rules of BFQ (see [1] for details), this causes sync non-weight-raised
++ * queues to get a lower number of requests served, and hence to ask for a
++ * lower number of requests from the request pool, before the busy weight-
++ * raised queues get served again.
++ *
++ * This is beneficial for the processes associated with weight-raised
++ * queues, when the request pool is saturated (e.g., in the presence of
++ * write hogs). In fact, if the processes associated with the other queues
++ * ask for requests at a lower rate, then weight-raised processes have a
++ * higher probability to get a request from the pool immediately (or at
++ * least soon) when they need one. Hence they have a higher probability to
++ * actually get a fraction of the disk throughput proportional to their
++ * high weight. This is especially true with NCQ-capable drives, which
++ * enqueue several requests in advance and further reorder internally-
++ * queued requests.
++ *
++ * In the end, mistreating non-weight-raised queues when there are busy
++ * weight-raised queues seems to mitigate starvation problems in the
++ * presence of heavy write workloads and NCQ, and hence to guarantee a
++ * higher application and system responsiveness in these hostile scenarios.
++ *
++ * If the first component of the compound condition is instead true, i.e.,
++ * there is no weight-raised busy queue, then the second component of the
++ * compound condition takes into account service-guarantee and throughput
++ * issues related to NCQ (recall that the compound condition is evaluated
++ * only if the device is detected as supporting NCQ).
++ *
++ * As for service guarantees, allowing the drive to enqueue more than one
++ * request at a time, and hence delegating de facto final scheduling
++ * decisions to the drive's internal scheduler, causes loss of control on
++ * the actual request service order. In this respect, when the drive is
++ * allowed to enqueue more than one request at a time, the service
++ * distribution enforced by the drive's internal scheduler is likely to
++ * coincide with the desired device-throughput distribution only in the
++ * following, perfectly symmetric, scenario:
++ * 1) all active queues have the same weight,
++ * 2) all active groups at the same level in the groups tree have the same
++ *    weight,
++ * 3) all active groups at the same level in the groups tree have the same
++ *    number of children.
++ *
++ * Even in such a scenario, sequential I/O may still receive a preferential
++ * treatment, but this is not likely to be a big issue with flash-based
++ * devices, because of their non-dramatic loss of throughput with random
++ * I/O. Things do differ with HDDs, for which additional care is taken, as
++ * explained after completing the discussion for flash-based devices.
++ *
++ * Unfortunately, keeping the necessary state for evaluating exactly the
++ * above symmetry conditions would be quite complex and time-consuming.
++ * Therefore BFQ evaluates instead the following stronger sub-conditions,
++ * for which it is much easier to maintain the needed state:
++ * 1) all active queues have the same weight,
++ * 2) all active groups have the same weight,
++ * 3) all active groups have at most one active child each.
++ * In particular, the last two conditions are always true if hierarchical
++ * support and the cgroups interface are not enabled, hence no state needs
++ * to be maintained in this case.
++ *
++ * According to the above considerations, the second component of the
++ * compound condition evaluates to true if any of the above symmetry
++ * sub-condition does not hold, or the device is not flash-based. Therefore,
++ * if also the first component is true, then idling is allowed for a sync
++ * queue. These are the only sub-conditions considered if the device is
++ * flash-based, as, for such a device, it is sensible to force idling only
++ * for service-guarantee issues. In fact, as for throughput, idling
++ * NCQ-capable flash-based devices would not boost the throughput even
++ * with sequential I/O; rather it would lower the throughput in proportion
++ * to how fast the device is. In the end, (only) if all the three
++ * sub-conditions hold and the device is flash-based, the compound
++ * condition evaluates to false and therefore no idling is performed.
++ *
++ * As already said, things change with a rotational device, where idling
++ * boosts the throughput with sequential I/O (even with NCQ). Hence, for
++ * such a device the second component of the compound condition evaluates
++ * to true also if the following additional sub-condition does not hold:
++ * the queue is constantly seeky. Unfortunately, this different behavior
++ * with respect to flash-based devices causes an additional asymmetry: if
++ * some sync queues enjoy idling and some other sync queues do not, then
++ * the latter get a low share of the device throughput, simply because the
++ * former get many requests served after being set as in service, whereas
++ * the latter do not. As a consequence, to guarantee the desired throughput
++ * distribution, on HDDs the compound expression evaluates to true (and
++ * hence device idling is performed) also if the following last symmetry
++ * condition does not hold: no other queue is benefiting from idling. Also
++ * this last condition is actually replaced with a simpler-to-maintain and
++ * stronger condition: there is no busy queue which is not constantly seeky
++ * (and hence may also benefit from idling).
++ *
++ * To sum up, when all the required symmetry and throughput-boosting
++ * sub-conditions hold, the second component of the compound condition
++ * evaluates to false, and hence no idling is performed. This helps to
++ * keep the drives' internal queues full on NCQ-capable devices, and hence
++ * to boost the throughput, without causing 'almost' any loss of service
++ * guarantees. The 'almost' follows from the fact that, if the internal
++ * queue of one such device is filled while all the sub-conditions hold,
++ * but at some point in time some sub-condition stops to hold, then it may
++ * become impossible to let requests be served in the new desired order
++ * until all the requests already queued in the device have been served.
++ */
++static inline bool bfq_bfqq_must_not_expire(struct bfq_queue *bfqq)
++{
++	struct bfq_data *bfqd = bfqq->bfqd;
++#ifdef CONFIG_CGROUP_BFQIO
++#define symmetric_scenario	  (!bfqd->active_numerous_groups && \
++				   !bfq_differentiated_weights(bfqd))
++#else
++#define symmetric_scenario	  (!bfq_differentiated_weights(bfqd))
++#endif
++#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
++				   bfqd->busy_in_flight_queues == \
++				   bfqd->const_seeky_busy_in_flight_queues)
++/*
++ * Condition for expiring a non-weight-raised queue (and hence not idling
++ * the device).
++ */
++#define cond_for_expiring_non_wr  (bfqd->hw_tag && \
++				   (bfqd->wr_busy_queues > 0 || \
++				    (symmetric_scenario && \
++				     (blk_queue_nonrot(bfqd->queue) || \
++				      cond_for_seeky_on_ncq_hdd))))
++
++	return bfq_bfqq_sync(bfqq) &&
++		(bfq_bfqq_IO_bound(bfqq) || bfqq->wr_coeff > 1) &&
++		(bfqq->wr_coeff > 1 ||
++		 (bfq_bfqq_idle_window(bfqq) &&
++		  !cond_for_expiring_non_wr)
++	);
++}
++
++/*
++ * If the in-service queue is empty but sync, and the function
++ * bfq_bfqq_must_not_expire returns true, then:
++ * 1) the queue must remain in service and cannot be expired, and
++ * 2) the disk must be idled to wait for the possible arrival of a new
++ *    request for the queue.
++ * See the comments to the function bfq_bfqq_must_not_expire for the reasons
++ * why performing device idling is the best choice to boost the throughput
++ * and preserve service guarantees when bfq_bfqq_must_not_expire itself
++ * returns true.
++ */
++static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
++{
++	struct bfq_data *bfqd = bfqq->bfqd;
++
++	return RB_EMPTY_ROOT(&bfqq->sort_list) && bfqd->bfq_slice_idle != 0 &&
++	       bfq_bfqq_must_not_expire(bfqq);
++}
++
++/*
++ * Select a queue for service.  If we have a current queue in service,
++ * check whether to continue servicing it, or retrieve and set a new one.
++ */
++static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq, *new_bfqq = NULL;
++	struct request *next_rq;
++	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
++
++	bfqq = bfqd->in_service_queue;
++	if (bfqq == NULL)
++		goto new_queue;
++
++	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
++
++	/*
++         * If another queue has a request waiting within our mean seek
++         * distance, let it run. The expire code will check for close
++         * cooperators and put the close queue at the front of the
++         * service tree. If possible, merge the expiring queue with the
++         * new bfqq.
++         */
++        new_bfqq = bfq_close_cooperator(bfqd, bfqq);
++        if (new_bfqq != NULL && bfqq->new_bfqq == NULL)
++                bfq_setup_merge(bfqq, new_bfqq);
++
++	if (bfq_may_expire_for_budg_timeout(bfqq) &&
++	    !timer_pending(&bfqd->idle_slice_timer) &&
++	    !bfq_bfqq_must_idle(bfqq))
++		goto expire;
++
++	next_rq = bfqq->next_rq;
++	/*
++	 * If bfqq has requests queued and it has enough budget left to
++	 * serve them, keep the queue, otherwise expire it.
++	 */
++	if (next_rq != NULL) {
++		if (bfq_serv_to_charge(next_rq, bfqq) >
++			bfq_bfqq_budget_left(bfqq)) {
++			reason = BFQ_BFQQ_BUDGET_EXHAUSTED;
++			goto expire;
++		} else {
++			/*
++			 * The idle timer may be pending because we may
++			 * not disable disk idling even when a new request
++			 * arrives.
++			 */
++			if (timer_pending(&bfqd->idle_slice_timer)) {
++				/*
++				 * If we get here: 1) at least a new request
++				 * has arrived but we have not disabled the
++				 * timer because the request was too small,
++				 * 2) then the block layer has unplugged
++				 * the device, causing the dispatch to be
++				 * invoked.
++				 *
++				 * Since the device is unplugged, now the
++				 * requests are probably large enough to
++				 * provide a reasonable throughput.
++				 * So we disable idling.
++				 */
++				bfq_clear_bfqq_wait_request(bfqq);
++				del_timer(&bfqd->idle_slice_timer);
++			}
++			if (new_bfqq == NULL)
++				goto keep_queue;
++			else
++				goto expire;
++		}
++	}
++
++	/*
++	 * No requests pending.  If the in-service queue still has requests
++	 * in flight (possibly waiting for a completion) or is idling for a
++	 * new request, then keep it.
++	 */
++	if (new_bfqq == NULL && (timer_pending(&bfqd->idle_slice_timer) ||
++	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq)))) {
++		bfqq = NULL;
++		goto keep_queue;
++	} else if (new_bfqq != NULL && timer_pending(&bfqd->idle_slice_timer)) {
++		/*
++		 * Expiring the queue because there is a close cooperator,
++		 * cancel timer.
++		 */
++		bfq_clear_bfqq_wait_request(bfqq);
++		del_timer(&bfqd->idle_slice_timer);
++	}
++
++	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
++expire:
++	bfq_bfqq_expire(bfqd, bfqq, 0, reason);
++new_queue:
++	bfqq = bfq_set_in_service_queue(bfqd, new_bfqq);
++	bfq_log(bfqd, "select_queue: new queue %d returned",
++		bfqq != NULL ? bfqq->pid : 0);
++keep_queue:
++	return bfqq;
++}
++
++static void bfq_update_wr_data(struct bfq_data *bfqd,
++			       struct bfq_queue *bfqq)
++{
++	if (bfqq->wr_coeff > 1) { /* queue is being boosted */
++		struct bfq_entity *entity = &bfqq->entity;
++
++		bfq_log_bfqq(bfqd, bfqq,
++			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
++			jiffies_to_msecs(jiffies -
++				bfqq->last_wr_start_finish),
++			jiffies_to_msecs(bfqq->wr_cur_max_time),
++			bfqq->wr_coeff,
++			bfqq->entity.weight, bfqq->entity.orig_weight);
++
++		BUG_ON(bfqq != bfqd->in_service_queue && entity->weight !=
++		       entity->orig_weight * bfqq->wr_coeff);
++		if (entity->ioprio_changed)
++			bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
++		/*
++		 * If too much time has elapsed from the beginning
++		 * of this weight-raising, stop it.
++		 */
++		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++					   bfqq->wr_cur_max_time)) {
++			bfqq->last_wr_start_finish = jiffies;
++			bfq_log_bfqq(bfqd, bfqq,
++				     "wrais ending at %lu, rais_max_time %u",
++				     bfqq->last_wr_start_finish,
++				     jiffies_to_msecs(bfqq->wr_cur_max_time));
++			bfq_bfqq_end_wr(bfqq);
++			__bfq_entity_update_weight_prio(
++				bfq_entity_service_tree(entity),
++				entity);
++		}
++	}
++}
++
++/*
++ * Dispatch one request from bfqq, moving it to the request queue
++ * dispatch list.
++ */
++static int bfq_dispatch_request(struct bfq_data *bfqd,
++				struct bfq_queue *bfqq)
++{
++	int dispatched = 0;
++	struct request *rq;
++	unsigned long service_to_charge;
++
++	BUG_ON(RB_EMPTY_ROOT(&bfqq->sort_list));
++
++	/* Follow expired path, else get first next available. */
++	rq = bfq_check_fifo(bfqq);
++	if (rq == NULL)
++		rq = bfqq->next_rq;
++	service_to_charge = bfq_serv_to_charge(rq, bfqq);
++
++	if (service_to_charge > bfq_bfqq_budget_left(bfqq)) {
++		/*
++		 * This may happen if the next rq is chosen in fifo order
++		 * instead of sector order. The budget is properly
++		 * dimensioned to be always sufficient to serve the next
++		 * request only if it is chosen in sector order. The reason
++		 * is that it would be quite inefficient and little useful
++		 * to always make sure that the budget is large enough to
++		 * serve even the possible next rq in fifo order.
++		 * In fact, requests are seldom served in fifo order.
++		 *
++		 * Expire the queue for budget exhaustion, and make sure
++		 * that the next act_budget is enough to serve the next
++		 * request, even if it comes from the fifo expired path.
++		 */
++		bfqq->next_rq = rq;
++		/*
++		 * Since this dispatch is failed, make sure that
++		 * a new one will be performed
++		 */
++		if (!bfqd->rq_in_driver)
++			bfq_schedule_dispatch(bfqd);
++		goto expire;
++	}
++
++	/* Finally, insert request into driver dispatch list. */
++	bfq_bfqq_served(bfqq, service_to_charge);
++	bfq_dispatch_insert(bfqd->queue, rq);
++
++	bfq_update_wr_data(bfqd, bfqq);
++
++	bfq_log_bfqq(bfqd, bfqq,
++			"dispatched %u sec req (%llu), budg left %lu",
++			blk_rq_sectors(rq),
++			(long long unsigned)blk_rq_pos(rq),
++			bfq_bfqq_budget_left(bfqq));
++
++	dispatched++;
++
++	if (bfqd->in_service_bic == NULL) {
++		atomic_long_inc(&RQ_BIC(rq)->icq.ioc->refcount);
++		bfqd->in_service_bic = RQ_BIC(rq);
++	}
++
++	if (bfqd->busy_queues > 1 && ((!bfq_bfqq_sync(bfqq) &&
++	    dispatched >= bfqd->bfq_max_budget_async_rq) ||
++	    bfq_class_idle(bfqq)))
++		goto expire;
++
++	return dispatched;
++
++expire:
++	bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_EXHAUSTED);
++	return dispatched;
++}
++
++static int __bfq_forced_dispatch_bfqq(struct bfq_queue *bfqq)
++{
++	int dispatched = 0;
++
++	while (bfqq->next_rq != NULL) {
++		bfq_dispatch_insert(bfqq->bfqd->queue, bfqq->next_rq);
++		dispatched++;
++	}
++
++	BUG_ON(!list_empty(&bfqq->fifo));
++	return dispatched;
++}
++
++/*
++ * Drain our current requests.
++ * Used for barriers and when switching io schedulers on-the-fly.
++ */
++static int bfq_forced_dispatch(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq, *n;
++	struct bfq_service_tree *st;
++	int dispatched = 0;
++
++	bfqq = bfqd->in_service_queue;
++	if (bfqq != NULL)
++		__bfq_bfqq_expire(bfqd, bfqq);
++
++	/*
++	 * Loop through classes, and be careful to leave the scheduler
++	 * in a consistent state, as feedback mechanisms and vtime
++	 * updates cannot be disabled during the process.
++	 */
++	list_for_each_entry_safe(bfqq, n, &bfqd->active_list, bfqq_list) {
++		st = bfq_entity_service_tree(&bfqq->entity);
++
++		dispatched += __bfq_forced_dispatch_bfqq(bfqq);
++		bfqq->max_budget = bfq_max_budget(bfqd);
++
++		bfq_forget_idle(st);
++	}
++
++	BUG_ON(bfqd->busy_queues != 0);
++
++	return dispatched;
++}
++
++static int bfq_dispatch_requests(struct request_queue *q, int force)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_queue *bfqq;
++	int max_dispatch;
++
++	bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
++	if (bfqd->busy_queues == 0)
++		return 0;
++
++	if (unlikely(force))
++		return bfq_forced_dispatch(bfqd);
++
++	bfqq = bfq_select_queue(bfqd);
++	if (bfqq == NULL)
++		return 0;
++
++	max_dispatch = bfqd->bfq_quantum;
++	if (bfq_class_idle(bfqq))
++		max_dispatch = 1;
++
++	if (!bfq_bfqq_sync(bfqq))
++		max_dispatch = bfqd->bfq_max_budget_async_rq;
++
++	if (bfqq->dispatched >= max_dispatch) {
++		if (bfqd->busy_queues > 1)
++			return 0;
++		if (bfqq->dispatched >= 4 * max_dispatch)
++			return 0;
++	}
++
++	if (bfqd->sync_flight != 0 && !bfq_bfqq_sync(bfqq))
++		return 0;
++
++	bfq_clear_bfqq_wait_request(bfqq);
++	BUG_ON(timer_pending(&bfqd->idle_slice_timer));
++
++	if (!bfq_dispatch_request(bfqd, bfqq))
++		return 0;
++
++	bfq_log_bfqq(bfqd, bfqq, "dispatched one request of %d (max_disp %d)",
++			bfqq->pid, max_dispatch);
++
++	return 1;
++}
++
++/*
++ * Task holds one reference to the queue, dropped when task exits.  Each rq
++ * in-flight on this queue also holds a reference, dropped when rq is freed.
++ *
++ * Queue lock must be held here.
++ */
++static void bfq_put_queue(struct bfq_queue *bfqq)
++{
++	struct bfq_data *bfqd = bfqq->bfqd;
++
++	BUG_ON(atomic_read(&bfqq->ref) <= 0);
++
++	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p %d", bfqq,
++		     atomic_read(&bfqq->ref));
++	if (!atomic_dec_and_test(&bfqq->ref))
++		return;
++
++	BUG_ON(rb_first(&bfqq->sort_list) != NULL);
++	BUG_ON(bfqq->allocated[READ] + bfqq->allocated[WRITE] != 0);
++	BUG_ON(bfqq->entity.tree != NULL);
++	BUG_ON(bfq_bfqq_busy(bfqq));
++	BUG_ON(bfqd->in_service_queue == bfqq);
++
++	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
++
++	kmem_cache_free(bfq_pool, bfqq);
++}
++
++static void bfq_put_cooperator(struct bfq_queue *bfqq)
++{
++	struct bfq_queue *__bfqq, *next;
++
++	/*
++	 * If this queue was scheduled to merge with another queue, be
++	 * sure to drop the reference taken on that queue (and others in
++	 * the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.
++	 */
++	__bfqq = bfqq->new_bfqq;
++	while (__bfqq) {
++		if (__bfqq == bfqq)
++			break;
++		next = __bfqq->new_bfqq;
++		bfq_put_queue(__bfqq);
++		__bfqq = next;
++	}
++}
++
++static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	if (bfqq == bfqd->in_service_queue) {
++		__bfq_bfqq_expire(bfqd, bfqq);
++		bfq_schedule_dispatch(bfqd);
++	}
++
++	bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq,
++		     atomic_read(&bfqq->ref));
++
++	bfq_put_cooperator(bfqq);
++
++	bfq_put_queue(bfqq);
++}
++
++static inline void bfq_init_icq(struct io_cq *icq)
++{
++	struct bfq_io_cq *bic = icq_to_bic(icq);
++
++	bic->ttime.last_end_request = jiffies;
++}
++
++static void bfq_exit_icq(struct io_cq *icq)
++{
++	struct bfq_io_cq *bic = icq_to_bic(icq);
++	struct bfq_data *bfqd = bic_to_bfqd(bic);
++
++	if (bic->bfqq[BLK_RW_ASYNC]) {
++		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_ASYNC]);
++		bic->bfqq[BLK_RW_ASYNC] = NULL;
++	}
++
++	if (bic->bfqq[BLK_RW_SYNC]) {
++		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
++		bic->bfqq[BLK_RW_SYNC] = NULL;
++	}
++}
++
++/*
++ * Update the entity prio values; note that the new values will not
++ * be used until the next (re)activation.
++ */
++static void bfq_init_prio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
++{
++	struct task_struct *tsk = current;
++	int ioprio_class;
++
++	if (!bfq_bfqq_prio_changed(bfqq))
++		return;
++
++	ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
++	switch (ioprio_class) {
++	default:
++		dev_err(bfqq->bfqd->queue->backing_dev_info.dev,
++			"bfq: bad prio %x\n", ioprio_class);
++	case IOPRIO_CLASS_NONE:
++		/*
++		 * No prio set, inherit CPU scheduling settings.
++		 */
++		bfqq->entity.new_ioprio = task_nice_ioprio(tsk);
++		bfqq->entity.new_ioprio_class = task_nice_ioclass(tsk);
++		break;
++	case IOPRIO_CLASS_RT:
++		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_RT;
++		break;
++	case IOPRIO_CLASS_BE:
++		bfqq->entity.new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_BE;
++		break;
++	case IOPRIO_CLASS_IDLE:
++		bfqq->entity.new_ioprio_class = IOPRIO_CLASS_IDLE;
++		bfqq->entity.new_ioprio = 7;
++		bfq_clear_bfqq_idle_window(bfqq);
++		break;
++	}
++
++	bfqq->entity.ioprio_changed = 1;
++
++	bfq_clear_bfqq_prio_changed(bfqq);
++}
++
++static void bfq_changed_ioprio(struct bfq_io_cq *bic)
++{
++	struct bfq_data *bfqd;
++	struct bfq_queue *bfqq, *new_bfqq;
++	struct bfq_group *bfqg;
++	unsigned long uninitialized_var(flags);
++	int ioprio = bic->icq.ioc->ioprio;
++
++	bfqd = bfq_get_bfqd_locked(&(bic->icq.q->elevator->elevator_data),
++				   &flags);
++	/*
++	 * This condition may trigger on a newly created bic, be sure to
++	 * drop the lock before returning.
++	 */
++	if (unlikely(bfqd == NULL) || likely(bic->ioprio == ioprio))
++		goto out;
++
++	bfqq = bic->bfqq[BLK_RW_ASYNC];
++	if (bfqq != NULL) {
++		bfqg = container_of(bfqq->entity.sched_data, struct bfq_group,
++				    sched_data);
++		new_bfqq = bfq_get_queue(bfqd, bfqg, BLK_RW_ASYNC, bic,
++					 GFP_ATOMIC);
++		if (new_bfqq != NULL) {
++			bic->bfqq[BLK_RW_ASYNC] = new_bfqq;
++			bfq_log_bfqq(bfqd, bfqq,
++				     "changed_ioprio: bfqq %p %d",
++				     bfqq, atomic_read(&bfqq->ref));
++			bfq_put_queue(bfqq);
++		}
++	}
++
++	bfqq = bic->bfqq[BLK_RW_SYNC];
++	if (bfqq != NULL)
++		bfq_mark_bfqq_prio_changed(bfqq);
++
++	bic->ioprio = ioprio;
++
++out:
++	bfq_put_bfqd_unlock(bfqd, &flags);
++}
++
++static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			  pid_t pid, int is_sync)
++{
++	RB_CLEAR_NODE(&bfqq->entity.rb_node);
++	INIT_LIST_HEAD(&bfqq->fifo);
++
++	atomic_set(&bfqq->ref, 0);
++	bfqq->bfqd = bfqd;
++
++	bfq_mark_bfqq_prio_changed(bfqq);
++
++	if (is_sync) {
++		if (!bfq_class_idle(bfqq))
++			bfq_mark_bfqq_idle_window(bfqq);
++		bfq_mark_bfqq_sync(bfqq);
++	}
++	bfq_mark_bfqq_IO_bound(bfqq);
++
++	/* Tentative initial value to trade off between thr and lat */
++	bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;
++	bfqq->pid = pid;
++
++	bfqq->wr_coeff = 1;
++	bfqq->last_wr_start_finish = 0;
++	/*
++	 * Set to the value for which bfqq will not be deemed as
++	 * soft rt when it becomes backlogged.
++	 */
++	bfqq->soft_rt_next_start = bfq_infinity_from_now(jiffies);
++}
++
++static struct bfq_queue *bfq_find_alloc_queue(struct bfq_data *bfqd,
++					      struct bfq_group *bfqg,
++					      int is_sync,
++					      struct bfq_io_cq *bic,
++					      gfp_t gfp_mask)
++{
++	struct bfq_queue *bfqq, *new_bfqq = NULL;
++
++retry:
++	/* bic always exists here */
++	bfqq = bic_to_bfqq(bic, is_sync);
++
++	/*
++	 * Always try a new alloc if we fall back to the OOM bfqq
++	 * originally, since it should just be a temporary situation.
++	 */
++	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
++		bfqq = NULL;
++		if (new_bfqq != NULL) {
++			bfqq = new_bfqq;
++			new_bfqq = NULL;
++		} else if (gfp_mask & __GFP_WAIT) {
++			spin_unlock_irq(bfqd->queue->queue_lock);
++			new_bfqq = kmem_cache_alloc_node(bfq_pool,
++					gfp_mask | __GFP_ZERO,
++					bfqd->queue->node);
++			spin_lock_irq(bfqd->queue->queue_lock);
++			if (new_bfqq != NULL)
++				goto retry;
++		} else {
++			bfqq = kmem_cache_alloc_node(bfq_pool,
++					gfp_mask | __GFP_ZERO,
++					bfqd->queue->node);
++		}
++
++		if (bfqq != NULL) {
++			bfq_init_bfqq(bfqd, bfqq, current->pid, is_sync);
++			bfq_log_bfqq(bfqd, bfqq, "allocated");
++		} else {
++			bfqq = &bfqd->oom_bfqq;
++			bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");
++		}
++
++		bfq_init_prio_data(bfqq, bic);
++		bfq_init_entity(&bfqq->entity, bfqg);
++	}
++
++	if (new_bfqq != NULL)
++		kmem_cache_free(bfq_pool, new_bfqq);
++
++	return bfqq;
++}
++
++static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
++					       struct bfq_group *bfqg,
++					       int ioprio_class, int ioprio)
++{
++	switch (ioprio_class) {
++	case IOPRIO_CLASS_RT:
++		return &bfqg->async_bfqq[0][ioprio];
++	case IOPRIO_CLASS_NONE:
++		ioprio = IOPRIO_NORM;
++		/* fall through */
++	case IOPRIO_CLASS_BE:
++		return &bfqg->async_bfqq[1][ioprio];
++	case IOPRIO_CLASS_IDLE:
++		return &bfqg->async_idle_bfqq;
++	default:
++		BUG();
++	}
++}
++
++static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
++				       struct bfq_group *bfqg, int is_sync,
++				       struct bfq_io_cq *bic, gfp_t gfp_mask)
++{
++	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
++	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
++	struct bfq_queue **async_bfqq = NULL;
++	struct bfq_queue *bfqq = NULL;
++
++	if (!is_sync) {
++		async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,
++						  ioprio);
++		bfqq = *async_bfqq;
++	}
++
++	if (bfqq == NULL)
++		bfqq = bfq_find_alloc_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
++
++	/*
++	 * Pin the queue now that it's allocated, scheduler exit will
++	 * prune it.
++	 */
++	if (!is_sync && *async_bfqq == NULL) {
++		atomic_inc(&bfqq->ref);
++		bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",
++			     bfqq, atomic_read(&bfqq->ref));
++		*async_bfqq = bfqq;
++	}
++
++	atomic_inc(&bfqq->ref);
++	bfq_log_bfqq(bfqd, bfqq, "get_queue, at end: %p, %d", bfqq,
++		     atomic_read(&bfqq->ref));
++	return bfqq;
++}
++
++static void bfq_update_io_thinktime(struct bfq_data *bfqd,
++				    struct bfq_io_cq *bic)
++{
++	unsigned long elapsed = jiffies - bic->ttime.last_end_request;
++	unsigned long ttime = min(elapsed, 2UL * bfqd->bfq_slice_idle);
++
++	bic->ttime.ttime_samples = (7*bic->ttime.ttime_samples + 256) / 8;
++	bic->ttime.ttime_total = (7*bic->ttime.ttime_total + 256*ttime) / 8;
++	bic->ttime.ttime_mean = (bic->ttime.ttime_total + 128) /
++				bic->ttime.ttime_samples;
++}
++
++static void bfq_update_io_seektime(struct bfq_data *bfqd,
++				   struct bfq_queue *bfqq,
++				   struct request *rq)
++{
++	sector_t sdist;
++	u64 total;
++
++	if (bfqq->last_request_pos < blk_rq_pos(rq))
++		sdist = blk_rq_pos(rq) - bfqq->last_request_pos;
++	else
++		sdist = bfqq->last_request_pos - blk_rq_pos(rq);
++
++	/*
++	 * Don't allow the seek distance to get too large from the
++	 * odd fragment, pagein, etc.
++	 */
++	if (bfqq->seek_samples == 0) /* first request, not really a seek */
++		sdist = 0;
++	else if (bfqq->seek_samples <= 60) /* second & third seek */
++		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*1024);
++	else
++		sdist = min(sdist, (bfqq->seek_mean * 4) + 2*1024*64);
++
++	bfqq->seek_samples = (7*bfqq->seek_samples + 256) / 8;
++	bfqq->seek_total = (7*bfqq->seek_total + (u64)256*sdist) / 8;
++	total = bfqq->seek_total + (bfqq->seek_samples/2);
++	do_div(total, bfqq->seek_samples);
++	bfqq->seek_mean = (sector_t)total;
++
++	bfq_log_bfqq(bfqd, bfqq, "dist=%llu mean=%llu", (u64)sdist,
++			(u64)bfqq->seek_mean);
++}
++
++/*
++ * Disable idle window if the process thinks too long or seeks so much that
++ * it doesn't matter.
++ */
++static void bfq_update_idle_window(struct bfq_data *bfqd,
++				   struct bfq_queue *bfqq,
++				   struct bfq_io_cq *bic)
++{
++	int enable_idle;
++
++	/* Don't idle for async or idle io prio class. */
++	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
++		return;
++
++	enable_idle = bfq_bfqq_idle_window(bfqq);
++
++	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
++	    bfqd->bfq_slice_idle == 0 ||
++		(bfqd->hw_tag && BFQQ_SEEKY(bfqq) &&
++			bfqq->wr_coeff == 1))
++		enable_idle = 0;
++	else if (bfq_sample_valid(bic->ttime.ttime_samples)) {
++		if (bic->ttime.ttime_mean > bfqd->bfq_slice_idle &&
++			bfqq->wr_coeff == 1)
++			enable_idle = 0;
++		else
++			enable_idle = 1;
++	}
++	bfq_log_bfqq(bfqd, bfqq, "update_idle_window: enable_idle %d",
++		enable_idle);
++
++	if (enable_idle)
++		bfq_mark_bfqq_idle_window(bfqq);
++	else
++		bfq_clear_bfqq_idle_window(bfqq);
++}
++
++/*
++ * Called when a new fs request (rq) is added to bfqq.  Check if there's
++ * something we should do about it.
++ */
++static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			    struct request *rq)
++{
++	struct bfq_io_cq *bic = RQ_BIC(rq);
++
++	if (rq->cmd_flags & REQ_META)
++		bfqq->meta_pending++;
++
++	bfq_update_io_thinktime(bfqd, bic);
++	bfq_update_io_seektime(bfqd, bfqq, rq);
++	if (!BFQQ_SEEKY(bfqq) && bfq_bfqq_constantly_seeky(bfqq)) {
++		bfq_clear_bfqq_constantly_seeky(bfqq);
++		if (!blk_queue_nonrot(bfqd->queue)) {
++			BUG_ON(!bfqd->const_seeky_busy_in_flight_queues);
++			bfqd->const_seeky_busy_in_flight_queues--;
++		}
++	}
++	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
++	    !BFQQ_SEEKY(bfqq))
++		bfq_update_idle_window(bfqd, bfqq, bic);
++
++	bfq_log_bfqq(bfqd, bfqq,
++		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
++		     bfq_bfqq_idle_window(bfqq), BFQQ_SEEKY(bfqq),
++		     (long long unsigned)bfqq->seek_mean);
++
++	bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
++
++	if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
++		int small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&
++				blk_rq_sectors(rq) < 32;
++		int budget_timeout = bfq_bfqq_budget_timeout(bfqq);
++
++		/*
++		 * There is just this request queued: if the request
++		 * is small and the queue is not to be expired, then
++		 * just exit.
++		 *
++		 * In this way, if the disk is being idled to wait for
++		 * a new request from the in-service queue, we avoid
++		 * unplugging the device and committing the disk to serve
++		 * just a small request. On the contrary, we wait for
++		 * the block layer to decide when to unplug the device:
++		 * hopefully, new requests will be merged to this one
++		 * quickly, then the device will be unplugged and
++		 * larger requests will be dispatched.
++		 */
++		if (small_req && !budget_timeout)
++			return;
++
++		/*
++		 * A large enough request arrived, or the queue is to
++		 * be expired: in both cases disk idling is to be
++		 * stopped, so clear wait_request flag and reset
++		 * timer.
++		 */
++		bfq_clear_bfqq_wait_request(bfqq);
++		del_timer(&bfqd->idle_slice_timer);
++
++		/*
++		 * The queue is not empty, because a new request just
++		 * arrived. Hence we can safely expire the queue, in
++		 * case of budget timeout, without risking that the
++		 * timestamps of the queue are not updated correctly.
++		 * See [1] for more details.
++		 */
++		if (budget_timeout)
++			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
++
++		/*
++		 * Let the request rip immediately, or let a new queue be
++		 * selected if bfqq has just been expired.
++		 */
++		__blk_run_queue(bfqd->queue);
++	}
++}
++
++static void bfq_insert_request(struct request_queue *q, struct request *rq)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++	assert_spin_locked(bfqd->queue->queue_lock);
++	bfq_init_prio_data(bfqq, RQ_BIC(rq));
++
++	bfq_add_request(rq);
++
++	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
++	list_add_tail(&rq->queuelist, &bfqq->fifo);
++
++	bfq_rq_enqueued(bfqd, bfqq, rq);
++}
++
++static void bfq_update_hw_tag(struct bfq_data *bfqd)
++{
++	bfqd->max_rq_in_driver = max(bfqd->max_rq_in_driver,
++				     bfqd->rq_in_driver);
++
++	if (bfqd->hw_tag == 1)
++		return;
++
++	/*
++	 * This sample is valid if the number of outstanding requests
++	 * is large enough to allow a queueing behavior.  Note that the
++	 * sum is not exact, as it's not taking into account deactivated
++	 * requests.
++	 */
++	if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
++		return;
++
++	if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
++		return;
++
++	bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
++	bfqd->max_rq_in_driver = 0;
++	bfqd->hw_tag_samples = 0;
++}
++
++static void bfq_completed_request(struct request_queue *q, struct request *rq)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++	struct bfq_data *bfqd = bfqq->bfqd;
++	bool sync = bfq_bfqq_sync(bfqq);
++
++	bfq_log_bfqq(bfqd, bfqq, "completed one req with %u sects left (%d)",
++		     blk_rq_sectors(rq), sync);
++
++	bfq_update_hw_tag(bfqd);
++
++	BUG_ON(!bfqd->rq_in_driver);
++	BUG_ON(!bfqq->dispatched);
++	bfqd->rq_in_driver--;
++	bfqq->dispatched--;
++
++	if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {
++		bfq_weights_tree_remove(bfqd, &bfqq->entity,
++					&bfqd->queue_weights_tree);
++		if (!blk_queue_nonrot(bfqd->queue)) {
++			BUG_ON(!bfqd->busy_in_flight_queues);
++			bfqd->busy_in_flight_queues--;
++			if (bfq_bfqq_constantly_seeky(bfqq)) {
++				BUG_ON(!bfqd->
++					const_seeky_busy_in_flight_queues);
++				bfqd->const_seeky_busy_in_flight_queues--;
++			}
++		}
++	}
++
++	if (sync) {
++		bfqd->sync_flight--;
++		RQ_BIC(rq)->ttime.last_end_request = jiffies;
++	}
++
++	/*
++	 * If we are waiting to discover whether the request pattern of the
++	 * task associated with the queue is actually isochronous, and
++	 * both requisites for this condition to hold are satisfied, then
++	 * compute soft_rt_next_start (see the comments to the function
++	 * bfq_bfqq_softrt_next_start()).
++	 */
++	if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
++	    RB_EMPTY_ROOT(&bfqq->sort_list))
++		bfqq->soft_rt_next_start =
++			bfq_bfqq_softrt_next_start(bfqd, bfqq);
++
++	/*
++	 * If this is the in-service queue, check if it needs to be expired,
++	 * or if we want to idle in case it has no pending requests.
++	 */
++	if (bfqd->in_service_queue == bfqq) {
++		if (bfq_bfqq_budget_new(bfqq))
++			bfq_set_budget_timeout(bfqd);
++
++		if (bfq_bfqq_must_idle(bfqq)) {
++			bfq_arm_slice_timer(bfqd);
++			goto out;
++		} else if (bfq_may_expire_for_budg_timeout(bfqq))
++			bfq_bfqq_expire(bfqd, bfqq, 0, BFQ_BFQQ_BUDGET_TIMEOUT);
++		else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&
++			 (bfqq->dispatched == 0 ||
++			  !bfq_bfqq_must_not_expire(bfqq)))
++			bfq_bfqq_expire(bfqd, bfqq, 0,
++					BFQ_BFQQ_NO_MORE_REQUESTS);
++	}
++
++	if (!bfqd->rq_in_driver)
++		bfq_schedule_dispatch(bfqd);
++
++out:
++	return;
++}
++
++static inline int __bfq_may_queue(struct bfq_queue *bfqq)
++{
++	if (bfq_bfqq_wait_request(bfqq) && bfq_bfqq_must_alloc(bfqq)) {
++		bfq_clear_bfqq_must_alloc(bfqq);
++		return ELV_MQUEUE_MUST;
++	}
++
++	return ELV_MQUEUE_MAY;
++}
++
++static int bfq_may_queue(struct request_queue *q, int rw)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct task_struct *tsk = current;
++	struct bfq_io_cq *bic;
++	struct bfq_queue *bfqq;
++
++	/*
++	 * Don't force setup of a queue from here, as a call to may_queue
++	 * does not necessarily imply that a request actually will be
++	 * queued. So just lookup a possibly existing queue, or return
++	 * 'may queue' if that fails.
++	 */
++	bic = bfq_bic_lookup(bfqd, tsk->io_context);
++	if (bic == NULL)
++		return ELV_MQUEUE_MAY;
++
++	bfqq = bic_to_bfqq(bic, rw_is_sync(rw));
++	if (bfqq != NULL) {
++		bfq_init_prio_data(bfqq, bic);
++
++		return __bfq_may_queue(bfqq);
++	}
++
++	return ELV_MQUEUE_MAY;
++}
++
++/*
++ * Queue lock held here.
++ */
++static void bfq_put_request(struct request *rq)
++{
++	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++
++	if (bfqq != NULL) {
++		const int rw = rq_data_dir(rq);
++
++		BUG_ON(!bfqq->allocated[rw]);
++		bfqq->allocated[rw]--;
++
++		rq->elv.priv[0] = NULL;
++		rq->elv.priv[1] = NULL;
++
++		bfq_log_bfqq(bfqq->bfqd, bfqq, "put_request %p, %d",
++			     bfqq, atomic_read(&bfqq->ref));
++		bfq_put_queue(bfqq);
++	}
++}
++
++static struct bfq_queue *
++bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
++		struct bfq_queue *bfqq)
++{
++	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
++		(long unsigned)bfqq->new_bfqq->pid);
++	bic_set_bfqq(bic, bfqq->new_bfqq, 1);
++	bfq_mark_bfqq_coop(bfqq->new_bfqq);
++	bfq_put_queue(bfqq);
++	return bic_to_bfqq(bic, 1);
++}
++
++/*
++ * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
++ * was the last process referring to said bfqq.
++ */
++static struct bfq_queue *
++bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
++{
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
++	if (bfqq_process_refs(bfqq) == 1) {
++		bfqq->pid = current->pid;
++		bfq_clear_bfqq_coop(bfqq);
++		bfq_clear_bfqq_split_coop(bfqq);
++		return bfqq;
++	}
++
++	bic_set_bfqq(bic, NULL, 1);
++
++	bfq_put_cooperator(bfqq);
++
++	bfq_put_queue(bfqq);
++	return NULL;
++}
++
++/*
++ * Allocate bfq data structures associated with this request.
++ */
++static int bfq_set_request(struct request_queue *q, struct request *rq,
++			   struct bio *bio, gfp_t gfp_mask)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
++	const int rw = rq_data_dir(rq);
++	const int is_sync = rq_is_sync(rq);
++	struct bfq_queue *bfqq;
++	struct bfq_group *bfqg;
++	unsigned long flags;
++
++	might_sleep_if(gfp_mask & __GFP_WAIT);
++
++	bfq_changed_ioprio(bic);
++
++	spin_lock_irqsave(q->queue_lock, flags);
++
++	if (bic == NULL)
++		goto queue_fail;
++
++	bfqg = bfq_bic_update_cgroup(bic);
++
++new_queue:
++	bfqq = bic_to_bfqq(bic, is_sync);
++	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
++		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
++		bic_set_bfqq(bic, bfqq, is_sync);
++	} else {
++		/*
++		 * If the queue was seeky for too long, break it apart.
++		 */
++		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
++			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
++			bfqq = bfq_split_bfqq(bic, bfqq);
++			if (!bfqq)
++				goto new_queue;
++		}
++
++		/*
++		 * Check to see if this queue is scheduled to merge with
++		 * another closely cooperating queue. The merging of queues
++		 * happens here as it must be done in process context.
++		 * The reference on new_bfqq was taken in merge_bfqqs.
++		 */
++		if (bfqq->new_bfqq != NULL)
++			bfqq = bfq_merge_bfqqs(bfqd, bic, bfqq);
++	}
++
++	bfqq->allocated[rw]++;
++	atomic_inc(&bfqq->ref);
++	bfq_log_bfqq(bfqd, bfqq, "set_request: bfqq %p, %d", bfqq,
++		     atomic_read(&bfqq->ref));
++
++	rq->elv.priv[0] = bic;
++	rq->elv.priv[1] = bfqq;
++
++	spin_unlock_irqrestore(q->queue_lock, flags);
++
++	return 0;
++
++queue_fail:
++	bfq_schedule_dispatch(bfqd);
++	spin_unlock_irqrestore(q->queue_lock, flags);
++
++	return 1;
++}
++
++static void bfq_kick_queue(struct work_struct *work)
++{
++	struct bfq_data *bfqd =
++		container_of(work, struct bfq_data, unplug_work);
++	struct request_queue *q = bfqd->queue;
++
++	spin_lock_irq(q->queue_lock);
++	__blk_run_queue(q);
++	spin_unlock_irq(q->queue_lock);
++}
++
++/*
++ * Handler of the expiration of the timer running if the in-service queue
++ * is idling inside its time slice.
++ */
++static void bfq_idle_slice_timer(unsigned long data)
++{
++	struct bfq_data *bfqd = (struct bfq_data *)data;
++	struct bfq_queue *bfqq;
++	unsigned long flags;
++	enum bfqq_expiration reason;
++
++	spin_lock_irqsave(bfqd->queue->queue_lock, flags);
++
++	bfqq = bfqd->in_service_queue;
++	/*
++	 * Theoretical race here: the in-service queue can be NULL or
++	 * different from the queue that was idling if the timer handler
++	 * spins on the queue_lock and a new request arrives for the
++	 * current queue and there is a full dispatch cycle that changes
++	 * the in-service queue.  This can hardly happen, but in the worst
++	 * case we just expire a queue too early.
++	 */
++	if (bfqq != NULL) {
++		bfq_log_bfqq(bfqd, bfqq, "slice_timer expired");
++		if (bfq_bfqq_budget_timeout(bfqq))
++			/*
++			 * Also here the queue can be safely expired
++			 * for budget timeout without wasting
++			 * guarantees
++			 */
++			reason = BFQ_BFQQ_BUDGET_TIMEOUT;
++		else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)
++			/*
++			 * The queue may not be empty upon timer expiration,
++			 * because we may not disable the timer when the
++			 * first request of the in-service queue arrives
++			 * during disk idling.
++			 */
++			reason = BFQ_BFQQ_TOO_IDLE;
++		else
++			goto schedule_dispatch;
++
++		bfq_bfqq_expire(bfqd, bfqq, 1, reason);
++	}
++
++schedule_dispatch:
++	bfq_schedule_dispatch(bfqd);
++
++	spin_unlock_irqrestore(bfqd->queue->queue_lock, flags);
++}
++
++static void bfq_shutdown_timer_wq(struct bfq_data *bfqd)
++{
++	del_timer_sync(&bfqd->idle_slice_timer);
++	cancel_work_sync(&bfqd->unplug_work);
++}
++
++static inline void __bfq_put_async_bfqq(struct bfq_data *bfqd,
++					struct bfq_queue **bfqq_ptr)
++{
++	struct bfq_group *root_group = bfqd->root_group;
++	struct bfq_queue *bfqq = *bfqq_ptr;
++
++	bfq_log(bfqd, "put_async_bfqq: %p", bfqq);
++	if (bfqq != NULL) {
++		bfq_bfqq_move(bfqd, bfqq, &bfqq->entity, root_group);
++		bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",
++			     bfqq, atomic_read(&bfqq->ref));
++		bfq_put_queue(bfqq);
++		*bfqq_ptr = NULL;
++	}
++}
++
++/*
++ * Release all the bfqg references to its async queues.  If we are
++ * deallocating the group these queues may still contain requests, so
++ * we reparent them to the root cgroup (i.e., the only one that will
++ * exist for sure until all the requests on a device are gone).
++ */
++static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
++{
++	int i, j;
++
++	for (i = 0; i < 2; i++)
++		for (j = 0; j < IOPRIO_BE_NR; j++)
++			__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j]);
++
++	__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq);
++}
++
++static void bfq_exit_queue(struct elevator_queue *e)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	struct request_queue *q = bfqd->queue;
++	struct bfq_queue *bfqq, *n;
++
++	bfq_shutdown_timer_wq(bfqd);
++
++	spin_lock_irq(q->queue_lock);
++
++	BUG_ON(bfqd->in_service_queue != NULL);
++	list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)
++		bfq_deactivate_bfqq(bfqd, bfqq, 0);
++
++	bfq_disconnect_groups(bfqd);
++	spin_unlock_irq(q->queue_lock);
++
++	bfq_shutdown_timer_wq(bfqd);
++
++	synchronize_rcu();
++
++	BUG_ON(timer_pending(&bfqd->idle_slice_timer));
++
++	bfq_free_root_group(bfqd);
++	kfree(bfqd);
++}
++
++static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
++{
++	struct bfq_group *bfqg;
++	struct bfq_data *bfqd;
++	struct elevator_queue *eq;
++
++	eq = elevator_alloc(q, e);
++	if (eq == NULL)
++		return -ENOMEM;
++
++	bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);
++	if (bfqd == NULL) {
++		kobject_put(&eq->kobj);
++		return -ENOMEM;
++	}
++	eq->elevator_data = bfqd;
++
++	/*
++	 * Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.
++	 * Grab a permanent reference to it, so that the normal code flow
++	 * will not attempt to free it.
++	 */
++	bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, 1, 0);
++	atomic_inc(&bfqd->oom_bfqq.ref);
++
++	bfqd->queue = q;
++
++	spin_lock_irq(q->queue_lock);
++	q->elevator = eq;
++	spin_unlock_irq(q->queue_lock);
++
++	bfqg = bfq_alloc_root_group(bfqd, q->node);
++	if (bfqg == NULL) {
++		kfree(bfqd);
++		kobject_put(&eq->kobj);
++		return -ENOMEM;
++	}
++
++	bfqd->root_group = bfqg;
++#ifdef CONFIG_CGROUP_BFQIO
++	bfqd->active_numerous_groups = 0;
++#endif
++
++	init_timer(&bfqd->idle_slice_timer);
++	bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
++	bfqd->idle_slice_timer.data = (unsigned long)bfqd;
++
++	bfqd->rq_pos_tree = RB_ROOT;
++	bfqd->queue_weights_tree = RB_ROOT;
++	bfqd->group_weights_tree = RB_ROOT;
++
++	INIT_WORK(&bfqd->unplug_work, bfq_kick_queue);
++
++	INIT_LIST_HEAD(&bfqd->active_list);
++	INIT_LIST_HEAD(&bfqd->idle_list);
++
++	bfqd->hw_tag = -1;
++
++	bfqd->bfq_max_budget = bfq_default_max_budget;
++
++	bfqd->bfq_quantum = bfq_quantum;
++	bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];
++	bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];
++	bfqd->bfq_back_max = bfq_back_max;
++	bfqd->bfq_back_penalty = bfq_back_penalty;
++	bfqd->bfq_slice_idle = bfq_slice_idle;
++	bfqd->bfq_class_idle_last_service = 0;
++	bfqd->bfq_max_budget_async_rq = bfq_max_budget_async_rq;
++	bfqd->bfq_timeout[BLK_RW_ASYNC] = bfq_timeout_async;
++	bfqd->bfq_timeout[BLK_RW_SYNC] = bfq_timeout_sync;
++
++	bfqd->bfq_coop_thresh = 2;
++	bfqd->bfq_failed_cooperations = 7000;
++	bfqd->bfq_requests_within_timer = 120;
++
++	bfqd->low_latency = true;
++
++	bfqd->bfq_wr_coeff = 20;
++	bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);
++	bfqd->bfq_wr_max_time = 0;
++	bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);
++	bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);
++	bfqd->bfq_wr_max_softrt_rate = 7000; /*
++					      * Approximate rate required
++					      * to playback or record a
++					      * high-definition compressed
++					      * video.
++					      */
++	bfqd->wr_busy_queues = 0;
++	bfqd->busy_in_flight_queues = 0;
++	bfqd->const_seeky_busy_in_flight_queues = 0;
++
++	/*
++	 * Begin by assuming, optimistically, that the device peak rate is
++	 * equal to the highest reference rate.
++	 */
++	bfqd->RT_prod = R_fast[blk_queue_nonrot(bfqd->queue)] *
++			T_fast[blk_queue_nonrot(bfqd->queue)];
++	bfqd->peak_rate = R_fast[blk_queue_nonrot(bfqd->queue)];
++	bfqd->device_speed = BFQ_BFQD_FAST;
++
++	return 0;
++}
++
++static void bfq_slab_kill(void)
++{
++	if (bfq_pool != NULL)
++		kmem_cache_destroy(bfq_pool);
++}
++
++static int __init bfq_slab_setup(void)
++{
++	bfq_pool = KMEM_CACHE(bfq_queue, 0);
++	if (bfq_pool == NULL)
++		return -ENOMEM;
++	return 0;
++}
++
++static ssize_t bfq_var_show(unsigned int var, char *page)
++{
++	return sprintf(page, "%d\n", var);
++}
++
++static ssize_t bfq_var_store(unsigned long *var, const char *page,
++			     size_t count)
++{
++	unsigned long new_val;
++	int ret = kstrtoul(page, 10, &new_val);
++
++	if (ret == 0)
++		*var = new_val;
++
++	return count;
++}
++
++static ssize_t bfq_wr_max_time_show(struct elevator_queue *e, char *page)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	return sprintf(page, "%d\n", bfqd->bfq_wr_max_time > 0 ?
++		       jiffies_to_msecs(bfqd->bfq_wr_max_time) :
++		       jiffies_to_msecs(bfq_wr_duration(bfqd)));
++}
++
++static ssize_t bfq_weights_show(struct elevator_queue *e, char *page)
++{
++	struct bfq_queue *bfqq;
++	struct bfq_data *bfqd = e->elevator_data;
++	ssize_t num_char = 0;
++
++	num_char += sprintf(page + num_char, "Tot reqs queued %d\n\n",
++			    bfqd->queued);
++
++	spin_lock_irq(bfqd->queue->queue_lock);
++
++	num_char += sprintf(page + num_char, "Active:\n");
++	list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list) {
++	  num_char += sprintf(page + num_char,
++			      "pid%d: weight %hu, nr_queued %d %d, dur %d/%u\n",
++			      bfqq->pid,
++			      bfqq->entity.weight,
++			      bfqq->queued[0],
++			      bfqq->queued[1],
++			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
++			jiffies_to_msecs(bfqq->wr_cur_max_time));
++	}
++
++	num_char += sprintf(page + num_char, "Idle:\n");
++	list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list) {
++			num_char += sprintf(page + num_char,
++				"pid%d: weight %hu, dur %d/%u\n",
++				bfqq->pid,
++				bfqq->entity.weight,
++				jiffies_to_msecs(jiffies -
++					bfqq->last_wr_start_finish),
++				jiffies_to_msecs(bfqq->wr_cur_max_time));
++	}
++
++	spin_unlock_irq(bfqd->queue->queue_lock);
++
++	return num_char;
++}
++
++#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
++static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
++{									\
++	struct bfq_data *bfqd = e->elevator_data;			\
++	unsigned int __data = __VAR;					\
++	if (__CONV)							\
++		__data = jiffies_to_msecs(__data);			\
++	return bfq_var_show(__data, (page));				\
++}
++SHOW_FUNCTION(bfq_quantum_show, bfqd->bfq_quantum, 0);
++SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 1);
++SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 1);
++SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);
++SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);
++SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 1);
++SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);
++SHOW_FUNCTION(bfq_max_budget_async_rq_show,
++	      bfqd->bfq_max_budget_async_rq, 0);
++SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout[BLK_RW_SYNC], 1);
++SHOW_FUNCTION(bfq_timeout_async_show, bfqd->bfq_timeout[BLK_RW_ASYNC], 1);
++SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);
++SHOW_FUNCTION(bfq_wr_coeff_show, bfqd->bfq_wr_coeff, 0);
++SHOW_FUNCTION(bfq_wr_rt_max_time_show, bfqd->bfq_wr_rt_max_time, 1);
++SHOW_FUNCTION(bfq_wr_min_idle_time_show, bfqd->bfq_wr_min_idle_time, 1);
++SHOW_FUNCTION(bfq_wr_min_inter_arr_async_show, bfqd->bfq_wr_min_inter_arr_async,
++	1);
++SHOW_FUNCTION(bfq_wr_max_softrt_rate_show, bfqd->bfq_wr_max_softrt_rate, 0);
++#undef SHOW_FUNCTION
++
++#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
++static ssize_t								\
++__FUNC(struct elevator_queue *e, const char *page, size_t count)	\
++{									\
++	struct bfq_data *bfqd = e->elevator_data;			\
++	unsigned long uninitialized_var(__data);			\
++	int ret = bfq_var_store(&__data, (page), count);		\
++	if (__data < (MIN))						\
++		__data = (MIN);						\
++	else if (__data > (MAX))					\
++		__data = (MAX);						\
++	if (__CONV)							\
++		*(__PTR) = msecs_to_jiffies(__data);			\
++	else								\
++		*(__PTR) = __data;					\
++	return ret;							\
++}
++STORE_FUNCTION(bfq_quantum_store, &bfqd->bfq_quantum, 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,
++		INT_MAX, 1);
++STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,
++		INT_MAX, 1);
++STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);
++STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,
++		INT_MAX, 0);
++STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_max_budget_async_rq_store, &bfqd->bfq_max_budget_async_rq,
++		1, INT_MAX, 0);
++STORE_FUNCTION(bfq_timeout_async_store, &bfqd->bfq_timeout[BLK_RW_ASYNC], 0,
++		INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_coeff_store, &bfqd->bfq_wr_coeff, 1, INT_MAX, 0);
++STORE_FUNCTION(bfq_wr_max_time_store, &bfqd->bfq_wr_max_time, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_rt_max_time_store, &bfqd->bfq_wr_rt_max_time, 0, INT_MAX,
++		1);
++STORE_FUNCTION(bfq_wr_min_idle_time_store, &bfqd->bfq_wr_min_idle_time, 0,
++		INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_min_inter_arr_async_store,
++		&bfqd->bfq_wr_min_inter_arr_async, 0, INT_MAX, 1);
++STORE_FUNCTION(bfq_wr_max_softrt_rate_store, &bfqd->bfq_wr_max_softrt_rate, 0,
++		INT_MAX, 0);
++#undef STORE_FUNCTION
++
++/* do nothing for the moment */
++static ssize_t bfq_weights_store(struct elevator_queue *e,
++				    const char *page, size_t count)
++{
++	return count;
++}
++
++static inline unsigned long bfq_estimated_max_budget(struct bfq_data *bfqd)
++{
++	u64 timeout = jiffies_to_msecs(bfqd->bfq_timeout[BLK_RW_SYNC]);
++
++	if (bfqd->peak_rate_samples >= BFQ_PEAK_RATE_SAMPLES)
++		return bfq_calc_max_budget(bfqd->peak_rate, timeout);
++	else
++		return bfq_default_max_budget;
++}
++
++static ssize_t bfq_max_budget_store(struct elevator_queue *e,
++				    const char *page, size_t count)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	unsigned long uninitialized_var(__data);
++	int ret = bfq_var_store(&__data, (page), count);
++
++	if (__data == 0)
++		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
++	else {
++		if (__data > INT_MAX)
++			__data = INT_MAX;
++		bfqd->bfq_max_budget = __data;
++	}
++
++	bfqd->bfq_user_max_budget = __data;
++
++	return ret;
++}
++
++static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,
++				      const char *page, size_t count)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	unsigned long uninitialized_var(__data);
++	int ret = bfq_var_store(&__data, (page), count);
++
++	if (__data < 1)
++		__data = 1;
++	else if (__data > INT_MAX)
++		__data = INT_MAX;
++
++	bfqd->bfq_timeout[BLK_RW_SYNC] = msecs_to_jiffies(__data);
++	if (bfqd->bfq_user_max_budget == 0)
++		bfqd->bfq_max_budget = bfq_estimated_max_budget(bfqd);
++
++	return ret;
++}
++
++static ssize_t bfq_low_latency_store(struct elevator_queue *e,
++				     const char *page, size_t count)
++{
++	struct bfq_data *bfqd = e->elevator_data;
++	unsigned long uninitialized_var(__data);
++	int ret = bfq_var_store(&__data, (page), count);
++
++	if (__data > 1)
++		__data = 1;
++	if (__data == 0 && bfqd->low_latency != 0)
++		bfq_end_wr(bfqd);
++	bfqd->low_latency = __data;
++
++	return ret;
++}
++
++#define BFQ_ATTR(name) \
++	__ATTR(name, S_IRUGO|S_IWUSR, bfq_##name##_show, bfq_##name##_store)
++
++static struct elv_fs_entry bfq_attrs[] = {
++	BFQ_ATTR(quantum),
++	BFQ_ATTR(fifo_expire_sync),
++	BFQ_ATTR(fifo_expire_async),
++	BFQ_ATTR(back_seek_max),
++	BFQ_ATTR(back_seek_penalty),
++	BFQ_ATTR(slice_idle),
++	BFQ_ATTR(max_budget),
++	BFQ_ATTR(max_budget_async_rq),
++	BFQ_ATTR(timeout_sync),
++	BFQ_ATTR(timeout_async),
++	BFQ_ATTR(low_latency),
++	BFQ_ATTR(wr_coeff),
++	BFQ_ATTR(wr_max_time),
++	BFQ_ATTR(wr_rt_max_time),
++	BFQ_ATTR(wr_min_idle_time),
++	BFQ_ATTR(wr_min_inter_arr_async),
++	BFQ_ATTR(wr_max_softrt_rate),
++	BFQ_ATTR(weights),
++	__ATTR_NULL
++};
++
++static struct elevator_type iosched_bfq = {
++	.ops = {
++		.elevator_merge_fn =		bfq_merge,
++		.elevator_merged_fn =		bfq_merged_request,
++		.elevator_merge_req_fn =	bfq_merged_requests,
++		.elevator_allow_merge_fn =	bfq_allow_merge,
++		.elevator_dispatch_fn =		bfq_dispatch_requests,
++		.elevator_add_req_fn =		bfq_insert_request,
++		.elevator_activate_req_fn =	bfq_activate_request,
++		.elevator_deactivate_req_fn =	bfq_deactivate_request,
++		.elevator_completed_req_fn =	bfq_completed_request,
++		.elevator_former_req_fn =	elv_rb_former_request,
++		.elevator_latter_req_fn =	elv_rb_latter_request,
++		.elevator_init_icq_fn =		bfq_init_icq,
++		.elevator_exit_icq_fn =		bfq_exit_icq,
++		.elevator_set_req_fn =		bfq_set_request,
++		.elevator_put_req_fn =		bfq_put_request,
++		.elevator_may_queue_fn =	bfq_may_queue,
++		.elevator_init_fn =		bfq_init_queue,
++		.elevator_exit_fn =		bfq_exit_queue,
++	},
++	.icq_size =		sizeof(struct bfq_io_cq),
++	.icq_align =		__alignof__(struct bfq_io_cq),
++	.elevator_attrs =	bfq_attrs,
++	.elevator_name =	"bfq",
++	.elevator_owner =	THIS_MODULE,
++};
++
++static int __init bfq_init(void)
++{
++	/*
++	 * Can be 0 on HZ < 1000 setups.
++	 */
++	if (bfq_slice_idle == 0)
++		bfq_slice_idle = 1;
++
++	if (bfq_timeout_async == 0)
++		bfq_timeout_async = 1;
++
++	if (bfq_slab_setup())
++		return -ENOMEM;
++
++	/*
++	 * Times to load large popular applications for the typical systems
++	 * installed on the reference devices (see the comments before the
++	 * definitions of the two arrays).
++	 */
++	T_slow[0] = msecs_to_jiffies(2600);
++	T_slow[1] = msecs_to_jiffies(1000);
++	T_fast[0] = msecs_to_jiffies(5500);
++	T_fast[1] = msecs_to_jiffies(2000);
++
++	/*
++	 * Thresholds that determine the switch between speed classes (see
++	 * the comments before the definition of the array).
++	 */
++	device_speed_thresh[0] = (R_fast[0] + R_slow[0]) / 2;
++	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
++
++	elv_register(&iosched_bfq);
++	pr_info("BFQ I/O-scheduler version: v7r5");
++
++	return 0;
++}
++
++static void __exit bfq_exit(void)
++{
++	elv_unregister(&iosched_bfq);
++	bfq_slab_kill();
++}
++
++module_init(bfq_init);
++module_exit(bfq_exit);
++
++MODULE_AUTHOR("Fabio Checconi, Paolo Valente");
++MODULE_LICENSE("GPL");
+diff --git a/block/bfq-sched.c b/block/bfq-sched.c
+new file mode 100644
+index 0000000..c4831b7
+--- /dev/null
++++ b/block/bfq-sched.c
+@@ -0,0 +1,1207 @@
++/*
++ * BFQ: Hierarchical B-WF2Q+ scheduler.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++#ifdef CONFIG_CGROUP_BFQIO
++#define for_each_entity(entity)	\
++	for (; entity != NULL; entity = entity->parent)
++
++#define for_each_entity_safe(entity, parent) \
++	for (; entity && ({ parent = entity->parent; 1; }); entity = parent)
++
++static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
++						 int extract,
++						 struct bfq_data *bfqd);
++
++static inline void bfq_update_budget(struct bfq_entity *next_in_service)
++{
++	struct bfq_entity *bfqg_entity;
++	struct bfq_group *bfqg;
++	struct bfq_sched_data *group_sd;
++
++	BUG_ON(next_in_service == NULL);
++
++	group_sd = next_in_service->sched_data;
++
++	bfqg = container_of(group_sd, struct bfq_group, sched_data);
++	/*
++	 * bfq_group's my_entity field is not NULL only if the group
++	 * is not the root group. We must not touch the root entity
++	 * as it must never become an in-service entity.
++	 */
++	bfqg_entity = bfqg->my_entity;
++	if (bfqg_entity != NULL)
++		bfqg_entity->budget = next_in_service->budget;
++}
++
++static int bfq_update_next_in_service(struct bfq_sched_data *sd)
++{
++	struct bfq_entity *next_in_service;
++
++	if (sd->in_service_entity != NULL)
++		/* will update/requeue at the end of service */
++		return 0;
++
++	/*
++	 * NOTE: this can be improved in many ways, such as returning
++	 * 1 (and thus propagating upwards the update) only when the
++	 * budget changes, or caching the bfqq that will be scheduled
++	 * next from this subtree.  By now we worry more about
++	 * correctness than about performance...
++	 */
++	next_in_service = bfq_lookup_next_entity(sd, 0, NULL);
++	sd->next_in_service = next_in_service;
++
++	if (next_in_service != NULL)
++		bfq_update_budget(next_in_service);
++
++	return 1;
++}
++
++static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
++					     struct bfq_entity *entity)
++{
++	BUG_ON(sd->next_in_service != entity);
++}
++#else
++#define for_each_entity(entity)	\
++	for (; entity != NULL; entity = NULL)
++
++#define for_each_entity_safe(entity, parent) \
++	for (parent = NULL; entity != NULL; entity = parent)
++
++static inline int bfq_update_next_in_service(struct bfq_sched_data *sd)
++{
++	return 0;
++}
++
++static inline void bfq_check_next_in_service(struct bfq_sched_data *sd,
++					     struct bfq_entity *entity)
++{
++}
++
++static inline void bfq_update_budget(struct bfq_entity *next_in_service)
++{
++}
++#endif
++
++/*
++ * Shift for timestamp calculations.  This actually limits the maximum
++ * service allowed in one timestamp delta (small shift values increase it),
++ * the maximum total weight that can be used for the queues in the system
++ * (big shift values increase it), and the period of virtual time
++ * wraparounds.
++ */
++#define WFQ_SERVICE_SHIFT	22
++
++/**
++ * bfq_gt - compare two timestamps.
++ * @a: first ts.
++ * @b: second ts.
++ *
++ * Return @a > @b, dealing with wrapping correctly.
++ */
++static inline int bfq_gt(u64 a, u64 b)
++{
++	return (s64)(a - b) > 0;
++}
++
++static inline struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = NULL;
++
++	BUG_ON(entity == NULL);
++
++	if (entity->my_sched_data == NULL)
++		bfqq = container_of(entity, struct bfq_queue, entity);
++
++	return bfqq;
++}
++
++
++/**
++ * bfq_delta - map service into the virtual time domain.
++ * @service: amount of service.
++ * @weight: scale factor (weight of an entity or weight sum).
++ */
++static inline u64 bfq_delta(unsigned long service,
++					unsigned long weight)
++{
++	u64 d = (u64)service << WFQ_SERVICE_SHIFT;
++
++	do_div(d, weight);
++	return d;
++}
++
++/**
++ * bfq_calc_finish - assign the finish time to an entity.
++ * @entity: the entity to act upon.
++ * @service: the service to be charged to the entity.
++ */
++static inline void bfq_calc_finish(struct bfq_entity *entity,
++				   unsigned long service)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++	BUG_ON(entity->weight == 0);
++
++	entity->finish = entity->start +
++		bfq_delta(service, entity->weight);
++
++	if (bfqq != NULL) {
++		bfq_log_bfqq(bfqq->bfqd, bfqq,
++			"calc_finish: serv %lu, w %d",
++			service, entity->weight);
++		bfq_log_bfqq(bfqq->bfqd, bfqq,
++			"calc_finish: start %llu, finish %llu, delta %llu",
++			entity->start, entity->finish,
++			bfq_delta(service, entity->weight));
++	}
++}
++
++/**
++ * bfq_entity_of - get an entity from a node.
++ * @node: the node field of the entity.
++ *
++ * Convert a node pointer to the relative entity.  This is used only
++ * to simplify the logic of some functions and not as the generic
++ * conversion mechanism because, e.g., in the tree walking functions,
++ * the check for a %NULL value would be redundant.
++ */
++static inline struct bfq_entity *bfq_entity_of(struct rb_node *node)
++{
++	struct bfq_entity *entity = NULL;
++
++	if (node != NULL)
++		entity = rb_entry(node, struct bfq_entity, rb_node);
++
++	return entity;
++}
++
++/**
++ * bfq_extract - remove an entity from a tree.
++ * @root: the tree root.
++ * @entity: the entity to remove.
++ */
++static inline void bfq_extract(struct rb_root *root,
++			       struct bfq_entity *entity)
++{
++	BUG_ON(entity->tree != root);
++
++	entity->tree = NULL;
++	rb_erase(&entity->rb_node, root);
++}
++
++/**
++ * bfq_idle_extract - extract an entity from the idle tree.
++ * @st: the service tree of the owning @entity.
++ * @entity: the entity being removed.
++ */
++static void bfq_idle_extract(struct bfq_service_tree *st,
++			     struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct rb_node *next;
++
++	BUG_ON(entity->tree != &st->idle);
++
++	if (entity == st->first_idle) {
++		next = rb_next(&entity->rb_node);
++		st->first_idle = bfq_entity_of(next);
++	}
++
++	if (entity == st->last_idle) {
++		next = rb_prev(&entity->rb_node);
++		st->last_idle = bfq_entity_of(next);
++	}
++
++	bfq_extract(&st->idle, entity);
++
++	if (bfqq != NULL)
++		list_del(&bfqq->bfqq_list);
++}
++
++/**
++ * bfq_insert - generic tree insertion.
++ * @root: tree root.
++ * @entity: entity to insert.
++ *
++ * This is used for the idle and the active tree, since they are both
++ * ordered by finish time.
++ */
++static void bfq_insert(struct rb_root *root, struct bfq_entity *entity)
++{
++	struct bfq_entity *entry;
++	struct rb_node **node = &root->rb_node;
++	struct rb_node *parent = NULL;
++
++	BUG_ON(entity->tree != NULL);
++
++	while (*node != NULL) {
++		parent = *node;
++		entry = rb_entry(parent, struct bfq_entity, rb_node);
++
++		if (bfq_gt(entry->finish, entity->finish))
++			node = &parent->rb_left;
++		else
++			node = &parent->rb_right;
++	}
++
++	rb_link_node(&entity->rb_node, parent, node);
++	rb_insert_color(&entity->rb_node, root);
++
++	entity->tree = root;
++}
++
++/**
++ * bfq_update_min - update the min_start field of a entity.
++ * @entity: the entity to update.
++ * @node: one of its children.
++ *
++ * This function is called when @entity may store an invalid value for
++ * min_start due to updates to the active tree.  The function  assumes
++ * that the subtree rooted at @node (which may be its left or its right
++ * child) has a valid min_start value.
++ */
++static inline void bfq_update_min(struct bfq_entity *entity,
++				  struct rb_node *node)
++{
++	struct bfq_entity *child;
++
++	if (node != NULL) {
++		child = rb_entry(node, struct bfq_entity, rb_node);
++		if (bfq_gt(entity->min_start, child->min_start))
++			entity->min_start = child->min_start;
++	}
++}
++
++/**
++ * bfq_update_active_node - recalculate min_start.
++ * @node: the node to update.
++ *
++ * @node may have changed position or one of its children may have moved,
++ * this function updates its min_start value.  The left and right subtrees
++ * are assumed to hold a correct min_start value.
++ */
++static inline void bfq_update_active_node(struct rb_node *node)
++{
++	struct bfq_entity *entity = rb_entry(node, struct bfq_entity, rb_node);
++
++	entity->min_start = entity->start;
++	bfq_update_min(entity, node->rb_right);
++	bfq_update_min(entity, node->rb_left);
++}
++
++/**
++ * bfq_update_active_tree - update min_start for the whole active tree.
++ * @node: the starting node.
++ *
++ * @node must be the deepest modified node after an update.  This function
++ * updates its min_start using the values held by its children, assuming
++ * that they did not change, and then updates all the nodes that may have
++ * changed in the path to the root.  The only nodes that may have changed
++ * are the ones in the path or their siblings.
++ */
++static void bfq_update_active_tree(struct rb_node *node)
++{
++	struct rb_node *parent;
++
++up:
++	bfq_update_active_node(node);
++
++	parent = rb_parent(node);
++	if (parent == NULL)
++		return;
++
++	if (node == parent->rb_left && parent->rb_right != NULL)
++		bfq_update_active_node(parent->rb_right);
++	else if (parent->rb_left != NULL)
++		bfq_update_active_node(parent->rb_left);
++
++	node = parent;
++	goto up;
++}
++
++static void bfq_weights_tree_add(struct bfq_data *bfqd,
++				 struct bfq_entity *entity,
++				 struct rb_root *root);
++
++static void bfq_weights_tree_remove(struct bfq_data *bfqd,
++				    struct bfq_entity *entity,
++				    struct rb_root *root);
++
++
++/**
++ * bfq_active_insert - insert an entity in the active tree of its
++ *                     group/device.
++ * @st: the service tree of the entity.
++ * @entity: the entity being inserted.
++ *
++ * The active tree is ordered by finish time, but an extra key is kept
++ * per each node, containing the minimum value for the start times of
++ * its children (and the node itself), so it's possible to search for
++ * the eligible node with the lowest finish time in logarithmic time.
++ */
++static void bfq_active_insert(struct bfq_service_tree *st,
++			      struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct rb_node *node = &entity->rb_node;
++#ifdef CONFIG_CGROUP_BFQIO
++	struct bfq_sched_data *sd = NULL;
++	struct bfq_group *bfqg = NULL;
++	struct bfq_data *bfqd = NULL;
++#endif
++
++	bfq_insert(&st->active, entity);
++
++	if (node->rb_left != NULL)
++		node = node->rb_left;
++	else if (node->rb_right != NULL)
++		node = node->rb_right;
++
++	bfq_update_active_tree(node);
++
++#ifdef CONFIG_CGROUP_BFQIO
++	sd = entity->sched_data;
++	bfqg = container_of(sd, struct bfq_group, sched_data);
++	BUG_ON(!bfqg);
++	bfqd = (struct bfq_data *)bfqg->bfqd;
++#endif
++	if (bfqq != NULL)
++		list_add(&bfqq->bfqq_list, &bfqq->bfqd->active_list);
++#ifdef CONFIG_CGROUP_BFQIO
++	else { /* bfq_group */
++		BUG_ON(!bfqd);
++		bfq_weights_tree_add(bfqd, entity, &bfqd->group_weights_tree);
++	}
++	if (bfqg != bfqd->root_group) {
++		BUG_ON(!bfqg);
++		BUG_ON(!bfqd);
++		bfqg->active_entities++;
++		if (bfqg->active_entities == 2)
++			bfqd->active_numerous_groups++;
++	}
++#endif
++}
++
++/**
++ * bfq_ioprio_to_weight - calc a weight from an ioprio.
++ * @ioprio: the ioprio value to convert.
++ */
++static inline unsigned short bfq_ioprio_to_weight(int ioprio)
++{
++	BUG_ON(ioprio < 0 || ioprio >= IOPRIO_BE_NR);
++	return IOPRIO_BE_NR - ioprio;
++}
++
++/**
++ * bfq_weight_to_ioprio - calc an ioprio from a weight.
++ * @weight: the weight value to convert.
++ *
++ * To preserve as mush as possible the old only-ioprio user interface,
++ * 0 is used as an escape ioprio value for weights (numerically) equal or
++ * larger than IOPRIO_BE_NR
++ */
++static inline unsigned short bfq_weight_to_ioprio(int weight)
++{
++	BUG_ON(weight < BFQ_MIN_WEIGHT || weight > BFQ_MAX_WEIGHT);
++	return IOPRIO_BE_NR - weight < 0 ? 0 : IOPRIO_BE_NR - weight;
++}
++
++static inline void bfq_get_entity(struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++
++	if (bfqq != NULL) {
++		atomic_inc(&bfqq->ref);
++		bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d",
++			     bfqq, atomic_read(&bfqq->ref));
++	}
++}
++
++/**
++ * bfq_find_deepest - find the deepest node that an extraction can modify.
++ * @node: the node being removed.
++ *
++ * Do the first step of an extraction in an rb tree, looking for the
++ * node that will replace @node, and returning the deepest node that
++ * the following modifications to the tree can touch.  If @node is the
++ * last node in the tree return %NULL.
++ */
++static struct rb_node *bfq_find_deepest(struct rb_node *node)
++{
++	struct rb_node *deepest;
++
++	if (node->rb_right == NULL && node->rb_left == NULL)
++		deepest = rb_parent(node);
++	else if (node->rb_right == NULL)
++		deepest = node->rb_left;
++	else if (node->rb_left == NULL)
++		deepest = node->rb_right;
++	else {
++		deepest = rb_next(node);
++		if (deepest->rb_right != NULL)
++			deepest = deepest->rb_right;
++		else if (rb_parent(deepest) != node)
++			deepest = rb_parent(deepest);
++	}
++
++	return deepest;
++}
++
++/**
++ * bfq_active_extract - remove an entity from the active tree.
++ * @st: the service_tree containing the tree.
++ * @entity: the entity being removed.
++ */
++static void bfq_active_extract(struct bfq_service_tree *st,
++			       struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct rb_node *node;
++#ifdef CONFIG_CGROUP_BFQIO
++	struct bfq_sched_data *sd = NULL;
++	struct bfq_group *bfqg = NULL;
++	struct bfq_data *bfqd = NULL;
++#endif
++
++	node = bfq_find_deepest(&entity->rb_node);
++	bfq_extract(&st->active, entity);
++
++	if (node != NULL)
++		bfq_update_active_tree(node);
++
++#ifdef CONFIG_CGROUP_BFQIO
++	sd = entity->sched_data;
++	bfqg = container_of(sd, struct bfq_group, sched_data);
++	BUG_ON(!bfqg);
++	bfqd = (struct bfq_data *)bfqg->bfqd;
++#endif
++	if (bfqq != NULL)
++		list_del(&bfqq->bfqq_list);
++#ifdef CONFIG_CGROUP_BFQIO
++	else { /* bfq_group */
++		BUG_ON(!bfqd);
++		bfq_weights_tree_remove(bfqd, entity,
++					&bfqd->group_weights_tree);
++	}
++	if (bfqg != bfqd->root_group) {
++		BUG_ON(!bfqg);
++		BUG_ON(!bfqd);
++		BUG_ON(!bfqg->active_entities);
++		bfqg->active_entities--;
++		if (bfqg->active_entities == 1) {
++			BUG_ON(!bfqd->active_numerous_groups);
++			bfqd->active_numerous_groups--;
++		}
++	}
++#endif
++}
++
++/**
++ * bfq_idle_insert - insert an entity into the idle tree.
++ * @st: the service tree containing the tree.
++ * @entity: the entity to insert.
++ */
++static void bfq_idle_insert(struct bfq_service_tree *st,
++			    struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct bfq_entity *first_idle = st->first_idle;
++	struct bfq_entity *last_idle = st->last_idle;
++
++	if (first_idle == NULL || bfq_gt(first_idle->finish, entity->finish))
++		st->first_idle = entity;
++	if (last_idle == NULL || bfq_gt(entity->finish, last_idle->finish))
++		st->last_idle = entity;
++
++	bfq_insert(&st->idle, entity);
++
++	if (bfqq != NULL)
++		list_add(&bfqq->bfqq_list, &bfqq->bfqd->idle_list);
++}
++
++/**
++ * bfq_forget_entity - remove an entity from the wfq trees.
++ * @st: the service tree.
++ * @entity: the entity being removed.
++ *
++ * Update the device status and forget everything about @entity, putting
++ * the device reference to it, if it is a queue.  Entities belonging to
++ * groups are not refcounted.
++ */
++static void bfq_forget_entity(struct bfq_service_tree *st,
++			      struct bfq_entity *entity)
++{
++	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++	struct bfq_sched_data *sd;
++
++	BUG_ON(!entity->on_st);
++
++	entity->on_st = 0;
++	st->wsum -= entity->weight;
++	if (bfqq != NULL) {
++		sd = entity->sched_data;
++		bfq_log_bfqq(bfqq->bfqd, bfqq, "forget_entity: %p %d",
++			     bfqq, atomic_read(&bfqq->ref));
++		bfq_put_queue(bfqq);
++	}
++}
++
++/**
++ * bfq_put_idle_entity - release the idle tree ref of an entity.
++ * @st: service tree for the entity.
++ * @entity: the entity being released.
++ */
++static void bfq_put_idle_entity(struct bfq_service_tree *st,
++				struct bfq_entity *entity)
++{
++	bfq_idle_extract(st, entity);
++	bfq_forget_entity(st, entity);
++}
++
++/**
++ * bfq_forget_idle - update the idle tree if necessary.
++ * @st: the service tree to act upon.
++ *
++ * To preserve the global O(log N) complexity we only remove one entry here;
++ * as the idle tree will not grow indefinitely this can be done safely.
++ */
++static void bfq_forget_idle(struct bfq_service_tree *st)
++{
++	struct bfq_entity *first_idle = st->first_idle;
++	struct bfq_entity *last_idle = st->last_idle;
++
++	if (RB_EMPTY_ROOT(&st->active) && last_idle != NULL &&
++	    !bfq_gt(last_idle->finish, st->vtime)) {
++		/*
++		 * Forget the whole idle tree, increasing the vtime past
++		 * the last finish time of idle entities.
++		 */
++		st->vtime = last_idle->finish;
++	}
++
++	if (first_idle != NULL && !bfq_gt(first_idle->finish, st->vtime))
++		bfq_put_idle_entity(st, first_idle);
++}
++
++static struct bfq_service_tree *
++__bfq_entity_update_weight_prio(struct bfq_service_tree *old_st,
++			 struct bfq_entity *entity)
++{
++	struct bfq_service_tree *new_st = old_st;
++
++	if (entity->ioprio_changed) {
++		struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
++		unsigned short prev_weight, new_weight;
++		struct bfq_data *bfqd = NULL;
++		struct rb_root *root;
++#ifdef CONFIG_CGROUP_BFQIO
++		struct bfq_sched_data *sd;
++		struct bfq_group *bfqg;
++#endif
++
++		if (bfqq != NULL)
++			bfqd = bfqq->bfqd;
++#ifdef CONFIG_CGROUP_BFQIO
++		else {
++			sd = entity->my_sched_data;
++			bfqg = container_of(sd, struct bfq_group, sched_data);
++			BUG_ON(!bfqg);
++			bfqd = (struct bfq_data *)bfqg->bfqd;
++			BUG_ON(!bfqd);
++		}
++#endif
++
++		BUG_ON(old_st->wsum < entity->weight);
++		old_st->wsum -= entity->weight;
++
++		if (entity->new_weight != entity->orig_weight) {
++			entity->orig_weight = entity->new_weight;
++			entity->ioprio =
++				bfq_weight_to_ioprio(entity->orig_weight);
++		} else if (entity->new_ioprio != entity->ioprio) {
++			entity->ioprio = entity->new_ioprio;
++			entity->orig_weight =
++					bfq_ioprio_to_weight(entity->ioprio);
++		} else
++			entity->new_weight = entity->orig_weight =
++				bfq_ioprio_to_weight(entity->ioprio);
++
++		entity->ioprio_class = entity->new_ioprio_class;
++		entity->ioprio_changed = 0;
++
++		/*
++		 * NOTE: here we may be changing the weight too early,
++		 * this will cause unfairness.  The correct approach
++		 * would have required additional complexity to defer
++		 * weight changes to the proper time instants (i.e.,
++		 * when entity->finish <= old_st->vtime).
++		 */
++		new_st = bfq_entity_service_tree(entity);
++
++		prev_weight = entity->weight;
++		new_weight = entity->orig_weight *
++			     (bfqq != NULL ? bfqq->wr_coeff : 1);
++		/*
++		 * If the weight of the entity changes, remove the entity
++		 * from its old weight counter (if there is a counter
++		 * associated with the entity), and add it to the counter
++		 * associated with its new weight.
++		 */
++		if (prev_weight != new_weight) {
++			root = bfqq ? &bfqd->queue_weights_tree :
++				      &bfqd->group_weights_tree;
++			bfq_weights_tree_remove(bfqd, entity, root);
++		}
++		entity->weight = new_weight;
++		/*
++		 * Add the entity to its weights tree only if it is
++		 * not associated with a weight-raised queue.
++		 */
++		if (prev_weight != new_weight &&
++		    (bfqq ? bfqq->wr_coeff == 1 : 1))
++			/* If we get here, root has been initialized. */
++			bfq_weights_tree_add(bfqd, entity, root);
++
++		new_st->wsum += entity->weight;
++
++		if (new_st != old_st)
++			entity->start = new_st->vtime;
++	}
++
++	return new_st;
++}
++
++/**
++ * bfq_bfqq_served - update the scheduler status after selection for
++ *                   service.
++ * @bfqq: the queue being served.
++ * @served: bytes to transfer.
++ *
++ * NOTE: this can be optimized, as the timestamps of upper level entities
++ * are synchronized every time a new bfqq is selected for service.  By now,
++ * we keep it to better check consistency.
++ */
++static void bfq_bfqq_served(struct bfq_queue *bfqq, unsigned long served)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++	struct bfq_service_tree *st;
++
++	for_each_entity(entity) {
++		st = bfq_entity_service_tree(entity);
++
++		entity->service += served;
++		BUG_ON(entity->service > entity->budget);
++		BUG_ON(st->wsum == 0);
++
++		st->vtime += bfq_delta(served, st->wsum);
++		bfq_forget_idle(st);
++	}
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "bfqq_served %lu secs", served);
++}
++
++/**
++ * bfq_bfqq_charge_full_budget - set the service to the entity budget.
++ * @bfqq: the queue that needs a service update.
++ *
++ * When it's not possible to be fair in the service domain, because
++ * a queue is not consuming its budget fast enough (the meaning of
++ * fast depends on the timeout parameter), we charge it a full
++ * budget.  In this way we should obtain a sort of time-domain
++ * fairness among all the seeky/slow queues.
++ */
++static inline void bfq_bfqq_charge_full_budget(struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "charge_full_budget");
++
++	bfq_bfqq_served(bfqq, entity->budget - entity->service);
++}
++
++/**
++ * __bfq_activate_entity - activate an entity.
++ * @entity: the entity being activated.
++ *
++ * Called whenever an entity is activated, i.e., it is not active and one
++ * of its children receives a new request, or has to be reactivated due to
++ * budget exhaustion.  It uses the current budget of the entity (and the
++ * service received if @entity is active) of the queue to calculate its
++ * timestamps.
++ */
++static void __bfq_activate_entity(struct bfq_entity *entity)
++{
++	struct bfq_sched_data *sd = entity->sched_data;
++	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++
++	if (entity == sd->in_service_entity) {
++		BUG_ON(entity->tree != NULL);
++		/*
++		 * If we are requeueing the current entity we have
++		 * to take care of not charging to it service it has
++		 * not received.
++		 */
++		bfq_calc_finish(entity, entity->service);
++		entity->start = entity->finish;
++		sd->in_service_entity = NULL;
++	} else if (entity->tree == &st->active) {
++		/*
++		 * Requeueing an entity due to a change of some
++		 * next_in_service entity below it.  We reuse the
++		 * old start time.
++		 */
++		bfq_active_extract(st, entity);
++	} else if (entity->tree == &st->idle) {
++		/*
++		 * Must be on the idle tree, bfq_idle_extract() will
++		 * check for that.
++		 */
++		bfq_idle_extract(st, entity);
++		entity->start = bfq_gt(st->vtime, entity->finish) ?
++				       st->vtime : entity->finish;
++	} else {
++		/*
++		 * The finish time of the entity may be invalid, and
++		 * it is in the past for sure, otherwise the queue
++		 * would have been on the idle tree.
++		 */
++		entity->start = st->vtime;
++		st->wsum += entity->weight;
++		bfq_get_entity(entity);
++
++		BUG_ON(entity->on_st);
++		entity->on_st = 1;
++	}
++
++	st = __bfq_entity_update_weight_prio(st, entity);
++	bfq_calc_finish(entity, entity->budget);
++	bfq_active_insert(st, entity);
++}
++
++/**
++ * bfq_activate_entity - activate an entity and its ancestors if necessary.
++ * @entity: the entity to activate.
++ *
++ * Activate @entity and all the entities on the path from it to the root.
++ */
++static void bfq_activate_entity(struct bfq_entity *entity)
++{
++	struct bfq_sched_data *sd;
++
++	for_each_entity(entity) {
++		__bfq_activate_entity(entity);
++
++		sd = entity->sched_data;
++		if (!bfq_update_next_in_service(sd))
++			/*
++			 * No need to propagate the activation to the
++			 * upper entities, as they will be updated when
++			 * the in-service entity is rescheduled.
++			 */
++			break;
++	}
++}
++
++/**
++ * __bfq_deactivate_entity - deactivate an entity from its service tree.
++ * @entity: the entity to deactivate.
++ * @requeue: if false, the entity will not be put into the idle tree.
++ *
++ * Deactivate an entity, independently from its previous state.  If the
++ * entity was not on a service tree just return, otherwise if it is on
++ * any scheduler tree, extract it from that tree, and if necessary
++ * and if the caller did not specify @requeue, put it on the idle tree.
++ *
++ * Return %1 if the caller should update the entity hierarchy, i.e.,
++ * if the entity was in service or if it was the next_in_service for
++ * its sched_data; return %0 otherwise.
++ */
++static int __bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
++{
++	struct bfq_sched_data *sd = entity->sched_data;
++	struct bfq_service_tree *st = bfq_entity_service_tree(entity);
++	int was_in_service = entity == sd->in_service_entity;
++	int ret = 0;
++
++	if (!entity->on_st)
++		return 0;
++
++	BUG_ON(was_in_service && entity->tree != NULL);
++
++	if (was_in_service) {
++		bfq_calc_finish(entity, entity->service);
++		sd->in_service_entity = NULL;
++	} else if (entity->tree == &st->active)
++		bfq_active_extract(st, entity);
++	else if (entity->tree == &st->idle)
++		bfq_idle_extract(st, entity);
++	else if (entity->tree != NULL)
++		BUG();
++
++	if (was_in_service || sd->next_in_service == entity)
++		ret = bfq_update_next_in_service(sd);
++
++	if (!requeue || !bfq_gt(entity->finish, st->vtime))
++		bfq_forget_entity(st, entity);
++	else
++		bfq_idle_insert(st, entity);
++
++	BUG_ON(sd->in_service_entity == entity);
++	BUG_ON(sd->next_in_service == entity);
++
++	return ret;
++}
++
++/**
++ * bfq_deactivate_entity - deactivate an entity.
++ * @entity: the entity to deactivate.
++ * @requeue: true if the entity can be put on the idle tree
++ */
++static void bfq_deactivate_entity(struct bfq_entity *entity, int requeue)
++{
++	struct bfq_sched_data *sd;
++	struct bfq_entity *parent;
++
++	for_each_entity_safe(entity, parent) {
++		sd = entity->sched_data;
++
++		if (!__bfq_deactivate_entity(entity, requeue))
++			/*
++			 * The parent entity is still backlogged, and
++			 * we don't need to update it as it is still
++			 * in service.
++			 */
++			break;
++
++		if (sd->next_in_service != NULL)
++			/*
++			 * The parent entity is still backlogged and
++			 * the budgets on the path towards the root
++			 * need to be updated.
++			 */
++			goto update;
++
++		/*
++		 * If we reach there the parent is no more backlogged and
++		 * we want to propagate the dequeue upwards.
++		 */
++		requeue = 1;
++	}
++
++	return;
++
++update:
++	entity = parent;
++	for_each_entity(entity) {
++		__bfq_activate_entity(entity);
++
++		sd = entity->sched_data;
++		if (!bfq_update_next_in_service(sd))
++			break;
++	}
++}
++
++/**
++ * bfq_update_vtime - update vtime if necessary.
++ * @st: the service tree to act upon.
++ *
++ * If necessary update the service tree vtime to have at least one
++ * eligible entity, skipping to its start time.  Assumes that the
++ * active tree of the device is not empty.
++ *
++ * NOTE: this hierarchical implementation updates vtimes quite often,
++ * we may end up with reactivated processes getting timestamps after a
++ * vtime skip done because we needed a ->first_active entity on some
++ * intermediate node.
++ */
++static void bfq_update_vtime(struct bfq_service_tree *st)
++{
++	struct bfq_entity *entry;
++	struct rb_node *node = st->active.rb_node;
++
++	entry = rb_entry(node, struct bfq_entity, rb_node);
++	if (bfq_gt(entry->min_start, st->vtime)) {
++		st->vtime = entry->min_start;
++		bfq_forget_idle(st);
++	}
++}
++
++/**
++ * bfq_first_active_entity - find the eligible entity with
++ *                           the smallest finish time
++ * @st: the service tree to select from.
++ *
++ * This function searches the first schedulable entity, starting from the
++ * root of the tree and going on the left every time on this side there is
++ * a subtree with at least one eligible (start >= vtime) entity. The path on
++ * the right is followed only if a) the left subtree contains no eligible
++ * entities and b) no eligible entity has been found yet.
++ */
++static struct bfq_entity *bfq_first_active_entity(struct bfq_service_tree *st)
++{
++	struct bfq_entity *entry, *first = NULL;
++	struct rb_node *node = st->active.rb_node;
++
++	while (node != NULL) {
++		entry = rb_entry(node, struct bfq_entity, rb_node);
++left:
++		if (!bfq_gt(entry->start, st->vtime))
++			first = entry;
++
++		BUG_ON(bfq_gt(entry->min_start, st->vtime));
++
++		if (node->rb_left != NULL) {
++			entry = rb_entry(node->rb_left,
++					 struct bfq_entity, rb_node);
++			if (!bfq_gt(entry->min_start, st->vtime)) {
++				node = node->rb_left;
++				goto left;
++			}
++		}
++		if (first != NULL)
++			break;
++		node = node->rb_right;
++	}
++
++	BUG_ON(first == NULL && !RB_EMPTY_ROOT(&st->active));
++	return first;
++}
++
++/**
++ * __bfq_lookup_next_entity - return the first eligible entity in @st.
++ * @st: the service tree.
++ *
++ * Update the virtual time in @st and return the first eligible entity
++ * it contains.
++ */
++static struct bfq_entity *__bfq_lookup_next_entity(struct bfq_service_tree *st,
++						   bool force)
++{
++	struct bfq_entity *entity, *new_next_in_service = NULL;
++
++	if (RB_EMPTY_ROOT(&st->active))
++		return NULL;
++
++	bfq_update_vtime(st);
++	entity = bfq_first_active_entity(st);
++	BUG_ON(bfq_gt(entity->start, st->vtime));
++
++	/*
++	 * If the chosen entity does not match with the sched_data's
++	 * next_in_service and we are forcedly serving the IDLE priority
++	 * class tree, bubble up budget update.
++	 */
++	if (unlikely(force && entity != entity->sched_data->next_in_service)) {
++		new_next_in_service = entity;
++		for_each_entity(new_next_in_service)
++			bfq_update_budget(new_next_in_service);
++	}
++
++	return entity;
++}
++
++/**
++ * bfq_lookup_next_entity - return the first eligible entity in @sd.
++ * @sd: the sched_data.
++ * @extract: if true the returned entity will be also extracted from @sd.
++ *
++ * NOTE: since we cache the next_in_service entity at each level of the
++ * hierarchy, the complexity of the lookup can be decreased with
++ * absolutely no effort just returning the cached next_in_service value;
++ * we prefer to do full lookups to test the consistency of * the data
++ * structures.
++ */
++static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
++						 int extract,
++						 struct bfq_data *bfqd)
++{
++	struct bfq_service_tree *st = sd->service_tree;
++	struct bfq_entity *entity;
++	int i = 0;
++
++	BUG_ON(sd->in_service_entity != NULL);
++
++	if (bfqd != NULL &&
++	    jiffies - bfqd->bfq_class_idle_last_service > BFQ_CL_IDLE_TIMEOUT) {
++		entity = __bfq_lookup_next_entity(st + BFQ_IOPRIO_CLASSES - 1,
++						  true);
++		if (entity != NULL) {
++			i = BFQ_IOPRIO_CLASSES - 1;
++			bfqd->bfq_class_idle_last_service = jiffies;
++			sd->next_in_service = entity;
++		}
++	}
++	for (; i < BFQ_IOPRIO_CLASSES; i++) {
++		entity = __bfq_lookup_next_entity(st + i, false);
++		if (entity != NULL) {
++			if (extract) {
++				bfq_check_next_in_service(sd, entity);
++				bfq_active_extract(st + i, entity);
++				sd->in_service_entity = entity;
++				sd->next_in_service = NULL;
++			}
++			break;
++		}
++	}
++
++	return entity;
++}
++
++/*
++ * Get next queue for service.
++ */
++static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
++{
++	struct bfq_entity *entity = NULL;
++	struct bfq_sched_data *sd;
++	struct bfq_queue *bfqq;
++
++	BUG_ON(bfqd->in_service_queue != NULL);
++
++	if (bfqd->busy_queues == 0)
++		return NULL;
++
++	sd = &bfqd->root_group->sched_data;
++	for (; sd != NULL; sd = entity->my_sched_data) {
++		entity = bfq_lookup_next_entity(sd, 1, bfqd);
++		BUG_ON(entity == NULL);
++		entity->service = 0;
++	}
++
++	bfqq = bfq_entity_to_bfqq(entity);
++	BUG_ON(bfqq == NULL);
++
++	return bfqq;
++}
++
++/*
++ * Forced extraction of the given queue.
++ */
++static void bfq_get_next_queue_forced(struct bfq_data *bfqd,
++				      struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity;
++	struct bfq_sched_data *sd;
++
++	BUG_ON(bfqd->in_service_queue != NULL);
++
++	entity = &bfqq->entity;
++	/*
++	 * Bubble up extraction/update from the leaf to the root.
++	*/
++	for_each_entity(entity) {
++		sd = entity->sched_data;
++		bfq_update_budget(entity);
++		bfq_update_vtime(bfq_entity_service_tree(entity));
++		bfq_active_extract(bfq_entity_service_tree(entity), entity);
++		sd->in_service_entity = entity;
++		sd->next_in_service = NULL;
++		entity->service = 0;
++	}
++
++	return;
++}
++
++static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
++{
++	if (bfqd->in_service_bic != NULL) {
++		put_io_context(bfqd->in_service_bic->icq.ioc);
++		bfqd->in_service_bic = NULL;
++	}
++
++	bfqd->in_service_queue = NULL;
++	del_timer(&bfqd->idle_slice_timer);
++}
++
++static void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++				int requeue)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++
++	if (bfqq == bfqd->in_service_queue)
++		__bfq_bfqd_reset_in_service(bfqd);
++
++	bfq_deactivate_entity(entity, requeue);
++}
++
++static void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	struct bfq_entity *entity = &bfqq->entity;
++
++	bfq_activate_entity(entity);
++}
++
++/*
++ * Called when the bfqq no longer has requests pending, remove it from
++ * the service tree.
++ */
++static void bfq_del_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			      int requeue)
++{
++	BUG_ON(!bfq_bfqq_busy(bfqq));
++	BUG_ON(!RB_EMPTY_ROOT(&bfqq->sort_list));
++
++	bfq_log_bfqq(bfqd, bfqq, "del from busy");
++
++	bfq_clear_bfqq_busy(bfqq);
++
++	BUG_ON(bfqd->busy_queues == 0);
++	bfqd->busy_queues--;
++
++	if (!bfqq->dispatched) {
++		bfq_weights_tree_remove(bfqd, &bfqq->entity,
++					&bfqd->queue_weights_tree);
++		if (!blk_queue_nonrot(bfqd->queue)) {
++			BUG_ON(!bfqd->busy_in_flight_queues);
++			bfqd->busy_in_flight_queues--;
++			if (bfq_bfqq_constantly_seeky(bfqq)) {
++				BUG_ON(!bfqd->
++					const_seeky_busy_in_flight_queues);
++				bfqd->const_seeky_busy_in_flight_queues--;
++			}
++		}
++	}
++	if (bfqq->wr_coeff > 1)
++		bfqd->wr_busy_queues--;
++
++	bfq_deactivate_bfqq(bfqd, bfqq, requeue);
++}
++
++/*
++ * Called when an inactive queue receives a new request.
++ */
++static void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	BUG_ON(bfq_bfqq_busy(bfqq));
++	BUG_ON(bfqq == bfqd->in_service_queue);
++
++	bfq_log_bfqq(bfqd, bfqq, "add to busy");
++
++	bfq_activate_bfqq(bfqd, bfqq);
++
++	bfq_mark_bfqq_busy(bfqq);
++	bfqd->busy_queues++;
++
++	if (!bfqq->dispatched) {
++		if (bfqq->wr_coeff == 1)
++			bfq_weights_tree_add(bfqd, &bfqq->entity,
++					     &bfqd->queue_weights_tree);
++		if (!blk_queue_nonrot(bfqd->queue)) {
++			bfqd->busy_in_flight_queues++;
++			if (bfq_bfqq_constantly_seeky(bfqq))
++				bfqd->const_seeky_busy_in_flight_queues++;
++		}
++	}
++	if (bfqq->wr_coeff > 1)
++		bfqd->wr_busy_queues++;
++}
+diff --git a/block/bfq.h b/block/bfq.h
+new file mode 100644
+index 0000000..a83e69d
+--- /dev/null
++++ b/block/bfq.h
+@@ -0,0 +1,742 @@
++/*
++ * BFQ-v7r5 for 3.16.0: data structures and common functions prototypes.
++ *
++ * Based on ideas and code from CFQ:
++ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
++ *
++ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
++ *		      Paolo Valente <paolo.valente@unimore.it>
++ *
++ * Copyright (C) 2010 Paolo Valente <paolo.valente@unimore.it>
++ */
++
++#ifndef _BFQ_H
++#define _BFQ_H
++
++#include <linux/blktrace_api.h>
++#include <linux/hrtimer.h>
++#include <linux/ioprio.h>
++#include <linux/rbtree.h>
++
++#define BFQ_IOPRIO_CLASSES	3
++#define BFQ_CL_IDLE_TIMEOUT	(HZ/5)
++
++#define BFQ_MIN_WEIGHT	1
++#define BFQ_MAX_WEIGHT	1000
++
++#define BFQ_DEFAULT_GRP_WEIGHT	10
++#define BFQ_DEFAULT_GRP_IOPRIO	0
++#define BFQ_DEFAULT_GRP_CLASS	IOPRIO_CLASS_BE
++
++struct bfq_entity;
++
++/**
++ * struct bfq_service_tree - per ioprio_class service tree.
++ * @active: tree for active entities (i.e., those backlogged).
++ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F_i).
++ * @first_idle: idle entity with minimum F_i.
++ * @last_idle: idle entity with maximum F_i.
++ * @vtime: scheduler virtual time.
++ * @wsum: scheduler weight sum; active and idle entities contribute to it.
++ *
++ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
++ * ioprio_class has its own independent scheduler, and so its own
++ * bfq_service_tree.  All the fields are protected by the queue lock
++ * of the containing bfqd.
++ */
++struct bfq_service_tree {
++	struct rb_root active;
++	struct rb_root idle;
++
++	struct bfq_entity *first_idle;
++	struct bfq_entity *last_idle;
++
++	u64 vtime;
++	unsigned long wsum;
++};
++
++/**
++ * struct bfq_sched_data - multi-class scheduler.
++ * @in_service_entity: entity in service.
++ * @next_in_service: head-of-the-line entity in the scheduler.
++ * @service_tree: array of service trees, one per ioprio_class.
++ *
++ * bfq_sched_data is the basic scheduler queue.  It supports three
++ * ioprio_classes, and can be used either as a toplevel queue or as
++ * an intermediate queue on a hierarchical setup.
++ * @next_in_service points to the active entity of the sched_data
++ * service trees that will be scheduled next.
++ *
++ * The supported ioprio_classes are the same as in CFQ, in descending
++ * priority order, IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE.
++ * Requests from higher priority queues are served before all the
++ * requests from lower priority queues; among requests of the same
++ * queue requests are served according to B-WF2Q+.
++ * All the fields are protected by the queue lock of the containing bfqd.
++ */
++struct bfq_sched_data {
++	struct bfq_entity *in_service_entity;
++	struct bfq_entity *next_in_service;
++	struct bfq_service_tree service_tree[BFQ_IOPRIO_CLASSES];
++};
++
++/**
++ * struct bfq_weight_counter - counter of the number of all active entities
++ *                             with a given weight.
++ * @weight: weight of the entities that this counter refers to.
++ * @num_active: number of active entities with this weight.
++ * @weights_node: weights tree member (see bfq_data's @queue_weights_tree
++ *                and @group_weights_tree).
++ */
++struct bfq_weight_counter {
++	short int weight;
++	unsigned int num_active;
++	struct rb_node weights_node;
++};
++
++/**
++ * struct bfq_entity - schedulable entity.
++ * @rb_node: service_tree member.
++ * @weight_counter: pointer to the weight counter associated with this entity.
++ * @on_st: flag, true if the entity is on a tree (either the active or
++ *         the idle one of its service_tree).
++ * @finish: B-WF2Q+ finish timestamp (aka F_i).
++ * @start: B-WF2Q+ start timestamp (aka S_i).
++ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
++ * @min_start: minimum start time of the (active) subtree rooted at
++ *             this entity; used for O(log N) lookups into active trees.
++ * @service: service received during the last round of service.
++ * @budget: budget used to calculate F_i; F_i = S_i + @budget / @weight.
++ * @weight: weight of the queue
++ * @parent: parent entity, for hierarchical scheduling.
++ * @my_sched_data: for non-leaf nodes in the cgroup hierarchy, the
++ *                 associated scheduler queue, %NULL on leaf nodes.
++ * @sched_data: the scheduler queue this entity belongs to.
++ * @ioprio: the ioprio in use.
++ * @new_weight: when a weight change is requested, the new weight value.
++ * @orig_weight: original weight, used to implement weight boosting
++ * @new_ioprio: when an ioprio change is requested, the new ioprio value.
++ * @ioprio_class: the ioprio_class in use.
++ * @new_ioprio_class: when an ioprio_class change is requested, the new
++ *                    ioprio_class value.
++ * @ioprio_changed: flag, true when the user requested a weight, ioprio or
++ *                  ioprio_class change.
++ *
++ * A bfq_entity is used to represent either a bfq_queue (leaf node in the
++ * cgroup hierarchy) or a bfq_group into the upper level scheduler.  Each
++ * entity belongs to the sched_data of the parent group in the cgroup
++ * hierarchy.  Non-leaf entities have also their own sched_data, stored
++ * in @my_sched_data.
++ *
++ * Each entity stores independently its priority values; this would
++ * allow different weights on different devices, but this
++ * functionality is not exported to userspace by now.  Priorities and
++ * weights are updated lazily, first storing the new values into the
++ * new_* fields, then setting the @ioprio_changed flag.  As soon as
++ * there is a transition in the entity state that allows the priority
++ * update to take place the effective and the requested priority
++ * values are synchronized.
++ *
++ * Unless cgroups are used, the weight value is calculated from the
++ * ioprio to export the same interface as CFQ.  When dealing with
++ * ``well-behaved'' queues (i.e., queues that do not spend too much
++ * time to consume their budget and have true sequential behavior, and
++ * when there are no external factors breaking anticipation) the
++ * relative weights at each level of the cgroups hierarchy should be
++ * guaranteed.  All the fields are protected by the queue lock of the
++ * containing bfqd.
++ */
++struct bfq_entity {
++	struct rb_node rb_node;
++	struct bfq_weight_counter *weight_counter;
++
++	int on_st;
++
++	u64 finish;
++	u64 start;
++
++	struct rb_root *tree;
++
++	u64 min_start;
++
++	unsigned long service, budget;
++	unsigned short weight, new_weight;
++	unsigned short orig_weight;
++
++	struct bfq_entity *parent;
++
++	struct bfq_sched_data *my_sched_data;
++	struct bfq_sched_data *sched_data;
++
++	unsigned short ioprio, new_ioprio;
++	unsigned short ioprio_class, new_ioprio_class;
++
++	int ioprio_changed;
++};
++
++struct bfq_group;
++
++/**
++ * struct bfq_queue - leaf schedulable entity.
++ * @ref: reference counter.
++ * @bfqd: parent bfq_data.
++ * @new_bfqq: shared bfq_queue if queue is cooperating with
++ *           one or more other queues.
++ * @pos_node: request-position tree member (see bfq_data's @rq_pos_tree).
++ * @pos_root: request-position tree root (see bfq_data's @rq_pos_tree).
++ * @sort_list: sorted list of pending requests.
++ * @next_rq: if fifo isn't expired, next request to serve.
++ * @queued: nr of requests queued in @sort_list.
++ * @allocated: currently allocated requests.
++ * @meta_pending: pending metadata requests.
++ * @fifo: fifo list of requests in sort_list.
++ * @entity: entity representing this queue in the scheduler.
++ * @max_budget: maximum budget allowed from the feedback mechanism.
++ * @budget_timeout: budget expiration (in jiffies).
++ * @dispatched: number of requests on the dispatch list or inside driver.
++ * @flags: status flags.
++ * @bfqq_list: node for active/idle bfqq list inside our bfqd.
++ * @seek_samples: number of seeks sampled
++ * @seek_total: sum of the distances of the seeks sampled
++ * @seek_mean: mean seek distance
++ * @last_request_pos: position of the last request enqueued
++ * @requests_within_timer: number of consecutive pairs of request completion
++ *                         and arrival, such that the queue becomes idle
++ *                         after the completion, but the next request arrives
++ *                         within an idle time slice; used only if the queue's
++ *                         IO_bound has been cleared.
++ * @pid: pid of the process owning the queue, used for logging purposes.
++ * @last_wr_start_finish: start time of the current weight-raising period if
++ *                        the @bfq-queue is being weight-raised, otherwise
++ *                        finish time of the last weight-raising period
++ * @wr_cur_max_time: current max raising time for this queue
++ * @soft_rt_next_start: minimum time instant such that, only if a new
++ *                      request is enqueued after this time instant in an
++ *                      idle @bfq_queue with no outstanding requests, then
++ *                      the task associated with the queue it is deemed as
++ *                      soft real-time (see the comments to the function
++ *                      bfq_bfqq_softrt_next_start()).
++ * @last_idle_bklogged: time of the last transition of the @bfq_queue from
++ *                      idle to backlogged
++ * @service_from_backlogged: cumulative service received from the @bfq_queue
++ *                           since the last transition from idle to
++ *                           backlogged
++ *
++ * A bfq_queue is a leaf request queue; it can be associated with an io_context
++ * or more, if it is async or shared between cooperating processes. @cgroup
++ * holds a reference to the cgroup, to be sure that it does not disappear while
++ * a bfqq still references it (mostly to avoid races between request issuing and
++ * task migration followed by cgroup destruction).
++ * All the fields are protected by the queue lock of the containing bfqd.
++ */
++struct bfq_queue {
++	atomic_t ref;
++	struct bfq_data *bfqd;
++
++	/* fields for cooperating queues handling */
++	struct bfq_queue *new_bfqq;
++	struct rb_node pos_node;
++	struct rb_root *pos_root;
++
++	struct rb_root sort_list;
++	struct request *next_rq;
++	int queued[2];
++	int allocated[2];
++	int meta_pending;
++	struct list_head fifo;
++
++	struct bfq_entity entity;
++
++	unsigned long max_budget;
++	unsigned long budget_timeout;
++
++	int dispatched;
++
++	unsigned int flags;
++
++	struct list_head bfqq_list;
++
++	unsigned int seek_samples;
++	u64 seek_total;
++	sector_t seek_mean;
++	sector_t last_request_pos;
++
++	unsigned int requests_within_timer;
++
++	pid_t pid;
++
++	/* weight-raising fields */
++	unsigned long wr_cur_max_time;
++	unsigned long soft_rt_next_start;
++	unsigned long last_wr_start_finish;
++	unsigned int wr_coeff;
++	unsigned long last_idle_bklogged;
++	unsigned long service_from_backlogged;
++};
++
++/**
++ * struct bfq_ttime - per process thinktime stats.
++ * @ttime_total: total process thinktime
++ * @ttime_samples: number of thinktime samples
++ * @ttime_mean: average process thinktime
++ */
++struct bfq_ttime {
++	unsigned long last_end_request;
++
++	unsigned long ttime_total;
++	unsigned long ttime_samples;
++	unsigned long ttime_mean;
++};
++
++/**
++ * struct bfq_io_cq - per (request_queue, io_context) structure.
++ * @icq: associated io_cq structure
++ * @bfqq: array of two process queues, the sync and the async
++ * @ttime: associated @bfq_ttime struct
++ */
++struct bfq_io_cq {
++	struct io_cq icq; /* must be the first member */
++	struct bfq_queue *bfqq[2];
++	struct bfq_ttime ttime;
++	int ioprio;
++};
++
++enum bfq_device_speed {
++	BFQ_BFQD_FAST,
++	BFQ_BFQD_SLOW,
++};
++
++/**
++ * struct bfq_data - per device data structure.
++ * @queue: request queue for the managed device.
++ * @root_group: root bfq_group for the device.
++ * @rq_pos_tree: rbtree sorted by next_request position, used when
++ *               determining if two or more queues have interleaving
++ *               requests (see bfq_close_cooperator()).
++ * @active_numerous_groups: number of bfq_groups containing more than one
++ *                          active @bfq_entity.
++ * @queue_weights_tree: rbtree of weight counters of @bfq_queues, sorted by
++ *                      weight. Used to keep track of whether all @bfq_queues
++ *                     have the same weight. The tree contains one counter
++ *                     for each distinct weight associated to some active
++ *                     and not weight-raised @bfq_queue (see the comments to
++ *                      the functions bfq_weights_tree_[add|remove] for
++ *                     further details).
++ * @group_weights_tree: rbtree of non-queue @bfq_entity weight counters, sorted
++ *                      by weight. Used to keep track of whether all
++ *                     @bfq_groups have the same weight. The tree contains
++ *                     one counter for each distinct weight associated to
++ *                     some active @bfq_group (see the comments to the
++ *                     functions bfq_weights_tree_[add|remove] for further
++ *                     details).
++ * @busy_queues: number of bfq_queues containing requests (including the
++ *		 queue in service, even if it is idling).
++ * @busy_in_flight_queues: number of @bfq_queues containing pending or
++ *                         in-flight requests, plus the @bfq_queue in
++ *                         service, even if idle but waiting for the
++ *                         possible arrival of its next sync request. This
++ *                         field is updated only if the device is rotational,
++ *                         but used only if the device is also NCQ-capable.
++ *                         The reason why the field is updated also for non-
++ *                         NCQ-capable rotational devices is related to the
++ *                         fact that the value of @hw_tag may be set also
++ *                         later than when busy_in_flight_queues may need to
++ *                         be incremented for the first time(s). Taking also
++ *                         this possibility into account, to avoid unbalanced
++ *                         increments/decrements, would imply more overhead
++ *                         than just updating busy_in_flight_queues
++ *                         regardless of the value of @hw_tag.
++ * @const_seeky_busy_in_flight_queues: number of constantly-seeky @bfq_queues
++ *                                     (that is, seeky queues that expired
++ *                                     for budget timeout at least once)
++ *                                     containing pending or in-flight
++ *                                     requests, including the in-service
++ *                                     @bfq_queue if constantly seeky. This
++ *                                     field is updated only if the device
++ *                                     is rotational, but used only if the
++ *                                     device is also NCQ-capable (see the
++ *                                     comments to @busy_in_flight_queues).
++ * @wr_busy_queues: number of weight-raised busy @bfq_queues.
++ * @queued: number of queued requests.
++ * @rq_in_driver: number of requests dispatched and waiting for completion.
++ * @sync_flight: number of sync requests in the driver.
++ * @max_rq_in_driver: max number of reqs in driver in the last
++ *                    @hw_tag_samples completed requests.
++ * @hw_tag_samples: nr of samples used to calculate hw_tag.
++ * @hw_tag: flag set to one if the driver is showing a queueing behavior.
++ * @budgets_assigned: number of budgets assigned.
++ * @idle_slice_timer: timer set when idling for the next sequential request
++ *                    from the queue in service.
++ * @unplug_work: delayed work to restart dispatching on the request queue.
++ * @in_service_queue: bfq_queue in service.
++ * @in_service_bic: bfq_io_cq (bic) associated with the @in_service_queue.
++ * @last_position: on-disk position of the last served request.
++ * @last_budget_start: beginning of the last budget.
++ * @last_idling_start: beginning of the last idle slice.
++ * @peak_rate: peak transfer rate observed for a budget.
++ * @peak_rate_samples: number of samples used to calculate @peak_rate.
++ * @bfq_max_budget: maximum budget allotted to a bfq_queue before
++ *                  rescheduling.
++ * @group_list: list of all the bfq_groups active on the device.
++ * @active_list: list of all the bfq_queues active on the device.
++ * @idle_list: list of all the bfq_queues idle on the device.
++ * @bfq_quantum: max number of requests dispatched per dispatch round.
++ * @bfq_fifo_expire: timeout for async/sync requests; when it expires
++ *                   requests are served in fifo order.
++ * @bfq_back_penalty: weight of backward seeks wrt forward ones.
++ * @bfq_back_max: maximum allowed backward seek.
++ * @bfq_slice_idle: maximum idling time.
++ * @bfq_user_max_budget: user-configured max budget value
++ *                       (0 for auto-tuning).
++ * @bfq_max_budget_async_rq: maximum budget (in nr of requests) allotted to
++ *                           async queues.
++ * @bfq_timeout: timeout for bfq_queues to consume their budget; used to
++ *               to prevent seeky queues to impose long latencies to well
++ *               behaved ones (this also implies that seeky queues cannot
++ *               receive guarantees in the service domain; after a timeout
++ *               they are charged for the whole allocated budget, to try
++ *               to preserve a behavior reasonably fair among them, but
++ *               without service-domain guarantees).
++ * @bfq_coop_thresh: number of queue merges after which a @bfq_queue is
++ *                   no more granted any weight-raising.
++ * @bfq_failed_cooperations: number of consecutive failed cooperation
++ *                           chances after which weight-raising is restored
++ *                           to a queue subject to more than bfq_coop_thresh
++ *                           queue merges.
++ * @bfq_requests_within_timer: number of consecutive requests that must be
++ *                             issued within the idle time slice to set
++ *                             again idling to a queue which was marked as
++ *                             non-I/O-bound (see the definition of the
++ *                             IO_bound flag for further details).
++ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
++ *                queue is multiplied
++ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
++ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
++ * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
++ *			  may be reactivated for a queue (in jiffies)
++ * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
++ *				after which weight-raising may be
++ *				reactivated for an already busy queue
++ *				(in jiffies)
++ * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
++ *			    sectors per seconds
++ * @RT_prod: cached value of the product R*T used for computing the maximum
++ *	     duration of the weight raising automatically
++ * @device_speed: device-speed class for the low-latency heuristic
++ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
++ *
++ * All the fields are protected by the @queue lock.
++ */
++struct bfq_data {
++	struct request_queue *queue;
++
++	struct bfq_group *root_group;
++	struct rb_root rq_pos_tree;
++
++#ifdef CONFIG_CGROUP_BFQIO
++	int active_numerous_groups;
++#endif
++
++	struct rb_root queue_weights_tree;
++	struct rb_root group_weights_tree;
++
++	int busy_queues;
++	int busy_in_flight_queues;
++	int const_seeky_busy_in_flight_queues;
++	int wr_busy_queues;
++	int queued;
++	int rq_in_driver;
++	int sync_flight;
++
++	int max_rq_in_driver;
++	int hw_tag_samples;
++	int hw_tag;
++
++	int budgets_assigned;
++
++	struct timer_list idle_slice_timer;
++	struct work_struct unplug_work;
++
++	struct bfq_queue *in_service_queue;
++	struct bfq_io_cq *in_service_bic;
++
++	sector_t last_position;
++
++	ktime_t last_budget_start;
++	ktime_t last_idling_start;
++	int peak_rate_samples;
++	u64 peak_rate;
++	unsigned long bfq_max_budget;
++
++	struct hlist_head group_list;
++	struct list_head active_list;
++	struct list_head idle_list;
++
++	unsigned int bfq_quantum;
++	unsigned int bfq_fifo_expire[2];
++	unsigned int bfq_back_penalty;
++	unsigned int bfq_back_max;
++	unsigned int bfq_slice_idle;
++	u64 bfq_class_idle_last_service;
++
++	unsigned int bfq_user_max_budget;
++	unsigned int bfq_max_budget_async_rq;
++	unsigned int bfq_timeout[2];
++
++	unsigned int bfq_coop_thresh;
++	unsigned int bfq_failed_cooperations;
++	unsigned int bfq_requests_within_timer;
++
++	bool low_latency;
++
++	/* parameters of the low_latency heuristics */
++	unsigned int bfq_wr_coeff;
++	unsigned int bfq_wr_max_time;
++	unsigned int bfq_wr_rt_max_time;
++	unsigned int bfq_wr_min_idle_time;
++	unsigned long bfq_wr_min_inter_arr_async;
++	unsigned int bfq_wr_max_softrt_rate;
++	u64 RT_prod;
++	enum bfq_device_speed device_speed;
++
++	struct bfq_queue oom_bfqq;
++};
++
++enum bfqq_state_flags {
++	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
++	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
++	BFQ_BFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
++	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
++	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
++	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
++	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
++	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
++	BFQ_BFQQ_FLAG_IO_bound,         /*
++					 * bfqq has timed-out at least once
++					 * having consumed at most 2/10 of
++					 * its budget
++					 */
++	BFQ_BFQQ_FLAG_constantly_seeky,	/*
++					 * bfqq has proved to be slow and
++					 * seeky until budget timeout
++					 */
++	BFQ_BFQQ_FLAG_softrt_update,    /*
++					 * may need softrt-next-start
++					 * update
++					 */
++	BFQ_BFQQ_FLAG_coop,		/* bfqq is shared */
++	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be splitted */
++};
++
++#define BFQ_BFQQ_FNS(name)						\
++static inline void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
++{									\
++	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
++}									\
++static inline void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)	\
++{									\
++	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
++}									\
++static inline int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
++{									\
++	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
++}
++
++BFQ_BFQQ_FNS(busy);
++BFQ_BFQQ_FNS(wait_request);
++BFQ_BFQQ_FNS(must_alloc);
++BFQ_BFQQ_FNS(fifo_expire);
++BFQ_BFQQ_FNS(idle_window);
++BFQ_BFQQ_FNS(prio_changed);
++BFQ_BFQQ_FNS(sync);
++BFQ_BFQQ_FNS(budget_new);
++BFQ_BFQQ_FNS(IO_bound);
++BFQ_BFQQ_FNS(constantly_seeky);
++BFQ_BFQQ_FNS(coop);
++BFQ_BFQQ_FNS(split_coop);
++BFQ_BFQQ_FNS(softrt_update);
++#undef BFQ_BFQQ_FNS
++
++/* Logging facilities. */
++#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
++	blk_add_trace_msg((bfqd)->queue, "bfq%d " fmt, (bfqq)->pid, ##args)
++
++#define bfq_log(bfqd, fmt, args...) \
++	blk_add_trace_msg((bfqd)->queue, "bfq " fmt, ##args)
++
++/* Expiration reasons. */
++enum bfqq_expiration {
++	BFQ_BFQQ_TOO_IDLE = 0,		/*
++					 * queue has been idling for
++					 * too long
++					 */
++	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
++	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
++	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
++};
++
++#ifdef CONFIG_CGROUP_BFQIO
++/**
++ * struct bfq_group - per (device, cgroup) data structure.
++ * @entity: schedulable entity to insert into the parent group sched_data.
++ * @sched_data: own sched_data, to contain child entities (they may be
++ *              both bfq_queues and bfq_groups).
++ * @group_node: node to be inserted into the bfqio_cgroup->group_data
++ *              list of the containing cgroup's bfqio_cgroup.
++ * @bfqd_node: node to be inserted into the @bfqd->group_list list
++ *             of the groups active on the same device; used for cleanup.
++ * @bfqd: the bfq_data for the device this group acts upon.
++ * @async_bfqq: array of async queues for all the tasks belonging to
++ *              the group, one queue per ioprio value per ioprio_class,
++ *              except for the idle class that has only one queue.
++ * @async_idle_bfqq: async queue for the idle class (ioprio is ignored).
++ * @my_entity: pointer to @entity, %NULL for the toplevel group; used
++ *             to avoid too many special cases during group creation/
++ *             migration.
++ * @active_entities: number of active entities belonging to the group;
++ *                   unused for the root group. Used to know whether there
++ *                   are groups with more than one active @bfq_entity
++ *                   (see the comments to the function
++ *                   bfq_bfqq_must_not_expire()).
++ *
++ * Each (device, cgroup) pair has its own bfq_group, i.e., for each cgroup
++ * there is a set of bfq_groups, each one collecting the lower-level
++ * entities belonging to the group that are acting on the same device.
++ *
++ * Locking works as follows:
++ *    o @group_node is protected by the bfqio_cgroup lock, and is accessed
++ *      via RCU from its readers.
++ *    o @bfqd is protected by the queue lock, RCU is used to access it
++ *      from the readers.
++ *    o All the other fields are protected by the @bfqd queue lock.
++ */
++struct bfq_group {
++	struct bfq_entity entity;
++	struct bfq_sched_data sched_data;
++
++	struct hlist_node group_node;
++	struct hlist_node bfqd_node;
++
++	void *bfqd;
++
++	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
++	struct bfq_queue *async_idle_bfqq;
++
++	struct bfq_entity *my_entity;
++
++	int active_entities;
++};
++
++/**
++ * struct bfqio_cgroup - bfq cgroup data structure.
++ * @css: subsystem state for bfq in the containing cgroup.
++ * @online: flag marked when the subsystem is inserted.
++ * @weight: cgroup weight.
++ * @ioprio: cgroup ioprio.
++ * @ioprio_class: cgroup ioprio_class.
++ * @lock: spinlock that protects @ioprio, @ioprio_class and @group_data.
++ * @group_data: list containing the bfq_group belonging to this cgroup.
++ *
++ * @group_data is accessed using RCU, with @lock protecting the updates,
++ * @ioprio and @ioprio_class are protected by @lock.
++ */
++struct bfqio_cgroup {
++	struct cgroup_subsys_state css;
++	bool online;
++
++	unsigned short weight, ioprio, ioprio_class;
++
++	spinlock_t lock;
++	struct hlist_head group_data;
++};
++#else
++struct bfq_group {
++	struct bfq_sched_data sched_data;
++
++	struct bfq_queue *async_bfqq[2][IOPRIO_BE_NR];
++	struct bfq_queue *async_idle_bfqq;
++};
++#endif
++
++static inline struct bfq_service_tree *
++bfq_entity_service_tree(struct bfq_entity *entity)
++{
++	struct bfq_sched_data *sched_data = entity->sched_data;
++	unsigned int idx = entity->ioprio_class - 1;
++
++	BUG_ON(idx >= BFQ_IOPRIO_CLASSES);
++	BUG_ON(sched_data == NULL);
++
++	return sched_data->service_tree + idx;
++}
++
++static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
++					    int is_sync)
++{
++	return bic->bfqq[!!is_sync];
++}
++
++static inline void bic_set_bfqq(struct bfq_io_cq *bic,
++				struct bfq_queue *bfqq, int is_sync)
++{
++	bic->bfqq[!!is_sync] = bfqq;
++}
++
++static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
++{
++	return bic->icq.q->elevator->elevator_data;
++}
++
++/**
++ * bfq_get_bfqd_locked - get a lock to a bfqd using a RCU protected pointer.
++ * @ptr: a pointer to a bfqd.
++ * @flags: storage for the flags to be saved.
++ *
++ * This function allows bfqg->bfqd to be protected by the
++ * queue lock of the bfqd they reference; the pointer is dereferenced
++ * under RCU, so the storage for bfqd is assured to be safe as long
++ * as the RCU read side critical section does not end.  After the
++ * bfqd->queue->queue_lock is taken the pointer is rechecked, to be
++ * sure that no other writer accessed it.  If we raced with a writer,
++ * the function returns NULL, with the queue unlocked, otherwise it
++ * returns the dereferenced pointer, with the queue locked.
++ */
++static inline struct bfq_data *bfq_get_bfqd_locked(void **ptr,
++						   unsigned long *flags)
++{
++	struct bfq_data *bfqd;
++
++	rcu_read_lock();
++	bfqd = rcu_dereference(*(struct bfq_data **)ptr);
++
++	if (bfqd != NULL) {
++		spin_lock_irqsave(bfqd->queue->queue_lock, *flags);
++		if (*ptr == bfqd)
++			goto out;
++		spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
++	}
++
++	bfqd = NULL;
++out:
++	rcu_read_unlock();
++	return bfqd;
++}
++
++static inline void bfq_put_bfqd_unlock(struct bfq_data *bfqd,
++				       unsigned long *flags)
++{
++	spin_unlock_irqrestore(bfqd->queue->queue_lock, *flags);
++}
++
++static void bfq_changed_ioprio(struct bfq_io_cq *bic);
++static void bfq_put_queue(struct bfq_queue *bfqq);
++static void bfq_dispatch_insert(struct request_queue *q, struct request *rq);
++static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
++				       struct bfq_group *bfqg, int is_sync,
++				       struct bfq_io_cq *bic, gfp_t gfp_mask);
++static void bfq_end_wr_async_queues(struct bfq_data *bfqd,
++				    struct bfq_group *bfqg);
++static void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg);
++static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
++
++#endif /* _BFQ_H */
+-- 
+2.0.3
+

diff --git a/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
new file mode 100644
index 0000000..e606f5d
--- /dev/null
+++ b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
@@ -0,0 +1,1188 @@
+From 5b290be286aa74051b4b77a216032b771ceadd23 Mon Sep 17 00:00:00 2001
+From: Mauro Andreolini <mauro.andreolini@unimore.it>
+Date: Wed, 18 Jun 2014 17:38:07 +0200
+Subject: [PATCH 3/3] block, bfq: add Early Queue Merge (EQM) to BFQ-v7r5 for
+ 3.16.0
+
+A set of processes may happen  to  perform interleaved reads, i.e.,requests
+whose union would give rise to a  sequential read  pattern.  There are two
+typical  cases: in the first  case,   processes  read  fixed-size chunks of
+data at a fixed distance from each other, while in the second case processes
+may read variable-size chunks at  variable distances. The latter case occurs
+for  example with  QEMU, which  splits the  I/O generated  by the  guest into
+multiple chunks,  and lets these chunks  be served by a  pool of cooperating
+processes,  iteratively  assigning  the  next  chunk of  I/O  to  the first
+available  process. CFQ  uses actual  queue merging  for the  first type of
+rocesses, whereas it  uses preemption to get a sequential  read pattern out
+of the read requests  performed by the second type of  processes. In the end
+it uses  two different  mechanisms to  achieve the  same goal: boosting the
+throughput with interleaved I/O.
+
+This patch introduces  Early Queue Merge (EQM), a unified mechanism to get a
+sequential  read pattern  with both  types of  processes. The  main idea is
+checking newly arrived requests against the next request of the active queue
+both in case of actual request insert and in case of request merge. By doing
+so, both the types of processes can be handled by just merging their queues.
+EQM is  then simpler and  more compact than the  pair of mechanisms used in
+CFQ.
+
+Finally, EQM  also preserves the  typical low-latency properties of BFQ, by
+properly restoring the weight-raising state of  a queue when it gets back to
+a non-merged state.
+
+Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
+Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
+Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
+---
+ block/bfq-iosched.c | 736 ++++++++++++++++++++++++++++++++++++----------------
+ block/bfq-sched.c   |  28 --
+ block/bfq.h         |  46 +++-
+ 3 files changed, 556 insertions(+), 254 deletions(-)
+
+diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
+index 0a0891b..d1d8e67 100644
+--- a/block/bfq-iosched.c
++++ b/block/bfq-iosched.c
+@@ -571,6 +571,57 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
+ 	return dur;
+ }
+ 
++static inline unsigned
++bfq_bfqq_cooperations(struct bfq_queue *bfqq)
++{
++	return bfqq->bic ? bfqq->bic->cooperations : 0;
++}
++
++static inline void
++bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
++{
++	if (bic->saved_idle_window)
++		bfq_mark_bfqq_idle_window(bfqq);
++	else
++		bfq_clear_bfqq_idle_window(bfqq);
++	if (bic->saved_IO_bound)
++		bfq_mark_bfqq_IO_bound(bfqq);
++	else
++		bfq_clear_bfqq_IO_bound(bfqq);
++	if (bic->wr_time_left && bfqq->bfqd->low_latency &&
++	    bic->cooperations < bfqq->bfqd->bfq_coop_thresh) {
++		/*
++		 * Start a weight raising period with the duration given by
++		 * the raising_time_left snapshot.
++		 */
++		if (bfq_bfqq_busy(bfqq))
++			bfqq->bfqd->wr_busy_queues++;
++		bfqq->wr_coeff = bfqq->bfqd->bfq_wr_coeff;
++		bfqq->wr_cur_max_time = bic->wr_time_left;
++		bfqq->last_wr_start_finish = jiffies;
++		bfqq->entity.ioprio_changed = 1;
++	}
++	/*
++	 * Clear wr_time_left to prevent bfq_bfqq_save_state() from
++	 * getting confused about the queue's need of a weight-raising
++	 * period.
++	 */
++	bic->wr_time_left = 0;
++}
++
++/*
++ * Must be called with the queue_lock held.
++ */
++static int bfqq_process_refs(struct bfq_queue *bfqq)
++{
++	int process_refs, io_refs;
++
++	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
++	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
++	BUG_ON(process_refs < 0);
++	return process_refs;
++}
++
+ static void bfq_add_request(struct request *rq)
+ {
+ 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
+@@ -602,8 +653,11 @@ static void bfq_add_request(struct request *rq)
+ 
+ 	if (!bfq_bfqq_busy(bfqq)) {
+ 		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++			bfq_bfqq_cooperations(bfqq) < bfqd->bfq_coop_thresh &&
+ 			time_is_before_jiffies(bfqq->soft_rt_next_start);
+-		idle_for_long_time = time_is_before_jiffies(
++		idle_for_long_time = bfq_bfqq_cooperations(bfqq) <
++				     bfqd->bfq_coop_thresh &&
++			time_is_before_jiffies(
+ 			bfqq->budget_timeout +
+ 			bfqd->bfq_wr_min_idle_time);
+ 		entity->budget = max_t(unsigned long, bfqq->max_budget,
+@@ -624,11 +678,20 @@ static void bfq_add_request(struct request *rq)
+ 		if (!bfqd->low_latency)
+ 			goto add_bfqq_busy;
+ 
++		if (bfq_bfqq_just_split(bfqq))
++			goto set_ioprio_changed;
++
+ 		/*
+-		 * If the queue is not being boosted and has been idle
+-		 * for enough time, start a weight-raising period
++		 * If the queue:
++		 * - is not being boosted,
++		 * - has been idle for enough time,
++		 * - is not a sync queue or is linked to a bfq_io_cq (it is
++		 *   shared "for its nature" or it is not shared and its
++		 *   requests have not been redirected to a shared queue)
++		 * start a weight-raising period.
+ 		 */
+-		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
++		    (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
+ 			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
+ 			if (idle_for_long_time)
+ 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+@@ -642,9 +705,11 @@ static void bfq_add_request(struct request *rq)
+ 		} else if (old_wr_coeff > 1) {
+ 			if (idle_for_long_time)
+ 				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
+-			else if (bfqq->wr_cur_max_time ==
+-				 bfqd->bfq_wr_rt_max_time &&
+-				 !soft_rt) {
++			else if (bfq_bfqq_cooperations(bfqq) >=
++					bfqd->bfq_coop_thresh ||
++				 (bfqq->wr_cur_max_time ==
++				  bfqd->bfq_wr_rt_max_time &&
++				  !soft_rt)) {
+ 				bfqq->wr_coeff = 1;
+ 				bfq_log_bfqq(bfqd, bfqq,
+ 					"wrais ending at %lu, rais_max_time %u",
+@@ -660,18 +725,18 @@ static void bfq_add_request(struct request *rq)
+ 				/*
+ 				 *
+ 				 * The remaining weight-raising time is lower
+-				 * than bfqd->bfq_wr_rt_max_time, which
+-				 * means that the application is enjoying
+-				 * weight raising either because deemed soft-
+-				 * rt in the near past, or because deemed
+-				 * interactive a long ago. In both cases,
+-				 * resetting now the current remaining weight-
+-				 * raising time for the application to the
+-				 * weight-raising duration for soft rt
+-				 * applications would not cause any latency
+-				 * increase for the application (as the new
+-				 * duration would be higher than the remaining
+-				 * time).
++				 * than bfqd->bfq_wr_rt_max_time, which means
++				 * that the application is enjoying weight
++				 * raising either because deemed soft-rt in
++				 * the near past, or because deemed interactive
++				 * a long ago.
++				 * In both cases, resetting now the current
++				 * remaining weight-raising time for the
++				 * application to the weight-raising duration
++				 * for soft rt applications would not cause any
++				 * latency increase for the application (as the
++				 * new duration would be higher than the
++				 * remaining time).
+ 				 *
+ 				 * In addition, the application is now meeting
+ 				 * the requirements for being deemed soft rt.
+@@ -706,6 +771,7 @@ static void bfq_add_request(struct request *rq)
+ 					bfqd->bfq_wr_rt_max_time;
+ 			}
+ 		}
++set_ioprio_changed:
+ 		if (old_wr_coeff != bfqq->wr_coeff)
+ 			entity->ioprio_changed = 1;
+ add_bfqq_busy:
+@@ -918,90 +984,35 @@ static void bfq_end_wr(struct bfq_data *bfqd)
+ 	spin_unlock_irq(bfqd->queue->queue_lock);
+ }
+ 
+-static int bfq_allow_merge(struct request_queue *q, struct request *rq,
+-			   struct bio *bio)
++static inline sector_t bfq_io_struct_pos(void *io_struct, bool request)
+ {
+-	struct bfq_data *bfqd = q->elevator->elevator_data;
+-	struct bfq_io_cq *bic;
+-	struct bfq_queue *bfqq;
+-
+-	/*
+-	 * Disallow merge of a sync bio into an async request.
+-	 */
+-	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
+-		return 0;
+-
+-	/*
+-	 * Lookup the bfqq that this bio will be queued with. Allow
+-	 * merge only if rq is queued there.
+-	 * Queue lock is held here.
+-	 */
+-	bic = bfq_bic_lookup(bfqd, current->io_context);
+-	if (bic == NULL)
+-		return 0;
+-
+-	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
+-	return bfqq == RQ_BFQQ(rq);
+-}
+-
+-static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
+-				       struct bfq_queue *bfqq)
+-{
+-	if (bfqq != NULL) {
+-		bfq_mark_bfqq_must_alloc(bfqq);
+-		bfq_mark_bfqq_budget_new(bfqq);
+-		bfq_clear_bfqq_fifo_expire(bfqq);
+-
+-		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
+-
+-		bfq_log_bfqq(bfqd, bfqq,
+-			     "set_in_service_queue, cur-budget = %lu",
+-			     bfqq->entity.budget);
+-	}
+-
+-	bfqd->in_service_queue = bfqq;
+-}
+-
+-/*
+- * Get and set a new queue for service.
+- */
+-static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd,
+-						  struct bfq_queue *bfqq)
+-{
+-	if (!bfqq)
+-		bfqq = bfq_get_next_queue(bfqd);
++	if (request)
++		return blk_rq_pos(io_struct);
+ 	else
+-		bfq_get_next_queue_forced(bfqd, bfqq);
+-
+-	__bfq_set_in_service_queue(bfqd, bfqq);
+-	return bfqq;
++		return ((struct bio *)io_struct)->bi_iter.bi_sector;
+ }
+ 
+-static inline sector_t bfq_dist_from_last(struct bfq_data *bfqd,
+-					  struct request *rq)
++static inline sector_t bfq_dist_from(sector_t pos1,
++				     sector_t pos2)
+ {
+-	if (blk_rq_pos(rq) >= bfqd->last_position)
+-		return blk_rq_pos(rq) - bfqd->last_position;
++	if (pos1 >= pos2)
++		return pos1 - pos2;
+ 	else
+-		return bfqd->last_position - blk_rq_pos(rq);
++		return pos2 - pos1;
+ }
+ 
+-/*
+- * Return true if bfqq has no request pending and rq is close enough to
+- * bfqd->last_position, or if rq is closer to bfqd->last_position than
+- * bfqq->next_rq
+- */
+-static inline int bfq_rq_close(struct bfq_data *bfqd, struct request *rq)
++static inline int bfq_rq_close_to_sector(void *io_struct, bool request,
++					 sector_t sector)
+ {
+-	return bfq_dist_from_last(bfqd, rq) <= BFQQ_SEEK_THR;
++	return bfq_dist_from(bfq_io_struct_pos(io_struct, request), sector) <=
++	       BFQQ_SEEK_THR;
+ }
+ 
+-static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
++static struct bfq_queue *bfqq_close(struct bfq_data *bfqd, sector_t sector)
+ {
+ 	struct rb_root *root = &bfqd->rq_pos_tree;
+ 	struct rb_node *parent, *node;
+ 	struct bfq_queue *__bfqq;
+-	sector_t sector = bfqd->last_position;
+ 
+ 	if (RB_EMPTY_ROOT(root))
+ 		return NULL;
+@@ -1020,7 +1031,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ 	 * next_request position).
+ 	 */
+ 	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
+-	if (bfq_rq_close(bfqd, __bfqq->next_rq))
++	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ 		return __bfqq;
+ 
+ 	if (blk_rq_pos(__bfqq->next_rq) < sector)
+@@ -1031,7 +1042,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ 		return NULL;
+ 
+ 	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
+-	if (bfq_rq_close(bfqd, __bfqq->next_rq))
++	if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))
+ 		return __bfqq;
+ 
+ 	return NULL;
+@@ -1040,14 +1051,12 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+ /*
+  * bfqd - obvious
+  * cur_bfqq - passed in so that we don't decide that the current queue
+- *            is closely cooperating with itself.
+- *
+- * We are assuming that cur_bfqq has dispatched at least one request,
+- * and that bfqd->last_position reflects a position on the disk associated
+- * with the I/O issued by cur_bfqq.
++ *            is closely cooperating with itself
++ * sector - used as a reference point to search for a close queue
+  */
+ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+-					      struct bfq_queue *cur_bfqq)
++					      struct bfq_queue *cur_bfqq,
++					      sector_t sector)
+ {
+ 	struct bfq_queue *bfqq;
+ 
+@@ -1067,7 +1076,7 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+ 	 * working closely on the same area of the disk. In that case,
+ 	 * we can group them together and don't waste time idling.
+ 	 */
+-	bfqq = bfqq_close(bfqd);
++	bfqq = bfqq_close(bfqd, sector);
+ 	if (bfqq == NULL || bfqq == cur_bfqq)
+ 		return NULL;
+ 
+@@ -1094,6 +1103,305 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+ 	return bfqq;
+ }
+ 
++static struct bfq_queue *
++bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++	int process_refs, new_process_refs;
++	struct bfq_queue *__bfqq;
++
++	/*
++	 * If there are no process references on the new_bfqq, then it is
++	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
++	 * may have dropped their last reference (not just their last process
++	 * reference).
++	 */
++	if (!bfqq_process_refs(new_bfqq))
++		return NULL;
++
++	/* Avoid a circular list and skip interim queue merges. */
++	while ((__bfqq = new_bfqq->new_bfqq)) {
++		if (__bfqq == bfqq)
++			return NULL;
++		new_bfqq = __bfqq;
++	}
++
++	process_refs = bfqq_process_refs(bfqq);
++	new_process_refs = bfqq_process_refs(new_bfqq);
++	/*
++	 * If the process for the bfqq has gone away, there is no
++	 * sense in merging the queues.
++	 */
++	if (process_refs == 0 || new_process_refs == 0)
++		return NULL;
++
++	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
++		new_bfqq->pid);
++
++	/*
++	 * Merging is just a redirection: the requests of the process
++	 * owning one of the two queues are redirected to the other queue.
++	 * The latter queue, in its turn, is set as shared if this is the
++	 * first time that the requests of some process are redirected to
++	 * it.
++	 *
++	 * We redirect bfqq to new_bfqq and not the opposite, because we
++	 * are in the context of the process owning bfqq, hence we have
++	 * the io_cq of this process. So we can immediately configure this
++	 * io_cq to redirect the requests of the process to new_bfqq.
++	 *
++	 * NOTE, even if new_bfqq coincides with the in-service queue, the
++	 * io_cq of new_bfqq is not available, because, if the in-service
++	 * queue is shared, bfqd->in_service_bic may not point to the
++	 * io_cq of the in-service queue.
++	 * Redirecting the requests of the process owning bfqq to the
++	 * currently in-service queue is in any case the best option, as
++	 * we feed the in-service queue with new requests close to the
++	 * last request served and, by doing so, hopefully increase the
++	 * throughput.
++	 */
++	bfqq->new_bfqq = new_bfqq;
++	atomic_add(process_refs, &new_bfqq->ref);
++	return new_bfqq;
++}
++
++/*
++ * Attempt to schedule a merge of bfqq with the currently in-service queue
++ * or with a close queue among the scheduled queues.
++ * Return NULL if no merge was scheduled, a pointer to the shared bfq_queue
++ * structure otherwise.
++ */
++static struct bfq_queue *
++bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++		     void *io_struct, bool request)
++{
++	struct bfq_queue *in_service_bfqq, *new_bfqq;
++
++	if (bfqq->new_bfqq)
++		return bfqq->new_bfqq;
++
++	if (!io_struct)
++		return NULL;
++
++	in_service_bfqq = bfqd->in_service_queue;
++
++	if (in_service_bfqq == NULL || in_service_bfqq == bfqq ||
++	    !bfqd->in_service_bic)
++		goto check_scheduled;
++
++	if (bfq_class_idle(in_service_bfqq) || bfq_class_idle(bfqq))
++		goto check_scheduled;
++
++	if (bfq_class_rt(in_service_bfqq) != bfq_class_rt(bfqq))
++		goto check_scheduled;
++
++	if (in_service_bfqq->entity.parent != bfqq->entity.parent)
++		goto check_scheduled;
++
++	if (bfq_rq_close_to_sector(io_struct, request, bfqd->last_position) &&
++	    bfq_bfqq_sync(in_service_bfqq) && bfq_bfqq_sync(bfqq)) {
++		new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);
++		if (new_bfqq != NULL)
++			return new_bfqq; /* Merge with in-service queue */
++	}
++
++	/*
++	 * Check whether there is a cooperator among currently scheduled
++	 * queues. The only thing we need is that the bio/request is not
++	 * NULL, as we need it to establish whether a cooperator exists.
++	 */
++check_scheduled:
++	new_bfqq = bfq_close_cooperator(bfqd, bfqq,
++					bfq_io_struct_pos(io_struct, request));
++	if (new_bfqq)
++		return bfq_setup_merge(bfqq, new_bfqq);
++
++	return NULL;
++}
++
++static inline void
++bfq_bfqq_save_state(struct bfq_queue *bfqq)
++{
++	/*
++	 * If bfqq->bic == NULL, the queue is already shared or its requests
++	 * have already been redirected to a shared queue; both idle window
++	 * and weight raising state have already been saved. Do nothing.
++	 */
++	if (bfqq->bic == NULL)
++		return;
++	if (bfqq->bic->wr_time_left)
++		/*
++		 * This is the queue of a just-started process, and would
++		 * deserve weight raising: we set wr_time_left to the full
++		 * weight-raising duration to trigger weight-raising when
++		 * and if the queue is split and the first request of the
++		 * queue is enqueued.
++		 */
++		bfqq->bic->wr_time_left = bfq_wr_duration(bfqq->bfqd);
++	else if (bfqq->wr_coeff > 1) {
++		unsigned long wr_duration =
++			jiffies - bfqq->last_wr_start_finish;
++		/*
++		 * It may happen that a queue's weight raising period lasts
++		 * longer than its wr_cur_max_time, as weight raising is
++		 * handled only when a request is enqueued or dispatched (it
++		 * does not use any timer). If the weight raising period is
++		 * about to end, don't save it.
++		 */
++		if (bfqq->wr_cur_max_time <= wr_duration)
++			bfqq->bic->wr_time_left = 0;
++		else
++			bfqq->bic->wr_time_left =
++				bfqq->wr_cur_max_time - wr_duration;
++		/*
++		 * The bfq_queue is becoming shared or the requests of the
++		 * process owning the queue are being redirected to a shared
++		 * queue. Stop the weight raising period of the queue, as in
++		 * both cases it should not be owned by an interactive or
++		 * soft real-time application.
++		 */
++		bfq_bfqq_end_wr(bfqq);
++	} else
++		bfqq->bic->wr_time_left = 0;
++	bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
++	bfqq->bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
++	bfqq->bic->cooperations++;
++	bfqq->bic->failed_cooperations = 0;
++}
++
++static inline void
++bfq_get_bic_reference(struct bfq_queue *bfqq)
++{
++	/*
++	 * If bfqq->bic has a non-NULL value, the bic to which it belongs
++	 * is about to begin using a shared bfq_queue.
++	 */
++	if (bfqq->bic)
++		atomic_long_inc(&bfqq->bic->icq.ioc->refcount);
++}
++
++static void
++bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
++		struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
++{
++	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
++		(long unsigned)new_bfqq->pid);
++	/* Save weight raising and idle window of the merged queues */
++	bfq_bfqq_save_state(bfqq);
++	bfq_bfqq_save_state(new_bfqq);
++	if (bfq_bfqq_IO_bound(bfqq))
++		bfq_mark_bfqq_IO_bound(new_bfqq);
++	bfq_clear_bfqq_IO_bound(bfqq);
++	/*
++	 * Grab a reference to the bic, to prevent it from being destroyed
++	 * before being possibly touched by a bfq_split_bfqq().
++	 */
++	bfq_get_bic_reference(bfqq);
++	bfq_get_bic_reference(new_bfqq);
++	/*
++	 * Merge queues (that is, let bic redirect its requests to new_bfqq)
++	 */
++	bic_set_bfqq(bic, new_bfqq, 1);
++	bfq_mark_bfqq_coop(new_bfqq);
++	/*
++	 * new_bfqq now belongs to at least two bics (it is a shared queue):
++	 * set new_bfqq->bic to NULL. bfqq either:
++	 * - does not belong to any bic any more, and hence bfqq->bic must
++	 *   be set to NULL, or
++	 * - is a queue whose owning bics have already been redirected to a
++	 *   different queue, hence the queue is destined to not belong to
++	 *   any bic soon and bfqq->bic is already NULL (therefore the next
++	 *   assignment causes no harm).
++	 */
++	new_bfqq->bic = NULL;
++	bfqq->bic = NULL;
++	bfq_put_queue(bfqq);
++}
++
++static inline void bfq_bfqq_increase_failed_cooperations(struct bfq_queue *bfqq)
++{
++	struct bfq_io_cq *bic = bfqq->bic;
++	struct bfq_data *bfqd = bfqq->bfqd;
++
++	if (bic && bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh) {
++		bic->failed_cooperations++;
++		if (bic->failed_cooperations >= bfqd->bfq_failed_cooperations)
++			bic->cooperations = 0;
++	}
++}
++
++static int bfq_allow_merge(struct request_queue *q, struct request *rq,
++			   struct bio *bio)
++{
++	struct bfq_data *bfqd = q->elevator->elevator_data;
++	struct bfq_io_cq *bic;
++	struct bfq_queue *bfqq, *new_bfqq;
++
++	/*
++	 * Disallow merge of a sync bio into an async request.
++	 */
++	if (bfq_bio_sync(bio) && !rq_is_sync(rq))
++		return 0;
++
++	/*
++	 * Lookup the bfqq that this bio will be queued with. Allow
++	 * merge only if rq is queued there.
++	 * Queue lock is held here.
++	 */
++	bic = bfq_bic_lookup(bfqd, current->io_context);
++	if (bic == NULL)
++		return 0;
++
++	bfqq = bic_to_bfqq(bic, bfq_bio_sync(bio));
++	/*
++	 * We take advantage of this function to perform an early merge
++	 * of the queues of possible cooperating processes.
++	 */
++	if (bfqq != NULL) {
++		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false);
++		if (new_bfqq != NULL) {
++			bfq_merge_bfqqs(bfqd, bic, bfqq, new_bfqq);
++			/*
++			 * If we get here, the bio will be queued in the
++			 * shared queue, i.e., new_bfqq, so use new_bfqq
++			 * to decide whether bio and rq can be merged.
++			 */
++			bfqq = new_bfqq;
++		} else
++			bfq_bfqq_increase_failed_cooperations(bfqq);
++	}
++
++	return bfqq == RQ_BFQQ(rq);
++}
++
++static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
++				       struct bfq_queue *bfqq)
++{
++	if (bfqq != NULL) {
++		bfq_mark_bfqq_must_alloc(bfqq);
++		bfq_mark_bfqq_budget_new(bfqq);
++		bfq_clear_bfqq_fifo_expire(bfqq);
++
++		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;
++
++		bfq_log_bfqq(bfqd, bfqq,
++			     "set_in_service_queue, cur-budget = %lu",
++			     bfqq->entity.budget);
++	}
++
++	bfqd->in_service_queue = bfqq;
++}
++
++/*
++ * Get and set a new queue for service.
++ */
++static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)
++{
++	struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);
++
++	__bfq_set_in_service_queue(bfqd, bfqq);
++	return bfqq;
++}
++
+ /*
+  * If enough samples have been computed, return the current max budget
+  * stored in bfqd, which is dynamically updated according to the
+@@ -1237,63 +1545,6 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+ 	return rq;
+ }
+ 
+-/*
+- * Must be called with the queue_lock held.
+- */
+-static int bfqq_process_refs(struct bfq_queue *bfqq)
+-{
+-	int process_refs, io_refs;
+-
+-	io_refs = bfqq->allocated[READ] + bfqq->allocated[WRITE];
+-	process_refs = atomic_read(&bfqq->ref) - io_refs - bfqq->entity.on_st;
+-	BUG_ON(process_refs < 0);
+-	return process_refs;
+-}
+-
+-static void bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)
+-{
+-	int process_refs, new_process_refs;
+-	struct bfq_queue *__bfqq;
+-
+-	/*
+-	 * If there are no process references on the new_bfqq, then it is
+-	 * unsafe to follow the ->new_bfqq chain as other bfqq's in the chain
+-	 * may have dropped their last reference (not just their last process
+-	 * reference).
+-	 */
+-	if (!bfqq_process_refs(new_bfqq))
+-		return;
+-
+-	/* Avoid a circular list and skip interim queue merges. */
+-	while ((__bfqq = new_bfqq->new_bfqq)) {
+-		if (__bfqq == bfqq)
+-			return;
+-		new_bfqq = __bfqq;
+-	}
+-
+-	process_refs = bfqq_process_refs(bfqq);
+-	new_process_refs = bfqq_process_refs(new_bfqq);
+-	/*
+-	 * If the process for the bfqq has gone away, there is no
+-	 * sense in merging the queues.
+-	 */
+-	if (process_refs == 0 || new_process_refs == 0)
+-		return;
+-
+-	/*
+-	 * Merge in the direction of the lesser amount of work.
+-	 */
+-	if (new_process_refs >= process_refs) {
+-		bfqq->new_bfqq = new_bfqq;
+-		atomic_add(process_refs, &new_bfqq->ref);
+-	} else {
+-		new_bfqq->new_bfqq = bfqq;
+-		atomic_add(new_process_refs, &bfqq->ref);
+-	}
+-	bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",
+-		new_bfqq->pid);
+-}
+-
+ static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
+ {
+ 	struct bfq_entity *entity = &bfqq->entity;
+@@ -2011,7 +2262,7 @@ static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+  */
+ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ {
+-	struct bfq_queue *bfqq, *new_bfqq = NULL;
++	struct bfq_queue *bfqq;
+ 	struct request *next_rq;
+ 	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
+ 
+@@ -2021,17 +2272,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ 
+ 	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
+ 
+-	/*
+-         * If another queue has a request waiting within our mean seek
+-         * distance, let it run. The expire code will check for close
+-         * cooperators and put the close queue at the front of the
+-         * service tree. If possible, merge the expiring queue with the
+-         * new bfqq.
+-         */
+-        new_bfqq = bfq_close_cooperator(bfqd, bfqq);
+-        if (new_bfqq != NULL && bfqq->new_bfqq == NULL)
+-                bfq_setup_merge(bfqq, new_bfqq);
+-
+ 	if (bfq_may_expire_for_budg_timeout(bfqq) &&
+ 	    !timer_pending(&bfqd->idle_slice_timer) &&
+ 	    !bfq_bfqq_must_idle(bfqq))
+@@ -2070,10 +2310,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ 				bfq_clear_bfqq_wait_request(bfqq);
+ 				del_timer(&bfqd->idle_slice_timer);
+ 			}
+-			if (new_bfqq == NULL)
+-				goto keep_queue;
+-			else
+-				goto expire;
++			goto keep_queue;
+ 		}
+ 	}
+ 
+@@ -2082,40 +2319,30 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+ 	 * in flight (possibly waiting for a completion) or is idling for a
+ 	 * new request, then keep it.
+ 	 */
+-	if (new_bfqq == NULL && (timer_pending(&bfqd->idle_slice_timer) ||
+-	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq)))) {
++	if (timer_pending(&bfqd->idle_slice_timer) ||
++	    (bfqq->dispatched != 0 && bfq_bfqq_must_not_expire(bfqq))) {
+ 		bfqq = NULL;
+ 		goto keep_queue;
+-	} else if (new_bfqq != NULL && timer_pending(&bfqd->idle_slice_timer)) {
+-		/*
+-		 * Expiring the queue because there is a close cooperator,
+-		 * cancel timer.
+-		 */
+-		bfq_clear_bfqq_wait_request(bfqq);
+-		del_timer(&bfqd->idle_slice_timer);
+ 	}
+ 
+ 	reason = BFQ_BFQQ_NO_MORE_REQUESTS;
+ expire:
+ 	bfq_bfqq_expire(bfqd, bfqq, 0, reason);
+ new_queue:
+-	bfqq = bfq_set_in_service_queue(bfqd, new_bfqq);
++	bfqq = bfq_set_in_service_queue(bfqd);
+ 	bfq_log(bfqd, "select_queue: new queue %d returned",
+ 		bfqq != NULL ? bfqq->pid : 0);
+ keep_queue:
+ 	return bfqq;
+ }
+ 
+-static void bfq_update_wr_data(struct bfq_data *bfqd,
+-			       struct bfq_queue *bfqq)
++static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+ {
+-	if (bfqq->wr_coeff > 1) { /* queue is being boosted */
+-		struct bfq_entity *entity = &bfqq->entity;
+-
++	struct bfq_entity *entity = &bfqq->entity;
++	if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */
+ 		bfq_log_bfqq(bfqd, bfqq,
+ 			"raising period dur %u/%u msec, old coeff %u, w %d(%d)",
+-			jiffies_to_msecs(jiffies -
+-				bfqq->last_wr_start_finish),
++			jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),
+ 			jiffies_to_msecs(bfqq->wr_cur_max_time),
+ 			bfqq->wr_coeff,
+ 			bfqq->entity.weight, bfqq->entity.orig_weight);
+@@ -2124,11 +2351,15 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+ 		       entity->orig_weight * bfqq->wr_coeff);
+ 		if (entity->ioprio_changed)
+ 			bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
++
+ 		/*
+ 		 * If too much time has elapsed from the beginning
+-		 * of this weight-raising, stop it.
++		 * of this weight-raising period, or the queue has
++		 * exceeded the acceptable number of cooperations,
++		 * stop it.
+ 		 */
+-		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++		if (bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh ||
++		    time_is_before_jiffies(bfqq->last_wr_start_finish +
+ 					   bfqq->wr_cur_max_time)) {
+ 			bfqq->last_wr_start_finish = jiffies;
+ 			bfq_log_bfqq(bfqd, bfqq,
+@@ -2136,11 +2367,13 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+ 				     bfqq->last_wr_start_finish,
+ 				     jiffies_to_msecs(bfqq->wr_cur_max_time));
+ 			bfq_bfqq_end_wr(bfqq);
+-			__bfq_entity_update_weight_prio(
+-				bfq_entity_service_tree(entity),
+-				entity);
+ 		}
+ 	}
++	/* Update weight both if it must be raised and if it must be lowered */
++	if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))
++		__bfq_entity_update_weight_prio(
++			bfq_entity_service_tree(entity),
++			entity);
+ }
+ 
+ /*
+@@ -2377,6 +2610,25 @@ static inline void bfq_init_icq(struct io_cq *icq)
+ 	struct bfq_io_cq *bic = icq_to_bic(icq);
+ 
+ 	bic->ttime.last_end_request = jiffies;
++	/*
++	 * A newly created bic indicates that the process has just
++	 * started doing I/O, and is probably mapping into memory its
++	 * executable and libraries: it definitely needs weight raising.
++	 * There is however the possibility that the process performs,
++	 * for a while, I/O close to some other process. EQM intercepts
++	 * this behavior and may merge the queue corresponding to the
++	 * process  with some other queue, BEFORE the weight of the queue
++	 * is raised. Merged queues are not weight-raised (they are assumed
++	 * to belong to processes that benefit only from high throughput).
++	 * If the merge is basically the consequence of an accident, then
++	 * the queue will be split soon and will get back its old weight.
++	 * It is then important to write down somewhere that this queue
++	 * does need weight raising, even if it did not make it to get its
++	 * weight raised before being merged. To this purpose, we overload
++	 * the field raising_time_left and assign 1 to it, to mark the queue
++	 * as needing weight raising.
++	 */
++	bic->wr_time_left = 1;
+ }
+ 
+ static void bfq_exit_icq(struct io_cq *icq)
+@@ -2390,6 +2642,13 @@ static void bfq_exit_icq(struct io_cq *icq)
+ 	}
+ 
+ 	if (bic->bfqq[BLK_RW_SYNC]) {
++		/*
++		 * If the bic is using a shared queue, put the reference
++		 * taken on the io_context when the bic started using a
++		 * shared bfq_queue.
++		 */
++		if (bfq_bfqq_coop(bic->bfqq[BLK_RW_SYNC]))
++			put_io_context(icq->ioc);
+ 		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
+ 		bic->bfqq[BLK_RW_SYNC] = NULL;
+ 	}
+@@ -2678,6 +2937,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
+ 	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
+ 		return;
+ 
++	/* Idle window just restored, statistics are meaningless. */
++	if (bfq_bfqq_just_split(bfqq))
++		return;
++
+ 	enable_idle = bfq_bfqq_idle_window(bfqq);
+ 
+ 	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
+@@ -2725,6 +2988,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ 	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
+ 	    !BFQQ_SEEKY(bfqq))
+ 		bfq_update_idle_window(bfqd, bfqq, bic);
++	bfq_clear_bfqq_just_split(bfqq);
+ 
+ 	bfq_log_bfqq(bfqd, bfqq,
+ 		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
+@@ -2785,13 +3049,49 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ static void bfq_insert_request(struct request_queue *q, struct request *rq)
+ {
+ 	struct bfq_data *bfqd = q->elevator->elevator_data;
+-	struct bfq_queue *bfqq = RQ_BFQQ(rq);
++	struct bfq_queue *bfqq = RQ_BFQQ(rq), *new_bfqq;
+ 
+ 	assert_spin_locked(bfqd->queue->queue_lock);
++
++	/*
++	 * An unplug may trigger a requeue of a request from the device
++	 * driver: make sure we are in process context while trying to
++	 * merge two bfq_queues.
++	 */
++	if (!in_interrupt()) {
++		new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true);
++		if (new_bfqq != NULL) {
++			if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
++				new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
++			/*
++			 * Release the request's reference to the old bfqq
++			 * and make sure one is taken to the shared queue.
++			 */
++			new_bfqq->allocated[rq_data_dir(rq)]++;
++			bfqq->allocated[rq_data_dir(rq)]--;
++			atomic_inc(&new_bfqq->ref);
++			bfq_put_queue(bfqq);
++			if (bic_to_bfqq(RQ_BIC(rq), 1) == bfqq)
++				bfq_merge_bfqqs(bfqd, RQ_BIC(rq),
++						bfqq, new_bfqq);
++			rq->elv.priv[1] = new_bfqq;
++			bfqq = new_bfqq;
++		} else
++			bfq_bfqq_increase_failed_cooperations(bfqq);
++	}
++
+ 	bfq_init_prio_data(bfqq, RQ_BIC(rq));
+ 
+ 	bfq_add_request(rq);
+ 
++	/*
++	 * Here a newly-created bfq_queue has already started a weight-raising
++	 * period: clear raising_time_left to prevent bfq_bfqq_save_state()
++	 * from assigning it a full weight-raising period. See the detailed
++	 * comments about this field in bfq_init_icq().
++	 */
++	if (bfqq->bic != NULL)
++		bfqq->bic->wr_time_left = 0;
+ 	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
+ 	list_add_tail(&rq->queuelist, &bfqq->fifo);
+ 
+@@ -2956,18 +3256,6 @@ static void bfq_put_request(struct request *rq)
+ 	}
+ }
+ 
+-static struct bfq_queue *
+-bfq_merge_bfqqs(struct bfq_data *bfqd, struct bfq_io_cq *bic,
+-		struct bfq_queue *bfqq)
+-{
+-	bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",
+-		(long unsigned)bfqq->new_bfqq->pid);
+-	bic_set_bfqq(bic, bfqq->new_bfqq, 1);
+-	bfq_mark_bfqq_coop(bfqq->new_bfqq);
+-	bfq_put_queue(bfqq);
+-	return bic_to_bfqq(bic, 1);
+-}
+-
+ /*
+  * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
+  * was the last process referring to said bfqq.
+@@ -2976,6 +3264,9 @@ static struct bfq_queue *
+ bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
+ {
+ 	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
++
++	put_io_context(bic->icq.ioc);
++
+ 	if (bfqq_process_refs(bfqq) == 1) {
+ 		bfqq->pid = current->pid;
+ 		bfq_clear_bfqq_coop(bfqq);
+@@ -3004,6 +3295,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
+ 	struct bfq_queue *bfqq;
+ 	struct bfq_group *bfqg;
+ 	unsigned long flags;
++	bool split = false;
+ 
+ 	might_sleep_if(gfp_mask & __GFP_WAIT);
+ 
+@@ -3022,24 +3314,14 @@ new_queue:
+ 		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
+ 		bic_set_bfqq(bic, bfqq, is_sync);
+ 	} else {
+-		/*
+-		 * If the queue was seeky for too long, break it apart.
+-		 */
++		/* If the queue was seeky for too long, break it apart. */
+ 		if (bfq_bfqq_coop(bfqq) && bfq_bfqq_split_coop(bfqq)) {
+ 			bfq_log_bfqq(bfqd, bfqq, "breaking apart bfqq");
+ 			bfqq = bfq_split_bfqq(bic, bfqq);
++			split = true;
+ 			if (!bfqq)
+ 				goto new_queue;
+ 		}
+-
+-		/*
+-		 * Check to see if this queue is scheduled to merge with
+-		 * another closely cooperating queue. The merging of queues
+-		 * happens here as it must be done in process context.
+-		 * The reference on new_bfqq was taken in merge_bfqqs.
+-		 */
+-		if (bfqq->new_bfqq != NULL)
+-			bfqq = bfq_merge_bfqqs(bfqd, bic, bfqq);
+ 	}
+ 
+ 	bfqq->allocated[rw]++;
+@@ -3050,6 +3332,26 @@ new_queue:
+ 	rq->elv.priv[0] = bic;
+ 	rq->elv.priv[1] = bfqq;
+ 
++	/*
++	 * If a bfq_queue has only one process reference, it is owned
++	 * by only one bfq_io_cq: we can set the bic field of the
++	 * bfq_queue to the address of that structure. Also, if the
++	 * queue has just been split, mark a flag so that the
++	 * information is available to the other scheduler hooks.
++	 */
++	if (bfqq_process_refs(bfqq) == 1) {
++		bfqq->bic = bic;
++		if (split) {
++			bfq_mark_bfqq_just_split(bfqq);
++			/*
++			 * If the queue has just been split from a shared
++			 * queue, restore the idle window and the possible
++			 * weight raising period.
++			 */
++			bfq_bfqq_resume_state(bfqq, bic);
++		}
++	}
++
+ 	spin_unlock_irqrestore(q->queue_lock, flags);
+ 
+ 	return 0;
+diff --git a/block/bfq-sched.c b/block/bfq-sched.c
+index c4831b7..546a254 100644
+--- a/block/bfq-sched.c
++++ b/block/bfq-sched.c
+@@ -1084,34 +1084,6 @@ static struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd)
+ 	return bfqq;
+ }
+ 
+-/*
+- * Forced extraction of the given queue.
+- */
+-static void bfq_get_next_queue_forced(struct bfq_data *bfqd,
+-				      struct bfq_queue *bfqq)
+-{
+-	struct bfq_entity *entity;
+-	struct bfq_sched_data *sd;
+-
+-	BUG_ON(bfqd->in_service_queue != NULL);
+-
+-	entity = &bfqq->entity;
+-	/*
+-	 * Bubble up extraction/update from the leaf to the root.
+-	*/
+-	for_each_entity(entity) {
+-		sd = entity->sched_data;
+-		bfq_update_budget(entity);
+-		bfq_update_vtime(bfq_entity_service_tree(entity));
+-		bfq_active_extract(bfq_entity_service_tree(entity), entity);
+-		sd->in_service_entity = entity;
+-		sd->next_in_service = NULL;
+-		entity->service = 0;
+-	}
+-
+-	return;
+-}
+-
+ static void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+ {
+ 	if (bfqd->in_service_bic != NULL) {
+diff --git a/block/bfq.h b/block/bfq.h
+index a83e69d..ebbd040 100644
+--- a/block/bfq.h
++++ b/block/bfq.h
+@@ -215,18 +215,21 @@ struct bfq_group;
+  *                      idle @bfq_queue with no outstanding requests, then
+  *                      the task associated with the queue it is deemed as
+  *                      soft real-time (see the comments to the function
+- *                      bfq_bfqq_softrt_next_start()).
++ *                      bfq_bfqq_softrt_next_start())
+  * @last_idle_bklogged: time of the last transition of the @bfq_queue from
+  *                      idle to backlogged
+  * @service_from_backlogged: cumulative service received from the @bfq_queue
+  *                           since the last transition from idle to
+  *                           backlogged
++ * @bic: pointer to the bfq_io_cq owning the bfq_queue, set to %NULL if the
++ *	 queue is shared
+  *
+- * A bfq_queue is a leaf request queue; it can be associated with an io_context
+- * or more, if it is async or shared between cooperating processes. @cgroup
+- * holds a reference to the cgroup, to be sure that it does not disappear while
+- * a bfqq still references it (mostly to avoid races between request issuing and
+- * task migration followed by cgroup destruction).
++ * A bfq_queue is a leaf request queue; it can be associated with an
++ * io_context or more, if it  is  async or shared  between  cooperating
++ * processes. @cgroup holds a reference to the cgroup, to be sure that it
++ * does not disappear while a bfqq still references it (mostly to avoid
++ * races between request issuing and task migration followed by cgroup
++ * destruction).
+  * All the fields are protected by the queue lock of the containing bfqd.
+  */
+ struct bfq_queue {
+@@ -264,6 +267,7 @@ struct bfq_queue {
+ 	unsigned int requests_within_timer;
+ 
+ 	pid_t pid;
++	struct bfq_io_cq *bic;
+ 
+ 	/* weight-raising fields */
+ 	unsigned long wr_cur_max_time;
+@@ -293,12 +297,34 @@ struct bfq_ttime {
+  * @icq: associated io_cq structure
+  * @bfqq: array of two process queues, the sync and the async
+  * @ttime: associated @bfq_ttime struct
++ * @wr_time_left: snapshot of the time left before weight raising ends
++ *                for the sync queue associated to this process; this
++ *		  snapshot is taken to remember this value while the weight
++ *		  raising is suspended because the queue is merged with a
++ *		  shared queue, and is used to set @raising_cur_max_time
++ *		  when the queue is split from the shared queue and its
++ *		  weight is raised again
++ * @saved_idle_window: same purpose as the previous field for the idle
++ *                     window
++ * @saved_IO_bound: same purpose as the previous two fields for the I/O
++ *                  bound classification of a queue
++ * @cooperations: counter of consecutive successful queue merges underwent
++ *                by any of the process' @bfq_queues
++ * @failed_cooperations: counter of consecutive failed queue merges of any
++ *                       of the process' @bfq_queues
+  */
+ struct bfq_io_cq {
+ 	struct io_cq icq; /* must be the first member */
+ 	struct bfq_queue *bfqq[2];
+ 	struct bfq_ttime ttime;
+ 	int ioprio;
++
++	unsigned int wr_time_left;
++	unsigned int saved_idle_window;
++	unsigned int saved_IO_bound;
++
++	unsigned int cooperations;
++	unsigned int failed_cooperations;
+ };
+ 
+ enum bfq_device_speed {
+@@ -511,7 +537,7 @@ enum bfqq_state_flags {
+ 	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
+ 	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
+ 	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
+-	BFQ_BFQQ_FLAG_IO_bound,         /*
++	BFQ_BFQQ_FLAG_IO_bound,		/*
+ 					 * bfqq has timed-out at least once
+ 					 * having consumed at most 2/10 of
+ 					 * its budget
+@@ -520,12 +546,13 @@ enum bfqq_state_flags {
+ 					 * bfqq has proved to be slow and
+ 					 * seeky until budget timeout
+ 					 */
+-	BFQ_BFQQ_FLAG_softrt_update,    /*
++	BFQ_BFQQ_FLAG_softrt_update,	/*
+ 					 * may need softrt-next-start
+ 					 * update
+ 					 */
+ 	BFQ_BFQQ_FLAG_coop,		/* bfqq is shared */
+-	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be splitted */
++	BFQ_BFQQ_FLAG_split_coop,	/* shared bfqq will be split */
++	BFQ_BFQQ_FLAG_just_split,	/* queue has just been split */
+ };
+ 
+ #define BFQ_BFQQ_FNS(name)						\
+@@ -554,6 +581,7 @@ BFQ_BFQQ_FNS(IO_bound);
+ BFQ_BFQQ_FNS(constantly_seeky);
+ BFQ_BFQQ_FNS(coop);
+ BFQ_BFQQ_FNS(split_coop);
++BFQ_BFQQ_FNS(just_split);
+ BFQ_BFQQ_FNS(softrt_update);
+ #undef BFQ_BFQQ_FNS
+ 
+-- 
+2.0.3
+


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-08-26 12:16 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-08-26 12:16 UTC (permalink / raw
  To: gentoo-commits

commit:     eb0a44de4e660928fbf347dae020a3b6cde29d7b
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Tue Aug 26 12:16:43 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Tue Aug 26 12:16:43 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=eb0a44de

Update to correct double mount thanks to mgorny

---
 2900_dev-root-proc-mount-fix.patch | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/2900_dev-root-proc-mount-fix.patch b/2900_dev-root-proc-mount-fix.patch
index 4c89adf..6ea86e2 100644
--- a/2900_dev-root-proc-mount-fix.patch
+++ b/2900_dev-root-proc-mount-fix.patch
@@ -1,6 +1,6 @@
---- a/init/do_mounts.c	2013-01-25 19:11:11.609802424 -0500
-+++ b/init/do_mounts.c	2013-01-25 19:14:20.606053568 -0500
-@@ -461,7 +461,10 @@ void __init change_floppy(char *fmt, ...
+--- a/init/do_mounts.c	2014-08-26 08:03:30.000013100 -0400
++++ b/init/do_mounts.c	2014-08-26 08:11:19.720014712 -0400
+@@ -484,7 +484,10 @@ void __init change_floppy(char *fmt, ...
  	va_start(args, fmt);
  	vsprintf(buf, fmt, args);
  	va_end(args);
@@ -12,10 +12,11 @@
  	if (fd >= 0) {
  		sys_ioctl(fd, FDEJECT, 0);
  		sys_close(fd);
-@@ -505,7 +508,13 @@ void __init mount_root(void)
+@@ -527,8 +530,13 @@ void __init mount_root(void)
+ 	}
  #endif
  #ifdef CONFIG_BLOCK
- 	create_dev("/dev/root", ROOT_DEV);
+-	create_dev("/dev/root", ROOT_DEV);
 -	mount_block_root("/dev/root", root_mountflags);
 +	if (saved_root_name[0]) {
 +		create_dev(saved_root_name, ROOT_DEV);


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-09 21:38 Vlastimil Babka
  0 siblings, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2014-09-09 21:38 UTC (permalink / raw
  To: gentoo-commits

commit:     3cbefb09946b411dbf2d5efb82db9628598dd2bb
Author:     Caster <caster <AT> gentoo <DOT> org>
AuthorDate: Tue Sep  9 21:35:39 2014 +0000
Commit:     Vlastimil Babka <caster <AT> gentoo <DOT> org>
CommitDate: Tue Sep  9 21:35:39 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=3cbefb09

Linux patch 3.16.2

---
 0000_README             |    4 +
 1001_linux-3.16.2.patch | 5945 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 5949 insertions(+)

diff --git a/0000_README b/0000_README
index f57085e..1ecfc95 100644
--- a/0000_README
+++ b/0000_README
@@ -46,6 +46,10 @@ Patch:  1000_linux-3.16.1.patch
 From:   http://www.kernel.org
 Desc:   Linux 3.16.1
 
+Patch:  1001_linux-3.16.2.patch
+From:   http://www.kernel.org
+Desc:   Linux 3.16.2
+
 Patch:  2400_kcopy-patch-for-infiniband-driver.patch
 From:   Alexey Shvetsov <alexxy@gentoo.org>
 Desc:   Zero copy for infiniband psm userspace driver

diff --git a/1001_linux-3.16.2.patch b/1001_linux-3.16.2.patch
new file mode 100644
index 0000000..b0b883d
--- /dev/null
+++ b/1001_linux-3.16.2.patch
@@ -0,0 +1,5945 @@
+diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt
+index 7ccf933bfbe0..48148d6d9307 100644
+--- a/Documentation/sound/alsa/ALSA-Configuration.txt
++++ b/Documentation/sound/alsa/ALSA-Configuration.txt
+@@ -2026,8 +2026,8 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed.
+   -------------------
+ 
+     Module for sound cards based on the Asus AV66/AV100/AV200 chips,
+-    i.e., Xonar D1, DX, D2, D2X, DS, Essence ST (Deluxe), Essence STX,
+-    HDAV1.3 (Deluxe), and HDAV1.3 Slim.
++    i.e., Xonar D1, DX, D2, D2X, DS, DSX, Essence ST (Deluxe),
++    Essence STX (II), HDAV1.3 (Deluxe), and HDAV1.3 Slim.
+ 
+     This module supports autoprobe and multiple cards.
+ 
+diff --git a/Documentation/stable_kernel_rules.txt b/Documentation/stable_kernel_rules.txt
+index cbc2f03056bd..aee73e78c7d4 100644
+--- a/Documentation/stable_kernel_rules.txt
++++ b/Documentation/stable_kernel_rules.txt
+@@ -29,6 +29,9 @@ Rules on what kind of patches are accepted, and which ones are not, into the
+ 
+ Procedure for submitting patches to the -stable tree:
+ 
++ - If the patch covers files in net/ or drivers/net please follow netdev stable
++   submission guidelines as described in
++   Documentation/networking/netdev-FAQ.txt
+  - Send the patch, after verifying that it follows the above rules, to
+    stable@vger.kernel.org.  You must note the upstream commit ID in the
+    changelog of your submission, as well as the kernel version you wish
+diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
+index 0fe36497642c..612e6e99d1e5 100644
+--- a/Documentation/virtual/kvm/api.txt
++++ b/Documentation/virtual/kvm/api.txt
+@@ -1869,7 +1869,8 @@ registers, find a list below:
+   PPC   | KVM_REG_PPC_PID	| 64
+   PPC   | KVM_REG_PPC_ACOP	| 64
+   PPC   | KVM_REG_PPC_VRSAVE	| 32
+-  PPC   | KVM_REG_PPC_LPCR	| 64
++  PPC   | KVM_REG_PPC_LPCR	| 32
++  PPC   | KVM_REG_PPC_LPCR_64	| 64
+   PPC   | KVM_REG_PPC_PPR	| 64
+   PPC   | KVM_REG_PPC_ARCH_COMPAT 32
+   PPC   | KVM_REG_PPC_DABRX     | 32
+diff --git a/Makefile b/Makefile
+index 87663a2d1d10..c2617526e605 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 1
++SUBLEVEL = 2
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+ 
+diff --git a/arch/arm/boot/dts/am4372.dtsi b/arch/arm/boot/dts/am4372.dtsi
+index 49fa59622254..c9aee0e799bb 100644
+--- a/arch/arm/boot/dts/am4372.dtsi
++++ b/arch/arm/boot/dts/am4372.dtsi
+@@ -168,9 +168,6 @@
+ 			ti,hwmods = "mailbox";
+ 			ti,mbox-num-users = <4>;
+ 			ti,mbox-num-fifos = <8>;
+-			ti,mbox-names = "wkup_m3";
+-			ti,mbox-data = <0 0 0 0>;
+-			status = "disabled";
+ 		};
+ 
+ 		timer1: timer@44e31000 {
+diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
+index 43876245fc57..21ca0cebcab0 100644
+--- a/arch/arm/include/asm/unistd.h
++++ b/arch/arm/include/asm/unistd.h
+@@ -15,7 +15,17 @@
+ 
+ #include <uapi/asm/unistd.h>
+ 
++/*
++ * This may need to be greater than __NR_last_syscall+1 in order to
++ * account for the padding in the syscall table
++ */
+ #define __NR_syscalls  (384)
++
++/*
++ * *NOTE*: This is a ghost syscall private to the kernel.  Only the
++ * __kuser_cmpxchg code in entry-armv.S should be aware of its
++ * existence.  Don't ever use this from user code.
++ */
+ #define __ARM_NR_cmpxchg		(__ARM_NR_BASE+0x00fff0)
+ 
+ #define __ARCH_WANT_STAT64
+diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
+index ba94446c72d9..acd5b66ea3aa 100644
+--- a/arch/arm/include/uapi/asm/unistd.h
++++ b/arch/arm/include/uapi/asm/unistd.h
+@@ -411,11 +411,6 @@
+ #define __NR_renameat2			(__NR_SYSCALL_BASE+382)
+ 
+ /*
+- * This may need to be greater than __NR_last_syscall+1 in order to
+- * account for the padding in the syscall table
+- */
+-
+-/*
+  * The following SWIs are ARM private.
+  */
+ #define __ARM_NR_BASE			(__NR_SYSCALL_BASE+0x0f0000)
+@@ -426,12 +421,6 @@
+ #define __ARM_NR_set_tls		(__ARM_NR_BASE+5)
+ 
+ /*
+- * *NOTE*: This is a ghost syscall private to the kernel.  Only the
+- * __kuser_cmpxchg code in entry-armv.S should be aware of its
+- * existence.  Don't ever use this from user code.
+- */
+-
+-/*
+  * The following syscalls are obsolete and no longer available for EABI.
+  */
+ #if !defined(__KERNEL__)
+diff --git a/arch/arm/mach-omap2/control.c b/arch/arm/mach-omap2/control.c
+index 751f3549bf6f..acadac0992b6 100644
+--- a/arch/arm/mach-omap2/control.c
++++ b/arch/arm/mach-omap2/control.c
+@@ -314,7 +314,8 @@ void omap3_save_scratchpad_contents(void)
+ 		scratchpad_contents.public_restore_ptr =
+ 			virt_to_phys(omap3_restore_3630);
+ 	else if (omap_rev() != OMAP3430_REV_ES3_0 &&
+-					omap_rev() != OMAP3430_REV_ES3_1)
++					omap_rev() != OMAP3430_REV_ES3_1 &&
++					omap_rev() != OMAP3430_REV_ES3_1_2)
+ 		scratchpad_contents.public_restore_ptr =
+ 			virt_to_phys(omap3_restore);
+ 	else
+diff --git a/arch/arm/mach-omap2/omap_hwmod.c b/arch/arm/mach-omap2/omap_hwmod.c
+index 6c074f37cdd2..da1b256caccc 100644
+--- a/arch/arm/mach-omap2/omap_hwmod.c
++++ b/arch/arm/mach-omap2/omap_hwmod.c
+@@ -2185,6 +2185,8 @@ static int _enable(struct omap_hwmod *oh)
+ 			 oh->mux->pads_dynamic))) {
+ 		omap_hwmod_mux(oh->mux, _HWMOD_STATE_ENABLED);
+ 		_reconfigure_io_chain();
++	} else if (oh->flags & HWMOD_FORCE_MSTANDBY) {
++		_reconfigure_io_chain();
+ 	}
+ 
+ 	_add_initiator_dep(oh, mpu_oh);
+@@ -2291,6 +2293,8 @@ static int _idle(struct omap_hwmod *oh)
+ 	if (oh->mux && oh->mux->pads_dynamic) {
+ 		omap_hwmod_mux(oh->mux, _HWMOD_STATE_IDLE);
+ 		_reconfigure_io_chain();
++	} else if (oh->flags & HWMOD_FORCE_MSTANDBY) {
++		_reconfigure_io_chain();
+ 	}
+ 
+ 	oh->_state = _HWMOD_STATE_IDLE;
+diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
+index a5176cf32dad..f2defe1c380c 100644
+--- a/arch/arm64/include/asm/cacheflush.h
++++ b/arch/arm64/include/asm/cacheflush.h
+@@ -138,19 +138,10 @@ static inline void __flush_icache_all(void)
+ #define flush_icache_page(vma,page)	do { } while (0)
+ 
+ /*
+- * flush_cache_vmap() is used when creating mappings (eg, via vmap,
+- * vmalloc, ioremap etc) in kernel space for pages.  On non-VIPT
+- * caches, since the direct-mappings of these pages may contain cached
+- * data, we need to do a full cache flush to ensure that writebacks
+- * don't corrupt data placed into these pages via the new mappings.
++ * Not required on AArch64 (PIPT or VIPT non-aliasing D-cache).
+  */
+ static inline void flush_cache_vmap(unsigned long start, unsigned long end)
+ {
+-	/*
+-	 * set_pte_at() called from vmap_pte_range() does not
+-	 * have a DSB after cleaning the cache line.
+-	 */
+-	dsb(ish);
+ }
+ 
+ static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
+diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
+index e0ccceb317d9..2a1508cdead0 100644
+--- a/arch/arm64/include/asm/pgtable.h
++++ b/arch/arm64/include/asm/pgtable.h
+@@ -138,6 +138,8 @@ extern struct page *empty_zero_page;
+ 
+ #define pte_valid_user(pte) \
+ 	((pte_val(pte) & (PTE_VALID | PTE_USER)) == (PTE_VALID | PTE_USER))
++#define pte_valid_not_user(pte) \
++	((pte_val(pte) & (PTE_VALID | PTE_USER)) == PTE_VALID)
+ 
+ static inline pte_t pte_wrprotect(pte_t pte)
+ {
+@@ -184,6 +186,15 @@ static inline pte_t pte_mkspecial(pte_t pte)
+ static inline void set_pte(pte_t *ptep, pte_t pte)
+ {
+ 	*ptep = pte;
++
++	/*
++	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
++	 * or update_mmu_cache() have the necessary barriers.
++	 */
++	if (pte_valid_not_user(pte)) {
++		dsb(ishst);
++		isb();
++	}
+ }
+ 
+ extern void __sync_icache_dcache(pte_t pteval, unsigned long addr);
+@@ -303,6 +314,7 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
+ {
+ 	*pmdp = pmd;
+ 	dsb(ishst);
++	isb();
+ }
+ 
+ static inline void pmd_clear(pmd_t *pmdp)
+@@ -333,6 +345,7 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
+ {
+ 	*pudp = pud;
+ 	dsb(ishst);
++	isb();
+ }
+ 
+ static inline void pud_clear(pud_t *pudp)
+diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
+index b9349c4513ea..3796ea6bb734 100644
+--- a/arch/arm64/include/asm/tlbflush.h
++++ b/arch/arm64/include/asm/tlbflush.h
+@@ -122,6 +122,7 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
+ 	for (addr = start; addr < end; addr += 1 << (PAGE_SHIFT - 12))
+ 		asm("tlbi vaae1is, %0" : : "r"(addr));
+ 	dsb(ish);
++	isb();
+ }
+ 
+ /*
+@@ -131,8 +132,8 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
+ 				    unsigned long addr, pte_t *ptep)
+ {
+ 	/*
+-	 * set_pte() does not have a DSB, so make sure that the page table
+-	 * write is visible.
++	 * set_pte() does not have a DSB for user mappings, so make sure that
++	 * the page table write is visible.
+ 	 */
+ 	dsb(ishst);
+ }
+diff --git a/arch/arm64/kernel/debug-monitors.c b/arch/arm64/kernel/debug-monitors.c
+index a7fb874b595e..fe5b94078d82 100644
+--- a/arch/arm64/kernel/debug-monitors.c
++++ b/arch/arm64/kernel/debug-monitors.c
+@@ -315,20 +315,20 @@ static int brk_handler(unsigned long addr, unsigned int esr,
+ {
+ 	siginfo_t info;
+ 
+-	if (call_break_hook(regs, esr) == DBG_HOOK_HANDLED)
+-		return 0;
++	if (user_mode(regs)) {
++		info = (siginfo_t) {
++			.si_signo = SIGTRAP,
++			.si_errno = 0,
++			.si_code  = TRAP_BRKPT,
++			.si_addr  = (void __user *)instruction_pointer(regs),
++		};
+ 
+-	if (!user_mode(regs))
++		force_sig_info(SIGTRAP, &info, current);
++	} else if (call_break_hook(regs, esr) != DBG_HOOK_HANDLED) {
++		pr_warning("Unexpected kernel BRK exception at EL1\n");
+ 		return -EFAULT;
++	}
+ 
+-	info = (siginfo_t) {
+-		.si_signo = SIGTRAP,
+-		.si_errno = 0,
+-		.si_code  = TRAP_BRKPT,
+-		.si_addr  = (void __user *)instruction_pointer(regs),
+-	};
+-
+-	force_sig_info(SIGTRAP, &info, current);
+ 	return 0;
+ }
+ 
+diff --git a/arch/arm64/kernel/efi.c b/arch/arm64/kernel/efi.c
+index 14db1f6e8d7f..c0aead7d1a72 100644
+--- a/arch/arm64/kernel/efi.c
++++ b/arch/arm64/kernel/efi.c
+@@ -464,6 +464,8 @@ static int __init arm64_enter_virtual_mode(void)
+ 
+ 	set_bit(EFI_RUNTIME_SERVICES, &efi.flags);
+ 
++	efi.runtime_version = efi.systab->hdr.revision;
++
+ 	return 0;
+ }
+ early_initcall(arm64_enter_virtual_mode);
+diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
+index 736c17a226e9..bf0fc6b16ad9 100644
+--- a/arch/mips/math-emu/cp1emu.c
++++ b/arch/mips/math-emu/cp1emu.c
+@@ -1827,7 +1827,7 @@ dcopuop:
+ 	case -1:
+ 
+ 		if (cpu_has_mips_4_5_r)
+-			cbit = fpucondbit[MIPSInst_RT(ir) >> 2];
++			cbit = fpucondbit[MIPSInst_FD(ir) >> 2];
+ 		else
+ 			cbit = FPU_CSR_COND;
+ 		if (rv.w)
+diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
+index 2bc4a9409a93..de7d426a9b0c 100644
+--- a/arch/powerpc/include/uapi/asm/kvm.h
++++ b/arch/powerpc/include/uapi/asm/kvm.h
+@@ -548,6 +548,7 @@ struct kvm_get_htab_header {
+ 
+ #define KVM_REG_PPC_VRSAVE	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb4)
+ #define KVM_REG_PPC_LPCR	(KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xb5)
++#define KVM_REG_PPC_LPCR_64	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb5)
+ #define KVM_REG_PPC_PPR		(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xb6)
+ 
+ /* Architecture compatibility level */
+diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
+index fbd01eba4473..94802d267022 100644
+--- a/arch/powerpc/kernel/eeh_pe.c
++++ b/arch/powerpc/kernel/eeh_pe.c
+@@ -802,53 +802,33 @@ void eeh_pe_restore_bars(struct eeh_pe *pe)
+  */
+ const char *eeh_pe_loc_get(struct eeh_pe *pe)
+ {
+-	struct pci_controller *hose;
+ 	struct pci_bus *bus = eeh_pe_bus_get(pe);
+-	struct pci_dev *pdev;
+-	struct device_node *dn;
+-	const char *loc;
++	struct device_node *dn = pci_bus_to_OF_node(bus);
++	const char *loc = NULL;
+ 
+-	if (!bus)
+-		return "N/A";
++	if (!dn)
++		goto out;
+ 
+ 	/* PHB PE or root PE ? */
+ 	if (pci_is_root_bus(bus)) {
+-		hose = pci_bus_to_host(bus);
+-		loc = of_get_property(hose->dn,
+-				"ibm,loc-code", NULL);
+-		if (loc)
+-			return loc;
+-		loc = of_get_property(hose->dn,
+-				"ibm,io-base-loc-code", NULL);
++		loc = of_get_property(dn, "ibm,loc-code", NULL);
++		if (!loc)
++			loc = of_get_property(dn, "ibm,io-base-loc-code", NULL);
+ 		if (loc)
+-			return loc;
+-
+-		pdev = pci_get_slot(bus, 0x0);
+-	} else {
+-		pdev = bus->self;
+-	}
+-
+-	if (!pdev) {
+-		loc = "N/A";
+-		goto out;
+-	}
++			goto out;
+ 
+-	dn = pci_device_to_OF_node(pdev);
+-	if (!dn) {
+-		loc = "N/A";
+-		goto out;
++		/* Check the root port */
++		dn = dn->child;
++		if (!dn)
++			goto out;
+ 	}
+ 
+ 	loc = of_get_property(dn, "ibm,loc-code", NULL);
+ 	if (!loc)
+ 		loc = of_get_property(dn, "ibm,slot-location-code", NULL);
+-	if (!loc)
+-		loc = "N/A";
+ 
+ out:
+-	if (pci_is_root_bus(bus) && pdev)
+-		pci_dev_put(pdev);
+-	return loc;
++	return loc ? loc : "N/A";
+ }
+ 
+ /**
+diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
+index 7a12edbb61e7..0f3a19237444 100644
+--- a/arch/powerpc/kvm/book3s_hv.c
++++ b/arch/powerpc/kvm/book3s_hv.c
+@@ -785,7 +785,8 @@ static int kvm_arch_vcpu_ioctl_set_sregs_hv(struct kvm_vcpu *vcpu,
+ 	return 0;
+ }
+ 
+-static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr)
++static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr,
++		bool preserve_top32)
+ {
+ 	struct kvmppc_vcore *vc = vcpu->arch.vcore;
+ 	u64 mask;
+@@ -820,6 +821,10 @@ static void kvmppc_set_lpcr(struct kvm_vcpu *vcpu, u64 new_lpcr)
+ 	mask = LPCR_DPFD | LPCR_ILE | LPCR_TC;
+ 	if (cpu_has_feature(CPU_FTR_ARCH_207S))
+ 		mask |= LPCR_AIL;
++
++	/* Broken 32-bit version of LPCR must not clear top bits */
++	if (preserve_top32)
++		mask &= 0xFFFFFFFF;
+ 	vc->lpcr = (vc->lpcr & ~mask) | (new_lpcr & mask);
+ 	spin_unlock(&vc->lock);
+ }
+@@ -939,6 +944,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
+ 		*val = get_reg_val(id, vcpu->arch.vcore->tb_offset);
+ 		break;
+ 	case KVM_REG_PPC_LPCR:
++	case KVM_REG_PPC_LPCR_64:
+ 		*val = get_reg_val(id, vcpu->arch.vcore->lpcr);
+ 		break;
+ 	case KVM_REG_PPC_PPR:
+@@ -1150,7 +1156,10 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, u64 id,
+ 			ALIGN(set_reg_val(id, *val), 1UL << 24);
+ 		break;
+ 	case KVM_REG_PPC_LPCR:
+-		kvmppc_set_lpcr(vcpu, set_reg_val(id, *val));
++		kvmppc_set_lpcr(vcpu, set_reg_val(id, *val), true);
++		break;
++	case KVM_REG_PPC_LPCR_64:
++		kvmppc_set_lpcr(vcpu, set_reg_val(id, *val), false);
+ 		break;
+ 	case KVM_REG_PPC_PPR:
+ 		vcpu->arch.ppr = set_reg_val(id, *val);
+diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
+index 8eef1e519077..66b7afec250f 100644
+--- a/arch/powerpc/kvm/book3s_pr.c
++++ b/arch/powerpc/kvm/book3s_pr.c
+@@ -1233,6 +1233,7 @@ static int kvmppc_get_one_reg_pr(struct kvm_vcpu *vcpu, u64 id,
+ 		*val = get_reg_val(id, to_book3s(vcpu)->hior);
+ 		break;
+ 	case KVM_REG_PPC_LPCR:
++	case KVM_REG_PPC_LPCR_64:
+ 		/*
+ 		 * We are only interested in the LPCR_ILE bit
+ 		 */
+@@ -1268,6 +1269,7 @@ static int kvmppc_set_one_reg_pr(struct kvm_vcpu *vcpu, u64 id,
+ 		to_book3s(vcpu)->hior_explicit = true;
+ 		break;
+ 	case KVM_REG_PPC_LPCR:
++	case KVM_REG_PPC_LPCR_64:
+ 		kvmppc_set_lpcr_pr(vcpu, set_reg_val(id, *val));
+ 		break;
+ 	default:
+diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
+index de19edeaa7a7..3136ae2f75af 100644
+--- a/arch/powerpc/platforms/powernv/pci-ioda.c
++++ b/arch/powerpc/platforms/powernv/pci-ioda.c
+@@ -491,6 +491,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
+ 		set_dma_ops(&pdev->dev, &dma_iommu_ops);
+ 		set_iommu_table_base(&pdev->dev, &pe->tce32_table);
+ 	}
++	*pdev->dev.dma_mask = dma_mask;
+ 	return 0;
+ }
+ 
+diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
+index 203cbf0dc101..89e23811199c 100644
+--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
++++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
+@@ -118,10 +118,10 @@ int remove_phb_dynamic(struct pci_controller *phb)
+ 		}
+ 	}
+ 
+-	/* Unregister the bridge device from sysfs and remove the PCI bus */
+-	device_unregister(b->bridge);
++	/* Remove the PCI bus and unregister the bridge device from sysfs */
+ 	phb->bus = NULL;
+ 	pci_remove_bus(b);
++	device_unregister(b->bridge);
+ 
+ 	/* Now release the IO resource */
+ 	if (res->flags & IORESOURCE_IO)
+diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
+index 37b8241ec784..f90ad8592b36 100644
+--- a/arch/s390/mm/pgtable.c
++++ b/arch/s390/mm/pgtable.c
+@@ -1279,6 +1279,7 @@ static unsigned long page_table_realloc_pmd(struct mmu_gather *tlb,
+ {
+ 	unsigned long next, *table, *new;
+ 	struct page *page;
++	spinlock_t *ptl;
+ 	pmd_t *pmd;
+ 
+ 	pmd = pmd_offset(pud, addr);
+@@ -1296,7 +1297,7 @@ again:
+ 		if (!new)
+ 			return -ENOMEM;
+ 
+-		spin_lock(&mm->page_table_lock);
++		ptl = pmd_lock(mm, pmd);
+ 		if (likely((unsigned long *) pmd_deref(*pmd) == table)) {
+ 			/* Nuke pmd entry pointing to the "short" page table */
+ 			pmdp_flush_lazy(mm, addr, pmd);
+@@ -1310,7 +1311,7 @@ again:
+ 			page_table_free_rcu(tlb, table);
+ 			new = NULL;
+ 		}
+-		spin_unlock(&mm->page_table_lock);
++		spin_unlock(ptl);
+ 		if (new) {
+ 			page_table_free_pgste(new);
+ 			goto again;
+diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
+index d24887b645dc..27adfd902c6f 100644
+--- a/arch/x86/Kconfig
++++ b/arch/x86/Kconfig
+@@ -1537,6 +1537,7 @@ config EFI
+ config EFI_STUB
+        bool "EFI stub support"
+        depends on EFI
++       select RELOCATABLE
+        ---help---
+           This kernel feature allows a bzImage to be loaded directly
+ 	  by EFI firmware without the use of a bootloader.
+diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
+index 49205d01b9ad..9f83c171ac18 100644
+--- a/arch/x86/include/asm/kvm_host.h
++++ b/arch/x86/include/asm/kvm_host.h
+@@ -95,7 +95,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
+ #define KVM_REFILL_PAGES 25
+ #define KVM_MAX_CPUID_ENTRIES 80
+ #define KVM_NR_FIXED_MTRR_REGION 88
+-#define KVM_NR_VAR_MTRR 10
++#define KVM_NR_VAR_MTRR 8
+ 
+ #define ASYNC_PF_PER_VCPU 64
+ 
+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
+index 0ec056012618..aa97a070f09f 100644
+--- a/arch/x86/include/asm/pgtable.h
++++ b/arch/x86/include/asm/pgtable.h
+@@ -131,8 +131,13 @@ static inline int pte_exec(pte_t pte)
+ 
+ static inline int pte_special(pte_t pte)
+ {
+-	return (pte_flags(pte) & (_PAGE_PRESENT|_PAGE_SPECIAL)) ==
+-				 (_PAGE_PRESENT|_PAGE_SPECIAL);
++	/*
++	 * See CONFIG_NUMA_BALANCING pte_numa in include/asm-generic/pgtable.h.
++	 * On x86 we have _PAGE_BIT_NUMA == _PAGE_BIT_GLOBAL+1 ==
++	 * __PAGE_BIT_SOFTW1 == _PAGE_BIT_SPECIAL.
++	 */
++	return (pte_flags(pte) & _PAGE_SPECIAL) &&
++		(pte_flags(pte) & (_PAGE_PRESENT|_PAGE_PROTNONE));
+ }
+ 
+ static inline unsigned long pte_pfn(pte_t pte)
+diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
+index 9a316b21df8b..3bdb95ae8c43 100644
+--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
++++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
+@@ -42,7 +42,7 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
+  * cmci_discover_lock protects against parallel discovery attempts
+  * which could race against each other.
+  */
+-static DEFINE_SPINLOCK(cmci_discover_lock);
++static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
+ 
+ #define CMCI_THRESHOLD		1
+ #define CMCI_POLL_INTERVAL	(30 * HZ)
+@@ -144,14 +144,14 @@ static void cmci_storm_disable_banks(void)
+ 	int bank;
+ 	u64 val;
+ 
+-	spin_lock_irqsave(&cmci_discover_lock, flags);
++	raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ 	owned = __get_cpu_var(mce_banks_owned);
+ 	for_each_set_bit(bank, owned, MAX_NR_BANKS) {
+ 		rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ 		val &= ~MCI_CTL2_CMCI_EN;
+ 		wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
+ 	}
+-	spin_unlock_irqrestore(&cmci_discover_lock, flags);
++	raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+ }
+ 
+ static bool cmci_storm_detect(void)
+@@ -211,7 +211,7 @@ static void cmci_discover(int banks)
+ 	int i;
+ 	int bios_wrong_thresh = 0;
+ 
+-	spin_lock_irqsave(&cmci_discover_lock, flags);
++	raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ 	for (i = 0; i < banks; i++) {
+ 		u64 val;
+ 		int bios_zero_thresh = 0;
+@@ -266,7 +266,7 @@ static void cmci_discover(int banks)
+ 			WARN_ON(!test_bit(i, __get_cpu_var(mce_poll_banks)));
+ 		}
+ 	}
+-	spin_unlock_irqrestore(&cmci_discover_lock, flags);
++	raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+ 	if (mca_cfg.bios_cmci_threshold && bios_wrong_thresh) {
+ 		pr_info_once(
+ 			"bios_cmci_threshold: Some banks do not have valid thresholds set\n");
+@@ -316,10 +316,10 @@ void cmci_clear(void)
+ 
+ 	if (!cmci_supported(&banks))
+ 		return;
+-	spin_lock_irqsave(&cmci_discover_lock, flags);
++	raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ 	for (i = 0; i < banks; i++)
+ 		__cmci_disable_bank(i);
+-	spin_unlock_irqrestore(&cmci_discover_lock, flags);
++	raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+ }
+ 
+ static void cmci_rediscover_work_func(void *arg)
+@@ -360,9 +360,9 @@ void cmci_disable_bank(int bank)
+ 	if (!cmci_supported(&banks))
+ 		return;
+ 
+-	spin_lock_irqsave(&cmci_discover_lock, flags);
++	raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+ 	__cmci_disable_bank(bank);
+-	spin_unlock_irqrestore(&cmci_discover_lock, flags);
++	raw_spin_unlock_irqrestore(&cmci_discover_lock, flags);
+ }
+ 
+ static void intel_init_cmci(void)
+diff --git a/arch/x86/kernel/resource.c b/arch/x86/kernel/resource.c
+index 2a26819bb6a8..80eab01c1a68 100644
+--- a/arch/x86/kernel/resource.c
++++ b/arch/x86/kernel/resource.c
+@@ -37,10 +37,12 @@ static void remove_e820_regions(struct resource *avail)
+ 
+ void arch_remove_reservations(struct resource *avail)
+ {
+-	/* Trim out BIOS areas (low 1MB and high 2MB) and E820 regions */
++	/*
++	 * Trim out BIOS area (high 2MB) and E820 regions. We do not remove
++	 * the low 1MB unconditionally, as this area is needed for some ISA
++	 * cards requiring a memory range, e.g. the i82365 PCMCIA controller.
++	 */
+ 	if (avail->flags & IORESOURCE_MEM) {
+-		if (avail->start < BIOS_END)
+-			avail->start = BIOS_END;
+ 		resource_clip(avail, BIOS_ROM_BASE, BIOS_ROM_END);
+ 
+ 		remove_e820_regions(avail);
+diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
+index ea5b5709aa76..e1e1e80fc6a6 100644
+--- a/arch/x86/kernel/vsyscall_64.c
++++ b/arch/x86/kernel/vsyscall_64.c
+@@ -81,10 +81,10 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
+ 	if (!show_unhandled_signals)
+ 		return;
+ 
+-	pr_notice_ratelimited("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
+-			      level, current->comm, task_pid_nr(current),
+-			      message, regs->ip, regs->cs,
+-			      regs->sp, regs->ax, regs->si, regs->di);
++	printk_ratelimited("%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n",
++			   level, current->comm, task_pid_nr(current),
++			   message, regs->ip, regs->cs,
++			   regs->sp, regs->ax, regs->si, regs->di);
+ }
+ 
+ static int addr_to_vsyscall_nr(unsigned long addr)
+diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
+index e4e833d3d7d7..2d3b8d0efa0f 100644
+--- a/arch/x86/kvm/emulate.c
++++ b/arch/x86/kvm/emulate.c
+@@ -2017,6 +2017,7 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt)
+ {
+ 	int rc;
+ 	unsigned long cs;
++	int cpl = ctxt->ops->cpl(ctxt);
+ 
+ 	rc = emulate_pop(ctxt, &ctxt->_eip, ctxt->op_bytes);
+ 	if (rc != X86EMUL_CONTINUE)
+@@ -2026,6 +2027,9 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt)
+ 	rc = emulate_pop(ctxt, &cs, ctxt->op_bytes);
+ 	if (rc != X86EMUL_CONTINUE)
+ 		return rc;
++	/* Outer-privilege level return is not implemented */
++	if (ctxt->mode >= X86EMUL_MODE_PROT16 && (cs & 3) > cpl)
++		return X86EMUL_UNHANDLEABLE;
+ 	rc = load_segment_descriptor(ctxt, (u16)cs, VCPU_SREG_CS);
+ 	return rc;
+ }
+diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
+index bd0da433e6d7..a1ec6a50a05a 100644
+--- a/arch/x86/kvm/irq.c
++++ b/arch/x86/kvm/irq.c
+@@ -108,7 +108,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
+ 
+ 	vector = kvm_cpu_get_extint(v);
+ 
+-	if (kvm_apic_vid_enabled(v->kvm) || vector != -1)
++	if (vector != -1)
+ 		return vector;			/* PIC */
+ 
+ 	return kvm_get_apic_interrupt(v);	/* APIC */
+diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
+index 006911858174..453e5fbbb7ae 100644
+--- a/arch/x86/kvm/lapic.c
++++ b/arch/x86/kvm/lapic.c
+@@ -352,25 +352,46 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic)
+ 
+ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic)
+ {
+-	apic->irr_pending = false;
++	struct kvm_vcpu *vcpu;
++
++	vcpu = apic->vcpu;
++
+ 	apic_clear_vector(vec, apic->regs + APIC_IRR);
+-	if (apic_search_irr(apic) != -1)
+-		apic->irr_pending = true;
++	if (unlikely(kvm_apic_vid_enabled(vcpu->kvm)))
++		/* try to update RVI */
++		kvm_make_request(KVM_REQ_EVENT, vcpu);
++	else {
++		vec = apic_search_irr(apic);
++		apic->irr_pending = (vec != -1);
++	}
+ }
+ 
+ static inline void apic_set_isr(int vec, struct kvm_lapic *apic)
+ {
+-	/* Note that we never get here with APIC virtualization enabled.  */
++	struct kvm_vcpu *vcpu;
++
++	if (__apic_test_and_set_vector(vec, apic->regs + APIC_ISR))
++		return;
++
++	vcpu = apic->vcpu;
+ 
+-	if (!__apic_test_and_set_vector(vec, apic->regs + APIC_ISR))
+-		++apic->isr_count;
+-	BUG_ON(apic->isr_count > MAX_APIC_VECTOR);
+ 	/*
+-	 * ISR (in service register) bit is set when injecting an interrupt.
+-	 * The highest vector is injected. Thus the latest bit set matches
+-	 * the highest bit in ISR.
++	 * With APIC virtualization enabled, all caching is disabled
++	 * because the processor can modify ISR under the hood.  Instead
++	 * just set SVI.
+ 	 */
+-	apic->highest_isr_cache = vec;
++	if (unlikely(kvm_apic_vid_enabled(vcpu->kvm)))
++		kvm_x86_ops->hwapic_isr_update(vcpu->kvm, vec);
++	else {
++		++apic->isr_count;
++		BUG_ON(apic->isr_count > MAX_APIC_VECTOR);
++		/*
++		 * ISR (in service register) bit is set when injecting an interrupt.
++		 * The highest vector is injected. Thus the latest bit set matches
++		 * the highest bit in ISR.
++		 */
++		apic->highest_isr_cache = vec;
++	}
+ }
+ 
+ static inline int apic_find_highest_isr(struct kvm_lapic *apic)
+@@ -1627,11 +1648,16 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
+ 	int vector = kvm_apic_has_interrupt(vcpu);
+ 	struct kvm_lapic *apic = vcpu->arch.apic;
+ 
+-	/* Note that we never get here with APIC virtualization enabled.  */
+-
+ 	if (vector == -1)
+ 		return -1;
+ 
++	/*
++	 * We get here even with APIC virtualization enabled, if doing
++	 * nested virtualization and L1 runs with the "acknowledge interrupt
++	 * on exit" mode.  Then we cannot inject the interrupt via RVI,
++	 * because the process would deliver it through the IDT.
++	 */
++
+ 	apic_set_isr(vector, apic);
+ 	apic_update_ppr(apic);
+ 	apic_clear_irr(vector, apic);
+diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
+index a19ed92e74e4..2ae525e0d8ba 100644
+--- a/arch/x86/pci/i386.c
++++ b/arch/x86/pci/i386.c
+@@ -162,6 +162,10 @@ pcibios_align_resource(void *data, const struct resource *res,
+ 			return start;
+ 		if (start & 0x300)
+ 			start = (start + 0x3ff) & ~0x3ff;
++	} else if (res->flags & IORESOURCE_MEM) {
++		/* The low 1MB range is reserved for ISA cards */
++		if (start < BIOS_END)
++			start = BIOS_END;
+ 	}
+ 	return start;
+ }
+diff --git a/arch/x86/xen/grant-table.c b/arch/x86/xen/grant-table.c
+index ebfa9b2c871d..767c9cbb869f 100644
+--- a/arch/x86/xen/grant-table.c
++++ b/arch/x86/xen/grant-table.c
+@@ -168,6 +168,7 @@ static int __init xlated_setup_gnttab_pages(void)
+ {
+ 	struct page **pages;
+ 	xen_pfn_t *pfns;
++	void *vaddr;
+ 	int rc;
+ 	unsigned int i;
+ 	unsigned long nr_grant_frames = gnttab_max_grant_frames();
+@@ -193,21 +194,20 @@ static int __init xlated_setup_gnttab_pages(void)
+ 	for (i = 0; i < nr_grant_frames; i++)
+ 		pfns[i] = page_to_pfn(pages[i]);
+ 
+-	rc = arch_gnttab_map_shared(pfns, nr_grant_frames, nr_grant_frames,
+-				    &xen_auto_xlat_grant_frames.vaddr);
+-
+-	if (rc) {
++	vaddr = vmap(pages, nr_grant_frames, 0, PAGE_KERNEL);
++	if (!vaddr) {
+ 		pr_warn("%s Couldn't map %ld pfns rc:%d\n", __func__,
+ 			nr_grant_frames, rc);
+ 		free_xenballooned_pages(nr_grant_frames, pages);
+ 		kfree(pages);
+ 		kfree(pfns);
+-		return rc;
++		return -ENOMEM;
+ 	}
+ 	kfree(pages);
+ 
+ 	xen_auto_xlat_grant_frames.pfn = pfns;
+ 	xen_auto_xlat_grant_frames.count = nr_grant_frames;
++	xen_auto_xlat_grant_frames.vaddr = vaddr;
+ 
+ 	return 0;
+ }
+diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
+index 7b78f88c1707..5718b0b58b60 100644
+--- a/arch/x86/xen/time.c
++++ b/arch/x86/xen/time.c
+@@ -444,7 +444,7 @@ void xen_setup_timer(int cpu)
+ 
+ 	irq = bind_virq_to_irqhandler(VIRQ_TIMER, cpu, xen_timer_interrupt,
+ 				      IRQF_PERCPU|IRQF_NOBALANCING|IRQF_TIMER|
+-				      IRQF_FORCE_RESUME,
++				      IRQF_FORCE_RESUME|IRQF_EARLY_RESUME,
+ 				      name, NULL);
+ 	(void)xen_set_irq_priority(irq, XEN_IRQ_PRIORITY_MAX);
+ 
+diff --git a/drivers/char/tpm/tpm_i2c_stm_st33.c b/drivers/char/tpm/tpm_i2c_stm_st33.c
+index 3b7bf2162898..4669e3713428 100644
+--- a/drivers/char/tpm/tpm_i2c_stm_st33.c
++++ b/drivers/char/tpm/tpm_i2c_stm_st33.c
+@@ -714,6 +714,7 @@ tpm_st33_i2c_probe(struct i2c_client *client, const struct i2c_device_id *id)
+ 	}
+ 
+ 	tpm_get_timeouts(chip);
++	tpm_do_selftest(chip);
+ 
+ 	dev_info(chip->dev, "TPM I2C Initialized\n");
+ 	return 0;
+diff --git a/drivers/crypto/ux500/cryp/cryp_core.c b/drivers/crypto/ux500/cryp/cryp_core.c
+index a999f537228f..92105f3dc8e0 100644
+--- a/drivers/crypto/ux500/cryp/cryp_core.c
++++ b/drivers/crypto/ux500/cryp/cryp_core.c
+@@ -190,7 +190,7 @@ static void add_session_id(struct cryp_ctx *ctx)
+ static irqreturn_t cryp_interrupt_handler(int irq, void *param)
+ {
+ 	struct cryp_ctx *ctx;
+-	int i;
++	int count;
+ 	struct cryp_device_data *device_data;
+ 
+ 	if (param == NULL) {
+@@ -215,12 +215,11 @@ static irqreturn_t cryp_interrupt_handler(int irq, void *param)
+ 	if (cryp_pending_irq_src(device_data,
+ 				 CRYP_IRQ_SRC_OUTPUT_FIFO)) {
+ 		if (ctx->outlen / ctx->blocksize > 0) {
+-			for (i = 0; i < ctx->blocksize / 4; i++) {
+-				*(ctx->outdata) = readl_relaxed(
+-						&device_data->base->dout);
+-				ctx->outdata += 4;
+-				ctx->outlen -= 4;
+-			}
++			count = ctx->blocksize / 4;
++
++			readsl(&device_data->base->dout, ctx->outdata, count);
++			ctx->outdata += count;
++			ctx->outlen -= count;
+ 
+ 			if (ctx->outlen == 0) {
+ 				cryp_disable_irq_src(device_data,
+@@ -230,12 +229,12 @@ static irqreturn_t cryp_interrupt_handler(int irq, void *param)
+ 	} else if (cryp_pending_irq_src(device_data,
+ 					CRYP_IRQ_SRC_INPUT_FIFO)) {
+ 		if (ctx->datalen / ctx->blocksize > 0) {
+-			for (i = 0 ; i < ctx->blocksize / 4; i++) {
+-				writel_relaxed(ctx->indata,
+-						&device_data->base->din);
+-				ctx->indata += 4;
+-				ctx->datalen -= 4;
+-			}
++			count = ctx->blocksize / 4;
++
++			writesl(&device_data->base->din, ctx->indata, count);
++
++			ctx->indata += count;
++			ctx->datalen -= count;
+ 
+ 			if (ctx->datalen == 0)
+ 				cryp_disable_irq_src(device_data,
+diff --git a/drivers/gpu/drm/omapdrm/omap_dmm_tiler.c b/drivers/gpu/drm/omapdrm/omap_dmm_tiler.c
+index f926b4caf449..56c60552abba 100644
+--- a/drivers/gpu/drm/omapdrm/omap_dmm_tiler.c
++++ b/drivers/gpu/drm/omapdrm/omap_dmm_tiler.c
+@@ -199,7 +199,7 @@ static struct dmm_txn *dmm_txn_init(struct dmm *dmm, struct tcm *tcm)
+ static void dmm_txn_append(struct dmm_txn *txn, struct pat_area *area,
+ 		struct page **pages, uint32_t npages, uint32_t roll)
+ {
+-	dma_addr_t pat_pa = 0;
++	dma_addr_t pat_pa = 0, data_pa = 0;
+ 	uint32_t *data;
+ 	struct pat *pat;
+ 	struct refill_engine *engine = txn->engine_handle;
+@@ -223,7 +223,9 @@ static void dmm_txn_append(struct dmm_txn *txn, struct pat_area *area,
+ 			.lut_id = engine->tcm->lut_id,
+ 		};
+ 
+-	data = alloc_dma(txn, 4*i, &pat->data_pa);
++	data = alloc_dma(txn, 4*i, &data_pa);
++	/* FIXME: what if data_pa is more than 32-bit ? */
++	pat->data_pa = data_pa;
+ 
+ 	while (i--) {
+ 		int n = i + roll;
+diff --git a/drivers/gpu/drm/omapdrm/omap_gem.c b/drivers/gpu/drm/omapdrm/omap_gem.c
+index 95dbce286a41..d9f5e5241af4 100644
+--- a/drivers/gpu/drm/omapdrm/omap_gem.c
++++ b/drivers/gpu/drm/omapdrm/omap_gem.c
+@@ -791,7 +791,7 @@ int omap_gem_get_paddr(struct drm_gem_object *obj,
+ 			omap_obj->paddr = tiler_ssptr(block);
+ 			omap_obj->block = block;
+ 
+-			DBG("got paddr: %08x", omap_obj->paddr);
++			DBG("got paddr: %pad", &omap_obj->paddr);
+ 		}
+ 
+ 		omap_obj->paddr_cnt++;
+@@ -985,9 +985,9 @@ void omap_gem_describe(struct drm_gem_object *obj, struct seq_file *m)
+ 
+ 	off = drm_vma_node_start(&obj->vma_node);
+ 
+-	seq_printf(m, "%08x: %2d (%2d) %08llx %08Zx (%2d) %p %4d",
++	seq_printf(m, "%08x: %2d (%2d) %08llx %pad (%2d) %p %4d",
+ 			omap_obj->flags, obj->name, obj->refcount.refcount.counter,
+-			off, omap_obj->paddr, omap_obj->paddr_cnt,
++			off, &omap_obj->paddr, omap_obj->paddr_cnt,
+ 			omap_obj->vaddr, omap_obj->roll);
+ 
+ 	if (omap_obj->flags & OMAP_BO_TILED) {
+@@ -1467,8 +1467,8 @@ void omap_gem_init(struct drm_device *dev)
+ 			entry->paddr = tiler_ssptr(block);
+ 			entry->block = block;
+ 
+-			DBG("%d:%d: %dx%d: paddr=%08x stride=%d", i, j, w, h,
+-					entry->paddr,
++			DBG("%d:%d: %dx%d: paddr=%pad stride=%d", i, j, w, h,
++					&entry->paddr,
+ 					usergart[i].stride_pfn << PAGE_SHIFT);
+ 		}
+ 	}
+diff --git a/drivers/gpu/drm/omapdrm/omap_plane.c b/drivers/gpu/drm/omapdrm/omap_plane.c
+index 3cf31ee59aac..6af3398b5278 100644
+--- a/drivers/gpu/drm/omapdrm/omap_plane.c
++++ b/drivers/gpu/drm/omapdrm/omap_plane.c
+@@ -142,8 +142,8 @@ static void omap_plane_pre_apply(struct omap_drm_apply *apply)
+ 	DBG("%dx%d -> %dx%d (%d)", info->width, info->height,
+ 			info->out_width, info->out_height,
+ 			info->screen_width);
+-	DBG("%d,%d %08x %08x", info->pos_x, info->pos_y,
+-			info->paddr, info->p_uv_addr);
++	DBG("%d,%d %pad %pad", info->pos_x, info->pos_y,
++			&info->paddr, &info->p_uv_addr);
+ 
+ 	/* TODO: */
+ 	ilace = false;
+diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
+index c0ea66192fe0..767f2cc44bd8 100644
+--- a/drivers/gpu/drm/radeon/cik.c
++++ b/drivers/gpu/drm/radeon/cik.c
+@@ -3320,6 +3320,7 @@ static void cik_gpu_init(struct radeon_device *rdev)
+ 			   (rdev->pdev->device == 0x130B) ||
+ 			   (rdev->pdev->device == 0x130E) ||
+ 			   (rdev->pdev->device == 0x1315) ||
++			   (rdev->pdev->device == 0x1318) ||
+ 			   (rdev->pdev->device == 0x131B)) {
+ 			rdev->config.cik.max_cu_per_sh = 4;
+ 			rdev->config.cik.max_backends_per_se = 1;
+diff --git a/drivers/hid/hid-cherry.c b/drivers/hid/hid-cherry.c
+index 1bdcccc54a1d..f745d2c1325e 100644
+--- a/drivers/hid/hid-cherry.c
++++ b/drivers/hid/hid-cherry.c
+@@ -28,7 +28,7 @@
+ static __u8 *ch_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ 		unsigned int *rsize)
+ {
+-	if (*rsize >= 17 && rdesc[11] == 0x3c && rdesc[12] == 0x02) {
++	if (*rsize >= 18 && rdesc[11] == 0x3c && rdesc[12] == 0x02) {
+ 		hid_info(hdev, "fixing up Cherry Cymotion report descriptor\n");
+ 		rdesc[11] = rdesc[16] = 0xff;
+ 		rdesc[12] = rdesc[17] = 0x03;
+diff --git a/drivers/hid/hid-kye.c b/drivers/hid/hid-kye.c
+index e77696367591..b92bf01a1ae8 100644
+--- a/drivers/hid/hid-kye.c
++++ b/drivers/hid/hid-kye.c
+@@ -300,7 +300,7 @@ static __u8 *kye_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ 		 *   - change the button usage range to 4-7 for the extra
+ 		 *     buttons
+ 		 */
+-		if (*rsize >= 74 &&
++		if (*rsize >= 75 &&
+ 			rdesc[61] == 0x05 && rdesc[62] == 0x08 &&
+ 			rdesc[63] == 0x19 && rdesc[64] == 0x08 &&
+ 			rdesc[65] == 0x29 && rdesc[66] == 0x0f &&
+diff --git a/drivers/hid/hid-lg.c b/drivers/hid/hid-lg.c
+index a976f48263f6..f91ff145db9a 100644
+--- a/drivers/hid/hid-lg.c
++++ b/drivers/hid/hid-lg.c
+@@ -345,14 +345,14 @@ static __u8 *lg_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ 	struct usb_device_descriptor *udesc;
+ 	__u16 bcdDevice, rev_maj, rev_min;
+ 
+-	if ((drv_data->quirks & LG_RDESC) && *rsize >= 90 && rdesc[83] == 0x26 &&
++	if ((drv_data->quirks & LG_RDESC) && *rsize >= 91 && rdesc[83] == 0x26 &&
+ 			rdesc[84] == 0x8c && rdesc[85] == 0x02) {
+ 		hid_info(hdev,
+ 			 "fixing up Logitech keyboard report descriptor\n");
+ 		rdesc[84] = rdesc[89] = 0x4d;
+ 		rdesc[85] = rdesc[90] = 0x10;
+ 	}
+-	if ((drv_data->quirks & LG_RDESC_REL_ABS) && *rsize >= 50 &&
++	if ((drv_data->quirks & LG_RDESC_REL_ABS) && *rsize >= 51 &&
+ 			rdesc[32] == 0x81 && rdesc[33] == 0x06 &&
+ 			rdesc[49] == 0x81 && rdesc[50] == 0x06) {
+ 		hid_info(hdev,
+diff --git a/drivers/hid/hid-logitech-dj.c b/drivers/hid/hid-logitech-dj.c
+index 486dbde2ba2d..b7ba82960c79 100644
+--- a/drivers/hid/hid-logitech-dj.c
++++ b/drivers/hid/hid-logitech-dj.c
+@@ -238,13 +238,6 @@ static void logi_dj_recv_add_djhid_device(struct dj_receiver_dev *djrcv_dev,
+ 		return;
+ 	}
+ 
+-	if ((dj_report->device_index < DJ_DEVICE_INDEX_MIN) ||
+-	    (dj_report->device_index > DJ_DEVICE_INDEX_MAX)) {
+-		dev_err(&djrcv_hdev->dev, "%s: invalid device index:%d\n",
+-			__func__, dj_report->device_index);
+-		return;
+-	}
+-
+ 	if (djrcv_dev->paired_dj_devices[dj_report->device_index]) {
+ 		/* The device is already known. No need to reallocate it. */
+ 		dbg_hid("%s: device is already known\n", __func__);
+@@ -557,7 +550,7 @@ static int logi_dj_ll_raw_request(struct hid_device *hid,
+ 	if (!out_buf)
+ 		return -ENOMEM;
+ 
+-	if (count < DJREPORT_SHORT_LENGTH - 2)
++	if (count > DJREPORT_SHORT_LENGTH - 2)
+ 		count = DJREPORT_SHORT_LENGTH - 2;
+ 
+ 	out_buf[0] = REPORT_ID_DJ_SHORT;
+@@ -690,6 +683,12 @@ static int logi_dj_raw_event(struct hid_device *hdev,
+ 	 * device (via hid_input_report() ) and return 1 so hid-core does not do
+ 	 * anything else with it.
+ 	 */
++	if ((dj_report->device_index < DJ_DEVICE_INDEX_MIN) ||
++	    (dj_report->device_index > DJ_DEVICE_INDEX_MAX)) {
++		dev_err(&hdev->dev, "%s: invalid device index:%d\n",
++				__func__, dj_report->device_index);
++		return false;
++	}
+ 
+ 	spin_lock_irqsave(&djrcv_dev->lock, flags);
+ 	if (dj_report->report_id == REPORT_ID_DJ_SHORT) {
+diff --git a/drivers/hid/hid-monterey.c b/drivers/hid/hid-monterey.c
+index 9e14c00eb1b6..25daf28b26bd 100644
+--- a/drivers/hid/hid-monterey.c
++++ b/drivers/hid/hid-monterey.c
+@@ -24,7 +24,7 @@
+ static __u8 *mr_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ 		unsigned int *rsize)
+ {
+-	if (*rsize >= 30 && rdesc[29] == 0x05 && rdesc[30] == 0x09) {
++	if (*rsize >= 31 && rdesc[29] == 0x05 && rdesc[30] == 0x09) {
+ 		hid_info(hdev, "fixing up button/consumer in HID report descriptor\n");
+ 		rdesc[30] = 0x0c;
+ 	}
+diff --git a/drivers/hid/hid-petalynx.c b/drivers/hid/hid-petalynx.c
+index 736b2502df4f..6aca4f2554bf 100644
+--- a/drivers/hid/hid-petalynx.c
++++ b/drivers/hid/hid-petalynx.c
+@@ -25,7 +25,7 @@
+ static __u8 *pl_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ 		unsigned int *rsize)
+ {
+-	if (*rsize >= 60 && rdesc[39] == 0x2a && rdesc[40] == 0xf5 &&
++	if (*rsize >= 62 && rdesc[39] == 0x2a && rdesc[40] == 0xf5 &&
+ 			rdesc[41] == 0x00 && rdesc[59] == 0x26 &&
+ 			rdesc[60] == 0xf9 && rdesc[61] == 0x00) {
+ 		hid_info(hdev, "fixing up Petalynx Maxter Remote report descriptor\n");
+diff --git a/drivers/hid/hid-sunplus.c b/drivers/hid/hid-sunplus.c
+index 87fc91e1c8de..91072fa54663 100644
+--- a/drivers/hid/hid-sunplus.c
++++ b/drivers/hid/hid-sunplus.c
+@@ -24,7 +24,7 @@
+ static __u8 *sp_report_fixup(struct hid_device *hdev, __u8 *rdesc,
+ 		unsigned int *rsize)
+ {
+-	if (*rsize >= 107 && rdesc[104] == 0x26 && rdesc[105] == 0x80 &&
++	if (*rsize >= 112 && rdesc[104] == 0x26 && rdesc[105] == 0x80 &&
+ 			rdesc[106] == 0x03) {
+ 		hid_info(hdev, "fixing up Sunplus Wireless Desktop report descriptor\n");
+ 		rdesc[105] = rdesc[110] = 0x03;
+diff --git a/drivers/hwmon/ads1015.c b/drivers/hwmon/ads1015.c
+index 7f9dc2f86b63..126516414c11 100644
+--- a/drivers/hwmon/ads1015.c
++++ b/drivers/hwmon/ads1015.c
+@@ -198,7 +198,7 @@ static int ads1015_get_channels_config_of(struct i2c_client *client)
+ 		}
+ 
+ 		channel = be32_to_cpup(property);
+-		if (channel > ADS1015_CHANNELS) {
++		if (channel >= ADS1015_CHANNELS) {
+ 			dev_err(&client->dev,
+ 				"invalid channel index %d on %s\n",
+ 				channel, node->full_name);
+@@ -212,6 +212,7 @@ static int ads1015_get_channels_config_of(struct i2c_client *client)
+ 				dev_err(&client->dev,
+ 					"invalid gain on %s\n",
+ 					node->full_name);
++				return -EINVAL;
+ 			}
+ 		}
+ 
+@@ -222,6 +223,7 @@ static int ads1015_get_channels_config_of(struct i2c_client *client)
+ 				dev_err(&client->dev,
+ 					"invalid data_rate on %s\n",
+ 					node->full_name);
++				return -EINVAL;
+ 			}
+ 		}
+ 
+diff --git a/drivers/hwmon/amc6821.c b/drivers/hwmon/amc6821.c
+index 9f2be3dd28f3..8a67ec6279a4 100644
+--- a/drivers/hwmon/amc6821.c
++++ b/drivers/hwmon/amc6821.c
+@@ -360,11 +360,13 @@ static ssize_t set_pwm1_enable(
+ 	if (config)
+ 		return config;
+ 
++	mutex_lock(&data->update_lock);
+ 	config = i2c_smbus_read_byte_data(client, AMC6821_REG_CONF1);
+ 	if (config < 0) {
+ 			dev_err(&client->dev,
+ 			"Error reading configuration register, aborting.\n");
+-			return config;
++			count = config;
++			goto unlock;
+ 	}
+ 
+ 	switch (val) {
+@@ -381,14 +383,15 @@ static ssize_t set_pwm1_enable(
+ 		config |= AMC6821_CONF1_FDRC1;
+ 		break;
+ 	default:
+-		return -EINVAL;
++		count = -EINVAL;
++		goto unlock;
+ 	}
+-	mutex_lock(&data->update_lock);
+ 	if (i2c_smbus_write_byte_data(client, AMC6821_REG_CONF1, config)) {
+ 			dev_err(&client->dev,
+ 			"Configuration register write error, aborting.\n");
+ 			count = -EIO;
+ 	}
++unlock:
+ 	mutex_unlock(&data->update_lock);
+ 	return count;
+ }
+@@ -493,8 +496,9 @@ static ssize_t set_temp_auto_point_temp(
+ 		return -EINVAL;
+ 	}
+ 
+-	data->valid = 0;
+ 	mutex_lock(&data->update_lock);
++	data->valid = 0;
++
+ 	switch (ix) {
+ 	case 0:
+ 		ptemp[0] = clamp_val(val / 1000, 0,
+@@ -658,13 +662,14 @@ static ssize_t set_fan1_div(
+ 	if (config)
+ 		return config;
+ 
++	mutex_lock(&data->update_lock);
+ 	config = i2c_smbus_read_byte_data(client, AMC6821_REG_CONF4);
+ 	if (config < 0) {
+ 		dev_err(&client->dev,
+ 			"Error reading configuration register, aborting.\n");
+-		return config;
++		count = config;
++		goto EXIT;
+ 	}
+-	mutex_lock(&data->update_lock);
+ 	switch (val) {
+ 	case 2:
+ 		config &= ~AMC6821_CONF4_PSPR;
+diff --git a/drivers/hwmon/dme1737.c b/drivers/hwmon/dme1737.c
+index 4ae3fff13f44..bea0a344fab5 100644
+--- a/drivers/hwmon/dme1737.c
++++ b/drivers/hwmon/dme1737.c
+@@ -247,8 +247,8 @@ struct dme1737_data {
+ 	u8  pwm_acz[3];
+ 	u8  pwm_freq[6];
+ 	u8  pwm_rr[2];
+-	u8  zone_low[3];
+-	u8  zone_abs[3];
++	s8  zone_low[3];
++	s8  zone_abs[3];
+ 	u8  zone_hyst[2];
+ 	u32 alarms;
+ };
+@@ -277,7 +277,7 @@ static inline int IN_FROM_REG(int reg, int nominal, int res)
+ 	return (reg * nominal + (3 << (res - 3))) / (3 << (res - 2));
+ }
+ 
+-static inline int IN_TO_REG(int val, int nominal)
++static inline int IN_TO_REG(long val, int nominal)
+ {
+ 	return clamp_val((val * 192 + nominal / 2) / nominal, 0, 255);
+ }
+@@ -293,7 +293,7 @@ static inline int TEMP_FROM_REG(int reg, int res)
+ 	return (reg * 1000) >> (res - 8);
+ }
+ 
+-static inline int TEMP_TO_REG(int val)
++static inline int TEMP_TO_REG(long val)
+ {
+ 	return clamp_val((val < 0 ? val - 500 : val + 500) / 1000, -128, 127);
+ }
+@@ -308,7 +308,7 @@ static inline int TEMP_RANGE_FROM_REG(int reg)
+ 	return TEMP_RANGE[(reg >> 4) & 0x0f];
+ }
+ 
+-static int TEMP_RANGE_TO_REG(int val, int reg)
++static int TEMP_RANGE_TO_REG(long val, int reg)
+ {
+ 	int i;
+ 
+@@ -331,7 +331,7 @@ static inline int TEMP_HYST_FROM_REG(int reg, int ix)
+ 	return (((ix == 1) ? reg : reg >> 4) & 0x0f) * 1000;
+ }
+ 
+-static inline int TEMP_HYST_TO_REG(int val, int ix, int reg)
++static inline int TEMP_HYST_TO_REG(long val, int ix, int reg)
+ {
+ 	int hyst = clamp_val((val + 500) / 1000, 0, 15);
+ 
+@@ -347,7 +347,7 @@ static inline int FAN_FROM_REG(int reg, int tpc)
+ 		return (reg == 0 || reg == 0xffff) ? 0 : 90000 * 60 / reg;
+ }
+ 
+-static inline int FAN_TO_REG(int val, int tpc)
++static inline int FAN_TO_REG(long val, int tpc)
+ {
+ 	if (tpc) {
+ 		return clamp_val(val / tpc, 0, 0xffff);
+@@ -379,7 +379,7 @@ static inline int FAN_TYPE_FROM_REG(int reg)
+ 	return (edge > 0) ? 1 << (edge - 1) : 0;
+ }
+ 
+-static inline int FAN_TYPE_TO_REG(int val, int reg)
++static inline int FAN_TYPE_TO_REG(long val, int reg)
+ {
+ 	int edge = (val == 4) ? 3 : val;
+ 
+@@ -402,7 +402,7 @@ static int FAN_MAX_FROM_REG(int reg)
+ 	return 1000 + i * 500;
+ }
+ 
+-static int FAN_MAX_TO_REG(int val)
++static int FAN_MAX_TO_REG(long val)
+ {
+ 	int i;
+ 
+@@ -460,7 +460,7 @@ static inline int PWM_ACZ_FROM_REG(int reg)
+ 	return acz[(reg >> 5) & 0x07];
+ }
+ 
+-static inline int PWM_ACZ_TO_REG(int val, int reg)
++static inline int PWM_ACZ_TO_REG(long val, int reg)
+ {
+ 	int acz = (val == 4) ? 2 : val - 1;
+ 
+@@ -476,7 +476,7 @@ static inline int PWM_FREQ_FROM_REG(int reg)
+ 	return PWM_FREQ[reg & 0x0f];
+ }
+ 
+-static int PWM_FREQ_TO_REG(int val, int reg)
++static int PWM_FREQ_TO_REG(long val, int reg)
+ {
+ 	int i;
+ 
+@@ -510,7 +510,7 @@ static inline int PWM_RR_FROM_REG(int reg, int ix)
+ 	return (rr & 0x08) ? PWM_RR[rr & 0x07] : 0;
+ }
+ 
+-static int PWM_RR_TO_REG(int val, int ix, int reg)
++static int PWM_RR_TO_REG(long val, int ix, int reg)
+ {
+ 	int i;
+ 
+@@ -528,7 +528,7 @@ static inline int PWM_RR_EN_FROM_REG(int reg, int ix)
+ 	return PWM_RR_FROM_REG(reg, ix) ? 1 : 0;
+ }
+ 
+-static inline int PWM_RR_EN_TO_REG(int val, int ix, int reg)
++static inline int PWM_RR_EN_TO_REG(long val, int ix, int reg)
+ {
+ 	int en = (ix == 1) ? 0x80 : 0x08;
+ 
+@@ -1481,13 +1481,16 @@ static ssize_t set_vrm(struct device *dev, struct device_attribute *attr,
+ 		       const char *buf, size_t count)
+ {
+ 	struct dme1737_data *data = dev_get_drvdata(dev);
+-	long val;
++	unsigned long val;
+ 	int err;
+ 
+-	err = kstrtol(buf, 10, &val);
++	err = kstrtoul(buf, 10, &val);
+ 	if (err)
+ 		return err;
+ 
++	if (val > 255)
++		return -EINVAL;
++
+ 	data->vrm = val;
+ 	return count;
+ }
+diff --git a/drivers/hwmon/gpio-fan.c b/drivers/hwmon/gpio-fan.c
+index 2566c43dd1e9..d10aa7b46cca 100644
+--- a/drivers/hwmon/gpio-fan.c
++++ b/drivers/hwmon/gpio-fan.c
+@@ -173,7 +173,7 @@ static int get_fan_speed_index(struct gpio_fan_data *fan_data)
+ 	return -ENODEV;
+ }
+ 
+-static int rpm_to_speed_index(struct gpio_fan_data *fan_data, int rpm)
++static int rpm_to_speed_index(struct gpio_fan_data *fan_data, unsigned long rpm)
+ {
+ 	struct gpio_fan_speed *speed = fan_data->speed;
+ 	int i;
+diff --git a/drivers/hwmon/lm78.c b/drivers/hwmon/lm78.c
+index 9efadfc851bc..c1eb464f0fd0 100644
+--- a/drivers/hwmon/lm78.c
++++ b/drivers/hwmon/lm78.c
+@@ -108,7 +108,7 @@ static inline int FAN_FROM_REG(u8 val, int div)
+  * TEMP: mC (-128C to +127C)
+  * REG: 1C/bit, two's complement
+  */
+-static inline s8 TEMP_TO_REG(int val)
++static inline s8 TEMP_TO_REG(long val)
+ {
+ 	int nval = clamp_val(val, -128000, 127000) ;
+ 	return nval < 0 ? (nval - 500) / 1000 : (nval + 500) / 1000;
+diff --git a/drivers/hwmon/lm85.c b/drivers/hwmon/lm85.c
+index b0129a54e1a6..ef627ea71cc8 100644
+--- a/drivers/hwmon/lm85.c
++++ b/drivers/hwmon/lm85.c
+@@ -155,7 +155,7 @@ static inline u16 FAN_TO_REG(unsigned long val)
+ 
+ /* Temperature is reported in .001 degC increments */
+ #define TEMP_TO_REG(val)	\
+-		clamp_val(SCALE(val, 1000, 1), -127, 127)
++		DIV_ROUND_CLOSEST(clamp_val((val), -127000, 127000), 1000)
+ #define TEMPEXT_FROM_REG(val, ext)	\
+ 		SCALE(((val) << 4) + (ext), 16, 1000)
+ #define TEMP_FROM_REG(val)	((val) * 1000)
+@@ -189,7 +189,7 @@ static const int lm85_range_map[] = {
+ 	13300, 16000, 20000, 26600, 32000, 40000, 53300, 80000
+ };
+ 
+-static int RANGE_TO_REG(int range)
++static int RANGE_TO_REG(long range)
+ {
+ 	int i;
+ 
+@@ -211,7 +211,7 @@ static const int adm1027_freq_map[8] = { /* 1 Hz */
+ 	11, 15, 22, 29, 35, 44, 59, 88
+ };
+ 
+-static int FREQ_TO_REG(const int *map, int freq)
++static int FREQ_TO_REG(const int *map, unsigned long freq)
+ {
+ 	int i;
+ 
+@@ -460,6 +460,9 @@ static ssize_t store_vrm_reg(struct device *dev, struct device_attribute *attr,
+ 	if (err)
+ 		return err;
+ 
++	if (val > 255)
++		return -EINVAL;
++
+ 	data->vrm = val;
+ 	return count;
+ }
+diff --git a/drivers/hwmon/lm92.c b/drivers/hwmon/lm92.c
+index d2060e245ff5..cfaf70b9cba7 100644
+--- a/drivers/hwmon/lm92.c
++++ b/drivers/hwmon/lm92.c
+@@ -74,12 +74,9 @@ static inline int TEMP_FROM_REG(s16 reg)
+ 	return reg / 8 * 625 / 10;
+ }
+ 
+-static inline s16 TEMP_TO_REG(int val)
++static inline s16 TEMP_TO_REG(long val)
+ {
+-	if (val <= -60000)
+-		return -60000 * 10 / 625 * 8;
+-	if (val >= 160000)
+-		return 160000 * 10 / 625 * 8;
++	val = clamp_val(val, -60000, 160000);
+ 	return val * 10 / 625 * 8;
+ }
+ 
+@@ -206,10 +203,12 @@ static ssize_t set_temp_hyst(struct device *dev,
+ 	if (err)
+ 		return err;
+ 
++	val = clamp_val(val, -120000, 220000);
+ 	mutex_lock(&data->update_lock);
+-	data->temp[t_hyst] = TEMP_FROM_REG(data->temp[attr->index]) - val;
++	 data->temp[t_hyst] =
++		TEMP_TO_REG(TEMP_FROM_REG(data->temp[attr->index]) - val);
+ 	i2c_smbus_write_word_swapped(client, LM92_REG_TEMP_HYST,
+-				     TEMP_TO_REG(data->temp[t_hyst]));
++				     data->temp[t_hyst]);
+ 	mutex_unlock(&data->update_lock);
+ 	return count;
+ }
+diff --git a/drivers/hwmon/sis5595.c b/drivers/hwmon/sis5595.c
+index 3532026e25da..bf1d7893d51c 100644
+--- a/drivers/hwmon/sis5595.c
++++ b/drivers/hwmon/sis5595.c
+@@ -159,7 +159,7 @@ static inline int TEMP_FROM_REG(s8 val)
+ {
+ 	return val * 830 + 52120;
+ }
+-static inline s8 TEMP_TO_REG(int val)
++static inline s8 TEMP_TO_REG(long val)
+ {
+ 	int nval = clamp_val(val, -54120, 157530) ;
+ 	return nval < 0 ? (nval - 5212 - 415) / 830 : (nval - 5212 + 415) / 830;
+diff --git a/drivers/i2c/busses/i2c-at91.c b/drivers/i2c/busses/i2c-at91.c
+index e95f9ba96790..83c989382be9 100644
+--- a/drivers/i2c/busses/i2c-at91.c
++++ b/drivers/i2c/busses/i2c-at91.c
+@@ -210,7 +210,7 @@ static void at91_twi_write_data_dma_callback(void *data)
+ 	struct at91_twi_dev *dev = (struct at91_twi_dev *)data;
+ 
+ 	dma_unmap_single(dev->dev, sg_dma_address(&dev->dma.sg),
+-			 dev->buf_len, DMA_MEM_TO_DEV);
++			 dev->buf_len, DMA_TO_DEVICE);
+ 
+ 	at91_twi_write(dev, AT91_TWI_CR, AT91_TWI_STOP);
+ }
+@@ -289,7 +289,7 @@ static void at91_twi_read_data_dma_callback(void *data)
+ 	struct at91_twi_dev *dev = (struct at91_twi_dev *)data;
+ 
+ 	dma_unmap_single(dev->dev, sg_dma_address(&dev->dma.sg),
+-			 dev->buf_len, DMA_DEV_TO_MEM);
++			 dev->buf_len, DMA_FROM_DEVICE);
+ 
+ 	/* The last two bytes have to be read without using dma */
+ 	dev->buf += dev->buf_len - 2;
+diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
+index a9791509966a..69e11853e8bf 100644
+--- a/drivers/i2c/busses/i2c-rk3x.c
++++ b/drivers/i2c/busses/i2c-rk3x.c
+@@ -399,7 +399,7 @@ static irqreturn_t rk3x_i2c_irq(int irqno, void *dev_id)
+ 	}
+ 
+ 	/* is there anything left to handle? */
+-	if (unlikely(ipd == 0))
++	if (unlikely((ipd & REG_INT_ALL) == 0))
+ 		goto out;
+ 
+ 	switch (i2c->state) {
+diff --git a/drivers/misc/mei/client.c b/drivers/misc/mei/client.c
+index 59d20c599b16..2da05c0e113d 100644
+--- a/drivers/misc/mei/client.c
++++ b/drivers/misc/mei/client.c
+@@ -459,7 +459,7 @@ int mei_cl_disconnect(struct mei_cl *cl)
+ {
+ 	struct mei_device *dev;
+ 	struct mei_cl_cb *cb;
+-	int rets, err;
++	int rets;
+ 
+ 	if (WARN_ON(!cl || !cl->dev))
+ 		return -ENODEV;
+@@ -491,6 +491,7 @@ int mei_cl_disconnect(struct mei_cl *cl)
+ 			cl_err(dev, cl, "failed to disconnect.\n");
+ 			goto free;
+ 		}
++		cl->timer_count = MEI_CONNECT_TIMEOUT;
+ 		mdelay(10); /* Wait for hardware disconnection ready */
+ 		list_add_tail(&cb->list, &dev->ctrl_rd_list.list);
+ 	} else {
+@@ -500,23 +501,18 @@ int mei_cl_disconnect(struct mei_cl *cl)
+ 	}
+ 	mutex_unlock(&dev->device_lock);
+ 
+-	err = wait_event_timeout(dev->wait_recvd_msg,
++	wait_event_timeout(dev->wait_recvd_msg,
+ 			MEI_FILE_DISCONNECTED == cl->state,
+ 			mei_secs_to_jiffies(MEI_CL_CONNECT_TIMEOUT));
+ 
+ 	mutex_lock(&dev->device_lock);
++
+ 	if (MEI_FILE_DISCONNECTED == cl->state) {
+ 		rets = 0;
+ 		cl_dbg(dev, cl, "successfully disconnected from FW client.\n");
+ 	} else {
+-		rets = -ENODEV;
+-		if (MEI_FILE_DISCONNECTED != cl->state)
+-			cl_err(dev, cl, "wrong status client disconnect.\n");
+-
+-		if (err)
+-			cl_dbg(dev, cl, "wait failed disconnect err=%d\n", err);
+-
+-		cl_err(dev, cl, "failed to disconnect from FW client.\n");
++		cl_dbg(dev, cl, "timeout on disconnect from FW client.\n");
++		rets = -ETIME;
+ 	}
+ 
+ 	mei_io_list_flush(&dev->ctrl_rd_list, cl);
+@@ -605,6 +601,7 @@ int mei_cl_connect(struct mei_cl *cl, struct file *file)
+ 		cl->timer_count = MEI_CONNECT_TIMEOUT;
+ 		list_add_tail(&cb->list, &dev->ctrl_rd_list.list);
+ 	} else {
++		cl->state = MEI_FILE_INITIALIZING;
+ 		list_add_tail(&cb->list, &dev->ctrl_wr_list.list);
+ 	}
+ 
+@@ -616,6 +613,7 @@ int mei_cl_connect(struct mei_cl *cl, struct file *file)
+ 	mutex_lock(&dev->device_lock);
+ 
+ 	if (cl->state != MEI_FILE_CONNECTED) {
++		cl->state = MEI_FILE_DISCONNECTED;
+ 		/* something went really wrong */
+ 		if (!cl->status)
+ 			cl->status = -EFAULT;
+diff --git a/drivers/misc/mei/nfc.c b/drivers/misc/mei/nfc.c
+index 3095fc514a65..5ccc23bc7690 100644
+--- a/drivers/misc/mei/nfc.c
++++ b/drivers/misc/mei/nfc.c
+@@ -342,9 +342,10 @@ static int mei_nfc_send(struct mei_cl_device *cldev, u8 *buf, size_t length)
+ 	ndev = (struct mei_nfc_dev *) cldev->priv_data;
+ 	dev = ndev->cl->dev;
+ 
++	err = -ENOMEM;
+ 	mei_buf = kzalloc(length + MEI_NFC_HEADER_SIZE, GFP_KERNEL);
+ 	if (!mei_buf)
+-		return -ENOMEM;
++		goto out;
+ 
+ 	hdr = (struct mei_nfc_hci_hdr *) mei_buf;
+ 	hdr->cmd = MEI_NFC_CMD_HCI_SEND;
+@@ -354,12 +355,9 @@ static int mei_nfc_send(struct mei_cl_device *cldev, u8 *buf, size_t length)
+ 	hdr->data_size = length;
+ 
+ 	memcpy(mei_buf + MEI_NFC_HEADER_SIZE, buf, length);
+-
+ 	err = __mei_cl_send(ndev->cl, mei_buf, length + MEI_NFC_HEADER_SIZE);
+ 	if (err < 0)
+-		return err;
+-
+-	kfree(mei_buf);
++		goto out;
+ 
+ 	if (!wait_event_interruptible_timeout(ndev->send_wq,
+ 				ndev->recv_req_id == ndev->req_id, HZ)) {
+@@ -368,7 +366,8 @@ static int mei_nfc_send(struct mei_cl_device *cldev, u8 *buf, size_t length)
+ 	} else {
+ 		ndev->req_id++;
+ 	}
+-
++out:
++	kfree(mei_buf);
+ 	return err;
+ }
+ 
+diff --git a/drivers/misc/mei/pci-me.c b/drivers/misc/mei/pci-me.c
+index 1b46c64a649f..4b821b4360e1 100644
+--- a/drivers/misc/mei/pci-me.c
++++ b/drivers/misc/mei/pci-me.c
+@@ -369,7 +369,7 @@ static int mei_me_pm_runtime_idle(struct device *device)
+ 	if (!dev)
+ 		return -ENODEV;
+ 	if (mei_write_is_idle(dev))
+-		pm_schedule_suspend(device, MEI_ME_RPM_TIMEOUT * 2);
++		pm_runtime_autosuspend(device);
+ 
+ 	return -EBUSY;
+ }
+diff --git a/drivers/misc/mei/pci-txe.c b/drivers/misc/mei/pci-txe.c
+index 2343c6236df9..32fef4d5b0b6 100644
+--- a/drivers/misc/mei/pci-txe.c
++++ b/drivers/misc/mei/pci-txe.c
+@@ -306,7 +306,7 @@ static int mei_txe_pm_runtime_idle(struct device *device)
+ 	if (!dev)
+ 		return -ENODEV;
+ 	if (mei_write_is_idle(dev))
+-		pm_schedule_suspend(device, MEI_TXI_RPM_TIMEOUT * 2);
++		pm_runtime_autosuspend(device);
+ 
+ 	return -EBUSY;
+ }
+diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
+index 7ad463e9741c..249ab80cbb45 100644
+--- a/drivers/mmc/host/mmci.c
++++ b/drivers/mmc/host/mmci.c
+@@ -834,6 +834,10 @@ static void
+ mmci_data_irq(struct mmci_host *host, struct mmc_data *data,
+ 	      unsigned int status)
+ {
++	/* Make sure we have data to handle */
++	if (!data)
++		return;
++
+ 	/* First check for errors */
+ 	if (status & (MCI_DATACRCFAIL|MCI_DATATIMEOUT|MCI_STARTBITERR|
+ 		      MCI_TXUNDERRUN|MCI_RXOVERRUN)) {
+@@ -902,9 +906,17 @@ mmci_cmd_irq(struct mmci_host *host, struct mmc_command *cmd,
+ 	     unsigned int status)
+ {
+ 	void __iomem *base = host->base;
+-	bool sbc = (cmd == host->mrq->sbc);
+-	bool busy_resp = host->variant->busy_detect &&
+-			(cmd->flags & MMC_RSP_BUSY);
++	bool sbc, busy_resp;
++
++	if (!cmd)
++		return;
++
++	sbc = (cmd == host->mrq->sbc);
++	busy_resp = host->variant->busy_detect && (cmd->flags & MMC_RSP_BUSY);
++
++	if (!((status|host->busy_status) & (MCI_CMDCRCFAIL|MCI_CMDTIMEOUT|
++		MCI_CMDSENT|MCI_CMDRESPEND)))
++		return;
+ 
+ 	/* Check if we need to wait for busy completion. */
+ 	if (host->busy_status && (status & MCI_ST_CARDBUSY))
+@@ -1132,9 +1144,6 @@ static irqreturn_t mmci_irq(int irq, void *dev_id)
+ 	spin_lock(&host->lock);
+ 
+ 	do {
+-		struct mmc_command *cmd;
+-		struct mmc_data *data;
+-
+ 		status = readl(host->base + MMCISTATUS);
+ 
+ 		if (host->singleirq) {
+@@ -1154,16 +1163,8 @@ static irqreturn_t mmci_irq(int irq, void *dev_id)
+ 
+ 		dev_dbg(mmc_dev(host->mmc), "irq0 (data+cmd) %08x\n", status);
+ 
+-		cmd = host->cmd;
+-		if ((status|host->busy_status) & (MCI_CMDCRCFAIL|MCI_CMDTIMEOUT|
+-			MCI_CMDSENT|MCI_CMDRESPEND) && cmd)
+-			mmci_cmd_irq(host, cmd, status);
+-
+-		data = host->data;
+-		if (status & (MCI_DATACRCFAIL|MCI_DATATIMEOUT|MCI_STARTBITERR|
+-			      MCI_TXUNDERRUN|MCI_RXOVERRUN|MCI_DATAEND|
+-			      MCI_DATABLOCKEND) && data)
+-			mmci_data_irq(host, data, status);
++		mmci_cmd_irq(host, host->cmd, status);
++		mmci_data_irq(host, host->data, status);
+ 
+ 		/* Don't poll for busy completion in irq context. */
+ 		if (host->busy_status)
+diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
+index 42914e04d110..056841651a80 100644
+--- a/drivers/pci/hotplug/pciehp_hpc.c
++++ b/drivers/pci/hotplug/pciehp_hpc.c
+@@ -794,7 +794,7 @@ struct controller *pcie_init(struct pcie_device *dev)
+ 	pcie_capability_write_word(pdev, PCI_EXP_SLTSTA,
+ 		PCI_EXP_SLTSTA_ABP | PCI_EXP_SLTSTA_PFD |
+ 		PCI_EXP_SLTSTA_MRLSC | PCI_EXP_SLTSTA_PDC |
+-		PCI_EXP_SLTSTA_CC);
++		PCI_EXP_SLTSTA_CC | PCI_EXP_SLTSTA_DLLSC);
+ 
+ 	/* Disable software notification */
+ 	pcie_disable_notification(ctrl);
+diff --git a/drivers/pci/pci-label.c b/drivers/pci/pci-label.c
+index a3fbe2012ea3..2ab1b47c7651 100644
+--- a/drivers/pci/pci-label.c
++++ b/drivers/pci/pci-label.c
+@@ -161,8 +161,8 @@ enum acpi_attr_enum {
+ static void dsm_label_utf16s_to_utf8s(union acpi_object *obj, char *buf)
+ {
+ 	int len;
+-	len = utf16s_to_utf8s((const wchar_t *)obj->string.pointer,
+-			      obj->string.length,
++	len = utf16s_to_utf8s((const wchar_t *)obj->buffer.pointer,
++			      obj->buffer.length,
+ 			      UTF16_LITTLE_ENDIAN,
+ 			      buf, PAGE_SIZE);
+ 	buf[len] = '\n';
+@@ -187,16 +187,22 @@ static int dsm_get_label(struct device *dev, char *buf,
+ 	tmp = obj->package.elements;
+ 	if (obj->type == ACPI_TYPE_PACKAGE && obj->package.count == 2 &&
+ 	    tmp[0].type == ACPI_TYPE_INTEGER &&
+-	    tmp[1].type == ACPI_TYPE_STRING) {
++	    (tmp[1].type == ACPI_TYPE_STRING ||
++	     tmp[1].type == ACPI_TYPE_BUFFER)) {
+ 		/*
+ 		 * The second string element is optional even when
+ 		 * this _DSM is implemented; when not implemented,
+ 		 * this entry must return a null string.
+ 		 */
+-		if (attr == ACPI_ATTR_INDEX_SHOW)
++		if (attr == ACPI_ATTR_INDEX_SHOW) {
+ 			scnprintf(buf, PAGE_SIZE, "%llu\n", tmp->integer.value);
+-		else if (attr == ACPI_ATTR_LABEL_SHOW)
+-			dsm_label_utf16s_to_utf8s(tmp + 1, buf);
++		} else if (attr == ACPI_ATTR_LABEL_SHOW) {
++			if (tmp[1].type == ACPI_TYPE_STRING)
++				scnprintf(buf, PAGE_SIZE, "%s\n",
++					  tmp[1].string.pointer);
++			else if (tmp[1].type == ACPI_TYPE_BUFFER)
++				dsm_label_utf16s_to_utf8s(tmp + 1, buf);
++		}
+ 		len = strlen(buf) > 0 ? strlen(buf) : -1;
+ 	}
+ 
+diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
+index 1c8592b0e146..81d49d3ab221 100644
+--- a/drivers/pci/pci.c
++++ b/drivers/pci/pci.c
+@@ -839,12 +839,6 @@ int pci_set_power_state(struct pci_dev *dev, pci_power_t state)
+ 
+ 	if (!__pci_complete_power_transition(dev, state))
+ 		error = 0;
+-	/*
+-	 * When aspm_policy is "powersave" this call ensures
+-	 * that ASPM is configured.
+-	 */
+-	if (!error && dev->bus->self)
+-		pcie_aspm_powersave_config_link(dev->bus->self);
+ 
+ 	return error;
+ }
+@@ -1195,12 +1189,18 @@ int __weak pcibios_enable_device(struct pci_dev *dev, int bars)
+ static int do_pci_enable_device(struct pci_dev *dev, int bars)
+ {
+ 	int err;
++	struct pci_dev *bridge;
+ 	u16 cmd;
+ 	u8 pin;
+ 
+ 	err = pci_set_power_state(dev, PCI_D0);
+ 	if (err < 0 && err != -EIO)
+ 		return err;
++
++	bridge = pci_upstream_bridge(dev);
++	if (bridge)
++		pcie_aspm_powersave_config_link(bridge);
++
+ 	err = pcibios_enable_device(dev, bars);
+ 	if (err < 0)
+ 		return err;
+diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
+index caed1ce6facd..481c4e18693a 100644
+--- a/drivers/pci/setup-res.c
++++ b/drivers/pci/setup-res.c
+@@ -320,9 +320,11 @@ int pci_reassign_resource(struct pci_dev *dev, int resno, resource_size_t addsiz
+ 			resource_size_t min_align)
+ {
+ 	struct resource *res = dev->resource + resno;
++	unsigned long flags;
+ 	resource_size_t new_size;
+ 	int ret;
+ 
++	flags = res->flags;
+ 	res->flags |= IORESOURCE_UNSET;
+ 	if (!res->parent) {
+ 		dev_info(&dev->dev, "BAR %d: can't reassign an unassigned resource %pR\n",
+@@ -339,7 +341,12 @@ int pci_reassign_resource(struct pci_dev *dev, int resno, resource_size_t addsiz
+ 		dev_info(&dev->dev, "BAR %d: reassigned %pR\n", resno, res);
+ 		if (resno < PCI_BRIDGE_RESOURCES)
+ 			pci_update_resource(dev, resno);
++	} else {
++		res->flags = flags;
++		dev_info(&dev->dev, "BAR %d: %pR (failed to expand by %#llx)\n",
++			 resno, res, (unsigned long long) addsize);
+ 	}
++
+ 	return ret;
+ }
+ 
+diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
+index 31184b35370f..489e83b6b5e1 100644
+--- a/drivers/scsi/hpsa.c
++++ b/drivers/scsi/hpsa.c
+@@ -5092,7 +5092,7 @@ static int hpsa_big_passthru_ioctl(struct ctlr_info *h, void __user *argp)
+ 		}
+ 		if (ioc->Request.Type.Direction & XFER_WRITE) {
+ 			if (copy_from_user(buff[sg_used], data_ptr, sz)) {
+-				status = -ENOMEM;
++				status = -EFAULT;
+ 				goto cleanup1;
+ 			}
+ 		} else
+@@ -6365,9 +6365,9 @@ static inline void hpsa_set_driver_support_bits(struct ctlr_info *h)
+ {
+ 	u32 driver_support;
+ 
+-#ifdef CONFIG_X86
+-	/* Need to enable prefetch in the SCSI core for 6400 in x86 */
+ 	driver_support = readl(&(h->cfgtable->driver_support));
++	/* Need to enable prefetch in the SCSI core for 6400 in x86 */
++#ifdef CONFIG_X86
+ 	driver_support |= ENABLE_SCSI_PREFETCH;
+ #endif
+ 	driver_support |= ENABLE_UNIT_ATTN;
+diff --git a/drivers/staging/et131x/et131x.c b/drivers/staging/et131x/et131x.c
+index 08356b6955a4..2d36eac6889c 100644
+--- a/drivers/staging/et131x/et131x.c
++++ b/drivers/staging/et131x/et131x.c
+@@ -1423,22 +1423,16 @@ static int et131x_mii_read(struct et131x_adapter *adapter, u8 reg, u16 *value)
+  * @reg: the register to read
+  * @value: 16-bit value to write
+  */
+-static int et131x_mii_write(struct et131x_adapter *adapter, u8 reg, u16 value)
++static int et131x_mii_write(struct et131x_adapter *adapter, u8 addr, u8 reg,
++			    u16 value)
+ {
+ 	struct mac_regs __iomem *mac = &adapter->regs->mac;
+-	struct phy_device *phydev = adapter->phydev;
+ 	int status = 0;
+-	u8 addr;
+ 	u32 delay = 0;
+ 	u32 mii_addr;
+ 	u32 mii_cmd;
+ 	u32 mii_indicator;
+ 
+-	if (!phydev)
+-		return -EIO;
+-
+-	addr = phydev->addr;
+-
+ 	/* Save a local copy of the registers we are dealing with so we can
+ 	 * set them back
+ 	 */
+@@ -1633,17 +1627,7 @@ static int et131x_mdio_write(struct mii_bus *bus, int phy_addr,
+ 	struct net_device *netdev = bus->priv;
+ 	struct et131x_adapter *adapter = netdev_priv(netdev);
+ 
+-	return et131x_mii_write(adapter, reg, value);
+-}
+-
+-static int et131x_mdio_reset(struct mii_bus *bus)
+-{
+-	struct net_device *netdev = bus->priv;
+-	struct et131x_adapter *adapter = netdev_priv(netdev);
+-
+-	et131x_mii_write(adapter, MII_BMCR, BMCR_RESET);
+-
+-	return 0;
++	return et131x_mii_write(adapter, phy_addr, reg, value);
+ }
+ 
+ /*	et1310_phy_power_switch	-	PHY power control
+@@ -1658,18 +1642,20 @@ static int et131x_mdio_reset(struct mii_bus *bus)
+ static void et1310_phy_power_switch(struct et131x_adapter *adapter, bool down)
+ {
+ 	u16 data;
++	struct  phy_device *phydev = adapter->phydev;
+ 
+ 	et131x_mii_read(adapter, MII_BMCR, &data);
+ 	data &= ~BMCR_PDOWN;
+ 	if (down)
+ 		data |= BMCR_PDOWN;
+-	et131x_mii_write(adapter, MII_BMCR, data);
++	et131x_mii_write(adapter, phydev->addr, MII_BMCR, data);
+ }
+ 
+ /* et131x_xcvr_init - Init the phy if we are setting it into force mode */
+ static void et131x_xcvr_init(struct et131x_adapter *adapter)
+ {
+ 	u16 lcr2;
++	struct  phy_device *phydev = adapter->phydev;
+ 
+ 	/* Set the LED behavior such that LED 1 indicates speed (off =
+ 	 * 10Mbits, blink = 100Mbits, on = 1000Mbits) and LED 2 indicates
+@@ -1690,7 +1676,7 @@ static void et131x_xcvr_init(struct et131x_adapter *adapter)
+ 		else
+ 			lcr2 |= (LED_VAL_LINKON << LED_TXRX_SHIFT);
+ 
+-		et131x_mii_write(adapter, PHY_LED_2, lcr2);
++		et131x_mii_write(adapter, phydev->addr, PHY_LED_2, lcr2);
+ 	}
+ }
+ 
+@@ -3645,14 +3631,14 @@ static void et131x_adjust_link(struct net_device *netdev)
+ 
+ 			et131x_mii_read(adapter, PHY_MPHY_CONTROL_REG,
+ 					 &register18);
+-			et131x_mii_write(adapter, PHY_MPHY_CONTROL_REG,
+-					 register18 | 0x4);
+-			et131x_mii_write(adapter, PHY_INDEX_REG,
++			et131x_mii_write(adapter, phydev->addr,
++					 PHY_MPHY_CONTROL_REG, register18 | 0x4);
++			et131x_mii_write(adapter, phydev->addr, PHY_INDEX_REG,
+ 					 register18 | 0x8402);
+-			et131x_mii_write(adapter, PHY_DATA_REG,
++			et131x_mii_write(adapter, phydev->addr, PHY_DATA_REG,
+ 					 register18 | 511);
+-			et131x_mii_write(adapter, PHY_MPHY_CONTROL_REG,
+-					 register18);
++			et131x_mii_write(adapter, phydev->addr,
++					 PHY_MPHY_CONTROL_REG, register18);
+ 		}
+ 
+ 		et1310_config_flow_control(adapter);
+@@ -3664,7 +3650,8 @@ static void et131x_adjust_link(struct net_device *netdev)
+ 			et131x_mii_read(adapter, PHY_CONFIG, &reg);
+ 			reg &= ~ET_PHY_CONFIG_TX_FIFO_DEPTH;
+ 			reg |= ET_PHY_CONFIG_FIFO_DEPTH_32;
+-			et131x_mii_write(adapter, PHY_CONFIG, reg);
++			et131x_mii_write(adapter, phydev->addr, PHY_CONFIG,
++					 reg);
+ 		}
+ 
+ 		et131x_set_rx_dma_timer(adapter);
+@@ -3677,14 +3664,14 @@ static void et131x_adjust_link(struct net_device *netdev)
+ 
+ 			et131x_mii_read(adapter, PHY_MPHY_CONTROL_REG,
+ 					 &register18);
+-			et131x_mii_write(adapter, PHY_MPHY_CONTROL_REG,
+-					 register18 | 0x4);
+-			et131x_mii_write(adapter, PHY_INDEX_REG,
+-					 register18 | 0x8402);
+-			et131x_mii_write(adapter, PHY_DATA_REG,
+-					 register18 | 511);
+-			et131x_mii_write(adapter, PHY_MPHY_CONTROL_REG,
+-					 register18);
++			et131x_mii_write(adapter, phydev->addr,
++					PHY_MPHY_CONTROL_REG, register18 | 0x4);
++			et131x_mii_write(adapter, phydev->addr,
++					PHY_INDEX_REG, register18 | 0x8402);
++			et131x_mii_write(adapter, phydev->addr,
++					PHY_DATA_REG, register18 | 511);
++			et131x_mii_write(adapter, phydev->addr,
++					PHY_MPHY_CONTROL_REG, register18);
+ 		}
+ 
+ 		/* Free the packets being actively sent & stopped */
+@@ -4646,10 +4633,6 @@ static int et131x_pci_setup(struct pci_dev *pdev,
+ 	/* Copy address into the net_device struct */
+ 	memcpy(netdev->dev_addr, adapter->addr, ETH_ALEN);
+ 
+-	/* Init variable for counting how long we do not have link status */
+-	adapter->boot_coma = 0;
+-	et1310_disable_phy_coma(adapter);
+-
+ 	rc = -ENOMEM;
+ 
+ 	/* Setup the mii_bus struct */
+@@ -4665,7 +4648,6 @@ static int et131x_pci_setup(struct pci_dev *pdev,
+ 	adapter->mii_bus->priv = netdev;
+ 	adapter->mii_bus->read = et131x_mdio_read;
+ 	adapter->mii_bus->write = et131x_mdio_write;
+-	adapter->mii_bus->reset = et131x_mdio_reset;
+ 	adapter->mii_bus->irq = kmalloc_array(PHY_MAX_ADDR, sizeof(int),
+ 					      GFP_KERNEL);
+ 	if (!adapter->mii_bus->irq)
+@@ -4689,6 +4671,10 @@ static int et131x_pci_setup(struct pci_dev *pdev,
+ 	/* Setup et1310 as per the documentation */
+ 	et131x_adapter_setup(adapter);
+ 
++	/* Init variable for counting how long we do not have link status */
++	adapter->boot_coma = 0;
++	et1310_disable_phy_coma(adapter);
++
+ 	/* We can enable interrupts now
+ 	 *
+ 	 *  NOTE - Because registration of interrupt handler is done in the
+diff --git a/drivers/staging/lustre/lustre/obdclass/class_obd.c b/drivers/staging/lustre/lustre/obdclass/class_obd.c
+index dde04b767a6d..b16687625c44 100644
+--- a/drivers/staging/lustre/lustre/obdclass/class_obd.c
++++ b/drivers/staging/lustre/lustre/obdclass/class_obd.c
+@@ -35,7 +35,7 @@
+  */
+ 
+ #define DEBUG_SUBSYSTEM S_CLASS
+-# include <asm/atomic.h>
++# include <linux/atomic.h>
+ 
+ #include <obd_support.h>
+ #include <obd_class.h>
+diff --git a/drivers/staging/rtl8188eu/os_dep/usb_intf.c b/drivers/staging/rtl8188eu/os_dep/usb_intf.c
+index 7526b989dcbf..c4273cd5f7ed 100644
+--- a/drivers/staging/rtl8188eu/os_dep/usb_intf.c
++++ b/drivers/staging/rtl8188eu/os_dep/usb_intf.c
+@@ -54,9 +54,11 @@ static struct usb_device_id rtw_usb_id_tbl[] = {
+ 	{USB_DEVICE(USB_VENDER_ID_REALTEK, 0x0179)}, /* 8188ETV */
+ 	/*=== Customer ID ===*/
+ 	/****** 8188EUS ********/
++	{USB_DEVICE(0x056e, 0x4008)}, /* Elecom WDC-150SU2M */
+ 	{USB_DEVICE(0x07b8, 0x8179)}, /* Abocom - Abocom */
+ 	{USB_DEVICE(0x2001, 0x330F)}, /* DLink DWA-125 REV D1 */
+ 	{USB_DEVICE(0x2001, 0x3310)}, /* Dlink DWA-123 REV D1 */
++	{USB_DEVICE(0x0df6, 0x0076)}, /* Sitecom N150 v2 */
+ 	{}	/* Terminating entry */
+ };
+ 
+diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
+index fbf6c5ad222f..ef2fb367d179 100644
+--- a/drivers/tty/serial/serial_core.c
++++ b/drivers/tty/serial/serial_core.c
+@@ -243,6 +243,9 @@ static void uart_shutdown(struct tty_struct *tty, struct uart_state *state)
+ 		/*
+ 		 * Turn off DTR and RTS early.
+ 		 */
++		if (uart_console(uport) && tty)
++			uport->cons->cflag = tty->termios.c_cflag;
++
+ 		if (!tty || (tty->termios.c_cflag & HUPCL))
+ 			uart_clear_mctrl(uport, TIOCM_DTR | TIOCM_RTS);
+ 
+diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
+index 257876ea03a1..0b59731c3021 100644
+--- a/drivers/usb/core/devio.c
++++ b/drivers/usb/core/devio.c
+@@ -1509,7 +1509,7 @@ static int proc_do_submiturb(struct usb_dev_state *ps, struct usbdevfs_urb *uurb
+ 	u = (is_in ? URB_DIR_IN : URB_DIR_OUT);
+ 	if (uurb->flags & USBDEVFS_URB_ISO_ASAP)
+ 		u |= URB_ISO_ASAP;
+-	if (uurb->flags & USBDEVFS_URB_SHORT_NOT_OK)
++	if (uurb->flags & USBDEVFS_URB_SHORT_NOT_OK && is_in)
+ 		u |= URB_SHORT_NOT_OK;
+ 	if (uurb->flags & USBDEVFS_URB_NO_FSBR)
+ 		u |= URB_NO_FSBR;
+diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
+index 0e950ad8cb25..27f217107ef1 100644
+--- a/drivers/usb/core/hub.c
++++ b/drivers/usb/core/hub.c
+@@ -1728,8 +1728,14 @@ static int hub_probe(struct usb_interface *intf, const struct usb_device_id *id)
+ 	 * - Change autosuspend delay of hub can avoid unnecessary auto
+ 	 *   suspend timer for hub, also may decrease power consumption
+ 	 *   of USB bus.
++	 *
++	 * - If user has indicated to prevent autosuspend by passing
++	 *   usbcore.autosuspend = -1 then keep autosuspend disabled.
+ 	 */
+-	pm_runtime_set_autosuspend_delay(&hdev->dev, 0);
++#ifdef CONFIG_PM_RUNTIME
++	if (hdev->dev.power.autosuspend_delay >= 0)
++		pm_runtime_set_autosuspend_delay(&hdev->dev, 0);
++#endif
+ 
+ 	/*
+ 	 * Hubs have proper suspend/resume support, except for root hubs
+@@ -3264,6 +3270,43 @@ static int finish_port_resume(struct usb_device *udev)
+ }
+ 
+ /*
++ * There are some SS USB devices which take longer time for link training.
++ * XHCI specs 4.19.4 says that when Link training is successful, port
++ * sets CSC bit to 1. So if SW reads port status before successful link
++ * training, then it will not find device to be present.
++ * USB Analyzer log with such buggy devices show that in some cases
++ * device switch on the RX termination after long delay of host enabling
++ * the VBUS. In few other cases it has been seen that device fails to
++ * negotiate link training in first attempt. It has been
++ * reported till now that few devices take as long as 2000 ms to train
++ * the link after host enabling its VBUS and termination. Following
++ * routine implements a 2000 ms timeout for link training. If in a case
++ * link trains before timeout, loop will exit earlier.
++ *
++ * FIXME: If a device was connected before suspend, but was removed
++ * while system was asleep, then the loop in the following routine will
++ * only exit at timeout.
++ *
++ * This routine should only be called when persist is enabled for a SS
++ * device.
++ */
++static int wait_for_ss_port_enable(struct usb_device *udev,
++		struct usb_hub *hub, int *port1,
++		u16 *portchange, u16 *portstatus)
++{
++	int status = 0, delay_ms = 0;
++
++	while (delay_ms < 2000) {
++		if (status || *portstatus & USB_PORT_STAT_CONNECTION)
++			break;
++		msleep(20);
++		delay_ms += 20;
++		status = hub_port_status(hub, *port1, portstatus, portchange);
++	}
++	return status;
++}
++
++/*
+  * usb_port_resume - re-activate a suspended usb device's upstream port
+  * @udev: device to re-activate, not a root hub
+  * Context: must be able to sleep; device not locked; pm locks held
+@@ -3359,6 +3402,10 @@ int usb_port_resume(struct usb_device *udev, pm_message_t msg)
+ 		}
+ 	}
+ 
++	if (udev->persist_enabled && hub_is_superspeed(hub->hdev))
++		status = wait_for_ss_port_enable(udev, hub, &port1, &portchange,
++				&portstatus);
++
+ 	status = check_port_resume_type(udev,
+ 			hub, port1, status, portchange, portstatus);
+ 	if (status == 0)
+@@ -4550,6 +4597,7 @@ static void hub_port_connect(struct usb_hub *hub, int port1, u16 portstatus,
+ 	struct usb_hcd *hcd = bus_to_hcd(hdev->bus);
+ 	struct usb_port *port_dev = hub->ports[port1 - 1];
+ 	struct usb_device *udev = port_dev->child;
++	static int unreliable_port = -1;
+ 
+ 	/* Disconnect any existing devices under this port */
+ 	if (udev) {
+@@ -4570,10 +4618,12 @@ static void hub_port_connect(struct usb_hub *hub, int port1, u16 portstatus,
+ 				USB_PORT_STAT_C_ENABLE)) {
+ 		status = hub_port_debounce_be_stable(hub, port1);
+ 		if (status < 0) {
+-			if (status != -ENODEV && printk_ratelimit())
+-				dev_err(&port_dev->dev,
+-						"connect-debounce failed\n");
++			if (status != -ENODEV &&
++				port1 != unreliable_port &&
++				printk_ratelimit())
++				dev_err(&port_dev->dev, "connect-debounce failed\n");
+ 			portstatus &= ~USB_PORT_STAT_CONNECTION;
++			unreliable_port = port1;
+ 		} else {
+ 			portstatus = status;
+ 		}
+diff --git a/drivers/usb/host/ehci-hub.c b/drivers/usb/host/ehci-hub.c
+index cc305c71ac3d..6130b7574908 100644
+--- a/drivers/usb/host/ehci-hub.c
++++ b/drivers/usb/host/ehci-hub.c
+@@ -1230,7 +1230,7 @@ int ehci_hub_control(
+ 			if (selector == EHSET_TEST_SINGLE_STEP_SET_FEATURE) {
+ 				spin_unlock_irqrestore(&ehci->lock, flags);
+ 				retval = ehset_single_step_set_feature(hcd,
+-									wIndex);
++								wIndex + 1);
+ 				spin_lock_irqsave(&ehci->lock, flags);
+ 				break;
+ 			}
+diff --git a/drivers/usb/host/ehci-pci.c b/drivers/usb/host/ehci-pci.c
+index 3e86bf4371b3..ca7b964124af 100644
+--- a/drivers/usb/host/ehci-pci.c
++++ b/drivers/usb/host/ehci-pci.c
+@@ -35,6 +35,21 @@ static const char hcd_name[] = "ehci-pci";
+ #define PCI_DEVICE_ID_INTEL_CE4100_USB	0x2e70
+ 
+ /*-------------------------------------------------------------------------*/
++#define PCI_DEVICE_ID_INTEL_QUARK_X1000_SOC		0x0939
++static inline bool is_intel_quark_x1000(struct pci_dev *pdev)
++{
++	return pdev->vendor == PCI_VENDOR_ID_INTEL &&
++		pdev->device == PCI_DEVICE_ID_INTEL_QUARK_X1000_SOC;
++}
++
++/*
++ * 0x84 is the offset of in/out threshold register,
++ * and it is the same offset as the register of 'hostpc'.
++ */
++#define	intel_quark_x1000_insnreg01	hostpc
++
++/* Maximum usable threshold value is 0x7f dwords for both IN and OUT */
++#define INTEL_QUARK_X1000_EHCI_MAX_THRESHOLD	0x007f007f
+ 
+ /* called after powerup, by probe or system-pm "wakeup" */
+ static int ehci_pci_reinit(struct ehci_hcd *ehci, struct pci_dev *pdev)
+@@ -50,6 +65,16 @@ static int ehci_pci_reinit(struct ehci_hcd *ehci, struct pci_dev *pdev)
+ 	if (!retval)
+ 		ehci_dbg(ehci, "MWI active\n");
+ 
++	/* Reset the threshold limit */
++	if (is_intel_quark_x1000(pdev)) {
++		/*
++		 * For the Intel QUARK X1000, raise the I/O threshold to the
++		 * maximum usable value in order to improve performance.
++		 */
++		ehci_writel(ehci, INTEL_QUARK_X1000_EHCI_MAX_THRESHOLD,
++			ehci->regs->intel_quark_x1000_insnreg01);
++	}
++
+ 	return 0;
+ }
+ 
+diff --git a/drivers/usb/host/ohci-dbg.c b/drivers/usb/host/ohci-dbg.c
+index 45032e933e18..04f2186939d2 100644
+--- a/drivers/usb/host/ohci-dbg.c
++++ b/drivers/usb/host/ohci-dbg.c
+@@ -236,7 +236,7 @@ ohci_dump_roothub (
+ 	}
+ }
+ 
+-static void ohci_dump (struct ohci_hcd *controller, int verbose)
++static void ohci_dump(struct ohci_hcd *controller)
+ {
+ 	ohci_dbg (controller, "OHCI controller state\n");
+ 
+@@ -464,15 +464,16 @@ show_list (struct ohci_hcd *ohci, char *buf, size_t count, struct ed *ed)
+ static ssize_t fill_async_buffer(struct debug_buffer *buf)
+ {
+ 	struct ohci_hcd		*ohci;
+-	size_t			temp;
++	size_t			temp, size;
+ 	unsigned long		flags;
+ 
+ 	ohci = buf->ohci;
++	size = PAGE_SIZE;
+ 
+ 	/* display control and bulk lists together, for simplicity */
+ 	spin_lock_irqsave (&ohci->lock, flags);
+-	temp = show_list(ohci, buf->page, buf->count, ohci->ed_controltail);
+-	temp += show_list(ohci, buf->page + temp, buf->count - temp,
++	temp = show_list(ohci, buf->page, size, ohci->ed_controltail);
++	temp += show_list(ohci, buf->page + temp, size - temp,
+ 			  ohci->ed_bulktail);
+ 	spin_unlock_irqrestore (&ohci->lock, flags);
+ 
+diff --git a/drivers/usb/host/ohci-hcd.c b/drivers/usb/host/ohci-hcd.c
+index f98d03f3144c..a21a36500fd7 100644
+--- a/drivers/usb/host/ohci-hcd.c
++++ b/drivers/usb/host/ohci-hcd.c
+@@ -76,8 +76,8 @@ static const char	hcd_name [] = "ohci_hcd";
+ #include "ohci.h"
+ #include "pci-quirks.h"
+ 
+-static void ohci_dump (struct ohci_hcd *ohci, int verbose);
+-static void ohci_stop (struct usb_hcd *hcd);
++static void ohci_dump(struct ohci_hcd *ohci);
++static void ohci_stop(struct usb_hcd *hcd);
+ 
+ #include "ohci-hub.c"
+ #include "ohci-dbg.c"
+@@ -744,7 +744,7 @@ retry:
+ 		ohci->ed_to_check = NULL;
+ 	}
+ 
+-	ohci_dump (ohci, 1);
++	ohci_dump(ohci);
+ 
+ 	return 0;
+ }
+@@ -825,7 +825,7 @@ static irqreturn_t ohci_irq (struct usb_hcd *hcd)
+ 			usb_hc_died(hcd);
+ 		}
+ 
+-		ohci_dump (ohci, 1);
++		ohci_dump(ohci);
+ 		ohci_usb_reset (ohci);
+ 	}
+ 
+@@ -925,7 +925,7 @@ static void ohci_stop (struct usb_hcd *hcd)
+ {
+ 	struct ohci_hcd		*ohci = hcd_to_ohci (hcd);
+ 
+-	ohci_dump (ohci, 1);
++	ohci_dump(ohci);
+ 
+ 	if (quirk_nec(ohci))
+ 		flush_work(&ohci->nec_work);
+diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
+index d4253e319428..a8bde5b8cbdd 100644
+--- a/drivers/usb/host/ohci-q.c
++++ b/drivers/usb/host/ohci-q.c
+@@ -311,8 +311,7 @@ static void periodic_unlink (struct ohci_hcd *ohci, struct ed *ed)
+  *  - ED_OPER: when there's any request queued, the ED gets rescheduled
+  *    immediately.  HC should be working on them.
+  *
+- *  - ED_IDLE:  when there's no TD queue. there's no reason for the HC
+- *    to care about this ED; safe to disable the endpoint.
++ *  - ED_IDLE: when there's no TD queue or the HC isn't running.
+  *
+  * When finish_unlinks() runs later, after SOF interrupt, it will often
+  * complete one or more URB unlinks before making that state change.
+@@ -926,6 +925,10 @@ rescan_all:
+ 		int			completed, modified;
+ 		__hc32			*prev;
+ 
++		/* Is this ED already invisible to the hardware? */
++		if (ed->state == ED_IDLE)
++			goto ed_idle;
++
+ 		/* only take off EDs that the HC isn't using, accounting for
+ 		 * frame counter wraps and EDs with partially retired TDs
+ 		 */
+@@ -955,12 +958,20 @@ skip_ed:
+ 			}
+ 		}
+ 
++		/* ED's now officially unlinked, hc doesn't see */
++		ed->state = ED_IDLE;
++		if (quirk_zfmicro(ohci) && ed->type == PIPE_INTERRUPT)
++			ohci->eds_scheduled--;
++		ed->hwHeadP &= ~cpu_to_hc32(ohci, ED_H);
++		ed->hwNextED = 0;
++		wmb();
++		ed->hwINFO &= ~cpu_to_hc32(ohci, ED_SKIP | ED_DEQUEUE);
++ed_idle:
++
+ 		/* reentrancy:  if we drop the schedule lock, someone might
+ 		 * have modified this list.  normally it's just prepending
+ 		 * entries (which we'd ignore), but paranoia won't hurt.
+ 		 */
+-		*last = ed->ed_next;
+-		ed->ed_next = NULL;
+ 		modified = 0;
+ 
+ 		/* unlink urbs as requested, but rescan the list after
+@@ -1018,19 +1029,20 @@ rescan_this:
+ 		if (completed && !list_empty (&ed->td_list))
+ 			goto rescan_this;
+ 
+-		/* ED's now officially unlinked, hc doesn't see */
+-		ed->state = ED_IDLE;
+-		if (quirk_zfmicro(ohci) && ed->type == PIPE_INTERRUPT)
+-			ohci->eds_scheduled--;
+-		ed->hwHeadP &= ~cpu_to_hc32(ohci, ED_H);
+-		ed->hwNextED = 0;
+-		wmb ();
+-		ed->hwINFO &= ~cpu_to_hc32 (ohci, ED_SKIP | ED_DEQUEUE);
+-
+-		/* but if there's work queued, reschedule */
+-		if (!list_empty (&ed->td_list)) {
+-			if (ohci->rh_state == OHCI_RH_RUNNING)
+-				ed_schedule (ohci, ed);
++		/*
++		 * If no TDs are queued, take ED off the ed_rm_list.
++		 * Otherwise, if the HC is running, reschedule.
++		 * If not, leave it on the list for further dequeues.
++		 */
++		if (list_empty(&ed->td_list)) {
++			*last = ed->ed_next;
++			ed->ed_next = NULL;
++		} else if (ohci->rh_state == OHCI_RH_RUNNING) {
++			*last = ed->ed_next;
++			ed->ed_next = NULL;
++			ed_schedule(ohci, ed);
++		} else {
++			last = &ed->ed_next;
+ 		}
+ 
+ 		if (modified)
+diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
+index e20520f42753..994a36e582ca 100644
+--- a/drivers/usb/host/xhci-pci.c
++++ b/drivers/usb/host/xhci-pci.c
+@@ -101,6 +101,10 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
+ 	/* AMD PLL quirk */
+ 	if (pdev->vendor == PCI_VENDOR_ID_AMD && usb_amd_find_chipset_info())
+ 		xhci->quirks |= XHCI_AMD_PLL_FIX;
++
++	if (pdev->vendor == PCI_VENDOR_ID_AMD)
++		xhci->quirks |= XHCI_TRUST_TX_LENGTH;
++
+ 	if (pdev->vendor == PCI_VENDOR_ID_INTEL) {
+ 		xhci->quirks |= XHCI_LPM_SUPPORT;
+ 		xhci->quirks |= XHCI_INTEL_HOST;
+@@ -143,6 +147,7 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
+ 			pdev->device == PCI_DEVICE_ID_ASROCK_P67) {
+ 		xhci->quirks |= XHCI_RESET_ON_RESUME;
+ 		xhci->quirks |= XHCI_TRUST_TX_LENGTH;
++		xhci->quirks |= XHCI_BROKEN_STREAMS;
+ 	}
+ 	if (pdev->vendor == PCI_VENDOR_ID_RENESAS &&
+ 			pdev->device == 0x0015)
+@@ -150,6 +155,11 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
+ 	if (pdev->vendor == PCI_VENDOR_ID_VIA)
+ 		xhci->quirks |= XHCI_RESET_ON_RESUME;
+ 
++	/* See https://bugzilla.kernel.org/show_bug.cgi?id=79511 */
++	if (pdev->vendor == PCI_VENDOR_ID_VIA &&
++			pdev->device == 0x3432)
++		xhci->quirks |= XHCI_BROKEN_STREAMS;
++
+ 	if (xhci->quirks & XHCI_RESET_ON_RESUME)
+ 		xhci_dbg_trace(xhci, trace_xhci_dbg_quirks,
+ 				"QUIRK: Resetting on resume");
+@@ -230,7 +240,8 @@ static int xhci_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
+ 		goto put_usb3_hcd;
+ 	/* Roothub already marked as USB 3.0 speed */
+ 
+-	if (HCC_MAX_PSA(xhci->hcc_params) >= 4)
++	if (!(xhci->quirks & XHCI_BROKEN_STREAMS) &&
++			HCC_MAX_PSA(xhci->hcc_params) >= 4)
+ 		xhci->shared_hcd->can_do_streams = 1;
+ 
+ 	/* USB-2 and USB-3 roothubs initialized, allow runtime pm suspend */
+diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
+index 749fc68eb5c1..28a929d45cfe 100644
+--- a/drivers/usb/host/xhci-ring.c
++++ b/drivers/usb/host/xhci-ring.c
+@@ -364,32 +364,6 @@ static void ring_doorbell_for_active_rings(struct xhci_hcd *xhci,
+ 	}
+ }
+ 
+-/*
+- * Find the segment that trb is in.  Start searching in start_seg.
+- * If we must move past a segment that has a link TRB with a toggle cycle state
+- * bit set, then we will toggle the value pointed at by cycle_state.
+- */
+-static struct xhci_segment *find_trb_seg(
+-		struct xhci_segment *start_seg,
+-		union xhci_trb	*trb, int *cycle_state)
+-{
+-	struct xhci_segment *cur_seg = start_seg;
+-	struct xhci_generic_trb *generic_trb;
+-
+-	while (cur_seg->trbs > trb ||
+-			&cur_seg->trbs[TRBS_PER_SEGMENT - 1] < trb) {
+-		generic_trb = &cur_seg->trbs[TRBS_PER_SEGMENT - 1].generic;
+-		if (generic_trb->field[3] & cpu_to_le32(LINK_TOGGLE))
+-			*cycle_state ^= 0x1;
+-		cur_seg = cur_seg->next;
+-		if (cur_seg == start_seg)
+-			/* Looped over the entire list.  Oops! */
+-			return NULL;
+-	}
+-	return cur_seg;
+-}
+-
+-
+ static struct xhci_ring *xhci_triad_to_transfer_ring(struct xhci_hcd *xhci,
+ 		unsigned int slot_id, unsigned int ep_index,
+ 		unsigned int stream_id)
+@@ -459,9 +433,12 @@ void xhci_find_new_dequeue_state(struct xhci_hcd *xhci,
+ 	struct xhci_virt_device *dev = xhci->devs[slot_id];
+ 	struct xhci_virt_ep *ep = &dev->eps[ep_index];
+ 	struct xhci_ring *ep_ring;
+-	struct xhci_generic_trb *trb;
++	struct xhci_segment *new_seg;
++	union xhci_trb *new_deq;
+ 	dma_addr_t addr;
+ 	u64 hw_dequeue;
++	bool cycle_found = false;
++	bool td_last_trb_found = false;
+ 
+ 	ep_ring = xhci_triad_to_transfer_ring(xhci, slot_id,
+ 			ep_index, stream_id);
+@@ -486,45 +463,45 @@ void xhci_find_new_dequeue_state(struct xhci_hcd *xhci,
+ 		hw_dequeue = le64_to_cpu(ep_ctx->deq);
+ 	}
+ 
+-	/* Find virtual address and segment of hardware dequeue pointer */
+-	state->new_deq_seg = ep_ring->deq_seg;
+-	state->new_deq_ptr = ep_ring->dequeue;
+-	while (xhci_trb_virt_to_dma(state->new_deq_seg, state->new_deq_ptr)
+-			!= (dma_addr_t)(hw_dequeue & ~0xf)) {
+-		next_trb(xhci, ep_ring, &state->new_deq_seg,
+-					&state->new_deq_ptr);
+-		if (state->new_deq_ptr == ep_ring->dequeue) {
+-			WARN_ON(1);
+-			return;
+-		}
+-	}
++	new_seg = ep_ring->deq_seg;
++	new_deq = ep_ring->dequeue;
++	state->new_cycle_state = hw_dequeue & 0x1;
++
+ 	/*
+-	 * Find cycle state for last_trb, starting at old cycle state of
+-	 * hw_dequeue. If there is only one segment ring, find_trb_seg() will
+-	 * return immediately and cannot toggle the cycle state if this search
+-	 * wraps around, so add one more toggle manually in that case.
++	 * We want to find the pointer, segment and cycle state of the new trb
++	 * (the one after current TD's last_trb). We know the cycle state at
++	 * hw_dequeue, so walk the ring until both hw_dequeue and last_trb are
++	 * found.
+ 	 */
+-	state->new_cycle_state = hw_dequeue & 0x1;
+-	if (ep_ring->first_seg == ep_ring->first_seg->next &&
+-			cur_td->last_trb < state->new_deq_ptr)
+-		state->new_cycle_state ^= 0x1;
++	do {
++		if (!cycle_found && xhci_trb_virt_to_dma(new_seg, new_deq)
++		    == (dma_addr_t)(hw_dequeue & ~0xf)) {
++			cycle_found = true;
++			if (td_last_trb_found)
++				break;
++		}
++		if (new_deq == cur_td->last_trb)
++			td_last_trb_found = true;
+ 
+-	state->new_deq_ptr = cur_td->last_trb;
+-	xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb,
+-			"Finding segment containing last TRB in TD.");
+-	state->new_deq_seg = find_trb_seg(state->new_deq_seg,
+-			state->new_deq_ptr, &state->new_cycle_state);
+-	if (!state->new_deq_seg) {
+-		WARN_ON(1);
+-		return;
+-	}
++		if (cycle_found &&
++		    TRB_TYPE_LINK_LE32(new_deq->generic.field[3]) &&
++		    new_deq->generic.field[3] & cpu_to_le32(LINK_TOGGLE))
++			state->new_cycle_state ^= 0x1;
++
++		next_trb(xhci, ep_ring, &new_seg, &new_deq);
++
++		/* Search wrapped around, bail out */
++		if (new_deq == ep->ring->dequeue) {
++			xhci_err(xhci, "Error: Failed finding new dequeue state\n");
++			state->new_deq_seg = NULL;
++			state->new_deq_ptr = NULL;
++			return;
++		}
++
++	} while (!cycle_found || !td_last_trb_found);
+ 
+-	/* Increment to find next TRB after last_trb. Cycle if appropriate. */
+-	trb = &state->new_deq_ptr->generic;
+-	if (TRB_TYPE_LINK_LE32(trb->field[3]) &&
+-	    (trb->field[3] & cpu_to_le32(LINK_TOGGLE)))
+-		state->new_cycle_state ^= 0x1;
+-	next_trb(xhci, ep_ring, &state->new_deq_seg, &state->new_deq_ptr);
++	state->new_deq_seg = new_seg;
++	state->new_deq_ptr = new_deq;
+ 
+ 	/* Don't update the ring cycle state for the producer (us). */
+ 	xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb,
+@@ -2483,7 +2460,8 @@ static int handle_tx_event(struct xhci_hcd *xhci,
+ 		 * last TRB of the previous TD. The command completion handle
+ 		 * will take care the rest.
+ 		 */
+-		if (!event_seg && trb_comp_code == COMP_STOP_INVAL) {
++		if (!event_seg && (trb_comp_code == COMP_STOP ||
++				   trb_comp_code == COMP_STOP_INVAL)) {
+ 			ret = 0;
+ 			goto cleanup;
+ 		}
+diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
+index 7436d5f5e67a..e32cc6cf86dc 100644
+--- a/drivers/usb/host/xhci.c
++++ b/drivers/usb/host/xhci.c
+@@ -2891,6 +2891,9 @@ void xhci_cleanup_stalled_ring(struct xhci_hcd *xhci,
+ 			ep_index, ep->stopped_stream, ep->stopped_td,
+ 			&deq_state);
+ 
++	if (!deq_state.new_deq_ptr || !deq_state.new_deq_seg)
++		return;
++
+ 	/* HW with the reset endpoint quirk will use the saved dequeue state to
+ 	 * issue a configure endpoint command later.
+ 	 */
+@@ -3163,7 +3166,8 @@ int xhci_alloc_streams(struct usb_hcd *hcd, struct usb_device *udev,
+ 			num_streams);
+ 
+ 	/* MaxPSASize value 0 (2 streams) means streams are not supported */
+-	if (HCC_MAX_PSA(xhci->hcc_params) < 4) {
++	if ((xhci->quirks & XHCI_BROKEN_STREAMS) ||
++			HCC_MAX_PSA(xhci->hcc_params) < 4) {
+ 		xhci_dbg(xhci, "xHCI controller does not support streams.\n");
+ 		return -ENOSYS;
+ 	}
+diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
+index 9ffecd56600d..dace5152e179 100644
+--- a/drivers/usb/host/xhci.h
++++ b/drivers/usb/host/xhci.h
+@@ -1558,6 +1558,8 @@ struct xhci_hcd {
+ #define XHCI_PLAT		(1 << 16)
+ #define XHCI_SLOW_SUSPEND	(1 << 17)
+ #define XHCI_SPURIOUS_WAKEUP	(1 << 18)
++/* For controllers with a broken beyond repair streams implementation */
++#define XHCI_BROKEN_STREAMS	(1 << 19)
+ 	unsigned int		num_active_eps;
+ 	unsigned int		limit_active_eps;
+ 	/* There are two roothubs to keep track of bus suspend info for */
+diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
+index 8a3813be1b28..8b0f517abb6b 100644
+--- a/drivers/usb/serial/ftdi_sio.c
++++ b/drivers/usb/serial/ftdi_sio.c
+@@ -151,6 +151,7 @@ static const struct usb_device_id id_table_combined[] = {
+ 	{ USB_DEVICE(FTDI_VID, FTDI_AMC232_PID) },
+ 	{ USB_DEVICE(FTDI_VID, FTDI_CANUSB_PID) },
+ 	{ USB_DEVICE(FTDI_VID, FTDI_CANDAPTER_PID) },
++	{ USB_DEVICE(FTDI_VID, FTDI_BM_ATOM_NANO_PID) },
+ 	{ USB_DEVICE(FTDI_VID, FTDI_NXTCAM_PID) },
+ 	{ USB_DEVICE(FTDI_VID, FTDI_EV3CON_PID) },
+ 	{ USB_DEVICE(FTDI_VID, FTDI_SCS_DEVICE_0_PID) },
+@@ -673,6 +674,8 @@ static const struct usb_device_id id_table_combined[] = {
+ 	{ USB_DEVICE(FTDI_VID, XSENS_CONVERTER_5_PID) },
+ 	{ USB_DEVICE(FTDI_VID, XSENS_CONVERTER_6_PID) },
+ 	{ USB_DEVICE(FTDI_VID, XSENS_CONVERTER_7_PID) },
++	{ USB_DEVICE(XSENS_VID, XSENS_CONVERTER_PID) },
++	{ USB_DEVICE(XSENS_VID, XSENS_MTW_PID) },
+ 	{ USB_DEVICE(FTDI_VID, FTDI_OMNI1509) },
+ 	{ USB_DEVICE(MOBILITY_VID, MOBILITY_USB_SERIAL_PID) },
+ 	{ USB_DEVICE(FTDI_VID, FTDI_ACTIVE_ROBOTS_PID) },
+@@ -945,6 +948,8 @@ static const struct usb_device_id id_table_combined[] = {
+ 	{ USB_DEVICE(BRAINBOXES_VID, BRAINBOXES_US_842_2_PID) },
+ 	{ USB_DEVICE(BRAINBOXES_VID, BRAINBOXES_US_842_3_PID) },
+ 	{ USB_DEVICE(BRAINBOXES_VID, BRAINBOXES_US_842_4_PID) },
++	/* ekey Devices */
++	{ USB_DEVICE(FTDI_VID, FTDI_EKEY_CONV_USB_PID) },
+ 	/* Infineon Devices */
+ 	{ USB_DEVICE_INTERFACE_NUMBER(INFINEON_VID, INFINEON_TRIBOARD_PID, 1) },
+ 	{ }					/* Terminating entry */
+diff --git a/drivers/usb/serial/ftdi_sio_ids.h b/drivers/usb/serial/ftdi_sio_ids.h
+index c4777bc6aee0..70b0b1d88ae9 100644
+--- a/drivers/usb/serial/ftdi_sio_ids.h
++++ b/drivers/usb/serial/ftdi_sio_ids.h
+@@ -42,6 +42,8 @@
+ /* www.candapter.com Ewert Energy Systems CANdapter device */
+ #define FTDI_CANDAPTER_PID 0x9F80 /* Product Id */
+ 
++#define FTDI_BM_ATOM_NANO_PID	0xa559	/* Basic Micro ATOM Nano USB2Serial */
++
+ /*
+  * Texas Instruments XDS100v2 JTAG / BeagleBone A3
+  * http://processors.wiki.ti.com/index.php/XDS100
+@@ -140,12 +142,15 @@
+ /*
+  * Xsens Technologies BV products (http://www.xsens.com).
+  */
+-#define XSENS_CONVERTER_0_PID	0xD388
+-#define XSENS_CONVERTER_1_PID	0xD389
++#define XSENS_VID		0x2639
++#define XSENS_CONVERTER_PID	0xD00D	/* Xsens USB-serial converter */
++#define XSENS_MTW_PID		0x0200	/* Xsens MTw */
++#define XSENS_CONVERTER_0_PID	0xD388	/* Xsens USB converter */
++#define XSENS_CONVERTER_1_PID	0xD389	/* Xsens Wireless Receiver */
+ #define XSENS_CONVERTER_2_PID	0xD38A
+-#define XSENS_CONVERTER_3_PID	0xD38B
+-#define XSENS_CONVERTER_4_PID	0xD38C
+-#define XSENS_CONVERTER_5_PID	0xD38D
++#define XSENS_CONVERTER_3_PID	0xD38B	/* Xsens USB-serial converter */
++#define XSENS_CONVERTER_4_PID	0xD38C	/* Xsens Wireless Receiver */
++#define XSENS_CONVERTER_5_PID	0xD38D	/* Xsens Awinda Station */
+ #define XSENS_CONVERTER_6_PID	0xD38E
+ #define XSENS_CONVERTER_7_PID	0xD38F
+ 
+@@ -1375,3 +1380,8 @@
+ #define BRAINBOXES_US_160_6_PID		0x9006 /* US-160 16xRS232 1Mbaud Port 11 and 12 */
+ #define BRAINBOXES_US_160_7_PID		0x9007 /* US-160 16xRS232 1Mbaud Port 13 and 14 */
+ #define BRAINBOXES_US_160_8_PID		0x9008 /* US-160 16xRS232 1Mbaud Port 15 and 16 */
++
++/*
++ * ekey biometric systems GmbH (http://ekey.net/)
++ */
++#define FTDI_EKEY_CONV_USB_PID		0xCB08	/* Converter USB */
+diff --git a/drivers/usb/serial/whiteheat.c b/drivers/usb/serial/whiteheat.c
+index e62f2dff8b7d..6c3734d2b45a 100644
+--- a/drivers/usb/serial/whiteheat.c
++++ b/drivers/usb/serial/whiteheat.c
+@@ -514,6 +514,10 @@ static void command_port_read_callback(struct urb *urb)
+ 		dev_dbg(&urb->dev->dev, "%s - command_info is NULL, exiting.\n", __func__);
+ 		return;
+ 	}
++	if (!urb->actual_length) {
++		dev_dbg(&urb->dev->dev, "%s - empty response, exiting.\n", __func__);
++		return;
++	}
+ 	if (status) {
+ 		dev_dbg(&urb->dev->dev, "%s - nonzero urb status: %d\n", __func__, status);
+ 		if (status != -ENOENT)
+@@ -534,7 +538,8 @@ static void command_port_read_callback(struct urb *urb)
+ 		/* These are unsolicited reports from the firmware, hence no
+ 		   waiting command to wakeup */
+ 		dev_dbg(&urb->dev->dev, "%s - event received\n", __func__);
+-	} else if (data[0] == WHITEHEAT_GET_DTR_RTS) {
++	} else if ((data[0] == WHITEHEAT_GET_DTR_RTS) &&
++		(urb->actual_length - 1 <= sizeof(command_info->result_buffer))) {
+ 		memcpy(command_info->result_buffer, &data[1],
+ 						urb->actual_length - 1);
+ 		command_info->command_finished = WHITEHEAT_CMD_COMPLETE;
+diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
+index 511b22953167..3f42785f653c 100644
+--- a/drivers/usb/storage/uas.c
++++ b/drivers/usb/storage/uas.c
+@@ -1026,7 +1026,7 @@ static int uas_configure_endpoints(struct uas_dev_info *devinfo)
+ 					    usb_endpoint_num(&eps[3]->desc));
+ 
+ 	if (udev->speed != USB_SPEED_SUPER) {
+-		devinfo->qdepth = 256;
++		devinfo->qdepth = 32;
+ 		devinfo->use_streams = 0;
+ 	} else {
+ 		devinfo->qdepth = usb_alloc_streams(devinfo->intf, eps + 1,
+diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
+index 84b4bfb84344..500713882ad5 100644
+--- a/drivers/xen/events/events_fifo.c
++++ b/drivers/xen/events/events_fifo.c
+@@ -67,10 +67,9 @@ static event_word_t *event_array[MAX_EVENT_ARRAY_PAGES] __read_mostly;
+ static unsigned event_array_pages __read_mostly;
+ 
+ /*
+- * sync_set_bit() and friends must be unsigned long aligned on non-x86
+- * platforms.
++ * sync_set_bit() and friends must be unsigned long aligned.
+  */
+-#if !defined(CONFIG_X86) && BITS_PER_LONG > 32
++#if BITS_PER_LONG > 32
+ 
+ #define BM(w) (unsigned long *)((unsigned long)w & ~0x7UL)
+ #define EVTCHN_FIFO_BIT(b, w) \
+diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
+index 5a201d81049c..fbd76ded9a34 100644
+--- a/fs/btrfs/async-thread.c
++++ b/fs/btrfs/async-thread.c
+@@ -22,7 +22,6 @@
+ #include <linux/list.h>
+ #include <linux/spinlock.h>
+ #include <linux/freezer.h>
+-#include <linux/workqueue.h>
+ #include "async-thread.h"
+ #include "ctree.h"
+ 
+@@ -55,8 +54,39 @@ struct btrfs_workqueue {
+ 	struct __btrfs_workqueue *high;
+ };
+ 
+-static inline struct __btrfs_workqueue
+-*__btrfs_alloc_workqueue(const char *name, int flags, int max_active,
++static void normal_work_helper(struct btrfs_work *work);
++
++#define BTRFS_WORK_HELPER(name)					\
++void btrfs_##name(struct work_struct *arg)				\
++{									\
++	struct btrfs_work *work = container_of(arg, struct btrfs_work,	\
++					       normal_work);		\
++	normal_work_helper(work);					\
++}
++
++BTRFS_WORK_HELPER(worker_helper);
++BTRFS_WORK_HELPER(delalloc_helper);
++BTRFS_WORK_HELPER(flush_delalloc_helper);
++BTRFS_WORK_HELPER(cache_helper);
++BTRFS_WORK_HELPER(submit_helper);
++BTRFS_WORK_HELPER(fixup_helper);
++BTRFS_WORK_HELPER(endio_helper);
++BTRFS_WORK_HELPER(endio_meta_helper);
++BTRFS_WORK_HELPER(endio_meta_write_helper);
++BTRFS_WORK_HELPER(endio_raid56_helper);
++BTRFS_WORK_HELPER(rmw_helper);
++BTRFS_WORK_HELPER(endio_write_helper);
++BTRFS_WORK_HELPER(freespace_write_helper);
++BTRFS_WORK_HELPER(delayed_meta_helper);
++BTRFS_WORK_HELPER(readahead_helper);
++BTRFS_WORK_HELPER(qgroup_rescan_helper);
++BTRFS_WORK_HELPER(extent_refs_helper);
++BTRFS_WORK_HELPER(scrub_helper);
++BTRFS_WORK_HELPER(scrubwrc_helper);
++BTRFS_WORK_HELPER(scrubnc_helper);
++
++static struct __btrfs_workqueue *
++__btrfs_alloc_workqueue(const char *name, int flags, int max_active,
+ 			 int thresh)
+ {
+ 	struct __btrfs_workqueue *ret = kzalloc(sizeof(*ret), GFP_NOFS);
+@@ -232,13 +262,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq)
+ 	spin_unlock_irqrestore(lock, flags);
+ }
+ 
+-static void normal_work_helper(struct work_struct *arg)
++static void normal_work_helper(struct btrfs_work *work)
+ {
+-	struct btrfs_work *work;
+ 	struct __btrfs_workqueue *wq;
+ 	int need_order = 0;
+ 
+-	work = container_of(arg, struct btrfs_work, normal_work);
+ 	/*
+ 	 * We should not touch things inside work in the following cases:
+ 	 * 1) after work->func() if it has no ordered_free
+@@ -262,7 +290,7 @@ static void normal_work_helper(struct work_struct *arg)
+ 		trace_btrfs_all_work_done(work);
+ }
+ 
+-void btrfs_init_work(struct btrfs_work *work,
++void btrfs_init_work(struct btrfs_work *work, btrfs_work_func_t uniq_func,
+ 		     btrfs_func_t func,
+ 		     btrfs_func_t ordered_func,
+ 		     btrfs_func_t ordered_free)
+@@ -270,7 +298,7 @@ void btrfs_init_work(struct btrfs_work *work,
+ 	work->func = func;
+ 	work->ordered_func = ordered_func;
+ 	work->ordered_free = ordered_free;
+-	INIT_WORK(&work->normal_work, normal_work_helper);
++	INIT_WORK(&work->normal_work, uniq_func);
+ 	INIT_LIST_HEAD(&work->ordered_list);
+ 	work->flags = 0;
+ }
+diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
+index 9c6b66d15fb0..e9e31c94758f 100644
+--- a/fs/btrfs/async-thread.h
++++ b/fs/btrfs/async-thread.h
+@@ -19,12 +19,14 @@
+ 
+ #ifndef __BTRFS_ASYNC_THREAD_
+ #define __BTRFS_ASYNC_THREAD_
++#include <linux/workqueue.h>
+ 
+ struct btrfs_workqueue;
+ /* Internal use only */
+ struct __btrfs_workqueue;
+ struct btrfs_work;
+ typedef void (*btrfs_func_t)(struct btrfs_work *arg);
++typedef void (*btrfs_work_func_t)(struct work_struct *arg);
+ 
+ struct btrfs_work {
+ 	btrfs_func_t func;
+@@ -38,11 +40,35 @@ struct btrfs_work {
+ 	unsigned long flags;
+ };
+ 
++#define BTRFS_WORK_HELPER_PROTO(name)					\
++void btrfs_##name(struct work_struct *arg)
++
++BTRFS_WORK_HELPER_PROTO(worker_helper);
++BTRFS_WORK_HELPER_PROTO(delalloc_helper);
++BTRFS_WORK_HELPER_PROTO(flush_delalloc_helper);
++BTRFS_WORK_HELPER_PROTO(cache_helper);
++BTRFS_WORK_HELPER_PROTO(submit_helper);
++BTRFS_WORK_HELPER_PROTO(fixup_helper);
++BTRFS_WORK_HELPER_PROTO(endio_helper);
++BTRFS_WORK_HELPER_PROTO(endio_meta_helper);
++BTRFS_WORK_HELPER_PROTO(endio_meta_write_helper);
++BTRFS_WORK_HELPER_PROTO(endio_raid56_helper);
++BTRFS_WORK_HELPER_PROTO(rmw_helper);
++BTRFS_WORK_HELPER_PROTO(endio_write_helper);
++BTRFS_WORK_HELPER_PROTO(freespace_write_helper);
++BTRFS_WORK_HELPER_PROTO(delayed_meta_helper);
++BTRFS_WORK_HELPER_PROTO(readahead_helper);
++BTRFS_WORK_HELPER_PROTO(qgroup_rescan_helper);
++BTRFS_WORK_HELPER_PROTO(extent_refs_helper);
++BTRFS_WORK_HELPER_PROTO(scrub_helper);
++BTRFS_WORK_HELPER_PROTO(scrubwrc_helper);
++BTRFS_WORK_HELPER_PROTO(scrubnc_helper);
++
+ struct btrfs_workqueue *btrfs_alloc_workqueue(const char *name,
+ 					      int flags,
+ 					      int max_active,
+ 					      int thresh);
+-void btrfs_init_work(struct btrfs_work *work,
++void btrfs_init_work(struct btrfs_work *work, btrfs_work_func_t helper,
+ 		     btrfs_func_t func,
+ 		     btrfs_func_t ordered_func,
+ 		     btrfs_func_t ordered_free);
+diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
+index e25564bfcb46..54a201dac7f9 100644
+--- a/fs/btrfs/backref.c
++++ b/fs/btrfs/backref.c
+@@ -276,9 +276,8 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
+ 			}
+ 			if (ret > 0)
+ 				goto next;
+-			ret = ulist_add_merge(parents, eb->start,
+-					      (uintptr_t)eie,
+-					      (u64 *)&old, GFP_NOFS);
++			ret = ulist_add_merge_ptr(parents, eb->start,
++						  eie, (void **)&old, GFP_NOFS);
+ 			if (ret < 0)
+ 				break;
+ 			if (!ret && extent_item_pos) {
+@@ -1001,16 +1000,19 @@ again:
+ 					ret = -EIO;
+ 					goto out;
+ 				}
++				btrfs_tree_read_lock(eb);
++				btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+ 				ret = find_extent_in_eb(eb, bytenr,
+ 							*extent_item_pos, &eie);
++				btrfs_tree_read_unlock_blocking(eb);
+ 				free_extent_buffer(eb);
+ 				if (ret < 0)
+ 					goto out;
+ 				ref->inode_list = eie;
+ 			}
+-			ret = ulist_add_merge(refs, ref->parent,
+-					      (uintptr_t)ref->inode_list,
+-					      (u64 *)&eie, GFP_NOFS);
++			ret = ulist_add_merge_ptr(refs, ref->parent,
++						  ref->inode_list,
++						  (void **)&eie, GFP_NOFS);
+ 			if (ret < 0)
+ 				goto out;
+ 			if (!ret && extent_item_pos) {
+diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
+index 4794923c410c..43527fd78825 100644
+--- a/fs/btrfs/btrfs_inode.h
++++ b/fs/btrfs/btrfs_inode.h
+@@ -84,12 +84,6 @@ struct btrfs_inode {
+ 	 */
+ 	struct list_head delalloc_inodes;
+ 
+-	/*
+-	 * list for tracking inodes that must be sent to disk before a
+-	 * rename or truncate commit
+-	 */
+-	struct list_head ordered_operations;
+-
+ 	/* node for the red-black tree that links inodes in subvolume root */
+ 	struct rb_node rb_node;
+ 
+diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
+index da775bfdebc9..a2e90f855d7d 100644
+--- a/fs/btrfs/delayed-inode.c
++++ b/fs/btrfs/delayed-inode.c
+@@ -1395,8 +1395,8 @@ static int btrfs_wq_run_delayed_node(struct btrfs_delayed_root *delayed_root,
+ 		return -ENOMEM;
+ 
+ 	async_work->delayed_root = delayed_root;
+-	btrfs_init_work(&async_work->work, btrfs_async_run_delayed_root,
+-			NULL, NULL);
++	btrfs_init_work(&async_work->work, btrfs_delayed_meta_helper,
++			btrfs_async_run_delayed_root, NULL, NULL);
+ 	async_work->nr = nr;
+ 
+ 	btrfs_queue_work(root->fs_info->delayed_workers, &async_work->work);
+diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
+index 08e65e9cf2aa..0229c3720b30 100644
+--- a/fs/btrfs/disk-io.c
++++ b/fs/btrfs/disk-io.c
+@@ -39,7 +39,6 @@
+ #include "btrfs_inode.h"
+ #include "volumes.h"
+ #include "print-tree.h"
+-#include "async-thread.h"
+ #include "locking.h"
+ #include "tree-log.h"
+ #include "free-space-cache.h"
+@@ -60,8 +59,6 @@ static void end_workqueue_fn(struct btrfs_work *work);
+ static void free_fs_root(struct btrfs_root *root);
+ static int btrfs_check_super_valid(struct btrfs_fs_info *fs_info,
+ 				    int read_only);
+-static void btrfs_destroy_ordered_operations(struct btrfs_transaction *t,
+-					     struct btrfs_root *root);
+ static void btrfs_destroy_ordered_extents(struct btrfs_root *root);
+ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
+ 				      struct btrfs_root *root);
+@@ -695,35 +692,41 @@ static void end_workqueue_bio(struct bio *bio, int err)
+ {
+ 	struct end_io_wq *end_io_wq = bio->bi_private;
+ 	struct btrfs_fs_info *fs_info;
++	struct btrfs_workqueue *wq;
++	btrfs_work_func_t func;
+ 
+ 	fs_info = end_io_wq->info;
+ 	end_io_wq->error = err;
+-	btrfs_init_work(&end_io_wq->work, end_workqueue_fn, NULL, NULL);
+ 
+ 	if (bio->bi_rw & REQ_WRITE) {
+-		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA)
+-			btrfs_queue_work(fs_info->endio_meta_write_workers,
+-					 &end_io_wq->work);
+-		else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_FREE_SPACE)
+-			btrfs_queue_work(fs_info->endio_freespace_worker,
+-					 &end_io_wq->work);
+-		else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
+-			btrfs_queue_work(fs_info->endio_raid56_workers,
+-					 &end_io_wq->work);
+-		else
+-			btrfs_queue_work(fs_info->endio_write_workers,
+-					 &end_io_wq->work);
++		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA) {
++			wq = fs_info->endio_meta_write_workers;
++			func = btrfs_endio_meta_write_helper;
++		} else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_FREE_SPACE) {
++			wq = fs_info->endio_freespace_worker;
++			func = btrfs_freespace_write_helper;
++		} else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56) {
++			wq = fs_info->endio_raid56_workers;
++			func = btrfs_endio_raid56_helper;
++		} else {
++			wq = fs_info->endio_write_workers;
++			func = btrfs_endio_write_helper;
++		}
+ 	} else {
+-		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56)
+-			btrfs_queue_work(fs_info->endio_raid56_workers,
+-					 &end_io_wq->work);
+-		else if (end_io_wq->metadata)
+-			btrfs_queue_work(fs_info->endio_meta_workers,
+-					 &end_io_wq->work);
+-		else
+-			btrfs_queue_work(fs_info->endio_workers,
+-					 &end_io_wq->work);
++		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_RAID56) {
++			wq = fs_info->endio_raid56_workers;
++			func = btrfs_endio_raid56_helper;
++		} else if (end_io_wq->metadata) {
++			wq = fs_info->endio_meta_workers;
++			func = btrfs_endio_meta_helper;
++		} else {
++			wq = fs_info->endio_workers;
++			func = btrfs_endio_helper;
++		}
+ 	}
++
++	btrfs_init_work(&end_io_wq->work, func, end_workqueue_fn, NULL, NULL);
++	btrfs_queue_work(wq, &end_io_wq->work);
+ }
+ 
+ /*
+@@ -830,7 +833,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct inode *inode,
+ 	async->submit_bio_start = submit_bio_start;
+ 	async->submit_bio_done = submit_bio_done;
+ 
+-	btrfs_init_work(&async->work, run_one_async_start,
++	btrfs_init_work(&async->work, btrfs_worker_helper, run_one_async_start,
+ 			run_one_async_done, run_one_async_free);
+ 
+ 	async->bio_flags = bio_flags;
+@@ -3829,34 +3832,6 @@ static void btrfs_error_commit_super(struct btrfs_root *root)
+ 	btrfs_cleanup_transaction(root);
+ }
+ 
+-static void btrfs_destroy_ordered_operations(struct btrfs_transaction *t,
+-					     struct btrfs_root *root)
+-{
+-	struct btrfs_inode *btrfs_inode;
+-	struct list_head splice;
+-
+-	INIT_LIST_HEAD(&splice);
+-
+-	mutex_lock(&root->fs_info->ordered_operations_mutex);
+-	spin_lock(&root->fs_info->ordered_root_lock);
+-
+-	list_splice_init(&t->ordered_operations, &splice);
+-	while (!list_empty(&splice)) {
+-		btrfs_inode = list_entry(splice.next, struct btrfs_inode,
+-					 ordered_operations);
+-
+-		list_del_init(&btrfs_inode->ordered_operations);
+-		spin_unlock(&root->fs_info->ordered_root_lock);
+-
+-		btrfs_invalidate_inodes(btrfs_inode->root);
+-
+-		spin_lock(&root->fs_info->ordered_root_lock);
+-	}
+-
+-	spin_unlock(&root->fs_info->ordered_root_lock);
+-	mutex_unlock(&root->fs_info->ordered_operations_mutex);
+-}
+-
+ static void btrfs_destroy_ordered_extents(struct btrfs_root *root)
+ {
+ 	struct btrfs_ordered_extent *ordered;
+@@ -4093,8 +4068,6 @@ again:
+ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
+ 				   struct btrfs_root *root)
+ {
+-	btrfs_destroy_ordered_operations(cur_trans, root);
+-
+ 	btrfs_destroy_delayed_refs(cur_trans, root);
+ 
+ 	cur_trans->state = TRANS_STATE_COMMIT_START;
+diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
+index 813537f362f9..8edb9fcc38d5 100644
+--- a/fs/btrfs/extent-tree.c
++++ b/fs/btrfs/extent-tree.c
+@@ -552,7 +552,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache,
+ 	caching_ctl->block_group = cache;
+ 	caching_ctl->progress = cache->key.objectid;
+ 	atomic_set(&caching_ctl->count, 1);
+-	btrfs_init_work(&caching_ctl->work, caching_thread, NULL, NULL);
++	btrfs_init_work(&caching_ctl->work, btrfs_cache_helper,
++			caching_thread, NULL, NULL);
+ 
+ 	spin_lock(&cache->lock);
+ 	/*
+@@ -2749,8 +2750,8 @@ int btrfs_async_run_delayed_refs(struct btrfs_root *root,
+ 		async->sync = 0;
+ 	init_completion(&async->wait);
+ 
+-	btrfs_init_work(&async->work, delayed_ref_async_start,
+-			NULL, NULL);
++	btrfs_init_work(&async->work, btrfs_extent_refs_helper,
++			delayed_ref_async_start, NULL, NULL);
+ 
+ 	btrfs_queue_work(root->fs_info->extent_workers, &async->work);
+ 
+diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
+index a389820d158b..09b4e3165e2c 100644
+--- a/fs/btrfs/extent_io.c
++++ b/fs/btrfs/extent_io.c
+@@ -2532,6 +2532,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err)
+ 					test_bit(BIO_UPTODATE, &bio->bi_flags);
+ 				if (err)
+ 					uptodate = 0;
++				offset += len;
+ 				continue;
+ 			}
+ 		}
+diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
+index f46cfe45d686..54c84daec9b5 100644
+--- a/fs/btrfs/file-item.c
++++ b/fs/btrfs/file-item.c
+@@ -756,7 +756,7 @@ again:
+ 				found_next = 1;
+ 			if (ret != 0)
+ 				goto insert;
+-			slot = 0;
++			slot = path->slots[0];
+ 		}
+ 		btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
+ 		if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
+index 1f2b99cb55ea..ab1fd668020d 100644
+--- a/fs/btrfs/file.c
++++ b/fs/btrfs/file.c
+@@ -1838,6 +1838,8 @@ out:
+ 
+ int btrfs_release_file(struct inode *inode, struct file *filp)
+ {
++	if (filp->private_data)
++		btrfs_ioctl_trans_end(filp);
+ 	/*
+ 	 * ordered_data_close is set by settattr when we are about to truncate
+ 	 * a file from a non-zero size to a zero size.  This tries to
+@@ -1845,26 +1847,8 @@ int btrfs_release_file(struct inode *inode, struct file *filp)
+ 	 * application were using truncate to replace a file in place.
+ 	 */
+ 	if (test_and_clear_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+-			       &BTRFS_I(inode)->runtime_flags)) {
+-		struct btrfs_trans_handle *trans;
+-		struct btrfs_root *root = BTRFS_I(inode)->root;
+-
+-		/*
+-		 * We need to block on a committing transaction to keep us from
+-		 * throwing a ordered operation on to the list and causing
+-		 * something like sync to deadlock trying to flush out this
+-		 * inode.
+-		 */
+-		trans = btrfs_start_transaction(root, 0);
+-		if (IS_ERR(trans))
+-			return PTR_ERR(trans);
+-		btrfs_add_ordered_operation(trans, BTRFS_I(inode)->root, inode);
+-		btrfs_end_transaction(trans, root);
+-		if (inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT)
++			       &BTRFS_I(inode)->runtime_flags))
+ 			filemap_flush(inode->i_mapping);
+-	}
+-	if (filp->private_data)
+-		btrfs_ioctl_trans_end(filp);
+ 	return 0;
+ }
+ 
+diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
+index 3668048e16f8..c6cd34e699d0 100644
+--- a/fs/btrfs/inode.c
++++ b/fs/btrfs/inode.c
+@@ -709,6 +709,18 @@ retry:
+ 				unlock_extent(io_tree, async_extent->start,
+ 					      async_extent->start +
+ 					      async_extent->ram_size - 1);
++
++				/*
++				 * we need to redirty the pages if we decide to
++				 * fallback to uncompressed IO, otherwise we
++				 * will not submit these pages down to lower
++				 * layers.
++				 */
++				extent_range_redirty_for_io(inode,
++						async_extent->start,
++						async_extent->start +
++						async_extent->ram_size - 1);
++
+ 				goto retry;
+ 			}
+ 			goto out_free;
+@@ -1084,8 +1096,10 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
+ 		async_cow->end = cur_end;
+ 		INIT_LIST_HEAD(&async_cow->extents);
+ 
+-		btrfs_init_work(&async_cow->work, async_cow_start,
+-				async_cow_submit, async_cow_free);
++		btrfs_init_work(&async_cow->work,
++				btrfs_delalloc_helper,
++				async_cow_start, async_cow_submit,
++				async_cow_free);
+ 
+ 		nr_pages = (cur_end - start + PAGE_CACHE_SIZE) >>
+ 			PAGE_CACHE_SHIFT;
+@@ -1869,7 +1883,8 @@ static int btrfs_writepage_start_hook(struct page *page, u64 start, u64 end)
+ 
+ 	SetPageChecked(page);
+ 	page_cache_get(page);
+-	btrfs_init_work(&fixup->work, btrfs_writepage_fixup_worker, NULL, NULL);
++	btrfs_init_work(&fixup->work, btrfs_fixup_helper,
++			btrfs_writepage_fixup_worker, NULL, NULL);
+ 	fixup->page = page;
+ 	btrfs_queue_work(root->fs_info->fixup_workers, &fixup->work);
+ 	return -EBUSY;
+@@ -2810,7 +2825,8 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+ 	struct inode *inode = page->mapping->host;
+ 	struct btrfs_root *root = BTRFS_I(inode)->root;
+ 	struct btrfs_ordered_extent *ordered_extent = NULL;
+-	struct btrfs_workqueue *workers;
++	struct btrfs_workqueue *wq;
++	btrfs_work_func_t func;
+ 
+ 	trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
+ 
+@@ -2819,13 +2835,17 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+ 					    end - start + 1, uptodate))
+ 		return 0;
+ 
+-	btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL);
++	if (btrfs_is_free_space_inode(inode)) {
++		wq = root->fs_info->endio_freespace_worker;
++		func = btrfs_freespace_write_helper;
++	} else {
++		wq = root->fs_info->endio_write_workers;
++		func = btrfs_endio_write_helper;
++	}
+ 
+-	if (btrfs_is_free_space_inode(inode))
+-		workers = root->fs_info->endio_freespace_worker;
+-	else
+-		workers = root->fs_info->endio_write_workers;
+-	btrfs_queue_work(workers, &ordered_extent->work);
++	btrfs_init_work(&ordered_extent->work, func, finish_ordered_fn, NULL,
++			NULL);
++	btrfs_queue_work(wq, &ordered_extent->work);
+ 
+ 	return 0;
+ }
+@@ -7146,7 +7166,8 @@ again:
+ 	if (!ret)
+ 		goto out_test;
+ 
+-	btrfs_init_work(&ordered->work, finish_ordered_fn, NULL, NULL);
++	btrfs_init_work(&ordered->work, btrfs_endio_write_helper,
++			finish_ordered_fn, NULL, NULL);
+ 	btrfs_queue_work(root->fs_info->endio_write_workers,
+ 			 &ordered->work);
+ out_test:
+@@ -7939,27 +7960,6 @@ static int btrfs_truncate(struct inode *inode)
+ 	BUG_ON(ret);
+ 
+ 	/*
+-	 * setattr is responsible for setting the ordered_data_close flag,
+-	 * but that is only tested during the last file release.  That
+-	 * could happen well after the next commit, leaving a great big
+-	 * window where new writes may get lost if someone chooses to write
+-	 * to this file after truncating to zero
+-	 *
+-	 * The inode doesn't have any dirty data here, and so if we commit
+-	 * this is a noop.  If someone immediately starts writing to the inode
+-	 * it is very likely we'll catch some of their writes in this
+-	 * transaction, and the commit will find this file on the ordered
+-	 * data list with good things to send down.
+-	 *
+-	 * This is a best effort solution, there is still a window where
+-	 * using truncate to replace the contents of the file will
+-	 * end up with a zero length file after a crash.
+-	 */
+-	if (inode->i_size == 0 && test_bit(BTRFS_INODE_ORDERED_DATA_CLOSE,
+-					   &BTRFS_I(inode)->runtime_flags))
+-		btrfs_add_ordered_operation(trans, root, inode);
+-
+-	/*
+ 	 * So if we truncate and then write and fsync we normally would just
+ 	 * write the extents that changed, which is a problem if we need to
+ 	 * first truncate that entire inode.  So set this flag so we write out
+@@ -8106,7 +8106,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
+ 	mutex_init(&ei->delalloc_mutex);
+ 	btrfs_ordered_inode_tree_init(&ei->ordered_tree);
+ 	INIT_LIST_HEAD(&ei->delalloc_inodes);
+-	INIT_LIST_HEAD(&ei->ordered_operations);
+ 	RB_CLEAR_NODE(&ei->rb_node);
+ 
+ 	return inode;
+@@ -8146,17 +8145,6 @@ void btrfs_destroy_inode(struct inode *inode)
+ 	if (!root)
+ 		goto free;
+ 
+-	/*
+-	 * Make sure we're properly removed from the ordered operation
+-	 * lists.
+-	 */
+-	smp_mb();
+-	if (!list_empty(&BTRFS_I(inode)->ordered_operations)) {
+-		spin_lock(&root->fs_info->ordered_root_lock);
+-		list_del_init(&BTRFS_I(inode)->ordered_operations);
+-		spin_unlock(&root->fs_info->ordered_root_lock);
+-	}
+-
+ 	if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
+ 		     &BTRFS_I(inode)->runtime_flags)) {
+ 		btrfs_info(root->fs_info, "inode %llu still on the orphan list",
+@@ -8338,12 +8326,10 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 	ret = 0;
+ 
+ 	/*
+-	 * we're using rename to replace one file with another.
+-	 * and the replacement file is large.  Start IO on it now so
+-	 * we don't add too much work to the end of the transaction
++	 * we're using rename to replace one file with another.  Start IO on it
++	 * now so  we don't add too much work to the end of the transaction
+ 	 */
+-	if (new_inode && S_ISREG(old_inode->i_mode) && new_inode->i_size &&
+-	    old_inode->i_size > BTRFS_ORDERED_OPERATIONS_FLUSH_LIMIT)
++	if (new_inode && S_ISREG(old_inode->i_mode) && new_inode->i_size)
+ 		filemap_flush(old_inode->i_mapping);
+ 
+ 	/* close the racy window with snapshot create/destroy ioctl */
+@@ -8391,12 +8377,6 @@ static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 		 */
+ 		btrfs_pin_log_trans(root);
+ 	}
+-	/*
+-	 * make sure the inode gets flushed if it is replacing
+-	 * something.
+-	 */
+-	if (new_inode && new_inode->i_size && S_ISREG(old_inode->i_mode))
+-		btrfs_add_ordered_operation(trans, root, old_inode);
+ 
+ 	inode_inc_iversion(old_dir);
+ 	inode_inc_iversion(new_dir);
+@@ -8514,7 +8494,9 @@ struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode,
+ 	work->inode = inode;
+ 	work->wait = wait;
+ 	work->delay_iput = delay_iput;
+-	btrfs_init_work(&work->work, btrfs_run_delalloc_work, NULL, NULL);
++	WARN_ON_ONCE(!inode);
++	btrfs_init_work(&work->work, btrfs_flush_delalloc_helper,
++			btrfs_run_delalloc_work, NULL, NULL);
+ 
+ 	return work;
+ }
+diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
+index 7187b14faa6c..ac734ec4cc20 100644
+--- a/fs/btrfs/ordered-data.c
++++ b/fs/btrfs/ordered-data.c
+@@ -571,18 +571,6 @@ void btrfs_remove_ordered_extent(struct inode *inode,
+ 
+ 	trace_btrfs_ordered_extent_remove(inode, entry);
+ 
+-	/*
+-	 * we have no more ordered extents for this inode and
+-	 * no dirty pages.  We can safely remove it from the
+-	 * list of ordered extents
+-	 */
+-	if (RB_EMPTY_ROOT(&tree->tree) &&
+-	    !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+-		spin_lock(&root->fs_info->ordered_root_lock);
+-		list_del_init(&BTRFS_I(inode)->ordered_operations);
+-		spin_unlock(&root->fs_info->ordered_root_lock);
+-	}
+-
+ 	if (!root->nr_ordered_extents) {
+ 		spin_lock(&root->fs_info->ordered_root_lock);
+ 		BUG_ON(list_empty(&root->ordered_root));
+@@ -627,6 +615,7 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr)
+ 		spin_unlock(&root->ordered_extent_lock);
+ 
+ 		btrfs_init_work(&ordered->flush_work,
++				btrfs_flush_delalloc_helper,
+ 				btrfs_run_ordered_extent_work, NULL, NULL);
+ 		list_add_tail(&ordered->work_list, &works);
+ 		btrfs_queue_work(root->fs_info->flush_workers,
+@@ -687,81 +676,6 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr)
+ }
+ 
+ /*
+- * this is used during transaction commit to write all the inodes
+- * added to the ordered operation list.  These files must be fully on
+- * disk before the transaction commits.
+- *
+- * we have two modes here, one is to just start the IO via filemap_flush
+- * and the other is to wait for all the io.  When we wait, we have an
+- * extra check to make sure the ordered operation list really is empty
+- * before we return
+- */
+-int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
+-				 struct btrfs_root *root, int wait)
+-{
+-	struct btrfs_inode *btrfs_inode;
+-	struct inode *inode;
+-	struct btrfs_transaction *cur_trans = trans->transaction;
+-	struct list_head splice;
+-	struct list_head works;
+-	struct btrfs_delalloc_work *work, *next;
+-	int ret = 0;
+-
+-	INIT_LIST_HEAD(&splice);
+-	INIT_LIST_HEAD(&works);
+-
+-	mutex_lock(&root->fs_info->ordered_extent_flush_mutex);
+-	spin_lock(&root->fs_info->ordered_root_lock);
+-	list_splice_init(&cur_trans->ordered_operations, &splice);
+-	while (!list_empty(&splice)) {
+-		btrfs_inode = list_entry(splice.next, struct btrfs_inode,
+-				   ordered_operations);
+-		inode = &btrfs_inode->vfs_inode;
+-
+-		list_del_init(&btrfs_inode->ordered_operations);
+-
+-		/*
+-		 * the inode may be getting freed (in sys_unlink path).
+-		 */
+-		inode = igrab(inode);
+-		if (!inode)
+-			continue;
+-
+-		if (!wait)
+-			list_add_tail(&BTRFS_I(inode)->ordered_operations,
+-				      &cur_trans->ordered_operations);
+-		spin_unlock(&root->fs_info->ordered_root_lock);
+-
+-		work = btrfs_alloc_delalloc_work(inode, wait, 1);
+-		if (!work) {
+-			spin_lock(&root->fs_info->ordered_root_lock);
+-			if (list_empty(&BTRFS_I(inode)->ordered_operations))
+-				list_add_tail(&btrfs_inode->ordered_operations,
+-					      &splice);
+-			list_splice_tail(&splice,
+-					 &cur_trans->ordered_operations);
+-			spin_unlock(&root->fs_info->ordered_root_lock);
+-			ret = -ENOMEM;
+-			goto out;
+-		}
+-		list_add_tail(&work->list, &works);
+-		btrfs_queue_work(root->fs_info->flush_workers,
+-				 &work->work);
+-
+-		cond_resched();
+-		spin_lock(&root->fs_info->ordered_root_lock);
+-	}
+-	spin_unlock(&root->fs_info->ordered_root_lock);
+-out:
+-	list_for_each_entry_safe(work, next, &works, list) {
+-		list_del_init(&work->list);
+-		btrfs_wait_and_free_delalloc_work(work);
+-	}
+-	mutex_unlock(&root->fs_info->ordered_extent_flush_mutex);
+-	return ret;
+-}
+-
+-/*
+  * Used to start IO or wait for a given ordered extent to finish.
+  *
+  * If wait is one, this effectively waits on page writeback for all the pages
+@@ -1120,42 +1034,6 @@ out:
+ 	return index;
+ }
+ 
+-
+-/*
+- * add a given inode to the list of inodes that must be fully on
+- * disk before a transaction commit finishes.
+- *
+- * This basically gives us the ext3 style data=ordered mode, and it is mostly
+- * used to make sure renamed files are fully on disk.
+- *
+- * It is a noop if the inode is already fully on disk.
+- *
+- * If trans is not null, we'll do a friendly check for a transaction that
+- * is already flushing things and force the IO down ourselves.
+- */
+-void btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
+-				 struct btrfs_root *root, struct inode *inode)
+-{
+-	struct btrfs_transaction *cur_trans = trans->transaction;
+-	u64 last_mod;
+-
+-	last_mod = max(BTRFS_I(inode)->generation, BTRFS_I(inode)->last_trans);
+-
+-	/*
+-	 * if this file hasn't been changed since the last transaction
+-	 * commit, we can safely return without doing anything
+-	 */
+-	if (last_mod <= root->fs_info->last_trans_committed)
+-		return;
+-
+-	spin_lock(&root->fs_info->ordered_root_lock);
+-	if (list_empty(&BTRFS_I(inode)->ordered_operations)) {
+-		list_add_tail(&BTRFS_I(inode)->ordered_operations,
+-			      &cur_trans->ordered_operations);
+-	}
+-	spin_unlock(&root->fs_info->ordered_root_lock);
+-}
+-
+ int __init ordered_data_init(void)
+ {
+ 	btrfs_ordered_extent_cache = kmem_cache_create("btrfs_ordered_extent",
+diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
+index 246897058efb..d81a274d621e 100644
+--- a/fs/btrfs/ordered-data.h
++++ b/fs/btrfs/ordered-data.h
+@@ -190,11 +190,6 @@ int btrfs_ordered_update_i_size(struct inode *inode, u64 offset,
+ 				struct btrfs_ordered_extent *ordered);
+ int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr,
+ 			   u32 *sum, int len);
+-int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
+-				 struct btrfs_root *root, int wait);
+-void btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
+-				 struct btrfs_root *root,
+-				 struct inode *inode);
+ int btrfs_wait_ordered_extents(struct btrfs_root *root, int nr);
+ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, int nr);
+ void btrfs_get_logged_extents(struct inode *inode,
+diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
+index 98cb6b2630f9..3eec914710b2 100644
+--- a/fs/btrfs/qgroup.c
++++ b/fs/btrfs/qgroup.c
+@@ -2551,6 +2551,7 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid,
+ 	memset(&fs_info->qgroup_rescan_work, 0,
+ 	       sizeof(fs_info->qgroup_rescan_work));
+ 	btrfs_init_work(&fs_info->qgroup_rescan_work,
++			btrfs_qgroup_rescan_helper,
+ 			btrfs_qgroup_rescan_worker, NULL, NULL);
+ 
+ 	if (ret) {
+diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
+index 4a88f073fdd7..0a6b6e4bcbb9 100644
+--- a/fs/btrfs/raid56.c
++++ b/fs/btrfs/raid56.c
+@@ -1416,7 +1416,8 @@ cleanup:
+ 
+ static void async_rmw_stripe(struct btrfs_raid_bio *rbio)
+ {
+-	btrfs_init_work(&rbio->work, rmw_work, NULL, NULL);
++	btrfs_init_work(&rbio->work, btrfs_rmw_helper,
++			rmw_work, NULL, NULL);
+ 
+ 	btrfs_queue_work(rbio->fs_info->rmw_workers,
+ 			 &rbio->work);
+@@ -1424,7 +1425,8 @@ static void async_rmw_stripe(struct btrfs_raid_bio *rbio)
+ 
+ static void async_read_rebuild(struct btrfs_raid_bio *rbio)
+ {
+-	btrfs_init_work(&rbio->work, read_rebuild_work, NULL, NULL);
++	btrfs_init_work(&rbio->work, btrfs_rmw_helper,
++			read_rebuild_work, NULL, NULL);
+ 
+ 	btrfs_queue_work(rbio->fs_info->rmw_workers,
+ 			 &rbio->work);
+@@ -1665,7 +1667,8 @@ static void btrfs_raid_unplug(struct blk_plug_cb *cb, bool from_schedule)
+ 	plug = container_of(cb, struct btrfs_plug_cb, cb);
+ 
+ 	if (from_schedule) {
+-		btrfs_init_work(&plug->work, unplug_work, NULL, NULL);
++		btrfs_init_work(&plug->work, btrfs_rmw_helper,
++				unplug_work, NULL, NULL);
+ 		btrfs_queue_work(plug->info->rmw_workers,
+ 				 &plug->work);
+ 		return;
+diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
+index 09230cf3a244..20408c6b665a 100644
+--- a/fs/btrfs/reada.c
++++ b/fs/btrfs/reada.c
+@@ -798,7 +798,8 @@ static void reada_start_machine(struct btrfs_fs_info *fs_info)
+ 		/* FIXME we cannot handle this properly right now */
+ 		BUG();
+ 	}
+-	btrfs_init_work(&rmw->work, reada_start_machine_worker, NULL, NULL);
++	btrfs_init_work(&rmw->work, btrfs_readahead_helper,
++			reada_start_machine_worker, NULL, NULL);
+ 	rmw->fs_info = fs_info;
+ 
+ 	btrfs_queue_work(fs_info->readahead_workers, &rmw->work);
+diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
+index b6d198f5181e..8dddedcfa961 100644
+--- a/fs/btrfs/scrub.c
++++ b/fs/btrfs/scrub.c
+@@ -428,8 +428,8 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int is_dev_replace)
+ 		sbio->index = i;
+ 		sbio->sctx = sctx;
+ 		sbio->page_count = 0;
+-		btrfs_init_work(&sbio->work, scrub_bio_end_io_worker,
+-				NULL, NULL);
++		btrfs_init_work(&sbio->work, btrfs_scrub_helper,
++				scrub_bio_end_io_worker, NULL, NULL);
+ 
+ 		if (i != SCRUB_BIOS_PER_SCTX - 1)
+ 			sctx->bios[i]->next_free = i + 1;
+@@ -999,8 +999,8 @@ nodatasum_case:
+ 		fixup_nodatasum->root = fs_info->extent_root;
+ 		fixup_nodatasum->mirror_num = failed_mirror_index + 1;
+ 		scrub_pending_trans_workers_inc(sctx);
+-		btrfs_init_work(&fixup_nodatasum->work, scrub_fixup_nodatasum,
+-				NULL, NULL);
++		btrfs_init_work(&fixup_nodatasum->work, btrfs_scrub_helper,
++				scrub_fixup_nodatasum, NULL, NULL);
+ 		btrfs_queue_work(fs_info->scrub_workers,
+ 				 &fixup_nodatasum->work);
+ 		goto out;
+@@ -1616,7 +1616,8 @@ static void scrub_wr_bio_end_io(struct bio *bio, int err)
+ 	sbio->err = err;
+ 	sbio->bio = bio;
+ 
+-	btrfs_init_work(&sbio->work, scrub_wr_bio_end_io_worker, NULL, NULL);
++	btrfs_init_work(&sbio->work, btrfs_scrubwrc_helper,
++			 scrub_wr_bio_end_io_worker, NULL, NULL);
+ 	btrfs_queue_work(fs_info->scrub_wr_completion_workers, &sbio->work);
+ }
+ 
+@@ -3203,7 +3204,8 @@ static int copy_nocow_pages(struct scrub_ctx *sctx, u64 logical, u64 len,
+ 	nocow_ctx->len = len;
+ 	nocow_ctx->mirror_num = mirror_num;
+ 	nocow_ctx->physical_for_dev_replace = physical_for_dev_replace;
+-	btrfs_init_work(&nocow_ctx->work, copy_nocow_pages_worker, NULL, NULL);
++	btrfs_init_work(&nocow_ctx->work, btrfs_scrubnc_helper,
++			copy_nocow_pages_worker, NULL, NULL);
+ 	INIT_LIST_HEAD(&nocow_ctx->inodes);
+ 	btrfs_queue_work(fs_info->scrub_nocow_workers,
+ 			 &nocow_ctx->work);
+diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
+index 5f379affdf23..d89c6d3542ca 100644
+--- a/fs/btrfs/transaction.c
++++ b/fs/btrfs/transaction.c
+@@ -218,7 +218,6 @@ loop:
+ 	spin_lock_init(&cur_trans->delayed_refs.lock);
+ 
+ 	INIT_LIST_HEAD(&cur_trans->pending_snapshots);
+-	INIT_LIST_HEAD(&cur_trans->ordered_operations);
+ 	INIT_LIST_HEAD(&cur_trans->pending_chunks);
+ 	INIT_LIST_HEAD(&cur_trans->switch_commits);
+ 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
+@@ -1612,27 +1611,6 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
+ 	kmem_cache_free(btrfs_trans_handle_cachep, trans);
+ }
+ 
+-static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
+-					  struct btrfs_root *root)
+-{
+-	int ret;
+-
+-	ret = btrfs_run_delayed_items(trans, root);
+-	if (ret)
+-		return ret;
+-
+-	/*
+-	 * rename don't use btrfs_join_transaction, so, once we
+-	 * set the transaction to blocked above, we aren't going
+-	 * to get any new ordered operations.  We can safely run
+-	 * it here and no for sure that nothing new will be added
+-	 * to the list
+-	 */
+-	ret = btrfs_run_ordered_operations(trans, root, 1);
+-
+-	return ret;
+-}
+-
+ static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
+ {
+ 	if (btrfs_test_opt(fs_info->tree_root, FLUSHONCOMMIT))
+@@ -1653,13 +1631,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
+ 	struct btrfs_transaction *prev_trans = NULL;
+ 	int ret;
+ 
+-	ret = btrfs_run_ordered_operations(trans, root, 0);
+-	if (ret) {
+-		btrfs_abort_transaction(trans, root, ret);
+-		btrfs_end_transaction(trans, root);
+-		return ret;
+-	}
+-
+ 	/* Stop the commit early if ->aborted is set */
+ 	if (unlikely(ACCESS_ONCE(cur_trans->aborted))) {
+ 		ret = cur_trans->aborted;
+@@ -1740,7 +1711,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
+ 	if (ret)
+ 		goto cleanup_transaction;
+ 
+-	ret = btrfs_flush_all_pending_stuffs(trans, root);
++	ret = btrfs_run_delayed_items(trans, root);
+ 	if (ret)
+ 		goto cleanup_transaction;
+ 
+@@ -1748,7 +1719,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
+ 		   extwriter_counter_read(cur_trans) == 0);
+ 
+ 	/* some pending stuffs might be added after the previous flush. */
+-	ret = btrfs_flush_all_pending_stuffs(trans, root);
++	ret = btrfs_run_delayed_items(trans, root);
+ 	if (ret)
+ 		goto cleanup_transaction;
+ 
+diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
+index 7dd558ed0716..579be51b27e5 100644
+--- a/fs/btrfs/transaction.h
++++ b/fs/btrfs/transaction.h
+@@ -55,7 +55,6 @@ struct btrfs_transaction {
+ 	wait_queue_head_t writer_wait;
+ 	wait_queue_head_t commit_wait;
+ 	struct list_head pending_snapshots;
+-	struct list_head ordered_operations;
+ 	struct list_head pending_chunks;
+ 	struct list_head switch_commits;
+ 	struct btrfs_delayed_ref_root delayed_refs;
+diff --git a/fs/btrfs/ulist.h b/fs/btrfs/ulist.h
+index 7f78cbf5cf41..4c29db604bbe 100644
+--- a/fs/btrfs/ulist.h
++++ b/fs/btrfs/ulist.h
+@@ -57,6 +57,21 @@ void ulist_free(struct ulist *ulist);
+ int ulist_add(struct ulist *ulist, u64 val, u64 aux, gfp_t gfp_mask);
+ int ulist_add_merge(struct ulist *ulist, u64 val, u64 aux,
+ 		    u64 *old_aux, gfp_t gfp_mask);
++
++/* just like ulist_add_merge() but take a pointer for the aux data */
++static inline int ulist_add_merge_ptr(struct ulist *ulist, u64 val, void *aux,
++				      void **old_aux, gfp_t gfp_mask)
++{
++#if BITS_PER_LONG == 32
++	u64 old64 = (uintptr_t)*old_aux;
++	int ret = ulist_add_merge(ulist, val, (uintptr_t)aux, &old64, gfp_mask);
++	*old_aux = (void *)((uintptr_t)old64);
++	return ret;
++#else
++	return ulist_add_merge(ulist, val, (u64)aux, (u64 *)old_aux, gfp_mask);
++#endif
++}
++
+ struct ulist_node *ulist_next(struct ulist *ulist,
+ 			      struct ulist_iterator *uiter);
+ 
+diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
+index 6cb82f62cb7c..81bec9fd8f19 100644
+--- a/fs/btrfs/volumes.c
++++ b/fs/btrfs/volumes.c
+@@ -5800,7 +5800,8 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
+ 	else
+ 		generate_random_uuid(dev->uuid);
+ 
+-	btrfs_init_work(&dev->work, pending_bios_fn, NULL, NULL);
++	btrfs_init_work(&dev->work, btrfs_submit_helper,
++			pending_bios_fn, NULL, NULL);
+ 
+ 	return dev;
+ }
+diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
+index 8c41b52da358..16a46b6a6fee 100644
+--- a/fs/debugfs/inode.c
++++ b/fs/debugfs/inode.c
+@@ -534,7 +534,7 @@ EXPORT_SYMBOL_GPL(debugfs_remove);
+  */
+ void debugfs_remove_recursive(struct dentry *dentry)
+ {
+-	struct dentry *child, *next, *parent;
++	struct dentry *child, *parent;
+ 
+ 	if (IS_ERR_OR_NULL(dentry))
+ 		return;
+@@ -546,30 +546,49 @@ void debugfs_remove_recursive(struct dentry *dentry)
+ 	parent = dentry;
+  down:
+ 	mutex_lock(&parent->d_inode->i_mutex);
+-	list_for_each_entry_safe(child, next, &parent->d_subdirs, d_u.d_child) {
++ loop:
++	/*
++	 * The parent->d_subdirs is protected by the d_lock. Outside that
++	 * lock, the child can be unlinked and set to be freed which can
++	 * use the d_u.d_child as the rcu head and corrupt this list.
++	 */
++	spin_lock(&parent->d_lock);
++	list_for_each_entry(child, &parent->d_subdirs, d_u.d_child) {
+ 		if (!debugfs_positive(child))
+ 			continue;
+ 
+ 		/* perhaps simple_empty(child) makes more sense */
+ 		if (!list_empty(&child->d_subdirs)) {
++			spin_unlock(&parent->d_lock);
+ 			mutex_unlock(&parent->d_inode->i_mutex);
+ 			parent = child;
+ 			goto down;
+ 		}
+- up:
++
++		spin_unlock(&parent->d_lock);
++
+ 		if (!__debugfs_remove(child, parent))
+ 			simple_release_fs(&debugfs_mount, &debugfs_mount_count);
++
++		/*
++		 * The parent->d_lock protects agaist child from unlinking
++		 * from d_subdirs. When releasing the parent->d_lock we can
++		 * no longer trust that the next pointer is valid.
++		 * Restart the loop. We'll skip this one with the
++		 * debugfs_positive() check.
++		 */
++		goto loop;
+ 	}
++	spin_unlock(&parent->d_lock);
+ 
+ 	mutex_unlock(&parent->d_inode->i_mutex);
+ 	child = parent;
+ 	parent = parent->d_parent;
+ 	mutex_lock(&parent->d_inode->i_mutex);
+ 
+-	if (child != dentry) {
+-		next = list_next_entry(child, d_u.d_child);
+-		goto up;
+-	}
++	if (child != dentry)
++		/* go up */
++		goto loop;
+ 
+ 	if (!__debugfs_remove(child, parent))
+ 		simple_release_fs(&debugfs_mount, &debugfs_mount_count);
+diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
+index 7cc5a0e23688..1bbe7c315138 100644
+--- a/fs/ext4/ext4.h
++++ b/fs/ext4/ext4.h
+@@ -2144,8 +2144,8 @@ extern ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
+ extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
+ extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
+ extern void ext4_ind_truncate(handle_t *, struct inode *inode);
+-extern int ext4_free_hole_blocks(handle_t *handle, struct inode *inode,
+-				 ext4_lblk_t first, ext4_lblk_t stop);
++extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
++				 ext4_lblk_t start, ext4_lblk_t end);
+ 
+ /* ioctl.c */
+ extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
+@@ -2453,6 +2453,22 @@ static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
+ 	up_write(&EXT4_I(inode)->i_data_sem);
+ }
+ 
++/* Update i_size, i_disksize. Requires i_mutex to avoid races with truncate */
++static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
++{
++	int changed = 0;
++
++	if (newsize > inode->i_size) {
++		i_size_write(inode, newsize);
++		changed = 1;
++	}
++	if (newsize > EXT4_I(inode)->i_disksize) {
++		ext4_update_i_disksize(inode, newsize);
++		changed |= 2;
++	}
++	return changed;
++}
++
+ struct ext4_group_info {
+ 	unsigned long   bb_state;
+ 	struct rb_root  bb_free_root;
+diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
+index 4da228a0e6d0..7dfd6300e1c2 100644
+--- a/fs/ext4/extents.c
++++ b/fs/ext4/extents.c
+@@ -4664,7 +4664,8 @@ retry:
+ }
+ 
+ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
+-				  ext4_lblk_t len, int flags, int mode)
++				  ext4_lblk_t len, loff_t new_size,
++				  int flags, int mode)
+ {
+ 	struct inode *inode = file_inode(file);
+ 	handle_t *handle;
+@@ -4673,8 +4674,10 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
+ 	int retries = 0;
+ 	struct ext4_map_blocks map;
+ 	unsigned int credits;
++	loff_t epos;
+ 
+ 	map.m_lblk = offset;
++	map.m_len = len;
+ 	/*
+ 	 * Don't normalize the request if it can fit in one extent so
+ 	 * that it doesn't get unnecessarily split into multiple
+@@ -4689,9 +4692,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
+ 	credits = ext4_chunk_trans_blocks(inode, len);
+ 
+ retry:
+-	while (ret >= 0 && ret < len) {
+-		map.m_lblk = map.m_lblk + ret;
+-		map.m_len = len = len - ret;
++	while (ret >= 0 && len) {
+ 		handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
+ 					    credits);
+ 		if (IS_ERR(handle)) {
+@@ -4708,6 +4709,21 @@ retry:
+ 			ret2 = ext4_journal_stop(handle);
+ 			break;
+ 		}
++		map.m_lblk += ret;
++		map.m_len = len = len - ret;
++		epos = (loff_t)map.m_lblk << inode->i_blkbits;
++		inode->i_ctime = ext4_current_time(inode);
++		if (new_size) {
++			if (epos > new_size)
++				epos = new_size;
++			if (ext4_update_inode_size(inode, epos) & 0x1)
++				inode->i_mtime = inode->i_ctime;
++		} else {
++			if (epos > inode->i_size)
++				ext4_set_inode_flag(inode,
++						    EXT4_INODE_EOFBLOCKS);
++		}
++		ext4_mark_inode_dirty(handle, inode);
+ 		ret2 = ext4_journal_stop(handle);
+ 		if (ret2)
+ 			break;
+@@ -4730,7 +4746,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ 	loff_t new_size = 0;
+ 	int ret = 0;
+ 	int flags;
+-	int partial;
++	int credits;
++	int partial_begin, partial_end;
+ 	loff_t start, end;
+ 	ext4_lblk_t lblk;
+ 	struct address_space *mapping = inode->i_mapping;
+@@ -4770,7 +4787,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ 
+ 	if (start < offset || end > offset + len)
+ 		return -EINVAL;
+-	partial = (offset + len) & ((1 << blkbits) - 1);
++	partial_begin = offset & ((1 << blkbits) - 1);
++	partial_end = (offset + len) & ((1 << blkbits) - 1);
+ 
+ 	lblk = start >> blkbits;
+ 	max_blocks = (end >> blkbits);
+@@ -4804,7 +4822,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ 		 * If we have a partial block after EOF we have to allocate
+ 		 * the entire block.
+ 		 */
+-		if (partial)
++		if (partial_end)
+ 			max_blocks += 1;
+ 	}
+ 
+@@ -4812,6 +4830,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ 
+ 		/* Now release the pages and zero block aligned part of pages*/
+ 		truncate_pagecache_range(inode, start, end - 1);
++		inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+ 
+ 		/* Wait all existing dio workers, newcomers will block on i_mutex */
+ 		ext4_inode_block_unlocked_dio(inode);
+@@ -4824,13 +4843,22 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ 		if (ret)
+ 			goto out_dio;
+ 
+-		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags,
+-					     mode);
++		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
++					     flags, mode);
+ 		if (ret)
+ 			goto out_dio;
+ 	}
++	if (!partial_begin && !partial_end)
++		goto out_dio;
+ 
+-	handle = ext4_journal_start(inode, EXT4_HT_MISC, 4);
++	/*
++	 * In worst case we have to writeout two nonadjacent unwritten
++	 * blocks and update the inode
++	 */
++	credits = (2 * ext4_ext_index_trans_blocks(inode, 2)) + 1;
++	if (ext4_should_journal_data(inode))
++		credits += 2;
++	handle = ext4_journal_start(inode, EXT4_HT_MISC, credits);
+ 	if (IS_ERR(handle)) {
+ 		ret = PTR_ERR(handle);
+ 		ext4_std_error(inode->i_sb, ret);
+@@ -4838,12 +4866,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ 	}
+ 
+ 	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+-
+ 	if (new_size) {
+-		if (new_size > i_size_read(inode))
+-			i_size_write(inode, new_size);
+-		if (new_size > EXT4_I(inode)->i_disksize)
+-			ext4_update_i_disksize(inode, new_size);
++		ext4_update_inode_size(inode, new_size);
+ 	} else {
+ 		/*
+ 		* Mark that we allocate beyond EOF so the subsequent truncate
+@@ -4852,7 +4876,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
+ 		if ((offset + len) > i_size_read(inode))
+ 			ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
+ 	}
+-
+ 	ext4_mark_inode_dirty(handle, inode);
+ 
+ 	/* Zero out partial block at the edges of the range */
+@@ -4879,13 +4902,11 @@ out_mutex:
+ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+ {
+ 	struct inode *inode = file_inode(file);
+-	handle_t *handle;
+ 	loff_t new_size = 0;
+ 	unsigned int max_blocks;
+ 	int ret = 0;
+ 	int flags;
+ 	ext4_lblk_t lblk;
+-	struct timespec tv;
+ 	unsigned int blkbits = inode->i_blkbits;
+ 
+ 	/* Return error if mode is not supported */
+@@ -4936,36 +4957,15 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+ 			goto out;
+ 	}
+ 
+-	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, flags, mode);
++	ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
++				     flags, mode);
+ 	if (ret)
+ 		goto out;
+ 
+-	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+-	if (IS_ERR(handle))
+-		goto out;
+-
+-	tv = inode->i_ctime = ext4_current_time(inode);
+-
+-	if (new_size) {
+-		if (new_size > i_size_read(inode)) {
+-			i_size_write(inode, new_size);
+-			inode->i_mtime = tv;
+-		}
+-		if (new_size > EXT4_I(inode)->i_disksize)
+-			ext4_update_i_disksize(inode, new_size);
+-	} else {
+-		/*
+-		* Mark that we allocate beyond EOF so the subsequent truncate
+-		* can proceed even if the new size is the same as i_size.
+-		*/
+-		if ((offset + len) > i_size_read(inode))
+-			ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
++	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
++		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
++						EXT4_I(inode)->i_sync_tid);
+ 	}
+-	ext4_mark_inode_dirty(handle, inode);
+-	if (file->f_flags & O_SYNC)
+-		ext4_handle_sync(handle);
+-
+-	ext4_journal_stop(handle);
+ out:
+ 	mutex_unlock(&inode->i_mutex);
+ 	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
+diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
+index fd69da194826..e75f840000a0 100644
+--- a/fs/ext4/indirect.c
++++ b/fs/ext4/indirect.c
+@@ -1295,97 +1295,220 @@ do_indirects:
+ 	}
+ }
+ 
+-static int free_hole_blocks(handle_t *handle, struct inode *inode,
+-			    struct buffer_head *parent_bh, __le32 *i_data,
+-			    int level, ext4_lblk_t first,
+-			    ext4_lblk_t count, int max)
++/**
++ *	ext4_ind_remove_space - remove space from the range
++ *	@handle: JBD handle for this transaction
++ *	@inode:	inode we are dealing with
++ *	@start:	First block to remove
++ *	@end:	One block after the last block to remove (exclusive)
++ *
++ *	Free the blocks in the defined range (end is exclusive endpoint of
++ *	range). This is used by ext4_punch_hole().
++ */
++int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
++			  ext4_lblk_t start, ext4_lblk_t end)
+ {
+-	struct buffer_head *bh = NULL;
++	struct ext4_inode_info *ei = EXT4_I(inode);
++	__le32 *i_data = ei->i_data;
+ 	int addr_per_block = EXT4_ADDR_PER_BLOCK(inode->i_sb);
+-	int ret = 0;
+-	int i, inc;
+-	ext4_lblk_t offset;
+-	__le32 blk;
+-
+-	inc = 1 << ((EXT4_BLOCK_SIZE_BITS(inode->i_sb) - 2) * level);
+-	for (i = 0, offset = 0; i < max; i++, i_data++, offset += inc) {
+-		if (offset >= count + first)
+-			break;
+-		if (*i_data == 0 || (offset + inc) <= first)
+-			continue;
+-		blk = *i_data;
+-		if (level > 0) {
+-			ext4_lblk_t first2;
+-			ext4_lblk_t count2;
++	ext4_lblk_t offsets[4], offsets2[4];
++	Indirect chain[4], chain2[4];
++	Indirect *partial, *partial2;
++	ext4_lblk_t max_block;
++	__le32 nr = 0, nr2 = 0;
++	int n = 0, n2 = 0;
++	unsigned blocksize = inode->i_sb->s_blocksize;
+ 
+-			bh = sb_bread(inode->i_sb, le32_to_cpu(blk));
+-			if (!bh) {
+-				EXT4_ERROR_INODE_BLOCK(inode, le32_to_cpu(blk),
+-						       "Read failure");
+-				return -EIO;
+-			}
+-			if (first > offset) {
+-				first2 = first - offset;
+-				count2 = count;
++	max_block = (EXT4_SB(inode->i_sb)->s_bitmap_maxbytes + blocksize-1)
++					>> EXT4_BLOCK_SIZE_BITS(inode->i_sb);
++	if (end >= max_block)
++		end = max_block;
++	if ((start >= end) || (start > max_block))
++		return 0;
++
++	n = ext4_block_to_path(inode, start, offsets, NULL);
++	n2 = ext4_block_to_path(inode, end, offsets2, NULL);
++
++	BUG_ON(n > n2);
++
++	if ((n == 1) && (n == n2)) {
++		/* We're punching only within direct block range */
++		ext4_free_data(handle, inode, NULL, i_data + offsets[0],
++			       i_data + offsets2[0]);
++		return 0;
++	} else if (n2 > n) {
++		/*
++		 * Start and end are on a different levels so we're going to
++		 * free partial block at start, and partial block at end of
++		 * the range. If there are some levels in between then
++		 * do_indirects label will take care of that.
++		 */
++
++		if (n == 1) {
++			/*
++			 * Start is at the direct block level, free
++			 * everything to the end of the level.
++			 */
++			ext4_free_data(handle, inode, NULL, i_data + offsets[0],
++				       i_data + EXT4_NDIR_BLOCKS);
++			goto end_range;
++		}
++
++
++		partial = ext4_find_shared(inode, n, offsets, chain, &nr);
++		if (nr) {
++			if (partial == chain) {
++				/* Shared branch grows from the inode */
++				ext4_free_branches(handle, inode, NULL,
++					   &nr, &nr+1, (chain+n-1) - partial);
++				*partial->p = 0;
+ 			} else {
+-				first2 = 0;
+-				count2 = count - (offset - first);
++				/* Shared branch grows from an indirect block */
++				BUFFER_TRACE(partial->bh, "get_write_access");
++				ext4_free_branches(handle, inode, partial->bh,
++					partial->p,
++					partial->p+1, (chain+n-1) - partial);
+ 			}
+-			ret = free_hole_blocks(handle, inode, bh,
+-					       (__le32 *)bh->b_data, level - 1,
+-					       first2, count2,
+-					       inode->i_sb->s_blocksize >> 2);
+-			if (ret) {
+-				brelse(bh);
+-				goto err;
++		}
++
++		/*
++		 * Clear the ends of indirect blocks on the shared branch
++		 * at the start of the range
++		 */
++		while (partial > chain) {
++			ext4_free_branches(handle, inode, partial->bh,
++				partial->p + 1,
++				(__le32 *)partial->bh->b_data+addr_per_block,
++				(chain+n-1) - partial);
++			BUFFER_TRACE(partial->bh, "call brelse");
++			brelse(partial->bh);
++			partial--;
++		}
++
++end_range:
++		partial2 = ext4_find_shared(inode, n2, offsets2, chain2, &nr2);
++		if (nr2) {
++			if (partial2 == chain2) {
++				/*
++				 * Remember, end is exclusive so here we're at
++				 * the start of the next level we're not going
++				 * to free. Everything was covered by the start
++				 * of the range.
++				 */
++				return 0;
++			} else {
++				/* Shared branch grows from an indirect block */
++				partial2--;
+ 			}
++		} else {
++			/*
++			 * ext4_find_shared returns Indirect structure which
++			 * points to the last element which should not be
++			 * removed by truncate. But this is end of the range
++			 * in punch_hole so we need to point to the next element
++			 */
++			partial2->p++;
+ 		}
+-		if (level == 0 ||
+-		    (bh && all_zeroes((__le32 *)bh->b_data,
+-				      (__le32 *)bh->b_data + addr_per_block))) {
+-			ext4_free_data(handle, inode, parent_bh,
+-				       i_data, i_data + 1);
++
++		/*
++		 * Clear the ends of indirect blocks on the shared branch
++		 * at the end of the range
++		 */
++		while (partial2 > chain2) {
++			ext4_free_branches(handle, inode, partial2->bh,
++					   (__le32 *)partial2->bh->b_data,
++					   partial2->p,
++					   (chain2+n2-1) - partial2);
++			BUFFER_TRACE(partial2->bh, "call brelse");
++			brelse(partial2->bh);
++			partial2--;
+ 		}
+-		brelse(bh);
+-		bh = NULL;
++		goto do_indirects;
+ 	}
+ 
+-err:
+-	return ret;
+-}
+-
+-int ext4_free_hole_blocks(handle_t *handle, struct inode *inode,
+-			  ext4_lblk_t first, ext4_lblk_t stop)
+-{
+-	int addr_per_block = EXT4_ADDR_PER_BLOCK(inode->i_sb);
+-	int level, ret = 0;
+-	int num = EXT4_NDIR_BLOCKS;
+-	ext4_lblk_t count, max = EXT4_NDIR_BLOCKS;
+-	__le32 *i_data = EXT4_I(inode)->i_data;
+-
+-	count = stop - first;
+-	for (level = 0; level < 4; level++, max *= addr_per_block) {
+-		if (first < max) {
+-			ret = free_hole_blocks(handle, inode, NULL, i_data,
+-					       level, first, count, num);
+-			if (ret)
+-				goto err;
+-			if (count > max - first)
+-				count -= max - first;
+-			else
+-				break;
+-			first = 0;
+-		} else {
+-			first -= max;
++	/* Punch happened within the same level (n == n2) */
++	partial = ext4_find_shared(inode, n, offsets, chain, &nr);
++	partial2 = ext4_find_shared(inode, n2, offsets2, chain2, &nr2);
++	/*
++	 * ext4_find_shared returns Indirect structure which
++	 * points to the last element which should not be
++	 * removed by truncate. But this is end of the range
++	 * in punch_hole so we need to point to the next element
++	 */
++	partial2->p++;
++	while ((partial > chain) || (partial2 > chain2)) {
++		/* We're at the same block, so we're almost finished */
++		if ((partial->bh && partial2->bh) &&
++		    (partial->bh->b_blocknr == partial2->bh->b_blocknr)) {
++			if ((partial > chain) && (partial2 > chain2)) {
++				ext4_free_branches(handle, inode, partial->bh,
++						   partial->p + 1,
++						   partial2->p,
++						   (chain+n-1) - partial);
++				BUFFER_TRACE(partial->bh, "call brelse");
++				brelse(partial->bh);
++				BUFFER_TRACE(partial2->bh, "call brelse");
++				brelse(partial2->bh);
++			}
++			return 0;
+ 		}
+-		i_data += num;
+-		if (level == 0) {
+-			num = 1;
+-			max = 1;
++		/*
++		 * Clear the ends of indirect blocks on the shared branch
++		 * at the start of the range
++		 */
++		if (partial > chain) {
++			ext4_free_branches(handle, inode, partial->bh,
++				   partial->p + 1,
++				   (__le32 *)partial->bh->b_data+addr_per_block,
++				   (chain+n-1) - partial);
++			BUFFER_TRACE(partial->bh, "call brelse");
++			brelse(partial->bh);
++			partial--;
++		}
++		/*
++		 * Clear the ends of indirect blocks on the shared branch
++		 * at the end of the range
++		 */
++		if (partial2 > chain2) {
++			ext4_free_branches(handle, inode, partial2->bh,
++					   (__le32 *)partial2->bh->b_data,
++					   partial2->p,
++					   (chain2+n-1) - partial2);
++			BUFFER_TRACE(partial2->bh, "call brelse");
++			brelse(partial2->bh);
++			partial2--;
+ 		}
+ 	}
+ 
+-err:
+-	return ret;
++do_indirects:
++	/* Kill the remaining (whole) subtrees */
++	switch (offsets[0]) {
++	default:
++		if (++n >= n2)
++			return 0;
++		nr = i_data[EXT4_IND_BLOCK];
++		if (nr) {
++			ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 1);
++			i_data[EXT4_IND_BLOCK] = 0;
++		}
++	case EXT4_IND_BLOCK:
++		if (++n >= n2)
++			return 0;
++		nr = i_data[EXT4_DIND_BLOCK];
++		if (nr) {
++			ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 2);
++			i_data[EXT4_DIND_BLOCK] = 0;
++		}
++	case EXT4_DIND_BLOCK:
++		if (++n >= n2)
++			return 0;
++		nr = i_data[EXT4_TIND_BLOCK];
++		if (nr) {
++			ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 3);
++			i_data[EXT4_TIND_BLOCK] = 0;
++		}
++	case EXT4_TIND_BLOCK:
++		;
++	}
++	return 0;
+ }
+-
+diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
+index 8a064734e6eb..e9c9b5bd906a 100644
+--- a/fs/ext4/inode.c
++++ b/fs/ext4/inode.c
+@@ -1092,27 +1092,11 @@ static int ext4_write_end(struct file *file,
+ 	} else
+ 		copied = block_write_end(file, mapping, pos,
+ 					 len, copied, page, fsdata);
+-
+ 	/*
+-	 * No need to use i_size_read() here, the i_size
+-	 * cannot change under us because we hole i_mutex.
+-	 *
+-	 * But it's important to update i_size while still holding page lock:
++	 * it's important to update i_size while still holding page lock:
+ 	 * page writeout could otherwise come in and zero beyond i_size.
+ 	 */
+-	if (pos + copied > inode->i_size) {
+-		i_size_write(inode, pos + copied);
+-		i_size_changed = 1;
+-	}
+-
+-	if (pos + copied > EXT4_I(inode)->i_disksize) {
+-		/* We need to mark inode dirty even if
+-		 * new_i_size is less that inode->i_size
+-		 * but greater than i_disksize. (hint delalloc)
+-		 */
+-		ext4_update_i_disksize(inode, (pos + copied));
+-		i_size_changed = 1;
+-	}
++	i_size_changed = ext4_update_inode_size(inode, pos + copied);
+ 	unlock_page(page);
+ 	page_cache_release(page);
+ 
+@@ -1160,7 +1144,7 @@ static int ext4_journalled_write_end(struct file *file,
+ 	int ret = 0, ret2;
+ 	int partial = 0;
+ 	unsigned from, to;
+-	loff_t new_i_size;
++	int size_changed = 0;
+ 
+ 	trace_ext4_journalled_write_end(inode, pos, len, copied);
+ 	from = pos & (PAGE_CACHE_SIZE - 1);
+@@ -1183,20 +1167,18 @@ static int ext4_journalled_write_end(struct file *file,
+ 		if (!partial)
+ 			SetPageUptodate(page);
+ 	}
+-	new_i_size = pos + copied;
+-	if (new_i_size > inode->i_size)
+-		i_size_write(inode, pos+copied);
++	size_changed = ext4_update_inode_size(inode, pos + copied);
+ 	ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+ 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
+-	if (new_i_size > EXT4_I(inode)->i_disksize) {
+-		ext4_update_i_disksize(inode, new_i_size);
++	unlock_page(page);
++	page_cache_release(page);
++
++	if (size_changed) {
+ 		ret2 = ext4_mark_inode_dirty(handle, inode);
+ 		if (!ret)
+ 			ret = ret2;
+ 	}
+ 
+-	unlock_page(page);
+-	page_cache_release(page);
+ 	if (pos + len > inode->i_size && ext4_can_truncate(inode))
+ 		/* if we have allocated more blocks and copied
+ 		 * less. We will have blocks allocated outside
+@@ -2212,6 +2194,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
+ 	struct ext4_map_blocks *map = &mpd->map;
+ 	int err;
+ 	loff_t disksize;
++	int progress = 0;
+ 
+ 	mpd->io_submit.io_end->offset =
+ 				((loff_t)map->m_lblk) << inode->i_blkbits;
+@@ -2228,8 +2211,11 @@ static int mpage_map_and_submit_extent(handle_t *handle,
+ 			 * is non-zero, a commit should free up blocks.
+ 			 */
+ 			if ((err == -ENOMEM) ||
+-			    (err == -ENOSPC && ext4_count_free_clusters(sb)))
++			    (err == -ENOSPC && ext4_count_free_clusters(sb))) {
++				if (progress)
++					goto update_disksize;
+ 				return err;
++			}
+ 			ext4_msg(sb, KERN_CRIT,
+ 				 "Delayed block allocation failed for "
+ 				 "inode %lu at logical offset %llu with"
+@@ -2246,15 +2232,17 @@ static int mpage_map_and_submit_extent(handle_t *handle,
+ 			*give_up_on_write = true;
+ 			return err;
+ 		}
++		progress = 1;
+ 		/*
+ 		 * Update buffer state, submit mapped pages, and get us new
+ 		 * extent to map
+ 		 */
+ 		err = mpage_map_and_submit_buffers(mpd);
+ 		if (err < 0)
+-			return err;
++			goto update_disksize;
+ 	} while (map->m_len);
+ 
++update_disksize:
+ 	/*
+ 	 * Update on-disk size after IO is submitted.  Races with
+ 	 * truncate are avoided by checking i_size under i_data_sem.
+@@ -3624,7 +3612,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
+ 		ret = ext4_ext_remove_space(inode, first_block,
+ 					    stop_block - 1);
+ 	else
+-		ret = ext4_free_hole_blocks(handle, inode, first_block,
++		ret = ext4_ind_remove_space(handle, inode, first_block,
+ 					    stop_block);
+ 
+ 	up_write(&EXT4_I(inode)->i_data_sem);
+diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
+index 2dcb936be90e..c3e7418a6811 100644
+--- a/fs/ext4/mballoc.c
++++ b/fs/ext4/mballoc.c
+@@ -1412,6 +1412,8 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
+ 	int last = first + count - 1;
+ 	struct super_block *sb = e4b->bd_sb;
+ 
++	if (WARN_ON(count == 0))
++		return;
+ 	BUG_ON(last >= (sb->s_blocksize << 3));
+ 	assert_spin_locked(ext4_group_lock_ptr(sb, e4b->bd_group));
+ 	/* Don't bother if the block group is corrupt. */
+@@ -3216,8 +3218,30 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac)
+ static void ext4_discard_allocated_blocks(struct ext4_allocation_context *ac)
+ {
+ 	struct ext4_prealloc_space *pa = ac->ac_pa;
++	struct ext4_buddy e4b;
++	int err;
+ 
+-	if (pa && pa->pa_type == MB_INODE_PA)
++	if (pa == NULL) {
++		if (ac->ac_f_ex.fe_len == 0)
++			return;
++		err = ext4_mb_load_buddy(ac->ac_sb, ac->ac_f_ex.fe_group, &e4b);
++		if (err) {
++			/*
++			 * This should never happen since we pin the
++			 * pages in the ext4_allocation_context so
++			 * ext4_mb_load_buddy() should never fail.
++			 */
++			WARN(1, "mb_load_buddy failed (%d)", err);
++			return;
++		}
++		ext4_lock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
++		mb_free_blocks(ac->ac_inode, &e4b, ac->ac_f_ex.fe_start,
++			       ac->ac_f_ex.fe_len);
++		ext4_unlock_group(ac->ac_sb, ac->ac_f_ex.fe_group);
++		ext4_mb_unload_buddy(&e4b);
++		return;
++	}
++	if (pa->pa_type == MB_INODE_PA)
+ 		pa->pa_free += ac->ac_b_ex.fe_len;
+ }
+ 
+diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
+index 3520ab8a6639..9e6eced1605b 100644
+--- a/fs/ext4/namei.c
++++ b/fs/ext4/namei.c
+@@ -3128,7 +3128,8 @@ static int ext4_find_delete_entry(handle_t *handle, struct inode *dir,
+ 	return retval;
+ }
+ 
+-static void ext4_rename_delete(handle_t *handle, struct ext4_renament *ent)
++static void ext4_rename_delete(handle_t *handle, struct ext4_renament *ent,
++			       int force_reread)
+ {
+ 	int retval;
+ 	/*
+@@ -3140,7 +3141,8 @@ static void ext4_rename_delete(handle_t *handle, struct ext4_renament *ent)
+ 	if (le32_to_cpu(ent->de->inode) != ent->inode->i_ino ||
+ 	    ent->de->name_len != ent->dentry->d_name.len ||
+ 	    strncmp(ent->de->name, ent->dentry->d_name.name,
+-		    ent->de->name_len)) {
++		    ent->de->name_len) ||
++	    force_reread) {
+ 		retval = ext4_find_delete_entry(handle, ent->dir,
+ 						&ent->dentry->d_name);
+ 	} else {
+@@ -3191,6 +3193,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 		.dentry = new_dentry,
+ 		.inode = new_dentry->d_inode,
+ 	};
++	int force_reread;
+ 	int retval;
+ 
+ 	dquot_initialize(old.dir);
+@@ -3246,6 +3249,15 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 		if (retval)
+ 			goto end_rename;
+ 	}
++	/*
++	 * If we're renaming a file within an inline_data dir and adding or
++	 * setting the new dirent causes a conversion from inline_data to
++	 * extents/blockmap, we need to force the dirent delete code to
++	 * re-read the directory, or else we end up trying to delete a dirent
++	 * from what is now the extent tree root (or a block map).
++	 */
++	force_reread = (new.dir->i_ino == old.dir->i_ino &&
++			ext4_test_inode_flag(new.dir, EXT4_INODE_INLINE_DATA));
+ 	if (!new.bh) {
+ 		retval = ext4_add_entry(handle, new.dentry, old.inode);
+ 		if (retval)
+@@ -3256,6 +3268,9 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 		if (retval)
+ 			goto end_rename;
+ 	}
++	if (force_reread)
++		force_reread = !ext4_test_inode_flag(new.dir,
++						     EXT4_INODE_INLINE_DATA);
+ 
+ 	/*
+ 	 * Like most other Unix systems, set the ctime for inodes on a
+@@ -3267,7 +3282,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 	/*
+ 	 * ok, that's it
+ 	 */
+-	ext4_rename_delete(handle, &old);
++	ext4_rename_delete(handle, &old, force_reread);
+ 
+ 	if (new.inode) {
+ 		ext4_dec_count(handle, new.inode);
+diff --git a/fs/ext4/super.c b/fs/ext4/super.c
+index 6df7bc611dbd..beeb5c4e1f9d 100644
+--- a/fs/ext4/super.c
++++ b/fs/ext4/super.c
+@@ -3185,9 +3185,9 @@ static int set_journal_csum_feature_set(struct super_block *sb)
+ 
+ 	if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ 				       EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
+-		/* journal checksum v2 */
++		/* journal checksum v3 */
+ 		compat = 0;
+-		incompat = JBD2_FEATURE_INCOMPAT_CSUM_V2;
++		incompat = JBD2_FEATURE_INCOMPAT_CSUM_V3;
+ 	} else {
+ 		/* journal checksum v1 */
+ 		compat = JBD2_FEATURE_COMPAT_CHECKSUM;
+@@ -3209,6 +3209,7 @@ static int set_journal_csum_feature_set(struct super_block *sb)
+ 		jbd2_journal_clear_features(sbi->s_journal,
+ 				JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+ 				JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
++				JBD2_FEATURE_INCOMPAT_CSUM_V3 |
+ 				JBD2_FEATURE_INCOMPAT_CSUM_V2);
+ 	}
+ 
+diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
+index 4556ce1af5b0..5ddaf8625d3b 100644
+--- a/fs/isofs/inode.c
++++ b/fs/isofs/inode.c
+@@ -61,7 +61,7 @@ static void isofs_put_super(struct super_block *sb)
+ 	return;
+ }
+ 
+-static int isofs_read_inode(struct inode *);
++static int isofs_read_inode(struct inode *, int relocated);
+ static int isofs_statfs (struct dentry *, struct kstatfs *);
+ 
+ static struct kmem_cache *isofs_inode_cachep;
+@@ -1259,7 +1259,7 @@ out_toomany:
+ 	goto out;
+ }
+ 
+-static int isofs_read_inode(struct inode *inode)
++static int isofs_read_inode(struct inode *inode, int relocated)
+ {
+ 	struct super_block *sb = inode->i_sb;
+ 	struct isofs_sb_info *sbi = ISOFS_SB(sb);
+@@ -1404,7 +1404,7 @@ static int isofs_read_inode(struct inode *inode)
+ 	 */
+ 
+ 	if (!high_sierra) {
+-		parse_rock_ridge_inode(de, inode);
++		parse_rock_ridge_inode(de, inode, relocated);
+ 		/* if we want uid/gid set, override the rock ridge setting */
+ 		if (sbi->s_uid_set)
+ 			inode->i_uid = sbi->s_uid;
+@@ -1483,9 +1483,10 @@ static int isofs_iget5_set(struct inode *ino, void *data)
+  * offset that point to the underlying meta-data for the inode.  The
+  * code below is otherwise similar to the iget() code in
+  * include/linux/fs.h */
+-struct inode *isofs_iget(struct super_block *sb,
+-			 unsigned long block,
+-			 unsigned long offset)
++struct inode *__isofs_iget(struct super_block *sb,
++			   unsigned long block,
++			   unsigned long offset,
++			   int relocated)
+ {
+ 	unsigned long hashval;
+ 	struct inode *inode;
+@@ -1507,7 +1508,7 @@ struct inode *isofs_iget(struct super_block *sb,
+ 		return ERR_PTR(-ENOMEM);
+ 
+ 	if (inode->i_state & I_NEW) {
+-		ret = isofs_read_inode(inode);
++		ret = isofs_read_inode(inode, relocated);
+ 		if (ret < 0) {
+ 			iget_failed(inode);
+ 			inode = ERR_PTR(ret);
+diff --git a/fs/isofs/isofs.h b/fs/isofs/isofs.h
+index 99167238518d..0ac4c1f73fbd 100644
+--- a/fs/isofs/isofs.h
++++ b/fs/isofs/isofs.h
+@@ -107,7 +107,7 @@ extern int iso_date(char *, int);
+ 
+ struct inode;		/* To make gcc happy */
+ 
+-extern int parse_rock_ridge_inode(struct iso_directory_record *, struct inode *);
++extern int parse_rock_ridge_inode(struct iso_directory_record *, struct inode *, int relocated);
+ extern int get_rock_ridge_filename(struct iso_directory_record *, char *, struct inode *);
+ extern int isofs_name_translate(struct iso_directory_record *, char *, struct inode *);
+ 
+@@ -118,9 +118,24 @@ extern struct dentry *isofs_lookup(struct inode *, struct dentry *, unsigned int
+ extern struct buffer_head *isofs_bread(struct inode *, sector_t);
+ extern int isofs_get_blocks(struct inode *, sector_t, struct buffer_head **, unsigned long);
+ 
+-extern struct inode *isofs_iget(struct super_block *sb,
+-                                unsigned long block,
+-                                unsigned long offset);
++struct inode *__isofs_iget(struct super_block *sb,
++			   unsigned long block,
++			   unsigned long offset,
++			   int relocated);
++
++static inline struct inode *isofs_iget(struct super_block *sb,
++				       unsigned long block,
++				       unsigned long offset)
++{
++	return __isofs_iget(sb, block, offset, 0);
++}
++
++static inline struct inode *isofs_iget_reloc(struct super_block *sb,
++					     unsigned long block,
++					     unsigned long offset)
++{
++	return __isofs_iget(sb, block, offset, 1);
++}
+ 
+ /* Because the inode number is no longer relevant to finding the
+  * underlying meta-data for an inode, we are free to choose a more
+diff --git a/fs/isofs/rock.c b/fs/isofs/rock.c
+index c0bf42472e40..f488bbae541a 100644
+--- a/fs/isofs/rock.c
++++ b/fs/isofs/rock.c
+@@ -288,12 +288,16 @@ eio:
+ 	goto out;
+ }
+ 
++#define RR_REGARD_XA 1
++#define RR_RELOC_DE 2
++
+ static int
+ parse_rock_ridge_inode_internal(struct iso_directory_record *de,
+-				struct inode *inode, int regard_xa)
++				struct inode *inode, int flags)
+ {
+ 	int symlink_len = 0;
+ 	int cnt, sig;
++	unsigned int reloc_block;
+ 	struct inode *reloc;
+ 	struct rock_ridge *rr;
+ 	int rootflag;
+@@ -305,7 +309,7 @@ parse_rock_ridge_inode_internal(struct iso_directory_record *de,
+ 
+ 	init_rock_state(&rs, inode);
+ 	setup_rock_ridge(de, inode, &rs);
+-	if (regard_xa) {
++	if (flags & RR_REGARD_XA) {
+ 		rs.chr += 14;
+ 		rs.len -= 14;
+ 		if (rs.len < 0)
+@@ -485,12 +489,22 @@ repeat:
+ 					"relocated directory\n");
+ 			goto out;
+ 		case SIG('C', 'L'):
+-			ISOFS_I(inode)->i_first_extent =
+-			    isonum_733(rr->u.CL.location);
+-			reloc =
+-			    isofs_iget(inode->i_sb,
+-				       ISOFS_I(inode)->i_first_extent,
+-				       0);
++			if (flags & RR_RELOC_DE) {
++				printk(KERN_ERR
++				       "ISOFS: Recursive directory relocation "
++				       "is not supported\n");
++				goto eio;
++			}
++			reloc_block = isonum_733(rr->u.CL.location);
++			if (reloc_block == ISOFS_I(inode)->i_iget5_block &&
++			    ISOFS_I(inode)->i_iget5_offset == 0) {
++				printk(KERN_ERR
++				       "ISOFS: Directory relocation points to "
++				       "itself\n");
++				goto eio;
++			}
++			ISOFS_I(inode)->i_first_extent = reloc_block;
++			reloc = isofs_iget_reloc(inode->i_sb, reloc_block, 0);
+ 			if (IS_ERR(reloc)) {
+ 				ret = PTR_ERR(reloc);
+ 				goto out;
+@@ -637,9 +651,11 @@ static char *get_symlink_chunk(char *rpnt, struct rock_ridge *rr, char *plimit)
+ 	return rpnt;
+ }
+ 
+-int parse_rock_ridge_inode(struct iso_directory_record *de, struct inode *inode)
++int parse_rock_ridge_inode(struct iso_directory_record *de, struct inode *inode,
++			   int relocated)
+ {
+-	int result = parse_rock_ridge_inode_internal(de, inode, 0);
++	int flags = relocated ? RR_RELOC_DE : 0;
++	int result = parse_rock_ridge_inode_internal(de, inode, flags);
+ 
+ 	/*
+ 	 * if rockridge flag was reset and we didn't look for attributes
+@@ -647,7 +663,8 @@ int parse_rock_ridge_inode(struct iso_directory_record *de, struct inode *inode)
+ 	 */
+ 	if ((ISOFS_SB(inode->i_sb)->s_rock_offset == -1)
+ 	    && (ISOFS_SB(inode->i_sb)->s_rock == 2)) {
+-		result = parse_rock_ridge_inode_internal(de, inode, 14);
++		result = parse_rock_ridge_inode_internal(de, inode,
++							 flags | RR_REGARD_XA);
+ 	}
+ 	return result;
+ }
+diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
+index 6fac74349856..b73e0215baa7 100644
+--- a/fs/jbd2/commit.c
++++ b/fs/jbd2/commit.c
+@@ -97,7 +97,7 @@ static void jbd2_commit_block_csum_set(journal_t *j, struct buffer_head *bh)
+ 	struct commit_header *h;
+ 	__u32 csum;
+ 
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return;
+ 
+ 	h = (struct commit_header *)(bh->b_data);
+@@ -313,11 +313,11 @@ static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
+ 	return checksum;
+ }
+ 
+-static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
++static void write_tag_block(journal_t *j, journal_block_tag_t *tag,
+ 				   unsigned long long block)
+ {
+ 	tag->t_blocknr = cpu_to_be32(block & (u32)~0);
+-	if (tag_bytes > JBD2_TAG_SIZE32)
++	if (JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_64BIT))
+ 		tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1);
+ }
+ 
+@@ -327,7 +327,7 @@ static void jbd2_descr_block_csum_set(journal_t *j,
+ 	struct jbd2_journal_block_tail *tail;
+ 	__u32 csum;
+ 
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return;
+ 
+ 	tail = (struct jbd2_journal_block_tail *)(bh->b_data + j->j_blocksize -
+@@ -340,12 +340,13 @@ static void jbd2_descr_block_csum_set(journal_t *j,
+ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
+ 				    struct buffer_head *bh, __u32 sequence)
+ {
++	journal_block_tag3_t *tag3 = (journal_block_tag3_t *)tag;
+ 	struct page *page = bh->b_page;
+ 	__u8 *addr;
+ 	__u32 csum32;
+ 	__be32 seq;
+ 
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return;
+ 
+ 	seq = cpu_to_be32(sequence);
+@@ -355,8 +356,10 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
+ 			     bh->b_size);
+ 	kunmap_atomic(addr);
+ 
+-	/* We only have space to store the lower 16 bits of the crc32c. */
+-	tag->t_checksum = cpu_to_be16(csum32);
++	if (JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V3))
++		tag3->t_checksum = cpu_to_be32(csum32);
++	else
++		tag->t_checksum = cpu_to_be16(csum32);
+ }
+ /*
+  * jbd2_journal_commit_transaction
+@@ -396,7 +399,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
+ 	LIST_HEAD(io_bufs);
+ 	LIST_HEAD(log_bufs);
+ 
+-	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (jbd2_journal_has_csum_v2or3(journal))
+ 		csum_size = sizeof(struct jbd2_journal_block_tail);
+ 
+ 	/*
+@@ -690,7 +693,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
+ 			tag_flag |= JBD2_FLAG_SAME_UUID;
+ 
+ 		tag = (journal_block_tag_t *) tagp;
+-		write_tag_block(tag_bytes, tag, jh2bh(jh)->b_blocknr);
++		write_tag_block(journal, tag, jh2bh(jh)->b_blocknr);
+ 		tag->t_flags = cpu_to_be16(tag_flag);
+ 		jbd2_block_tag_csum_set(journal, tag, wbuf[bufs],
+ 					commit_transaction->t_tid);
+diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
+index 67b8e303946c..19d74d86d99c 100644
+--- a/fs/jbd2/journal.c
++++ b/fs/jbd2/journal.c
+@@ -124,7 +124,7 @@ EXPORT_SYMBOL(__jbd2_debug);
+ /* Checksumming functions */
+ static int jbd2_verify_csum_type(journal_t *j, journal_superblock_t *sb)
+ {
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return 1;
+ 
+ 	return sb->s_checksum_type == JBD2_CRC32C_CHKSUM;
+@@ -145,7 +145,7 @@ static __be32 jbd2_superblock_csum(journal_t *j, journal_superblock_t *sb)
+ 
+ static int jbd2_superblock_csum_verify(journal_t *j, journal_superblock_t *sb)
+ {
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return 1;
+ 
+ 	return sb->s_checksum == jbd2_superblock_csum(j, sb);
+@@ -153,7 +153,7 @@ static int jbd2_superblock_csum_verify(journal_t *j, journal_superblock_t *sb)
+ 
+ static void jbd2_superblock_csum_set(journal_t *j, journal_superblock_t *sb)
+ {
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return;
+ 
+ 	sb->s_checksum = jbd2_superblock_csum(j, sb);
+@@ -1522,21 +1522,29 @@ static int journal_get_superblock(journal_t *journal)
+ 		goto out;
+ 	}
+ 
+-	if (JBD2_HAS_COMPAT_FEATURE(journal, JBD2_FEATURE_COMPAT_CHECKSUM) &&
+-	    JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
++	if (jbd2_journal_has_csum_v2or3(journal) &&
++	    JBD2_HAS_COMPAT_FEATURE(journal, JBD2_FEATURE_COMPAT_CHECKSUM)) {
+ 		/* Can't have checksum v1 and v2 on at the same time! */
+ 		printk(KERN_ERR "JBD2: Can't enable checksumming v1 and v2 "
+ 		       "at the same time!\n");
+ 		goto out;
+ 	}
+ 
++	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2) &&
++	    JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V3)) {
++		/* Can't have checksum v2 and v3 at the same time! */
++		printk(KERN_ERR "JBD2: Can't enable checksumming v2 and v3 "
++		       "at the same time!\n");
++		goto out;
++	}
++
+ 	if (!jbd2_verify_csum_type(journal, sb)) {
+ 		printk(KERN_ERR "JBD2: Unknown checksum type\n");
+ 		goto out;
+ 	}
+ 
+ 	/* Load the checksum driver */
+-	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
++	if (jbd2_journal_has_csum_v2or3(journal)) {
+ 		journal->j_chksum_driver = crypto_alloc_shash("crc32c", 0, 0);
+ 		if (IS_ERR(journal->j_chksum_driver)) {
+ 			printk(KERN_ERR "JBD2: Cannot load crc32c driver.\n");
+@@ -1553,7 +1561,7 @@ static int journal_get_superblock(journal_t *journal)
+ 	}
+ 
+ 	/* Precompute checksum seed for all metadata */
+-	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (jbd2_journal_has_csum_v2or3(journal))
+ 		journal->j_csum_seed = jbd2_chksum(journal, ~0, sb->s_uuid,
+ 						   sizeof(sb->s_uuid));
+ 
+@@ -1813,8 +1821,14 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
+ 	if (!jbd2_journal_check_available_features(journal, compat, ro, incompat))
+ 		return 0;
+ 
+-	/* Asking for checksumming v2 and v1?  Only give them v2. */
+-	if (incompat & JBD2_FEATURE_INCOMPAT_CSUM_V2 &&
++	/* If enabling v2 checksums, turn on v3 instead */
++	if (incompat & JBD2_FEATURE_INCOMPAT_CSUM_V2) {
++		incompat &= ~JBD2_FEATURE_INCOMPAT_CSUM_V2;
++		incompat |= JBD2_FEATURE_INCOMPAT_CSUM_V3;
++	}
++
++	/* Asking for checksumming v3 and v1?  Only give them v3. */
++	if (incompat & JBD2_FEATURE_INCOMPAT_CSUM_V3 &&
+ 	    compat & JBD2_FEATURE_COMPAT_CHECKSUM)
+ 		compat &= ~JBD2_FEATURE_COMPAT_CHECKSUM;
+ 
+@@ -1823,8 +1837,8 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
+ 
+ 	sb = journal->j_superblock;
+ 
+-	/* If enabling v2 checksums, update superblock */
+-	if (INCOMPAT_FEATURE_ON(JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
++	/* If enabling v3 checksums, update superblock */
++	if (INCOMPAT_FEATURE_ON(JBD2_FEATURE_INCOMPAT_CSUM_V3)) {
+ 		sb->s_checksum_type = JBD2_CRC32C_CHKSUM;
+ 		sb->s_feature_compat &=
+ 			~cpu_to_be32(JBD2_FEATURE_COMPAT_CHECKSUM);
+@@ -1842,8 +1856,7 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
+ 		}
+ 
+ 		/* Precompute checksum seed for all metadata */
+-		if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+-					      JBD2_FEATURE_INCOMPAT_CSUM_V2))
++		if (jbd2_journal_has_csum_v2or3(journal))
+ 			journal->j_csum_seed = jbd2_chksum(journal, ~0,
+ 							   sb->s_uuid,
+ 							   sizeof(sb->s_uuid));
+@@ -1852,7 +1865,8 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
+ 	/* If enabling v1 checksums, downgrade superblock */
+ 	if (COMPAT_FEATURE_ON(JBD2_FEATURE_COMPAT_CHECKSUM))
+ 		sb->s_feature_incompat &=
+-			~cpu_to_be32(JBD2_FEATURE_INCOMPAT_CSUM_V2);
++			~cpu_to_be32(JBD2_FEATURE_INCOMPAT_CSUM_V2 |
++				     JBD2_FEATURE_INCOMPAT_CSUM_V3);
+ 
+ 	sb->s_feature_compat    |= cpu_to_be32(compat);
+ 	sb->s_feature_ro_compat |= cpu_to_be32(ro);
+@@ -2165,16 +2179,20 @@ int jbd2_journal_blocks_per_page(struct inode *inode)
+  */
+ size_t journal_tag_bytes(journal_t *journal)
+ {
+-	journal_block_tag_t tag;
+-	size_t x = 0;
++	size_t sz;
++
++	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V3))
++		return sizeof(journal_block_tag3_t);
++
++	sz = sizeof(journal_block_tag_t);
+ 
+ 	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+-		x += sizeof(tag.t_checksum);
++		sz += sizeof(__u16);
+ 
+ 	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
+-		return x + JBD2_TAG_SIZE64;
++		return sz;
+ 	else
+-		return x + JBD2_TAG_SIZE32;
++		return sz - sizeof(__u32);
+ }
+ 
+ /*
+diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
+index 3b6bb19d60b1..9b329b55ffe3 100644
+--- a/fs/jbd2/recovery.c
++++ b/fs/jbd2/recovery.c
+@@ -181,7 +181,7 @@ static int jbd2_descr_block_csum_verify(journal_t *j,
+ 	__be32 provided;
+ 	__u32 calculated;
+ 
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return 1;
+ 
+ 	tail = (struct jbd2_journal_block_tail *)(buf + j->j_blocksize -
+@@ -205,7 +205,7 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
+ 	int			nr = 0, size = journal->j_blocksize;
+ 	int			tag_bytes = journal_tag_bytes(journal);
+ 
+-	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (jbd2_journal_has_csum_v2or3(journal))
+ 		size -= sizeof(struct jbd2_journal_block_tail);
+ 
+ 	tagp = &bh->b_data[sizeof(journal_header_t)];
+@@ -338,10 +338,11 @@ int jbd2_journal_skip_recovery(journal_t *journal)
+ 	return err;
+ }
+ 
+-static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag)
++static inline unsigned long long read_tag_block(journal_t *journal,
++						journal_block_tag_t *tag)
+ {
+ 	unsigned long long block = be32_to_cpu(tag->t_blocknr);
+-	if (tag_bytes > JBD2_TAG_SIZE32)
++	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
+ 		block |= (u64)be32_to_cpu(tag->t_blocknr_high) << 32;
+ 	return block;
+ }
+@@ -384,7 +385,7 @@ static int jbd2_commit_block_csum_verify(journal_t *j, void *buf)
+ 	__be32 provided;
+ 	__u32 calculated;
+ 
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return 1;
+ 
+ 	h = buf;
+@@ -399,17 +400,21 @@ static int jbd2_commit_block_csum_verify(journal_t *j, void *buf)
+ static int jbd2_block_tag_csum_verify(journal_t *j, journal_block_tag_t *tag,
+ 				      void *buf, __u32 sequence)
+ {
++	journal_block_tag3_t *tag3 = (journal_block_tag3_t *)tag;
+ 	__u32 csum32;
+ 	__be32 seq;
+ 
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return 1;
+ 
+ 	seq = cpu_to_be32(sequence);
+ 	csum32 = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&seq, sizeof(seq));
+ 	csum32 = jbd2_chksum(j, csum32, buf, j->j_blocksize);
+ 
+-	return tag->t_checksum == cpu_to_be16(csum32);
++	if (JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V3))
++		return tag3->t_checksum == cpu_to_be32(csum32);
++	else
++		return tag->t_checksum == cpu_to_be16(csum32);
+ }
+ 
+ static int do_one_pass(journal_t *journal,
+@@ -426,6 +431,7 @@ static int do_one_pass(journal_t *journal,
+ 	int			tag_bytes = journal_tag_bytes(journal);
+ 	__u32			crc32_sum = ~0; /* Transactional Checksums */
+ 	int			descr_csum_size = 0;
++	int			block_error = 0;
+ 
+ 	/*
+ 	 * First thing is to establish what we expect to find in the log
+@@ -512,8 +518,7 @@ static int do_one_pass(journal_t *journal,
+ 		switch(blocktype) {
+ 		case JBD2_DESCRIPTOR_BLOCK:
+ 			/* Verify checksum first */
+-			if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+-					JBD2_FEATURE_INCOMPAT_CSUM_V2))
++			if (jbd2_journal_has_csum_v2or3(journal))
+ 				descr_csum_size =
+ 					sizeof(struct jbd2_journal_block_tail);
+ 			if (descr_csum_size > 0 &&
+@@ -574,7 +579,7 @@ static int do_one_pass(journal_t *journal,
+ 					unsigned long long blocknr;
+ 
+ 					J_ASSERT(obh != NULL);
+-					blocknr = read_tag_block(tag_bytes,
++					blocknr = read_tag_block(journal,
+ 								 tag);
+ 
+ 					/* If the block has been
+@@ -598,7 +603,8 @@ static int do_one_pass(journal_t *journal,
+ 						       "checksum recovering "
+ 						       "block %llu in log\n",
+ 						       blocknr);
+-						continue;
++						block_error = 1;
++						goto skip_write;
+ 					}
+ 
+ 					/* Find a buffer for the new
+@@ -797,7 +803,8 @@ static int do_one_pass(journal_t *journal,
+ 				success = -EIO;
+ 		}
+ 	}
+-
++	if (block_error && success == 0)
++		success = -EIO;
+ 	return success;
+ 
+  failed:
+@@ -811,7 +818,7 @@ static int jbd2_revoke_block_csum_verify(journal_t *j,
+ 	__be32 provided;
+ 	__u32 calculated;
+ 
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return 1;
+ 
+ 	tail = (struct jbd2_journal_revoke_tail *)(buf + j->j_blocksize -
+diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
+index 198c9c10276d..d5e95a175c92 100644
+--- a/fs/jbd2/revoke.c
++++ b/fs/jbd2/revoke.c
+@@ -91,8 +91,8 @@
+ #include <linux/list.h>
+ #include <linux/init.h>
+ #include <linux/bio.h>
+-#endif
+ #include <linux/log2.h>
++#endif
+ 
+ static struct kmem_cache *jbd2_revoke_record_cache;
+ static struct kmem_cache *jbd2_revoke_table_cache;
+@@ -597,7 +597,7 @@ static void write_one_revoke_record(journal_t *journal,
+ 	offset = *offsetp;
+ 
+ 	/* Do we need to leave space at the end for a checksum? */
+-	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (jbd2_journal_has_csum_v2or3(journal))
+ 		csum_size = sizeof(struct jbd2_journal_revoke_tail);
+ 
+ 	/* Make sure we have a descriptor with space left for the record */
+@@ -644,7 +644,7 @@ static void jbd2_revoke_csum_set(journal_t *j, struct buffer_head *bh)
+ 	struct jbd2_journal_revoke_tail *tail;
+ 	__u32 csum;
+ 
+-	if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
++	if (!jbd2_journal_has_csum_v2or3(j))
+ 		return;
+ 
+ 	tail = (struct jbd2_journal_revoke_tail *)(bh->b_data + j->j_blocksize -
+diff --git a/fs/nfs/nfs3acl.c b/fs/nfs/nfs3acl.c
+index 8f854dde4150..24c6898159cc 100644
+--- a/fs/nfs/nfs3acl.c
++++ b/fs/nfs/nfs3acl.c
+@@ -129,7 +129,10 @@ static int __nfs3_proc_setacls(struct inode *inode, struct posix_acl *acl,
+ 		.rpc_argp	= &args,
+ 		.rpc_resp	= &fattr,
+ 	};
+-	int status;
++	int status = 0;
++
++	if (acl == NULL && (!S_ISDIR(inode->i_mode) || dfacl == NULL))
++		goto out;
+ 
+ 	status = -EOPNOTSUPP;
+ 	if (!nfs_server_capable(inode, NFS_CAP_ACLS))
+@@ -256,7 +259,7 @@ nfs3_list_one_acl(struct inode *inode, int type, const char *name, void *data,
+ 	char *p = data + *result;
+ 
+ 	acl = get_acl(inode, type);
+-	if (!acl)
++	if (IS_ERR_OR_NULL(acl))
+ 		return 0;
+ 
+ 	posix_acl_release(acl);
+diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
+index 4bf3d97cc5a0..dac979866f83 100644
+--- a/fs/nfs/nfs4proc.c
++++ b/fs/nfs/nfs4proc.c
+@@ -2545,6 +2545,7 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
+ 	struct nfs4_closedata *calldata = data;
+ 	struct nfs4_state *state = calldata->state;
+ 	struct nfs_server *server = NFS_SERVER(calldata->inode);
++	nfs4_stateid *res_stateid = NULL;
+ 
+ 	dprintk("%s: begin!\n", __func__);
+ 	if (!nfs4_sequence_done(task, &calldata->res.seq_res))
+@@ -2555,12 +2556,12 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
+ 	 */
+ 	switch (task->tk_status) {
+ 		case 0:
+-			if (calldata->roc)
++			res_stateid = &calldata->res.stateid;
++			if (calldata->arg.fmode == 0 && calldata->roc)
+ 				pnfs_roc_set_barrier(state->inode,
+ 						     calldata->roc_barrier);
+-			nfs_clear_open_stateid(state, &calldata->res.stateid, 0);
+ 			renew_lease(server, calldata->timestamp);
+-			goto out_release;
++			break;
+ 		case -NFS4ERR_ADMIN_REVOKED:
+ 		case -NFS4ERR_STALE_STATEID:
+ 		case -NFS4ERR_OLD_STATEID:
+@@ -2574,7 +2575,7 @@ static void nfs4_close_done(struct rpc_task *task, void *data)
+ 				goto out_release;
+ 			}
+ 	}
+-	nfs_clear_open_stateid(state, NULL, calldata->arg.fmode);
++	nfs_clear_open_stateid(state, res_stateid, calldata->arg.fmode);
+ out_release:
+ 	nfs_release_seqid(calldata->arg.seqid);
+ 	nfs_refresh_inode(calldata->inode, calldata->res.fattr);
+@@ -2586,6 +2587,7 @@ static void nfs4_close_prepare(struct rpc_task *task, void *data)
+ 	struct nfs4_closedata *calldata = data;
+ 	struct nfs4_state *state = calldata->state;
+ 	struct inode *inode = calldata->inode;
++	bool is_rdonly, is_wronly, is_rdwr;
+ 	int call_close = 0;
+ 
+ 	dprintk("%s: begin!\n", __func__);
+@@ -2593,18 +2595,24 @@ static void nfs4_close_prepare(struct rpc_task *task, void *data)
+ 		goto out_wait;
+ 
+ 	task->tk_msg.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_OPEN_DOWNGRADE];
+-	calldata->arg.fmode = FMODE_READ|FMODE_WRITE;
+ 	spin_lock(&state->owner->so_lock);
++	is_rdwr = test_bit(NFS_O_RDWR_STATE, &state->flags);
++	is_rdonly = test_bit(NFS_O_RDONLY_STATE, &state->flags);
++	is_wronly = test_bit(NFS_O_WRONLY_STATE, &state->flags);
++	/* Calculate the current open share mode */
++	calldata->arg.fmode = 0;
++	if (is_rdonly || is_rdwr)
++		calldata->arg.fmode |= FMODE_READ;
++	if (is_wronly || is_rdwr)
++		calldata->arg.fmode |= FMODE_WRITE;
+ 	/* Calculate the change in open mode */
+ 	if (state->n_rdwr == 0) {
+ 		if (state->n_rdonly == 0) {
+-			call_close |= test_bit(NFS_O_RDONLY_STATE, &state->flags);
+-			call_close |= test_bit(NFS_O_RDWR_STATE, &state->flags);
++			call_close |= is_rdonly || is_rdwr;
+ 			calldata->arg.fmode &= ~FMODE_READ;
+ 		}
+ 		if (state->n_wronly == 0) {
+-			call_close |= test_bit(NFS_O_WRONLY_STATE, &state->flags);
+-			call_close |= test_bit(NFS_O_RDWR_STATE, &state->flags);
++			call_close |= is_wronly || is_rdwr;
+ 			calldata->arg.fmode &= ~FMODE_WRITE;
+ 		}
+ 	}
+diff --git a/fs/nfs/super.c b/fs/nfs/super.c
+index 084af1060d79..3fd83327bbad 100644
+--- a/fs/nfs/super.c
++++ b/fs/nfs/super.c
+@@ -2180,7 +2180,7 @@ out_no_address:
+ 	return -EINVAL;
+ }
+ 
+-#define NFS_MOUNT_CMP_FLAGMASK ~(NFS_MOUNT_INTR \
++#define NFS_REMOUNT_CMP_FLAGMASK ~(NFS_MOUNT_INTR \
+ 		| NFS_MOUNT_SECURE \
+ 		| NFS_MOUNT_TCP \
+ 		| NFS_MOUNT_VER3 \
+@@ -2188,15 +2188,16 @@ out_no_address:
+ 		| NFS_MOUNT_NONLM \
+ 		| NFS_MOUNT_BROKEN_SUID \
+ 		| NFS_MOUNT_STRICTLOCK \
+-		| NFS_MOUNT_UNSHARED \
+-		| NFS_MOUNT_NORESVPORT \
+ 		| NFS_MOUNT_LEGACY_INTERFACE)
+ 
++#define NFS_MOUNT_CMP_FLAGMASK (NFS_REMOUNT_CMP_FLAGMASK & \
++		~(NFS_MOUNT_UNSHARED | NFS_MOUNT_NORESVPORT))
++
+ static int
+ nfs_compare_remount_data(struct nfs_server *nfss,
+ 			 struct nfs_parsed_mount_data *data)
+ {
+-	if ((data->flags ^ nfss->flags) & NFS_MOUNT_CMP_FLAGMASK ||
++	if ((data->flags ^ nfss->flags) & NFS_REMOUNT_CMP_FLAGMASK ||
+ 	    data->rsize != nfss->rsize ||
+ 	    data->wsize != nfss->wsize ||
+ 	    data->version != nfss->nfs_client->rpc_ops->version ||
+diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
+index 2c73cae9899d..0f23ad005826 100644
+--- a/fs/nfsd/nfs4callback.c
++++ b/fs/nfsd/nfs4callback.c
+@@ -689,7 +689,8 @@ static int setup_callback_client(struct nfs4_client *clp, struct nfs4_cb_conn *c
+ 		clp->cl_cb_session = ses;
+ 		args.bc_xprt = conn->cb_xprt;
+ 		args.prognumber = clp->cl_cb_session->se_cb_prog;
+-		args.protocol = XPRT_TRANSPORT_BC_TCP;
++		args.protocol = conn->cb_xprt->xpt_class->xcl_ident |
++				XPRT_TRANSPORT_BC;
+ 		args.authflavor = ses->se_cb_sec.flavor;
+ 	}
+ 	/* Create RPC client */
+diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
+index 1879e43f2868..2f2edbb2a4a3 100644
+--- a/fs/nfsd/nfssvc.c
++++ b/fs/nfsd/nfssvc.c
+@@ -221,7 +221,8 @@ static int nfsd_startup_generic(int nrservs)
+ 	 */
+ 	ret = nfsd_racache_init(2*nrservs);
+ 	if (ret)
+-		return ret;
++		goto dec_users;
++
+ 	ret = nfs4_state_start();
+ 	if (ret)
+ 		goto out_racache;
+@@ -229,6 +230,8 @@ static int nfsd_startup_generic(int nrservs)
+ 
+ out_racache:
+ 	nfsd_racache_shutdown();
++dec_users:
++	nfsd_users--;
+ 	return ret;
+ }
+ 
+diff --git a/include/drm/drm_pciids.h b/include/drm/drm_pciids.h
+index 6dfd64b3a604..e973540cd15b 100644
+--- a/include/drm/drm_pciids.h
++++ b/include/drm/drm_pciids.h
+@@ -17,6 +17,7 @@
+ 	{0x1002, 0x1315, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ 	{0x1002, 0x1316, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ 	{0x1002, 0x1317, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
++	{0x1002, 0x1318, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ 	{0x1002, 0x131B, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ 	{0x1002, 0x131C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+ 	{0x1002, 0x131D, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_KAVERI|RADEON_NEW_MEMMAP|RADEON_IS_IGP}, \
+@@ -164,8 +165,11 @@
+ 	{0x1002, 0x6601, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6602, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6603, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++	{0x1002, 0x6604, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++	{0x1002, 0x6605, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6606, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6607, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++	{0x1002, 0x6608, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6610, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6611, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6613, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+@@ -175,6 +179,8 @@
+ 	{0x1002, 0x6631, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_OLAND|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6640, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6641, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++	{0x1002, 0x6646, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++	{0x1002, 0x6647, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6649, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6650, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6651, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_BONAIRE|RADEON_NEW_MEMMAP}, \
+@@ -297,6 +303,7 @@
+ 	{0x1002, 0x6829, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x682A, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x682B, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
++	{0x1002, 0x682C, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x682D, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x682F, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+ 	{0x1002, 0x6830, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_VERDE|RADEON_IS_MOBILITY|RADEON_NEW_MEMMAP}, \
+diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
+index d5b50a19463c..0dae71e9971c 100644
+--- a/include/linux/jbd2.h
++++ b/include/linux/jbd2.h
+@@ -159,7 +159,11 @@ typedef struct journal_header_s
+  * journal_block_tag (in the descriptor).  The other h_chksum* fields are
+  * not used.
+  *
+- * Checksum v1 and v2 are mutually exclusive features.
++ * If FEATURE_INCOMPAT_CSUM_V3 is set, the descriptor block uses
++ * journal_block_tag3_t to store a full 32-bit checksum.  Everything else
++ * is the same as v2.
++ *
++ * Checksum v1, v2, and v3 are mutually exclusive features.
+  */
+ struct commit_header {
+ 	__be32		h_magic;
+@@ -179,6 +183,14 @@ struct commit_header {
+  * raw struct shouldn't be used for pointer math or sizeof() - use
+  * journal_tag_bytes(journal) instead to compute this.
+  */
++typedef struct journal_block_tag3_s
++{
++	__be32		t_blocknr;	/* The on-disk block number */
++	__be32		t_flags;	/* See below */
++	__be32		t_blocknr_high; /* most-significant high 32bits. */
++	__be32		t_checksum;	/* crc32c(uuid+seq+block) */
++} journal_block_tag3_t;
++
+ typedef struct journal_block_tag_s
+ {
+ 	__be32		t_blocknr;	/* The on-disk block number */
+@@ -187,9 +199,6 @@ typedef struct journal_block_tag_s
+ 	__be32		t_blocknr_high; /* most-significant high 32bits. */
+ } journal_block_tag_t;
+ 
+-#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
+-#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))
+-
+ /* Tail of descriptor block, for checksumming */
+ struct jbd2_journal_block_tail {
+ 	__be32		t_checksum;	/* crc32c(uuid+descr_block) */
+@@ -284,6 +293,7 @@ typedef struct journal_superblock_s
+ #define JBD2_FEATURE_INCOMPAT_64BIT		0x00000002
+ #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT	0x00000004
+ #define JBD2_FEATURE_INCOMPAT_CSUM_V2		0x00000008
++#define JBD2_FEATURE_INCOMPAT_CSUM_V3		0x00000010
+ 
+ /* Features known to this kernel version: */
+ #define JBD2_KNOWN_COMPAT_FEATURES	JBD2_FEATURE_COMPAT_CHECKSUM
+@@ -291,7 +301,8 @@ typedef struct journal_superblock_s
+ #define JBD2_KNOWN_INCOMPAT_FEATURES	(JBD2_FEATURE_INCOMPAT_REVOKE | \
+ 					JBD2_FEATURE_INCOMPAT_64BIT | \
+ 					JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \
+-					JBD2_FEATURE_INCOMPAT_CSUM_V2)
++					JBD2_FEATURE_INCOMPAT_CSUM_V2 | \
++					JBD2_FEATURE_INCOMPAT_CSUM_V3)
+ 
+ #ifdef __KERNEL__
+ 
+@@ -1296,6 +1307,15 @@ static inline int tid_geq(tid_t x, tid_t y)
+ extern int jbd2_journal_blocks_per_page(struct inode *inode);
+ extern size_t journal_tag_bytes(journal_t *journal);
+ 
++static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
++{
++	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2) ||
++	    JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V3))
++		return 1;
++
++	return 0;
++}
++
+ /*
+  * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
+  * transaction control blocks.
+diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
+index 7235040a19b2..5d9d6f84b382 100644
+--- a/include/linux/sunrpc/svc_xprt.h
++++ b/include/linux/sunrpc/svc_xprt.h
+@@ -33,6 +33,7 @@ struct svc_xprt_class {
+ 	struct svc_xprt_ops	*xcl_ops;
+ 	struct list_head	xcl_list;
+ 	u32			xcl_max_payload;
++	int			xcl_ident;
+ };
+ 
+ /*
+diff --git a/kernel/sched/core.c b/kernel/sched/core.c
+index bc1638b33449..0acf96b790c5 100644
+--- a/kernel/sched/core.c
++++ b/kernel/sched/core.c
+@@ -3558,9 +3558,10 @@ static int _sched_setscheduler(struct task_struct *p, int policy,
+ 	};
+ 
+ 	/*
+-	 * Fixup the legacy SCHED_RESET_ON_FORK hack
++	 * Fixup the legacy SCHED_RESET_ON_FORK hack, except if
++	 * the policy=-1 was passed by sched_setparam().
+ 	 */
+-	if (policy & SCHED_RESET_ON_FORK) {
++	if ((policy != -1) && (policy & SCHED_RESET_ON_FORK)) {
+ 		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
+ 		policy &= ~SCHED_RESET_ON_FORK;
+ 		attr.sched_policy = policy;
+diff --git a/mm/memory.c b/mm/memory.c
+index 8b44f765b645..0a21f3d162ae 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -751,7 +751,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ 	unsigned long pfn = pte_pfn(pte);
+ 
+ 	if (HAVE_PTE_SPECIAL) {
+-		if (likely(!pte_special(pte) || pte_numa(pte)))
++		if (likely(!pte_special(pte)))
+ 			goto check_pfn;
+ 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ 			return NULL;
+@@ -777,15 +777,14 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ 		}
+ 	}
+ 
++	if (is_zero_pfn(pfn))
++		return NULL;
+ check_pfn:
+ 	if (unlikely(pfn > highest_memmap_pfn)) {
+ 		print_bad_pte(vma, addr, pte, NULL);
+ 		return NULL;
+ 	}
+ 
+-	if (is_zero_pfn(pfn))
+-		return NULL;
+-
+ 	/*
+ 	 * NOTE! We still have PageReserved() pages in the page tables.
+ 	 * eg. VDSO mappings can cause them to exist.
+diff --git a/mm/util.c b/mm/util.c
+index d5ea733c5082..33e9f4455800 100644
+--- a/mm/util.c
++++ b/mm/util.c
+@@ -277,17 +277,14 @@ pid_t vm_is_stack(struct task_struct *task,
+ 
+ 	if (in_group) {
+ 		struct task_struct *t;
+-		rcu_read_lock();
+-		if (!pid_alive(task))
+-			goto done;
+ 
+-		t = task;
+-		do {
++		rcu_read_lock();
++		for_each_thread(task, t) {
+ 			if (vm_is_stack_for_task(t, vma)) {
+ 				ret = t->pid;
+ 				goto done;
+ 			}
+-		} while_each_thread(task, t);
++		}
+ done:
+ 		rcu_read_unlock();
+ 	}
+diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
+index b507cd327d9b..b2437ee93657 100644
+--- a/net/sunrpc/svcsock.c
++++ b/net/sunrpc/svcsock.c
+@@ -692,6 +692,7 @@ static struct svc_xprt_class svc_udp_class = {
+ 	.xcl_owner = THIS_MODULE,
+ 	.xcl_ops = &svc_udp_ops,
+ 	.xcl_max_payload = RPCSVC_MAXPAYLOAD_UDP,
++	.xcl_ident = XPRT_TRANSPORT_UDP,
+ };
+ 
+ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv)
+@@ -1292,6 +1293,7 @@ static struct svc_xprt_class svc_tcp_class = {
+ 	.xcl_owner = THIS_MODULE,
+ 	.xcl_ops = &svc_tcp_ops,
+ 	.xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
++	.xcl_ident = XPRT_TRANSPORT_TCP,
+ };
+ 
+ void svc_init_xprt_sock(void)
+diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
+index c3b2b3369e52..51c63165073c 100644
+--- a/net/sunrpc/xprt.c
++++ b/net/sunrpc/xprt.c
+@@ -1306,7 +1306,7 @@ struct rpc_xprt *xprt_create_transport(struct xprt_create *args)
+ 		}
+ 	}
+ 	spin_unlock(&xprt_list_lock);
+-	printk(KERN_ERR "RPC: transport (%d) not supported\n", args->ident);
++	dprintk("RPC: transport (%d) not supported\n", args->ident);
+ 	return ERR_PTR(-EIO);
+ 
+ found:
+diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
+index e7323fbbd348..06a5d9235107 100644
+--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
++++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
+@@ -92,6 +92,7 @@ struct svc_xprt_class svc_rdma_class = {
+ 	.xcl_owner = THIS_MODULE,
+ 	.xcl_ops = &svc_rdma_ops,
+ 	.xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
++	.xcl_ident = XPRT_TRANSPORT_RDMA,
+ };
+ 
+ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+diff --git a/sound/pci/Kconfig b/sound/pci/Kconfig
+index 3a3a3a71088b..50dd0086cfb1 100644
+--- a/sound/pci/Kconfig
++++ b/sound/pci/Kconfig
+@@ -858,8 +858,8 @@ config SND_VIRTUOSO
+ 	select SND_JACK if INPUT=y || INPUT=SND
+ 	help
+ 	  Say Y here to include support for sound cards based on the
+-	  Asus AV66/AV100/AV200 chips, i.e., Xonar D1, DX, D2, D2X, DS,
+-	  Essence ST (Deluxe), and Essence STX.
++	  Asus AV66/AV100/AV200 chips, i.e., Xonar D1, DX, D2, D2X, DS, DSX,
++	  Essence ST (Deluxe), and Essence STX (II).
+ 	  Support for the HDAV1.3 (Deluxe) and HDAV1.3 Slim is experimental;
+ 	  for the Xense, missing.
+ 
+diff --git a/sound/pci/hda/patch_ca0132.c b/sound/pci/hda/patch_ca0132.c
+index 092f2bd030bd..b686aca7f000 100644
+--- a/sound/pci/hda/patch_ca0132.c
++++ b/sound/pci/hda/patch_ca0132.c
+@@ -4376,6 +4376,9 @@ static void ca0132_download_dsp(struct hda_codec *codec)
+ 	return; /* NOP */
+ #endif
+ 
++	if (spec->dsp_state == DSP_DOWNLOAD_FAILED)
++		return; /* don't retry failures */
++
+ 	chipio_enable_clocks(codec);
+ 	spec->dsp_state = DSP_DOWNLOADING;
+ 	if (!ca0132_download_dsp_images(codec))
+@@ -4552,7 +4555,8 @@ static int ca0132_init(struct hda_codec *codec)
+ 	struct auto_pin_cfg *cfg = &spec->autocfg;
+ 	int i;
+ 
+-	spec->dsp_state = DSP_DOWNLOAD_INIT;
++	if (spec->dsp_state != DSP_DOWNLOAD_FAILED)
++		spec->dsp_state = DSP_DOWNLOAD_INIT;
+ 	spec->curr_chip_addx = INVALID_CHIP_ADDRESS;
+ 
+ 	snd_hda_power_up(codec);
+@@ -4663,6 +4667,7 @@ static int patch_ca0132(struct hda_codec *codec)
+ 	codec->spec = spec;
+ 	spec->codec = codec;
+ 
++	spec->dsp_state = DSP_DOWNLOAD_INIT;
+ 	spec->num_mixers = 1;
+ 	spec->mixers[0] = ca0132_mixer;
+ 
+diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
+index b60824e90408..25728aaacc26 100644
+--- a/sound/pci/hda/patch_realtek.c
++++ b/sound/pci/hda/patch_realtek.c
+@@ -180,6 +180,8 @@ static void alc_fix_pll(struct hda_codec *codec)
+ 			    spec->pll_coef_idx);
+ 	val = snd_hda_codec_read(codec, spec->pll_nid, 0,
+ 				 AC_VERB_GET_PROC_COEF, 0);
++	if (val == -1)
++		return;
+ 	snd_hda_codec_write(codec, spec->pll_nid, 0, AC_VERB_SET_COEF_INDEX,
+ 			    spec->pll_coef_idx);
+ 	snd_hda_codec_write(codec, spec->pll_nid, 0, AC_VERB_SET_PROC_COEF,
+@@ -2784,6 +2786,8 @@ static int alc269_parse_auto_config(struct hda_codec *codec)
+ static void alc269vb_toggle_power_output(struct hda_codec *codec, int power_up)
+ {
+ 	int val = alc_read_coef_idx(codec, 0x04);
++	if (val == -1)
++		return;
+ 	if (power_up)
+ 		val |= 1 << 11;
+ 	else
+@@ -3242,6 +3246,15 @@ static int alc269_resume(struct hda_codec *codec)
+ 	snd_hda_codec_resume_cache(codec);
+ 	alc_inv_dmic_sync(codec, true);
+ 	hda_call_check_power_status(codec, 0x01);
++
++	/* on some machine, the BIOS will clear the codec gpio data when enter
++	 * suspend, and won't restore the data after resume, so we restore it
++	 * in the driver.
++	 */
++	if (spec->gpio_led)
++		snd_hda_codec_write(codec, codec->afg, 0, AC_VERB_SET_GPIO_DATA,
++			    spec->gpio_led);
++
+ 	if (spec->has_alc5505_dsp)
+ 		alc5505_dsp_resume(codec);
+ 
+@@ -4782,6 +4795,8 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = {
+ 	SND_PCI_QUIRK(0x103c, 0x1983, "HP Pavilion", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+ 	SND_PCI_QUIRK(0x103c, 0x218b, "HP", ALC269_FIXUP_LIMIT_INT_MIC_BOOST_MUTE_LED),
+ 	/* ALC282 */
++	SND_PCI_QUIRK(0x103c, 0x2191, "HP Touchsmart 14", ALC269_FIXUP_HP_MUTE_LED_MIC1),
++	SND_PCI_QUIRK(0x103c, 0x2192, "HP Touchsmart 15", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+ 	SND_PCI_QUIRK(0x103c, 0x220d, "HP", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+ 	SND_PCI_QUIRK(0x103c, 0x220e, "HP", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+ 	SND_PCI_QUIRK(0x103c, 0x220f, "HP", ALC269_FIXUP_HP_MUTE_LED_MIC1),
+@@ -5122,27 +5137,30 @@ static void alc269_fill_coef(struct hda_codec *codec)
+ 	if ((alc_get_coef0(codec) & 0x00ff) == 0x017) {
+ 		val = alc_read_coef_idx(codec, 0x04);
+ 		/* Power up output pin */
+-		alc_write_coef_idx(codec, 0x04, val | (1<<11));
++		if (val != -1)
++			alc_write_coef_idx(codec, 0x04, val | (1<<11));
+ 	}
+ 
+ 	if ((alc_get_coef0(codec) & 0x00ff) == 0x018) {
+ 		val = alc_read_coef_idx(codec, 0xd);
+-		if ((val & 0x0c00) >> 10 != 0x1) {
++		if (val != -1 && (val & 0x0c00) >> 10 != 0x1) {
+ 			/* Capless ramp up clock control */
+ 			alc_write_coef_idx(codec, 0xd, val | (1<<10));
+ 		}
+ 		val = alc_read_coef_idx(codec, 0x17);
+-		if ((val & 0x01c0) >> 6 != 0x4) {
++		if (val != -1 && (val & 0x01c0) >> 6 != 0x4) {
+ 			/* Class D power on reset */
+ 			alc_write_coef_idx(codec, 0x17, val | (1<<7));
+ 		}
+ 	}
+ 
+ 	val = alc_read_coef_idx(codec, 0xd); /* Class D */
+-	alc_write_coef_idx(codec, 0xd, val | (1<<14));
++	if (val != -1)
++		alc_write_coef_idx(codec, 0xd, val | (1<<14));
+ 
+ 	val = alc_read_coef_idx(codec, 0x4); /* HP */
+-	alc_write_coef_idx(codec, 0x4, val | (1<<11));
++	if (val != -1)
++		alc_write_coef_idx(codec, 0x4, val | (1<<11));
+ }
+ 
+ /*
+diff --git a/sound/pci/hda/patch_sigmatel.c b/sound/pci/hda/patch_sigmatel.c
+index 3744ea4e843d..4d3a3b932690 100644
+--- a/sound/pci/hda/patch_sigmatel.c
++++ b/sound/pci/hda/patch_sigmatel.c
+@@ -84,6 +84,7 @@ enum {
+ 	STAC_DELL_EQ,
+ 	STAC_ALIENWARE_M17X,
+ 	STAC_92HD89XX_HP_FRONT_JACK,
++	STAC_92HD89XX_HP_Z1_G2_RIGHT_MIC_JACK,
+ 	STAC_92HD73XX_MODELS
+ };
+ 
+@@ -1809,6 +1810,11 @@ static const struct hda_pintbl stac92hd89xx_hp_front_jack_pin_configs[] = {
+ 	{}
+ };
+ 
++static const struct hda_pintbl stac92hd89xx_hp_z1_g2_right_mic_jack_pin_configs[] = {
++	{ 0x0e, 0x400000f0 },
++	{}
++};
++
+ static void stac92hd73xx_fixup_ref(struct hda_codec *codec,
+ 				   const struct hda_fixup *fix, int action)
+ {
+@@ -1931,6 +1937,10 @@ static const struct hda_fixup stac92hd73xx_fixups[] = {
+ 	[STAC_92HD89XX_HP_FRONT_JACK] = {
+ 		.type = HDA_FIXUP_PINS,
+ 		.v.pins = stac92hd89xx_hp_front_jack_pin_configs,
++	},
++	[STAC_92HD89XX_HP_Z1_G2_RIGHT_MIC_JACK] = {
++		.type = HDA_FIXUP_PINS,
++		.v.pins = stac92hd89xx_hp_z1_g2_right_mic_jack_pin_configs,
+ 	}
+ };
+ 
+@@ -1991,6 +2001,8 @@ static const struct snd_pci_quirk stac92hd73xx_fixup_tbl[] = {
+ 		      "Alienware M17x", STAC_ALIENWARE_M17X),
+ 	SND_PCI_QUIRK(PCI_VENDOR_ID_DELL, 0x0490,
+ 		      "Alienware M17x R3", STAC_DELL_EQ),
++	SND_PCI_QUIRK(PCI_VENDOR_ID_HP, 0x1927,
++				"HP Z1 G2", STAC_92HD89XX_HP_Z1_G2_RIGHT_MIC_JACK),
+ 	SND_PCI_QUIRK(PCI_VENDOR_ID_HP, 0x2b17,
+ 				"unknown HP", STAC_92HD89XX_HP_FRONT_JACK),
+ 	{} /* terminator */
+diff --git a/sound/pci/oxygen/virtuoso.c b/sound/pci/oxygen/virtuoso.c
+index 64b9fda5f04a..dbbbacfd535e 100644
+--- a/sound/pci/oxygen/virtuoso.c
++++ b/sound/pci/oxygen/virtuoso.c
+@@ -53,6 +53,7 @@ static DEFINE_PCI_DEVICE_TABLE(xonar_ids) = {
+ 	{ OXYGEN_PCI_SUBID(0x1043, 0x835e) },
+ 	{ OXYGEN_PCI_SUBID(0x1043, 0x838e) },
+ 	{ OXYGEN_PCI_SUBID(0x1043, 0x8522) },
++	{ OXYGEN_PCI_SUBID(0x1043, 0x85f4) },
+ 	{ OXYGEN_PCI_SUBID_BROKEN_EEPROM },
+ 	{ }
+ };
+diff --git a/sound/pci/oxygen/xonar_pcm179x.c b/sound/pci/oxygen/xonar_pcm179x.c
+index c8c7f2c9b355..e02605931669 100644
+--- a/sound/pci/oxygen/xonar_pcm179x.c
++++ b/sound/pci/oxygen/xonar_pcm179x.c
+@@ -100,8 +100,8 @@
+  */
+ 
+ /*
+- * Xonar Essence ST (Deluxe)/STX
+- * -----------------------------
++ * Xonar Essence ST (Deluxe)/STX (II)
++ * ----------------------------------
+  *
+  * CMI8788:
+  *
+@@ -1138,6 +1138,14 @@ int get_xonar_pcm179x_model(struct oxygen *chip,
+ 		chip->model.resume = xonar_stx_resume;
+ 		chip->model.set_dac_params = set_pcm1796_params;
+ 		break;
++	case 0x85f4:
++		chip->model = model_xonar_st;
++		/* TODO: daughterboard support */
++		chip->model.shortname = "Xonar STX II";
++		chip->model.init = xonar_stx_init;
++		chip->model.resume = xonar_stx_resume;
++		chip->model.set_dac_params = set_pcm1796_params;
++		break;
+ 	default:
+ 		return -EINVAL;
+ 	}
+diff --git a/sound/usb/quirks-table.h b/sound/usb/quirks-table.h
+index f652b10ce905..223c47b33ba3 100644
+--- a/sound/usb/quirks-table.h
++++ b/sound/usb/quirks-table.h
+@@ -1581,6 +1581,35 @@ YAMAHA_DEVICE(0x7010, "UB99"),
+ 	}
+ },
+ {
++	/* BOSS ME-25 */
++	USB_DEVICE(0x0582, 0x0113),
++	.driver_info = (unsigned long) & (const struct snd_usb_audio_quirk) {
++		.ifnum = QUIRK_ANY_INTERFACE,
++		.type = QUIRK_COMPOSITE,
++		.data = (const struct snd_usb_audio_quirk[]) {
++			{
++				.ifnum = 0,
++				.type = QUIRK_AUDIO_STANDARD_INTERFACE
++			},
++			{
++				.ifnum = 1,
++				.type = QUIRK_AUDIO_STANDARD_INTERFACE
++			},
++			{
++				.ifnum = 2,
++				.type = QUIRK_MIDI_FIXED_ENDPOINT,
++				.data = & (const struct snd_usb_midi_endpoint_info) {
++					.out_cables = 0x0001,
++					.in_cables  = 0x0001
++				}
++			},
++			{
++				.ifnum = -1
++			}
++		}
++	}
++},
++{
+ 	/* only 44.1 kHz works at the moment */
+ 	USB_DEVICE(0x0582, 0x0120),
+ 	.driver_info = (unsigned long) & (const struct snd_usb_audio_quirk) {
+diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c
+index 7c57f2268dd7..19a921eb75f1 100644
+--- a/sound/usb/quirks.c
++++ b/sound/usb/quirks.c
+@@ -670,7 +670,7 @@ static int snd_usb_gamecon780_boot_quirk(struct usb_device *dev)
+ 	/* set the initial volume and don't change; other values are either
+ 	 * too loud or silent due to firmware bug (bko#65251)
+ 	 */
+-	u8 buf[2] = { 0x74, 0xdc };
++	u8 buf[2] = { 0x74, 0xe3 };
+ 	return snd_usb_ctl_msg(dev, usb_sndctrlpipe(dev, 0), UAC_SET_CUR,
+ 			USB_RECIP_INTERFACE | USB_TYPE_CLASS | USB_DIR_OUT,
+ 			UAC_FU_VOLUME << 8, 9 << 8, buf, 2);
+diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
+index 2458a1dc2ba9..e8ce34c9db32 100644
+--- a/virt/kvm/ioapic.c
++++ b/virt/kvm/ioapic.c
+@@ -254,10 +254,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
+ 	spin_lock(&ioapic->lock);
+ 	for (index = 0; index < IOAPIC_NUM_PINS; index++) {
+ 		e = &ioapic->redirtbl[index];
+-		if (!e->fields.mask &&
+-			(e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
+-			 kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC,
+-				 index) || index == RTC_GSI)) {
++		if (e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
++		    kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, index) ||
++		    index == RTC_GSI) {
+ 			if (kvm_apic_match_dest(vcpu, NULL, 0,
+ 				e->fields.dest_id, e->fields.dest_mode)) {
+ 				__set_bit(e->fields.vector,
+diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
+index 0df7d4b34dfe..714b94932312 100644
+--- a/virt/kvm/iommu.c
++++ b/virt/kvm/iommu.c
+@@ -61,6 +61,14 @@ static pfn_t kvm_pin_pages(struct kvm_memory_slot *slot, gfn_t gfn,
+ 	return pfn;
+ }
+ 
++static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
++{
++	unsigned long i;
++
++	for (i = 0; i < npages; ++i)
++		kvm_release_pfn_clean(pfn + i);
++}
++
+ int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+ {
+ 	gfn_t gfn, end_gfn;
+@@ -123,6 +131,7 @@ int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+ 		if (r) {
+ 			printk(KERN_ERR "kvm_iommu_map_address:"
+ 			       "iommu failed to map pfn=%llx\n", pfn);
++			kvm_unpin_pages(kvm, pfn, page_size);
+ 			goto unmap_pages;
+ 		}
+ 
+@@ -134,7 +143,7 @@ int kvm_iommu_map_pages(struct kvm *kvm, struct kvm_memory_slot *slot)
+ 	return 0;
+ 
+ unmap_pages:
+-	kvm_iommu_put_pages(kvm, slot->base_gfn, gfn);
++	kvm_iommu_put_pages(kvm, slot->base_gfn, gfn - slot->base_gfn);
+ 	return r;
+ }
+ 
+@@ -266,14 +275,6 @@ out_unlock:
+ 	return r;
+ }
+ 
+-static void kvm_unpin_pages(struct kvm *kvm, pfn_t pfn, unsigned long npages)
+-{
+-	unsigned long i;
+-
+-	for (i = 0; i < npages; ++i)
+-		kvm_release_pfn_clean(pfn + i);
+-}
+-
+ static void kvm_iommu_put_pages(struct kvm *kvm,
+ 				gfn_t base_gfn, unsigned long npages)
+ {


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-17 22:19 Anthony G. Basile
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-09-17 22:19 UTC (permalink / raw
  To: gentoo-commits

commit:     e086bd08b11f58e9c3bedb30b3d52f2ca6fdcf7d
Author:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Wed Sep 17 22:22:02 2014 +0000
Commit:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Wed Sep 17 22:22:02 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=e086bd08

Linux patch 3.16.3

---
 0000_README             |    4 +
 1002_linux-3.16.3.patch | 7142 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 7146 insertions(+)

diff --git a/0000_README b/0000_README
index 1ecfc95..706e53e 100644
--- a/0000_README
+++ b/0000_README
@@ -50,6 +50,10 @@ Patch:  1001_linux-3.16.2.patch
 From:   http://www.kernel.org
 Desc:   Linux 3.16.2
 
+Patch:  1002_linux-3.16.3.patch
+From:   http://www.kernel.org
+Desc:   Linux 3.16.3
+
 Patch:  2400_kcopy-patch-for-infiniband-driver.patch
 From:   Alexey Shvetsov <alexxy@gentoo.org>
 Desc:   Zero copy for infiniband psm userspace driver

diff --git a/1002_linux-3.16.3.patch b/1002_linux-3.16.3.patch
new file mode 100644
index 0000000..987f475
--- /dev/null
+++ b/1002_linux-3.16.3.patch
@@ -0,0 +1,7142 @@
+diff --git a/Documentation/devicetree/bindings/sound/adi,axi-spdif-tx.txt b/Documentation/devicetree/bindings/sound/adi,axi-spdif-tx.txt
+index 46f344965313..4eb7997674a0 100644
+--- a/Documentation/devicetree/bindings/sound/adi,axi-spdif-tx.txt
++++ b/Documentation/devicetree/bindings/sound/adi,axi-spdif-tx.txt
+@@ -1,7 +1,7 @@
+ ADI AXI-SPDIF controller
+ 
+ Required properties:
+- - compatible : Must be "adi,axi-spdif-1.00.a"
++ - compatible : Must be "adi,axi-spdif-tx-1.00.a"
+  - reg : Must contain SPDIF core's registers location and length
+  - clocks : Pairs of phandle and specifier referencing the controller's clocks.
+    The controller expects two clocks, the clock used for the AXI interface and
+diff --git a/Makefile b/Makefile
+index c2617526e605..9b25a830a9d7 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 2
++SUBLEVEL = 3
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+ 
+diff --git a/arch/arm/boot/dts/omap3-n900.dts b/arch/arm/boot/dts/omap3-n900.dts
+index b15f1a77d684..1fe45d1f75ec 100644
+--- a/arch/arm/boot/dts/omap3-n900.dts
++++ b/arch/arm/boot/dts/omap3-n900.dts
+@@ -353,7 +353,7 @@
+ 	};
+ 
+ 	twl_power: power {
+-		compatible = "ti,twl4030-power-n900";
++		compatible = "ti,twl4030-power-n900", "ti,twl4030-power-idle-osc-off";
+ 		ti,use_poweroff;
+ 	};
+ };
+diff --git a/arch/mips/cavium-octeon/setup.c b/arch/mips/cavium-octeon/setup.c
+index 008e9c8b8eac..c9d9c627e244 100644
+--- a/arch/mips/cavium-octeon/setup.c
++++ b/arch/mips/cavium-octeon/setup.c
+@@ -458,6 +458,18 @@ static void octeon_halt(void)
+ 	octeon_kill_core(NULL);
+ }
+ 
++static char __read_mostly octeon_system_type[80];
++
++static int __init init_octeon_system_type(void)
++{
++	snprintf(octeon_system_type, sizeof(octeon_system_type), "%s (%s)",
++		cvmx_board_type_to_string(octeon_bootinfo->board_type),
++		octeon_model_get_string(read_c0_prid()));
++
++	return 0;
++}
++early_initcall(init_octeon_system_type);
++
+ /**
+  * Return a string representing the system type
+  *
+@@ -465,11 +477,7 @@ static void octeon_halt(void)
+  */
+ const char *octeon_board_type_string(void)
+ {
+-	static char name[80];
+-	sprintf(name, "%s (%s)",
+-		cvmx_board_type_to_string(octeon_bootinfo->board_type),
+-		octeon_model_get_string(read_c0_prid()));
+-	return name;
++	return octeon_system_type;
+ }
+ 
+ const char *get_system_type(void)
+diff --git a/arch/mips/include/asm/eva.h b/arch/mips/include/asm/eva.h
+new file mode 100644
+index 000000000000..a3d1807f227c
+--- /dev/null
++++ b/arch/mips/include/asm/eva.h
+@@ -0,0 +1,43 @@
++/*
++ * This file is subject to the terms and conditions of the GNU General Public
++ * License.  See the file "COPYING" in the main directory of this archive
++ * for more details.
++ *
++ * Copyright (C) 2014, Imagination Technologies Ltd.
++ *
++ * EVA functions for generic code
++ */
++
++#ifndef _ASM_EVA_H
++#define _ASM_EVA_H
++
++#include <kernel-entry-init.h>
++
++#ifdef __ASSEMBLY__
++
++#ifdef CONFIG_EVA
++
++/*
++ * EVA early init code
++ *
++ * Platforms must define their own 'platform_eva_init' macro in
++ * their kernel-entry-init.h header. This macro usually does the
++ * platform specific configuration of the segmentation registers,
++ * and it is normally called from assembly code.
++ *
++ */
++
++.macro eva_init
++platform_eva_init
++.endm
++
++#else
++
++.macro eva_init
++.endm
++
++#endif /* CONFIG_EVA */
++
++#endif /* __ASSEMBLY__ */
++
++#endif
+diff --git a/arch/mips/include/asm/mach-malta/kernel-entry-init.h b/arch/mips/include/asm/mach-malta/kernel-entry-init.h
+index 77eeda77e73c..0cf8622db27f 100644
+--- a/arch/mips/include/asm/mach-malta/kernel-entry-init.h
++++ b/arch/mips/include/asm/mach-malta/kernel-entry-init.h
+@@ -10,14 +10,15 @@
+ #ifndef __ASM_MACH_MIPS_KERNEL_ENTRY_INIT_H
+ #define __ASM_MACH_MIPS_KERNEL_ENTRY_INIT_H
+ 
++#include <asm/regdef.h>
++#include <asm/mipsregs.h>
++
+ 	/*
+ 	 * Prepare segments for EVA boot:
+ 	 *
+ 	 * This is in case the processor boots in legacy configuration
+ 	 * (SI_EVAReset is de-asserted and CONFIG5.K == 0)
+ 	 *
+-	 * On entry, t1 is loaded with CP0_CONFIG
+-	 *
+ 	 * ========================= Mappings =============================
+ 	 * Virtual memory           Physical memory           Mapping
+ 	 * 0x00000000 - 0x7fffffff  0x80000000 - 0xfffffffff   MUSUK (kuseg)
+@@ -30,12 +31,20 @@
+ 	 *
+ 	 *
+ 	 * Lowmem is expanded to 2GB
++	 *
++	 * The following code uses the t0, t1, t2 and ra registers without
++	 * previously preserving them.
++	 *
+ 	 */
+-	.macro	eva_entry
++	.macro	platform_eva_init
++
++	.set	push
++	.set	reorder
+ 	/*
+ 	 * Get Config.K0 value and use it to program
+ 	 * the segmentation registers
+ 	 */
++	mfc0    t1, CP0_CONFIG
+ 	andi	t1, 0x7 /* CCA */
+ 	move	t2, t1
+ 	ins	t2, t1, 16, 3
+@@ -77,6 +86,8 @@
+ 	mtc0    t0, $16, 5
+ 	sync
+ 	jal	mips_ihb
++
++	.set	pop
+ 	.endm
+ 
+ 	.macro	kernel_entry_setup
+@@ -95,7 +106,7 @@
+ 	sll     t0, t0, 6   /* SC bit */
+ 	bgez    t0, 9f
+ 
+-	eva_entry
++	platform_eva_init
+ 	b       0f
+ 9:
+ 	/* Assume we came from YAMON... */
+@@ -127,8 +138,7 @@ nonsc_processor:
+ #ifdef CONFIG_EVA
+ 	sync
+ 	ehb
+-	mfc0    t1, CP0_CONFIG
+-	eva_entry
++	platform_eva_init
+ #endif
+ 	.endm
+ 
+diff --git a/arch/mips/include/asm/ptrace.h b/arch/mips/include/asm/ptrace.h
+index 7e6e682aece3..c301fa9b139f 100644
+--- a/arch/mips/include/asm/ptrace.h
++++ b/arch/mips/include/asm/ptrace.h
+@@ -23,7 +23,7 @@
+ struct pt_regs {
+ #ifdef CONFIG_32BIT
+ 	/* Pad bytes for argument save space on the stack. */
+-	unsigned long pad0[6];
++	unsigned long pad0[8];
+ #endif
+ 
+ 	/* Saved main processor registers. */
+diff --git a/arch/mips/include/asm/reg.h b/arch/mips/include/asm/reg.h
+index 910e71a12466..b8343ccbc989 100644
+--- a/arch/mips/include/asm/reg.h
++++ b/arch/mips/include/asm/reg.h
+@@ -12,116 +12,194 @@
+ #ifndef __ASM_MIPS_REG_H
+ #define __ASM_MIPS_REG_H
+ 
+-
+-#if defined(CONFIG_32BIT) || defined(WANT_COMPAT_REG_H)
+-
+-#define EF_R0			6
+-#define EF_R1			7
+-#define EF_R2			8
+-#define EF_R3			9
+-#define EF_R4			10
+-#define EF_R5			11
+-#define EF_R6			12
+-#define EF_R7			13
+-#define EF_R8			14
+-#define EF_R9			15
+-#define EF_R10			16
+-#define EF_R11			17
+-#define EF_R12			18
+-#define EF_R13			19
+-#define EF_R14			20
+-#define EF_R15			21
+-#define EF_R16			22
+-#define EF_R17			23
+-#define EF_R18			24
+-#define EF_R19			25
+-#define EF_R20			26
+-#define EF_R21			27
+-#define EF_R22			28
+-#define EF_R23			29
+-#define EF_R24			30
+-#define EF_R25			31
++#define MIPS32_EF_R0		6
++#define MIPS32_EF_R1		7
++#define MIPS32_EF_R2		8
++#define MIPS32_EF_R3		9
++#define MIPS32_EF_R4		10
++#define MIPS32_EF_R5		11
++#define MIPS32_EF_R6		12
++#define MIPS32_EF_R7		13
++#define MIPS32_EF_R8		14
++#define MIPS32_EF_R9		15
++#define MIPS32_EF_R10		16
++#define MIPS32_EF_R11		17
++#define MIPS32_EF_R12		18
++#define MIPS32_EF_R13		19
++#define MIPS32_EF_R14		20
++#define MIPS32_EF_R15		21
++#define MIPS32_EF_R16		22
++#define MIPS32_EF_R17		23
++#define MIPS32_EF_R18		24
++#define MIPS32_EF_R19		25
++#define MIPS32_EF_R20		26
++#define MIPS32_EF_R21		27
++#define MIPS32_EF_R22		28
++#define MIPS32_EF_R23		29
++#define MIPS32_EF_R24		30
++#define MIPS32_EF_R25		31
+ 
+ /*
+  * k0/k1 unsaved
+  */
+-#define EF_R26			32
+-#define EF_R27			33
++#define MIPS32_EF_R26		32
++#define MIPS32_EF_R27		33
+ 
+-#define EF_R28			34
+-#define EF_R29			35
+-#define EF_R30			36
+-#define EF_R31			37
++#define MIPS32_EF_R28		34
++#define MIPS32_EF_R29		35
++#define MIPS32_EF_R30		36
++#define MIPS32_EF_R31		37
+ 
+ /*
+  * Saved special registers
+  */
+-#define EF_LO			38
+-#define EF_HI			39
+-
+-#define EF_CP0_EPC		40
+-#define EF_CP0_BADVADDR		41
+-#define EF_CP0_STATUS		42
+-#define EF_CP0_CAUSE		43
+-#define EF_UNUSED0		44
+-
+-#define EF_SIZE			180
+-
+-#endif
+-
+-#if defined(CONFIG_64BIT) && !defined(WANT_COMPAT_REG_H)
+-
+-#define EF_R0			 0
+-#define EF_R1			 1
+-#define EF_R2			 2
+-#define EF_R3			 3
+-#define EF_R4			 4
+-#define EF_R5			 5
+-#define EF_R6			 6
+-#define EF_R7			 7
+-#define EF_R8			 8
+-#define EF_R9			 9
+-#define EF_R10			10
+-#define EF_R11			11
+-#define EF_R12			12
+-#define EF_R13			13
+-#define EF_R14			14
+-#define EF_R15			15
+-#define EF_R16			16
+-#define EF_R17			17
+-#define EF_R18			18
+-#define EF_R19			19
+-#define EF_R20			20
+-#define EF_R21			21
+-#define EF_R22			22
+-#define EF_R23			23
+-#define EF_R24			24
+-#define EF_R25			25
++#define MIPS32_EF_LO		38
++#define MIPS32_EF_HI		39
++
++#define MIPS32_EF_CP0_EPC	40
++#define MIPS32_EF_CP0_BADVADDR	41
++#define MIPS32_EF_CP0_STATUS	42
++#define MIPS32_EF_CP0_CAUSE	43
++#define MIPS32_EF_UNUSED0	44
++
++#define MIPS32_EF_SIZE		180
++
++#define MIPS64_EF_R0		0
++#define MIPS64_EF_R1		1
++#define MIPS64_EF_R2		2
++#define MIPS64_EF_R3		3
++#define MIPS64_EF_R4		4
++#define MIPS64_EF_R5		5
++#define MIPS64_EF_R6		6
++#define MIPS64_EF_R7		7
++#define MIPS64_EF_R8		8
++#define MIPS64_EF_R9		9
++#define MIPS64_EF_R10		10
++#define MIPS64_EF_R11		11
++#define MIPS64_EF_R12		12
++#define MIPS64_EF_R13		13
++#define MIPS64_EF_R14		14
++#define MIPS64_EF_R15		15
++#define MIPS64_EF_R16		16
++#define MIPS64_EF_R17		17
++#define MIPS64_EF_R18		18
++#define MIPS64_EF_R19		19
++#define MIPS64_EF_R20		20
++#define MIPS64_EF_R21		21
++#define MIPS64_EF_R22		22
++#define MIPS64_EF_R23		23
++#define MIPS64_EF_R24		24
++#define MIPS64_EF_R25		25
+ 
+ /*
+  * k0/k1 unsaved
+  */
+-#define EF_R26			26
+-#define EF_R27			27
++#define MIPS64_EF_R26		26
++#define MIPS64_EF_R27		27
+ 
+ 
+-#define EF_R28			28
+-#define EF_R29			29
+-#define EF_R30			30
+-#define EF_R31			31
++#define MIPS64_EF_R28		28
++#define MIPS64_EF_R29		29
++#define MIPS64_EF_R30		30
++#define MIPS64_EF_R31		31
+ 
+ /*
+  * Saved special registers
+  */
+-#define EF_LO			32
+-#define EF_HI			33
+-
+-#define EF_CP0_EPC		34
+-#define EF_CP0_BADVADDR		35
+-#define EF_CP0_STATUS		36
+-#define EF_CP0_CAUSE		37
+-
+-#define EF_SIZE			304	/* size in bytes */
++#define MIPS64_EF_LO		32
++#define MIPS64_EF_HI		33
++
++#define MIPS64_EF_CP0_EPC	34
++#define MIPS64_EF_CP0_BADVADDR	35
++#define MIPS64_EF_CP0_STATUS	36
++#define MIPS64_EF_CP0_CAUSE	37
++
++#define MIPS64_EF_SIZE		304	/* size in bytes */
++
++#if defined(CONFIG_32BIT)
++
++#define EF_R0			MIPS32_EF_R0
++#define EF_R1			MIPS32_EF_R1
++#define EF_R2			MIPS32_EF_R2
++#define EF_R3			MIPS32_EF_R3
++#define EF_R4			MIPS32_EF_R4
++#define EF_R5			MIPS32_EF_R5
++#define EF_R6			MIPS32_EF_R6
++#define EF_R7			MIPS32_EF_R7
++#define EF_R8			MIPS32_EF_R8
++#define EF_R9			MIPS32_EF_R9
++#define EF_R10			MIPS32_EF_R10
++#define EF_R11			MIPS32_EF_R11
++#define EF_R12			MIPS32_EF_R12
++#define EF_R13			MIPS32_EF_R13
++#define EF_R14			MIPS32_EF_R14
++#define EF_R15			MIPS32_EF_R15
++#define EF_R16			MIPS32_EF_R16
++#define EF_R17			MIPS32_EF_R17
++#define EF_R18			MIPS32_EF_R18
++#define EF_R19			MIPS32_EF_R19
++#define EF_R20			MIPS32_EF_R20
++#define EF_R21			MIPS32_EF_R21
++#define EF_R22			MIPS32_EF_R22
++#define EF_R23			MIPS32_EF_R23
++#define EF_R24			MIPS32_EF_R24
++#define EF_R25			MIPS32_EF_R25
++#define EF_R26			MIPS32_EF_R26
++#define EF_R27			MIPS32_EF_R27
++#define EF_R28			MIPS32_EF_R28
++#define EF_R29			MIPS32_EF_R29
++#define EF_R30			MIPS32_EF_R30
++#define EF_R31			MIPS32_EF_R31
++#define EF_LO			MIPS32_EF_LO
++#define EF_HI			MIPS32_EF_HI
++#define EF_CP0_EPC		MIPS32_EF_CP0_EPC
++#define EF_CP0_BADVADDR		MIPS32_EF_CP0_BADVADDR
++#define EF_CP0_STATUS		MIPS32_EF_CP0_STATUS
++#define EF_CP0_CAUSE		MIPS32_EF_CP0_CAUSE
++#define EF_UNUSED0		MIPS32_EF_UNUSED0
++#define EF_SIZE			MIPS32_EF_SIZE
++
++#elif defined(CONFIG_64BIT)
++
++#define EF_R0			MIPS64_EF_R0
++#define EF_R1			MIPS64_EF_R1
++#define EF_R2			MIPS64_EF_R2
++#define EF_R3			MIPS64_EF_R3
++#define EF_R4			MIPS64_EF_R4
++#define EF_R5			MIPS64_EF_R5
++#define EF_R6			MIPS64_EF_R6
++#define EF_R7			MIPS64_EF_R7
++#define EF_R8			MIPS64_EF_R8
++#define EF_R9			MIPS64_EF_R9
++#define EF_R10			MIPS64_EF_R10
++#define EF_R11			MIPS64_EF_R11
++#define EF_R12			MIPS64_EF_R12
++#define EF_R13			MIPS64_EF_R13
++#define EF_R14			MIPS64_EF_R14
++#define EF_R15			MIPS64_EF_R15
++#define EF_R16			MIPS64_EF_R16
++#define EF_R17			MIPS64_EF_R17
++#define EF_R18			MIPS64_EF_R18
++#define EF_R19			MIPS64_EF_R19
++#define EF_R20			MIPS64_EF_R20
++#define EF_R21			MIPS64_EF_R21
++#define EF_R22			MIPS64_EF_R22
++#define EF_R23			MIPS64_EF_R23
++#define EF_R24			MIPS64_EF_R24
++#define EF_R25			MIPS64_EF_R25
++#define EF_R26			MIPS64_EF_R26
++#define EF_R27			MIPS64_EF_R27
++#define EF_R28			MIPS64_EF_R28
++#define EF_R29			MIPS64_EF_R29
++#define EF_R30			MIPS64_EF_R30
++#define EF_R31			MIPS64_EF_R31
++#define EF_LO			MIPS64_EF_LO
++#define EF_HI			MIPS64_EF_HI
++#define EF_CP0_EPC		MIPS64_EF_CP0_EPC
++#define EF_CP0_BADVADDR		MIPS64_EF_CP0_BADVADDR
++#define EF_CP0_STATUS		MIPS64_EF_CP0_STATUS
++#define EF_CP0_CAUSE		MIPS64_EF_CP0_CAUSE
++#define EF_SIZE			MIPS64_EF_SIZE
+ 
+ #endif /* CONFIG_64BIT */
+ 
+diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
+index 17960fe7a8ce..cdf68b33bd65 100644
+--- a/arch/mips/include/asm/syscall.h
++++ b/arch/mips/include/asm/syscall.h
+@@ -131,10 +131,12 @@ static inline int syscall_get_arch(void)
+ {
+ 	int arch = EM_MIPS;
+ #ifdef CONFIG_64BIT
+-	if (!test_thread_flag(TIF_32BIT_REGS))
++	if (!test_thread_flag(TIF_32BIT_REGS)) {
+ 		arch |= __AUDIT_ARCH_64BIT;
+-	if (test_thread_flag(TIF_32BIT_ADDR))
+-		arch |= __AUDIT_ARCH_CONVENTION_MIPS64_N32;
++		/* N32 sets only TIF_32BIT_ADDR */
++		if (test_thread_flag(TIF_32BIT_ADDR))
++			arch |= __AUDIT_ARCH_CONVENTION_MIPS64_N32;
++	}
+ #endif
+ #if defined(__LITTLE_ENDIAN)
+ 	arch |=  __AUDIT_ARCH_LE;
+diff --git a/arch/mips/kernel/binfmt_elfo32.c b/arch/mips/kernel/binfmt_elfo32.c
+index 7faf5f2bee25..71df942fb77c 100644
+--- a/arch/mips/kernel/binfmt_elfo32.c
++++ b/arch/mips/kernel/binfmt_elfo32.c
+@@ -72,12 +72,6 @@ typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
+ 
+ #include <asm/processor.h>
+ 
+-/*
+- * When this file is selected, we are definitely running a 64bit kernel.
+- * So using the right regs define in asm/reg.h
+- */
+-#define WANT_COMPAT_REG_H
+-
+ /* These MUST be defined before elf.h gets included */
+ extern void elf32_core_copy_regs(elf_gregset_t grp, struct pt_regs *regs);
+ #define ELF_CORE_COPY_REGS(_dest, _regs) elf32_core_copy_regs(_dest, _regs);
+@@ -149,21 +143,21 @@ void elf32_core_copy_regs(elf_gregset_t grp, struct pt_regs *regs)
+ {
+ 	int i;
+ 
+-	for (i = 0; i < EF_R0; i++)
++	for (i = 0; i < MIPS32_EF_R0; i++)
+ 		grp[i] = 0;
+-	grp[EF_R0] = 0;
++	grp[MIPS32_EF_R0] = 0;
+ 	for (i = 1; i <= 31; i++)
+-		grp[EF_R0 + i] = (elf_greg_t) regs->regs[i];
+-	grp[EF_R26] = 0;
+-	grp[EF_R27] = 0;
+-	grp[EF_LO] = (elf_greg_t) regs->lo;
+-	grp[EF_HI] = (elf_greg_t) regs->hi;
+-	grp[EF_CP0_EPC] = (elf_greg_t) regs->cp0_epc;
+-	grp[EF_CP0_BADVADDR] = (elf_greg_t) regs->cp0_badvaddr;
+-	grp[EF_CP0_STATUS] = (elf_greg_t) regs->cp0_status;
+-	grp[EF_CP0_CAUSE] = (elf_greg_t) regs->cp0_cause;
+-#ifdef EF_UNUSED0
+-	grp[EF_UNUSED0] = 0;
++		grp[MIPS32_EF_R0 + i] = (elf_greg_t) regs->regs[i];
++	grp[MIPS32_EF_R26] = 0;
++	grp[MIPS32_EF_R27] = 0;
++	grp[MIPS32_EF_LO] = (elf_greg_t) regs->lo;
++	grp[MIPS32_EF_HI] = (elf_greg_t) regs->hi;
++	grp[MIPS32_EF_CP0_EPC] = (elf_greg_t) regs->cp0_epc;
++	grp[MIPS32_EF_CP0_BADVADDR] = (elf_greg_t) regs->cp0_badvaddr;
++	grp[MIPS32_EF_CP0_STATUS] = (elf_greg_t) regs->cp0_status;
++	grp[MIPS32_EF_CP0_CAUSE] = (elf_greg_t) regs->cp0_cause;
++#ifdef MIPS32_EF_UNUSED0
++	grp[MIPS32_EF_UNUSED0] = 0;
+ #endif
+ }
+ 
+diff --git a/arch/mips/kernel/cps-vec.S b/arch/mips/kernel/cps-vec.S
+index 6f4f739dad96..e6e97d2a5c9e 100644
+--- a/arch/mips/kernel/cps-vec.S
++++ b/arch/mips/kernel/cps-vec.S
+@@ -13,6 +13,7 @@
+ #include <asm/asm-offsets.h>
+ #include <asm/asmmacro.h>
+ #include <asm/cacheops.h>
++#include <asm/eva.h>
+ #include <asm/mipsregs.h>
+ #include <asm/mipsmtregs.h>
+ #include <asm/pm.h>
+@@ -166,6 +167,9 @@ dcache_done:
+ 1:	jal	mips_cps_core_init
+ 	 nop
+ 
++	/* Do any EVA initialization if necessary */
++	eva_init
++
+ 	/*
+ 	 * Boot any other VPEs within this core that should be online, and
+ 	 * deactivate this VPE if it should be offline.
+diff --git a/arch/mips/kernel/irq-gic.c b/arch/mips/kernel/irq-gic.c
+index 88e4c323382c..d5e59b8f4863 100644
+--- a/arch/mips/kernel/irq-gic.c
++++ b/arch/mips/kernel/irq-gic.c
+@@ -269,11 +269,13 @@ static void __init gic_setup_intr(unsigned int intr, unsigned int cpu,
+ 
+ 	/* Setup Intr to Pin mapping */
+ 	if (pin & GIC_MAP_TO_NMI_MSK) {
++		int i;
++
+ 		GICWRITE(GIC_REG_ADDR(SHARED, GIC_SH_MAP_TO_PIN(intr)), pin);
+ 		/* FIXME: hack to route NMI to all cpu's */
+-		for (cpu = 0; cpu < NR_CPUS; cpu += 32) {
++		for (i = 0; i < NR_CPUS; i += 32) {
+ 			GICWRITE(GIC_REG_ADDR(SHARED,
+-					  GIC_SH_MAP_TO_VPE_REG_OFF(intr, cpu)),
++					  GIC_SH_MAP_TO_VPE_REG_OFF(intr, i)),
+ 				 0xffffffff);
+ 		}
+ 	} else {
+diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
+index f639ccd5060c..aae71198b515 100644
+--- a/arch/mips/kernel/ptrace.c
++++ b/arch/mips/kernel/ptrace.c
+@@ -129,7 +129,7 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data)
+ 	}
+ 
+ 	__put_user(child->thread.fpu.fcr31, data + 64);
+-	__put_user(current_cpu_data.fpu_id, data + 65);
++	__put_user(boot_cpu_data.fpu_id, data + 65);
+ 
+ 	return 0;
+ }
+@@ -151,6 +151,7 @@ int ptrace_setfpregs(struct task_struct *child, __u32 __user *data)
+ 	}
+ 
+ 	__get_user(child->thread.fpu.fcr31, data + 64);
++	child->thread.fpu.fcr31 &= ~FPU_CSR_ALL_X;
+ 
+ 	/* FIR may not be written.  */
+ 
+@@ -246,36 +247,160 @@ int ptrace_set_watch_regs(struct task_struct *child,
+ 
+ /* regset get/set implementations */
+ 
+-static int gpr_get(struct task_struct *target,
+-		   const struct user_regset *regset,
+-		   unsigned int pos, unsigned int count,
+-		   void *kbuf, void __user *ubuf)
++#if defined(CONFIG_32BIT) || defined(CONFIG_MIPS32_O32)
++
++static int gpr32_get(struct task_struct *target,
++		     const struct user_regset *regset,
++		     unsigned int pos, unsigned int count,
++		     void *kbuf, void __user *ubuf)
+ {
+ 	struct pt_regs *regs = task_pt_regs(target);
++	u32 uregs[ELF_NGREG] = {};
++	unsigned i;
++
++	for (i = MIPS32_EF_R1; i <= MIPS32_EF_R31; i++) {
++		/* k0/k1 are copied as zero. */
++		if (i == MIPS32_EF_R26 || i == MIPS32_EF_R27)
++			continue;
++
++		uregs[i] = regs->regs[i - MIPS32_EF_R0];
++	}
+ 
+-	return user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+-				   regs, 0, sizeof(*regs));
++	uregs[MIPS32_EF_LO] = regs->lo;
++	uregs[MIPS32_EF_HI] = regs->hi;
++	uregs[MIPS32_EF_CP0_EPC] = regs->cp0_epc;
++	uregs[MIPS32_EF_CP0_BADVADDR] = regs->cp0_badvaddr;
++	uregs[MIPS32_EF_CP0_STATUS] = regs->cp0_status;
++	uregs[MIPS32_EF_CP0_CAUSE] = regs->cp0_cause;
++
++	return user_regset_copyout(&pos, &count, &kbuf, &ubuf, uregs, 0,
++				   sizeof(uregs));
+ }
+ 
+-static int gpr_set(struct task_struct *target,
+-		   const struct user_regset *regset,
+-		   unsigned int pos, unsigned int count,
+-		   const void *kbuf, const void __user *ubuf)
++static int gpr32_set(struct task_struct *target,
++		     const struct user_regset *regset,
++		     unsigned int pos, unsigned int count,
++		     const void *kbuf, const void __user *ubuf)
+ {
+-	struct pt_regs newregs;
+-	int ret;
++	struct pt_regs *regs = task_pt_regs(target);
++	u32 uregs[ELF_NGREG];
++	unsigned start, num_regs, i;
++	int err;
++
++	start = pos / sizeof(u32);
++	num_regs = count / sizeof(u32);
++
++	if (start + num_regs > ELF_NGREG)
++		return -EIO;
++
++	err = user_regset_copyin(&pos, &count, &kbuf, &ubuf, uregs, 0,
++				 sizeof(uregs));
++	if (err)
++		return err;
++
++	for (i = start; i < num_regs; i++) {
++		/*
++		 * Cast all values to signed here so that if this is a 64-bit
++		 * kernel, the supplied 32-bit values will be sign extended.
++		 */
++		switch (i) {
++		case MIPS32_EF_R1 ... MIPS32_EF_R25:
++			/* k0/k1 are ignored. */
++		case MIPS32_EF_R28 ... MIPS32_EF_R31:
++			regs->regs[i - MIPS32_EF_R0] = (s32)uregs[i];
++			break;
++		case MIPS32_EF_LO:
++			regs->lo = (s32)uregs[i];
++			break;
++		case MIPS32_EF_HI:
++			regs->hi = (s32)uregs[i];
++			break;
++		case MIPS32_EF_CP0_EPC:
++			regs->cp0_epc = (s32)uregs[i];
++			break;
++		}
++	}
++
++	return 0;
++}
++
++#endif /* CONFIG_32BIT || CONFIG_MIPS32_O32 */
++
++#ifdef CONFIG_64BIT
++
++static int gpr64_get(struct task_struct *target,
++		     const struct user_regset *regset,
++		     unsigned int pos, unsigned int count,
++		     void *kbuf, void __user *ubuf)
++{
++	struct pt_regs *regs = task_pt_regs(target);
++	u64 uregs[ELF_NGREG] = {};
++	unsigned i;
++
++	for (i = MIPS64_EF_R1; i <= MIPS64_EF_R31; i++) {
++		/* k0/k1 are copied as zero. */
++		if (i == MIPS64_EF_R26 || i == MIPS64_EF_R27)
++			continue;
++
++		uregs[i] = regs->regs[i - MIPS64_EF_R0];
++	}
++
++	uregs[MIPS64_EF_LO] = regs->lo;
++	uregs[MIPS64_EF_HI] = regs->hi;
++	uregs[MIPS64_EF_CP0_EPC] = regs->cp0_epc;
++	uregs[MIPS64_EF_CP0_BADVADDR] = regs->cp0_badvaddr;
++	uregs[MIPS64_EF_CP0_STATUS] = regs->cp0_status;
++	uregs[MIPS64_EF_CP0_CAUSE] = regs->cp0_cause;
++
++	return user_regset_copyout(&pos, &count, &kbuf, &ubuf, uregs, 0,
++				   sizeof(uregs));
++}
+ 
+-	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
+-				 &newregs,
+-				 0, sizeof(newregs));
+-	if (ret)
+-		return ret;
++static int gpr64_set(struct task_struct *target,
++		     const struct user_regset *regset,
++		     unsigned int pos, unsigned int count,
++		     const void *kbuf, const void __user *ubuf)
++{
++	struct pt_regs *regs = task_pt_regs(target);
++	u64 uregs[ELF_NGREG];
++	unsigned start, num_regs, i;
++	int err;
++
++	start = pos / sizeof(u64);
++	num_regs = count / sizeof(u64);
+ 
+-	*task_pt_regs(target) = newregs;
++	if (start + num_regs > ELF_NGREG)
++		return -EIO;
++
++	err = user_regset_copyin(&pos, &count, &kbuf, &ubuf, uregs, 0,
++				 sizeof(uregs));
++	if (err)
++		return err;
++
++	for (i = start; i < num_regs; i++) {
++		switch (i) {
++		case MIPS64_EF_R1 ... MIPS64_EF_R25:
++			/* k0/k1 are ignored. */
++		case MIPS64_EF_R28 ... MIPS64_EF_R31:
++			regs->regs[i - MIPS64_EF_R0] = uregs[i];
++			break;
++		case MIPS64_EF_LO:
++			regs->lo = uregs[i];
++			break;
++		case MIPS64_EF_HI:
++			regs->hi = uregs[i];
++			break;
++		case MIPS64_EF_CP0_EPC:
++			regs->cp0_epc = uregs[i];
++			break;
++		}
++	}
+ 
+ 	return 0;
+ }
+ 
++#endif /* CONFIG_64BIT */
++
+ static int fpr_get(struct task_struct *target,
+ 		   const struct user_regset *regset,
+ 		   unsigned int pos, unsigned int count,
+@@ -337,14 +462,16 @@ enum mips_regset {
+ 	REGSET_FPR,
+ };
+ 
++#if defined(CONFIG_32BIT) || defined(CONFIG_MIPS32_O32)
++
+ static const struct user_regset mips_regsets[] = {
+ 	[REGSET_GPR] = {
+ 		.core_note_type	= NT_PRSTATUS,
+ 		.n		= ELF_NGREG,
+ 		.size		= sizeof(unsigned int),
+ 		.align		= sizeof(unsigned int),
+-		.get		= gpr_get,
+-		.set		= gpr_set,
++		.get		= gpr32_get,
++		.set		= gpr32_set,
+ 	},
+ 	[REGSET_FPR] = {
+ 		.core_note_type	= NT_PRFPREG,
+@@ -364,14 +491,18 @@ static const struct user_regset_view user_mips_view = {
+ 	.n		= ARRAY_SIZE(mips_regsets),
+ };
+ 
++#endif /* CONFIG_32BIT || CONFIG_MIPS32_O32 */
++
++#ifdef CONFIG_64BIT
++
+ static const struct user_regset mips64_regsets[] = {
+ 	[REGSET_GPR] = {
+ 		.core_note_type	= NT_PRSTATUS,
+ 		.n		= ELF_NGREG,
+ 		.size		= sizeof(unsigned long),
+ 		.align		= sizeof(unsigned long),
+-		.get		= gpr_get,
+-		.set		= gpr_set,
++		.get		= gpr64_get,
++		.set		= gpr64_set,
+ 	},
+ 	[REGSET_FPR] = {
+ 		.core_note_type	= NT_PRFPREG,
+@@ -384,25 +515,26 @@ static const struct user_regset mips64_regsets[] = {
+ };
+ 
+ static const struct user_regset_view user_mips64_view = {
+-	.name		= "mips",
++	.name		= "mips64",
+ 	.e_machine	= ELF_ARCH,
+ 	.ei_osabi	= ELF_OSABI,
+ 	.regsets	= mips64_regsets,
+-	.n		= ARRAY_SIZE(mips_regsets),
++	.n		= ARRAY_SIZE(mips64_regsets),
+ };
+ 
++#endif /* CONFIG_64BIT */
++
+ const struct user_regset_view *task_user_regset_view(struct task_struct *task)
+ {
+ #ifdef CONFIG_32BIT
+ 	return &user_mips_view;
+-#endif
+-
++#else
+ #ifdef CONFIG_MIPS32_O32
+-		if (test_thread_flag(TIF_32BIT_REGS))
+-			return &user_mips_view;
++	if (test_tsk_thread_flag(task, TIF_32BIT_REGS))
++		return &user_mips_view;
+ #endif
+-
+ 	return &user_mips64_view;
++#endif
+ }
+ 
+ long arch_ptrace(struct task_struct *child, long request,
+@@ -480,7 +612,7 @@ long arch_ptrace(struct task_struct *child, long request,
+ 			break;
+ 		case FPC_EIR:
+ 			/* implementation / version register */
+-			tmp = current_cpu_data.fpu_id;
++			tmp = boot_cpu_data.fpu_id;
+ 			break;
+ 		case DSP_BASE ... DSP_BASE + 5: {
+ 			dspreg_t *dregs;
+@@ -565,7 +697,7 @@ long arch_ptrace(struct task_struct *child, long request,
+ 			break;
+ #endif
+ 		case FPC_CSR:
+-			child->thread.fpu.fcr31 = data;
++			child->thread.fpu.fcr31 = data & ~FPU_CSR_ALL_X;
+ 			break;
+ 		case DSP_BASE ... DSP_BASE + 5: {
+ 			dspreg_t *dregs;
+diff --git a/arch/mips/kernel/ptrace32.c b/arch/mips/kernel/ptrace32.c
+index b40c3ca60ee5..a83fb730b387 100644
+--- a/arch/mips/kernel/ptrace32.c
++++ b/arch/mips/kernel/ptrace32.c
+@@ -129,7 +129,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
+ 			break;
+ 		case FPC_EIR:
+ 			/* implementation / version register */
+-			tmp = current_cpu_data.fpu_id;
++			tmp = boot_cpu_data.fpu_id;
+ 			break;
+ 		case DSP_BASE ... DSP_BASE + 5: {
+ 			dspreg_t *dregs;
+diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
+index f1343ccd7ed7..7f5feb25ae04 100644
+--- a/arch/mips/kernel/scall64-o32.S
++++ b/arch/mips/kernel/scall64-o32.S
+@@ -113,15 +113,19 @@ trace_a_syscall:
+ 	move	s0, t2			# Save syscall pointer
+ 	move	a0, sp
+ 	/*
+-	 * syscall number is in v0 unless we called syscall(__NR_###)
++	 * absolute syscall number is in v0 unless we called syscall(__NR_###)
+ 	 * where the real syscall number is in a0
+ 	 * note: NR_syscall is the first O32 syscall but the macro is
+ 	 * only defined when compiling with -mabi=32 (CONFIG_32BIT)
+ 	 * therefore __NR_O32_Linux is used (4000)
+ 	 */
+-	addiu	a1, v0,  __NR_O32_Linux
+-	bnez	v0, 1f /* __NR_syscall at offset 0 */
+-	lw	a1, PT_R4(sp)
++	.set	push
++	.set	reorder
++	subu	t1, v0,  __NR_O32_Linux
++	move	a1, v0
++	bnez	t1, 1f /* __NR_syscall at offset 0 */
++	lw	a1, PT_R4(sp) /* Arg1 for __NR_syscall case */
++	.set	pop
+ 
+ 1:	jal	syscall_trace_enter
+ 
+diff --git a/arch/mips/kernel/smp-mt.c b/arch/mips/kernel/smp-mt.c
+index 3babf6e4f894..21f23add04f4 100644
+--- a/arch/mips/kernel/smp-mt.c
++++ b/arch/mips/kernel/smp-mt.c
+@@ -288,6 +288,7 @@ struct plat_smp_ops vsmp_smp_ops = {
+ 	.prepare_cpus		= vsmp_prepare_cpus,
+ };
+ 
++#ifdef CONFIG_PROC_FS
+ static int proc_cpuinfo_chain_call(struct notifier_block *nfb,
+ 	unsigned long action_unused, void *data)
+ {
+@@ -309,3 +310,4 @@ static int __init proc_cpuinfo_notifier_init(void)
+ }
+ 
+ subsys_initcall(proc_cpuinfo_notifier_init);
++#endif
+diff --git a/arch/mips/kernel/unaligned.c b/arch/mips/kernel/unaligned.c
+index 2b3517214d6d..e11906dff885 100644
+--- a/arch/mips/kernel/unaligned.c
++++ b/arch/mips/kernel/unaligned.c
+@@ -690,7 +690,6 @@ static void emulate_load_store_insn(struct pt_regs *regs,
+ 	case sdc1_op:
+ 		die_if_kernel("Unaligned FP access in kernel code", regs);
+ 		BUG_ON(!used_math());
+-		BUG_ON(!is_fpu_owner());
+ 
+ 		lose_fpu(1);	/* Save FPU state for the emulator. */
+ 		res = fpu_emulator_cop1Handler(regs, &current->thread.fpu, 1,
+diff --git a/arch/mips/mm/tlbex.c b/arch/mips/mm/tlbex.c
+index e80e10bafc83..343fe0f559b1 100644
+--- a/arch/mips/mm/tlbex.c
++++ b/arch/mips/mm/tlbex.c
+@@ -1299,6 +1299,7 @@ static void build_r4000_tlb_refill_handler(void)
+ 	}
+ #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
+ 	uasm_l_tlb_huge_update(&l, p);
++	UASM_i_LW(&p, K0, 0, K1);
+ 	build_huge_update_entries(&p, htlb_info.huge_pte, K1);
+ 	build_huge_tlb_write_entry(&p, &l, &r, K0, tlb_random,
+ 				   htlb_info.restore_scratch);
+diff --git a/arch/mips/mti-malta/malta-memory.c b/arch/mips/mti-malta/malta-memory.c
+index 6d9773096750..fdffc806664f 100644
+--- a/arch/mips/mti-malta/malta-memory.c
++++ b/arch/mips/mti-malta/malta-memory.c
+@@ -34,13 +34,19 @@ fw_memblock_t * __init fw_getmdesc(int eva)
+ 	/* otherwise look in the environment */
+ 
+ 	memsize_str = fw_getenv("memsize");
+-	if (memsize_str)
+-		tmp = kstrtol(memsize_str, 0, &memsize);
++	if (memsize_str) {
++		tmp = kstrtoul(memsize_str, 0, &memsize);
++		if (tmp)
++			pr_warn("Failed to read the 'memsize' env variable.\n");
++	}
+ 	if (eva) {
+ 	/* Look for ememsize for EVA */
+ 		ememsize_str = fw_getenv("ememsize");
+-		if (ememsize_str)
+-			tmp = kstrtol(ememsize_str, 0, &ememsize);
++		if (ememsize_str) {
++			tmp = kstrtoul(ememsize_str, 0, &ememsize);
++			if (tmp)
++				pr_warn("Failed to read the 'ememsize' env variable.\n");
++		}
+ 	}
+ 	if (!memsize && !ememsize) {
+ 		pr_warn("memsize not set in YAMON, set to default (32Mb)\n");
+diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
+index f92b0b54e921..8dcb721d03d8 100644
+--- a/arch/powerpc/include/asm/machdep.h
++++ b/arch/powerpc/include/asm/machdep.h
+@@ -57,10 +57,10 @@ struct machdep_calls {
+ 	void            (*hpte_removebolted)(unsigned long ea,
+ 					     int psize, int ssize);
+ 	void		(*flush_hash_range)(unsigned long number, int local);
+-	void		(*hugepage_invalidate)(struct mm_struct *mm,
++	void		(*hugepage_invalidate)(unsigned long vsid,
++					       unsigned long addr,
+ 					       unsigned char *hpte_slot_array,
+-					       unsigned long addr, int psize);
+-
++					       int psize, int ssize);
+ 	/* special for kexec, to be called in real mode, linear mapping is
+ 	 * destroyed as well */
+ 	void		(*hpte_clear_all)(void);
+diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
+index eb9261024f51..7b3d54fae46f 100644
+--- a/arch/powerpc/include/asm/pgtable-ppc64.h
++++ b/arch/powerpc/include/asm/pgtable-ppc64.h
+@@ -413,7 +413,7 @@ static inline char *get_hpte_slot_array(pmd_t *pmdp)
+ }
+ 
+ extern void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+-				   pmd_t *pmdp);
++				   pmd_t *pmdp, unsigned long old_pmd);
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+ extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
+diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
+index d836d945068d..9ecede1e124c 100644
+--- a/arch/powerpc/include/asm/pte-hash64-64k.h
++++ b/arch/powerpc/include/asm/pte-hash64-64k.h
+@@ -46,11 +46,31 @@
+  * in order to deal with 64K made of 4K HW pages. Thus we override the
+  * generic accessors and iterators here
+  */
+-#define __real_pte(e,p) 	((real_pte_t) { \
+-			(e), (pte_val(e) & _PAGE_COMBO) ? \
+-				(pte_val(*((p) + PTRS_PER_PTE))) : 0 })
+-#define __rpte_to_hidx(r,index)	((pte_val((r).pte) & _PAGE_COMBO) ? \
+-        (((r).hidx >> ((index)<<2)) & 0xf) : ((pte_val((r).pte) >> 12) & 0xf))
++#define __real_pte __real_pte
++static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
++{
++	real_pte_t rpte;
++
++	rpte.pte = pte;
++	rpte.hidx = 0;
++	if (pte_val(pte) & _PAGE_COMBO) {
++		/*
++		 * Make sure we order the hidx load against the _PAGE_COMBO
++		 * check. The store side ordering is done in __hash_page_4K
++		 */
++		smp_rmb();
++		rpte.hidx = pte_val(*((ptep) + PTRS_PER_PTE));
++	}
++	return rpte;
++}
++
++static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
++{
++	if ((pte_val(rpte.pte) & _PAGE_COMBO))
++		return (rpte.hidx >> (index<<2)) & 0xf;
++	return (pte_val(rpte.pte) >> 12) & 0xf;
++}
++
+ #define __rpte_to_pte(r)	((r).pte)
+ #define __rpte_sub_valid(rpte, index) \
+ 	(pte_val(rpte.pte) & (_PAGE_HPTE_SUB0 >> (index)))
+diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
+index 88e3ec6e1d96..48fb2c18fa81 100644
+--- a/arch/powerpc/kernel/iommu.c
++++ b/arch/powerpc/kernel/iommu.c
+@@ -1120,37 +1120,41 @@ EXPORT_SYMBOL_GPL(iommu_release_ownership);
+ int iommu_add_device(struct device *dev)
+ {
+ 	struct iommu_table *tbl;
+-	int ret = 0;
+ 
+-	if (WARN_ON(dev->iommu_group)) {
+-		pr_warn("iommu_tce: device %s is already in iommu group %d, skipping\n",
+-				dev_name(dev),
+-				iommu_group_id(dev->iommu_group));
++	/*
++	 * The sysfs entries should be populated before
++	 * binding IOMMU group. If sysfs entries isn't
++	 * ready, we simply bail.
++	 */
++	if (!device_is_registered(dev))
++		return -ENOENT;
++
++	if (dev->iommu_group) {
++		pr_debug("%s: Skipping device %s with iommu group %d\n",
++			 __func__, dev_name(dev),
++			 iommu_group_id(dev->iommu_group));
+ 		return -EBUSY;
+ 	}
+ 
+ 	tbl = get_iommu_table_base(dev);
+ 	if (!tbl || !tbl->it_group) {
+-		pr_debug("iommu_tce: skipping device %s with no tbl\n",
+-				dev_name(dev));
++		pr_debug("%s: Skipping device %s with no tbl\n",
++			 __func__, dev_name(dev));
+ 		return 0;
+ 	}
+ 
+-	pr_debug("iommu_tce: adding %s to iommu group %d\n",
+-			dev_name(dev), iommu_group_id(tbl->it_group));
++	pr_debug("%s: Adding %s to iommu group %d\n",
++		 __func__, dev_name(dev),
++		 iommu_group_id(tbl->it_group));
+ 
+ 	if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) {
+-		pr_err("iommu_tce: unsupported iommu page size.");
+-		pr_err("%s has not been added\n", dev_name(dev));
++		pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n",
++		       __func__, IOMMU_PAGE_SIZE(tbl),
++		       PAGE_SIZE, dev_name(dev));
+ 		return -EINVAL;
+ 	}
+ 
+-	ret = iommu_group_add_device(tbl->it_group, dev);
+-	if (ret < 0)
+-		pr_err("iommu_tce: %s has not been added, ret=%d\n",
+-				dev_name(dev), ret);
+-
+-	return ret;
++	return iommu_group_add_device(tbl->it_group, dev);
+ }
+ EXPORT_SYMBOL_GPL(iommu_add_device);
+ 
+diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
+index cf1d325eae8b..afc0a8295f84 100644
+--- a/arch/powerpc/mm/hash_native_64.c
++++ b/arch/powerpc/mm/hash_native_64.c
+@@ -412,18 +412,18 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
+ 	local_irq_restore(flags);
+ }
+ 
+-static void native_hugepage_invalidate(struct mm_struct *mm,
++static void native_hugepage_invalidate(unsigned long vsid,
++				       unsigned long addr,
+ 				       unsigned char *hpte_slot_array,
+-				       unsigned long addr, int psize)
++				       int psize, int ssize)
+ {
+-	int ssize = 0, i;
+-	int lock_tlbie;
++	int i;
+ 	struct hash_pte *hptep;
+ 	int actual_psize = MMU_PAGE_16M;
+ 	unsigned int max_hpte_count, valid;
+ 	unsigned long flags, s_addr = addr;
+ 	unsigned long hpte_v, want_v, shift;
+-	unsigned long hidx, vpn = 0, vsid, hash, slot;
++	unsigned long hidx, vpn = 0, hash, slot;
+ 
+ 	shift = mmu_psize_defs[psize].shift;
+ 	max_hpte_count = 1U << (PMD_SHIFT - shift);
+@@ -437,15 +437,6 @@ static void native_hugepage_invalidate(struct mm_struct *mm,
+ 
+ 		/* get the vpn */
+ 		addr = s_addr + (i * (1ul << shift));
+-		if (!is_kernel_addr(addr)) {
+-			ssize = user_segment_size(addr);
+-			vsid = get_vsid(mm->context.id, addr, ssize);
+-			WARN_ON(vsid == 0);
+-		} else {
+-			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+-			ssize = mmu_kernel_ssize;
+-		}
+-
+ 		vpn = hpt_vpn(addr, vsid, ssize);
+ 		hash = hpt_hash(vpn, shift, ssize);
+ 		if (hidx & _PTEIDX_SECONDARY)
+@@ -465,22 +456,13 @@ static void native_hugepage_invalidate(struct mm_struct *mm,
+ 		else
+ 			/* Invalidate the hpte. NOTE: this also unlocks it */
+ 			hptep->v = 0;
++		/*
++		 * We need to do tlb invalidate for all the address, tlbie
++		 * instruction compares entry_VA in tlb with the VA specified
++		 * here
++		 */
++		tlbie(vpn, psize, actual_psize, ssize, 0);
+ 	}
+-	/*
+-	 * Since this is a hugepage, we just need a single tlbie.
+-	 * use the last vpn.
+-	 */
+-	lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
+-	if (lock_tlbie)
+-		raw_spin_lock(&native_tlbie_lock);
+-
+-	asm volatile("ptesync":::"memory");
+-	__tlbie(vpn, psize, actual_psize, ssize);
+-	asm volatile("eieio; tlbsync; ptesync":::"memory");
+-
+-	if (lock_tlbie)
+-		raw_spin_unlock(&native_tlbie_lock);
+-
+ 	local_irq_restore(flags);
+ }
+ 
+diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
+index 826893fcb3a7..5f5e6328c21c 100644
+--- a/arch/powerpc/mm/hugepage-hash64.c
++++ b/arch/powerpc/mm/hugepage-hash64.c
+@@ -18,6 +18,57 @@
+ #include <linux/mm.h>
+ #include <asm/machdep.h>
+ 
++static void invalidate_old_hpte(unsigned long vsid, unsigned long addr,
++				pmd_t *pmdp, unsigned int psize, int ssize)
++{
++	int i, max_hpte_count, valid;
++	unsigned long s_addr;
++	unsigned char *hpte_slot_array;
++	unsigned long hidx, shift, vpn, hash, slot;
++
++	s_addr = addr & HPAGE_PMD_MASK;
++	hpte_slot_array = get_hpte_slot_array(pmdp);
++	/*
++	 * IF we try to do a HUGE PTE update after a withdraw is done.
++	 * we will find the below NULL. This happens when we do
++	 * split_huge_page_pmd
++	 */
++	if (!hpte_slot_array)
++		return;
++
++	if (ppc_md.hugepage_invalidate)
++		return ppc_md.hugepage_invalidate(vsid, s_addr, hpte_slot_array,
++						  psize, ssize);
++	/*
++	 * No bluk hpte removal support, invalidate each entry
++	 */
++	shift = mmu_psize_defs[psize].shift;
++	max_hpte_count = HPAGE_PMD_SIZE >> shift;
++	for (i = 0; i < max_hpte_count; i++) {
++		/*
++		 * 8 bits per each hpte entries
++		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
++		 */
++		valid = hpte_valid(hpte_slot_array, i);
++		if (!valid)
++			continue;
++		hidx =  hpte_hash_index(hpte_slot_array, i);
++
++		/* get the vpn */
++		addr = s_addr + (i * (1ul << shift));
++		vpn = hpt_vpn(addr, vsid, ssize);
++		hash = hpt_hash(vpn, shift, ssize);
++		if (hidx & _PTEIDX_SECONDARY)
++			hash = ~hash;
++
++		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
++		slot += hidx & _PTEIDX_GROUP_IX;
++		ppc_md.hpte_invalidate(slot, vpn, psize,
++				       MMU_PAGE_16M, ssize, 0);
++	}
++}
++
++
+ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ 		    pmd_t *pmdp, unsigned long trap, int local, int ssize,
+ 		    unsigned int psize)
+@@ -33,7 +84,9 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ 	 * atomically mark the linux large page PMD busy and dirty
+ 	 */
+ 	do {
+-		old_pmd = pmd_val(*pmdp);
++		pmd_t pmd = ACCESS_ONCE(*pmdp);
++
++		old_pmd = pmd_val(pmd);
+ 		/* If PMD busy, retry the access */
+ 		if (unlikely(old_pmd & _PAGE_BUSY))
+ 			return 0;
+@@ -85,6 +138,15 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ 	vpn = hpt_vpn(ea, vsid, ssize);
+ 	hash = hpt_hash(vpn, shift, ssize);
+ 	hpte_slot_array = get_hpte_slot_array(pmdp);
++	if (psize == MMU_PAGE_4K) {
++		/*
++		 * invalidate the old hpte entry if we have that mapped via 64K
++		 * base page size. This is because demote_segment won't flush
++		 * hash page table entries.
++		 */
++		if ((old_pmd & _PAGE_HASHPTE) && !(old_pmd & _PAGE_COMBO))
++			invalidate_old_hpte(vsid, ea, pmdp, MMU_PAGE_64K, ssize);
++	}
+ 
+ 	valid = hpte_valid(hpte_slot_array, index);
+ 	if (valid) {
+@@ -107,11 +169,8 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ 			 * safely update this here.
+ 			 */
+ 			valid = 0;
+-			new_pmd &= ~_PAGE_HPTEFLAGS;
+ 			hpte_slot_array[index] = 0;
+-		} else
+-			/* clear the busy bits and set the hash pte bits */
+-			new_pmd = (new_pmd & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
++		}
+ 	}
+ 
+ 	if (!valid) {
+@@ -119,11 +178,7 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+ 
+ 		/* insert new entry */
+ 		pa = pmd_pfn(__pmd(old_pmd)) << PAGE_SHIFT;
+-repeat:
+-		hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+-
+-		/* clear the busy bits and set the hash pte bits */
+-		new_pmd = (new_pmd & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
++		new_pmd |= _PAGE_HASHPTE;
+ 
+ 		/* Add in WIMG bits */
+ 		rflags |= (new_pmd & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
+@@ -132,6 +187,8 @@ repeat:
+ 		 * enable the memory coherence always
+ 		 */
+ 		rflags |= HPTE_R_M;
++repeat:
++		hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+ 
+ 		/* Insert into the hash table, primary slot */
+ 		slot = ppc_md.hpte_insert(hpte_group, vpn, pa, rflags, 0,
+@@ -172,8 +229,17 @@ repeat:
+ 		mark_hpte_slot_valid(hpte_slot_array, index, slot);
+ 	}
+ 	/*
+-	 * No need to use ldarx/stdcx here
++	 * Mark the pte with _PAGE_COMBO, if we are trying to hash it with
++	 * base page size 4k.
++	 */
++	if (psize == MMU_PAGE_4K)
++		new_pmd |= _PAGE_COMBO;
++	/*
++	 * The hpte valid is stored in the pgtable whose address is in the
++	 * second half of the PMD. Order this against clearing of the busy bit in
++	 * huge pmd.
+ 	 */
++	smp_wmb();
+ 	*pmdp = __pmd(new_pmd & ~_PAGE_BUSY);
+ 	return 0;
+ }
+diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
+index 3b181b22cd46..d3e9a78eaed3 100644
+--- a/arch/powerpc/mm/numa.c
++++ b/arch/powerpc/mm/numa.c
+@@ -611,8 +611,8 @@ static int cpu_numa_callback(struct notifier_block *nfb, unsigned long action,
+ 	case CPU_UP_CANCELED:
+ 	case CPU_UP_CANCELED_FROZEN:
+ 		unmap_cpu_from_node(lcpu);
+-		break;
+ 		ret = NOTIFY_OK;
++		break;
+ #endif
+ 	}
+ 	return ret;
+diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
+index f6ce1f111f5b..71d084b6f766 100644
+--- a/arch/powerpc/mm/pgtable_64.c
++++ b/arch/powerpc/mm/pgtable_64.c
+@@ -538,7 +538,7 @@ unsigned long pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
+ 	*pmdp = __pmd((old & ~clr) | set);
+ #endif
+ 	if (old & _PAGE_HASHPTE)
+-		hpte_do_hugepage_flush(mm, addr, pmdp);
++		hpte_do_hugepage_flush(mm, addr, pmdp, old);
+ 	return old;
+ }
+ 
+@@ -645,7 +645,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma,
+ 	if (!(old & _PAGE_SPLITTING)) {
+ 		/* We need to flush the hpte */
+ 		if (old & _PAGE_HASHPTE)
+-			hpte_do_hugepage_flush(vma->vm_mm, address, pmdp);
++			hpte_do_hugepage_flush(vma->vm_mm, address, pmdp, old);
+ 	}
+ 	/*
+ 	 * This ensures that generic code that rely on IRQ disabling
+@@ -723,7 +723,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+  * neesd to be flushed.
+  */
+ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+-			    pmd_t *pmdp)
++			    pmd_t *pmdp, unsigned long old_pmd)
+ {
+ 	int ssize, i;
+ 	unsigned long s_addr;
+@@ -745,12 +745,29 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+ 	if (!hpte_slot_array)
+ 		return;
+ 
+-	/* get the base page size */
++	/* get the base page size,vsid and segment size */
++#ifdef CONFIG_DEBUG_VM
+ 	psize = get_slice_psize(mm, s_addr);
++	BUG_ON(psize == MMU_PAGE_16M);
++#endif
++	if (old_pmd & _PAGE_COMBO)
++		psize = MMU_PAGE_4K;
++	else
++		psize = MMU_PAGE_64K;
++
++	if (!is_kernel_addr(s_addr)) {
++		ssize = user_segment_size(s_addr);
++		vsid = get_vsid(mm->context.id, s_addr, ssize);
++		WARN_ON(vsid == 0);
++	} else {
++		vsid = get_kernel_vsid(s_addr, mmu_kernel_ssize);
++		ssize = mmu_kernel_ssize;
++	}
+ 
+ 	if (ppc_md.hugepage_invalidate)
+-		return ppc_md.hugepage_invalidate(mm, hpte_slot_array,
+-						  s_addr, psize);
++		return ppc_md.hugepage_invalidate(vsid, s_addr,
++						  hpte_slot_array,
++						  psize, ssize);
+ 	/*
+ 	 * No bluk hpte removal support, invalidate each entry
+ 	 */
+@@ -768,15 +785,6 @@ void hpte_do_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+ 
+ 		/* get the vpn */
+ 		addr = s_addr + (i * (1ul << shift));
+-		if (!is_kernel_addr(addr)) {
+-			ssize = user_segment_size(addr);
+-			vsid = get_vsid(mm->context.id, addr, ssize);
+-			WARN_ON(vsid == 0);
+-		} else {
+-			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+-			ssize = mmu_kernel_ssize;
+-		}
+-
+ 		vpn = hpt_vpn(addr, vsid, ssize);
+ 		hash = hpt_hash(vpn, shift, ssize);
+ 		if (hidx & _PTEIDX_SECONDARY)
+diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
+index c99f6510a0b2..9adda5790463 100644
+--- a/arch/powerpc/mm/tlb_hash64.c
++++ b/arch/powerpc/mm/tlb_hash64.c
+@@ -216,7 +216,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
+ 		if (!(pte & _PAGE_HASHPTE))
+ 			continue;
+ 		if (unlikely(hugepage_shift && pmd_trans_huge(*(pmd_t *)pte)))
+-			hpte_do_hugepage_flush(mm, start, (pmd_t *)pte);
++			hpte_do_hugepage_flush(mm, start, (pmd_t *)ptep, pte);
+ 		else
+ 			hpte_need_flush(mm, start, ptep, pte, 0);
+ 	}
+diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
+index 3136ae2f75af..dc30aa5a2ce8 100644
+--- a/arch/powerpc/platforms/powernv/pci-ioda.c
++++ b/arch/powerpc/platforms/powernv/pci-ioda.c
+@@ -462,7 +462,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev
+ 
+ 	pe = &phb->ioda.pe_array[pdn->pe_number];
+ 	WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops);
+-	set_iommu_table_base(&pdev->dev, &pe->tce32_table);
++	set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table);
+ }
+ 
+ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
+diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
+index 7995135170a3..24abc5c223c7 100644
+--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
++++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
+@@ -146,7 +146,7 @@ static inline int pseries_remove_memblock(unsigned long base,
+ }
+ static inline int pseries_remove_mem_node(struct device_node *np)
+ {
+-	return -EOPNOTSUPP;
++	return 0;
+ }
+ #endif /* CONFIG_MEMORY_HOTREMOVE */
+ 
+diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
+index 33b552ffbe57..4642d6a4d356 100644
+--- a/arch/powerpc/platforms/pseries/iommu.c
++++ b/arch/powerpc/platforms/pseries/iommu.c
+@@ -721,13 +721,13 @@ static int __init disable_ddw_setup(char *str)
+ 
+ early_param("disable_ddw", disable_ddw_setup);
+ 
+-static void remove_ddw(struct device_node *np)
++static void remove_ddw(struct device_node *np, bool remove_prop)
+ {
+ 	struct dynamic_dma_window_prop *dwp;
+ 	struct property *win64;
+ 	const u32 *ddw_avail;
+ 	u64 liobn;
+-	int len, ret;
++	int len, ret = 0;
+ 
+ 	ddw_avail = of_get_property(np, "ibm,ddw-applicable", &len);
+ 	win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
+@@ -761,7 +761,8 @@ static void remove_ddw(struct device_node *np)
+ 			np->full_name, ret, ddw_avail[2], liobn);
+ 
+ delprop:
+-	ret = of_remove_property(np, win64);
++	if (remove_prop)
++		ret = of_remove_property(np, win64);
+ 	if (ret)
+ 		pr_warning("%s: failed to remove direct window property: %d\n",
+ 			np->full_name, ret);
+@@ -805,7 +806,7 @@ static int find_existing_ddw_windows(void)
+ 		window = kzalloc(sizeof(*window), GFP_KERNEL);
+ 		if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
+ 			kfree(window);
+-			remove_ddw(pdn);
++			remove_ddw(pdn, true);
+ 			continue;
+ 		}
+ 
+@@ -1045,7 +1046,7 @@ out_free_window:
+ 	kfree(window);
+ 
+ out_clear_window:
+-	remove_ddw(pdn);
++	remove_ddw(pdn, true);
+ 
+ out_free_prop:
+ 	kfree(win64->name);
+@@ -1255,7 +1256,14 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti
+ 
+ 	switch (action) {
+ 	case OF_RECONFIG_DETACH_NODE:
+-		remove_ddw(np);
++		/*
++		 * Removing the property will invoke the reconfig
++		 * notifier again, which causes dead-lock on the
++		 * read-write semaphore of the notifier chain. So
++		 * we have to remove the property when releasing
++		 * the device node.
++		 */
++		remove_ddw(np, false);
+ 		if (pci && pci->iommu_table)
+ 			iommu_free_table(pci->iommu_table, np->full_name);
+ 
+diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
+index b02af9ef3ff6..ccf6f162f69c 100644
+--- a/arch/powerpc/platforms/pseries/lpar.c
++++ b/arch/powerpc/platforms/pseries/lpar.c
+@@ -430,16 +430,17 @@ static void __pSeries_lpar_hugepage_invalidate(unsigned long *slot,
+ 		spin_unlock_irqrestore(&pSeries_lpar_tlbie_lock, flags);
+ }
+ 
+-static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
+-				       unsigned char *hpte_slot_array,
+-				       unsigned long addr, int psize)
++static void pSeries_lpar_hugepage_invalidate(unsigned long vsid,
++					     unsigned long addr,
++					     unsigned char *hpte_slot_array,
++					     int psize, int ssize)
+ {
+-	int ssize = 0, i, index = 0;
++	int i, index = 0;
+ 	unsigned long s_addr = addr;
+ 	unsigned int max_hpte_count, valid;
+ 	unsigned long vpn_array[PPC64_HUGE_HPTE_BATCH];
+ 	unsigned long slot_array[PPC64_HUGE_HPTE_BATCH];
+-	unsigned long shift, hidx, vpn = 0, vsid, hash, slot;
++	unsigned long shift, hidx, vpn = 0, hash, slot;
+ 
+ 	shift = mmu_psize_defs[psize].shift;
+ 	max_hpte_count = 1U << (PMD_SHIFT - shift);
+@@ -452,15 +453,6 @@ static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
+ 
+ 		/* get the vpn */
+ 		addr = s_addr + (i * (1ul << shift));
+-		if (!is_kernel_addr(addr)) {
+-			ssize = user_segment_size(addr);
+-			vsid = get_vsid(mm->context.id, addr, ssize);
+-			WARN_ON(vsid == 0);
+-		} else {
+-			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+-			ssize = mmu_kernel_ssize;
+-		}
+-
+ 		vpn = hpt_vpn(addr, vsid, ssize);
+ 		hash = hpt_hash(vpn, shift, ssize);
+ 		if (hidx & _PTEIDX_SECONDARY)
+diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
+index bb63499fc5d3..9f00f9301613 100644
+--- a/arch/s390/Kconfig
++++ b/arch/s390/Kconfig
+@@ -92,6 +92,7 @@ config S390
+ 	select ARCH_INLINE_WRITE_UNLOCK_IRQ
+ 	select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
+ 	select ARCH_SAVE_PAGE_KEYS if HIBERNATION
++	select ARCH_SUPPORTS_ATOMIC_RMW
+ 	select ARCH_USE_CMPXCHG_LOCKREF
+ 	select ARCH_WANT_IPC_PARSE_VERSION
+ 	select BUILDTIME_EXTABLE_SORT
+diff --git a/arch/sh/include/asm/io_noioport.h b/arch/sh/include/asm/io_noioport.h
+index 4d48f1436a63..c727e6ddf69e 100644
+--- a/arch/sh/include/asm/io_noioport.h
++++ b/arch/sh/include/asm/io_noioport.h
+@@ -34,6 +34,17 @@ static inline void outl(unsigned int x, unsigned long port)
+ 	BUG();
+ }
+ 
++static inline void __iomem *ioport_map(unsigned long port, unsigned int size)
++{
++	BUG();
++	return NULL;
++}
++
++static inline void ioport_unmap(void __iomem *addr)
++{
++	BUG();
++}
++
+ #define inb_p(addr)	inb(addr)
+ #define inw_p(addr)	inw(addr)
+ #define inl_p(addr)	inl(addr)
+diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
+index 14695c6221c8..84ab119b6ffa 100644
+--- a/block/scsi_ioctl.c
++++ b/block/scsi_ioctl.c
+@@ -438,6 +438,11 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
+ 	}
+ 
+ 	rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_WAIT);
++	if (!rq) {
++		err = -ENOMEM;
++		goto error;
++	}
++	blk_rq_set_block_pc(rq);
+ 
+ 	cmdlen = COMMAND_SIZE(opcode);
+ 
+@@ -491,7 +496,6 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
+ 	memset(sense, 0, sizeof(sense));
+ 	rq->sense = sense;
+ 	rq->sense_len = 0;
+-	blk_rq_set_block_pc(rq);
+ 
+ 	blk_execute_rq(q, disk, rq, 0);
+ 
+@@ -511,7 +515,8 @@ out:
+ 	
+ error:
+ 	kfree(buffer);
+-	blk_put_request(rq);
++	if (rq)
++		blk_put_request(rq);
+ 	return err;
+ }
+ EXPORT_SYMBOL_GPL(sg_scsi_ioctl);
+diff --git a/drivers/acpi/acpica/nsobject.c b/drivers/acpi/acpica/nsobject.c
+index fe54a8c73b8c..f1ea8e56cd87 100644
+--- a/drivers/acpi/acpica/nsobject.c
++++ b/drivers/acpi/acpica/nsobject.c
+@@ -239,6 +239,17 @@ void acpi_ns_detach_object(struct acpi_namespace_node *node)
+ 		}
+ 	}
+ 
++	/*
++	 * Detach the object from any data objects (which are still held by
++	 * the namespace node)
++	 */
++
++	if (obj_desc->common.next_object &&
++	    ((obj_desc->common.next_object)->common.type ==
++	     ACPI_TYPE_LOCAL_DATA)) {
++		obj_desc->common.next_object = NULL;
++	}
++
+ 	/* Reset the node type to untyped */
+ 
+ 	node->type = ACPI_TYPE_ANY;
+diff --git a/drivers/acpi/acpica/utcopy.c b/drivers/acpi/acpica/utcopy.c
+index 270c16464dd9..ff601c0f7c7a 100644
+--- a/drivers/acpi/acpica/utcopy.c
++++ b/drivers/acpi/acpica/utcopy.c
+@@ -1001,5 +1001,11 @@ acpi_ut_copy_iobject_to_iobject(union acpi_operand_object *source_desc,
+ 		status = acpi_ut_copy_simple_object(source_desc, *dest_desc);
+ 	}
+ 
++	/* Delete the allocated object if copy failed */
++
++	if (ACPI_FAILURE(status)) {
++		acpi_ut_remove_reference(*dest_desc);
++	}
++
+ 	return_ACPI_STATUS(status);
+ }
+diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
+index a66ab658abbc..9922cc46b15c 100644
+--- a/drivers/acpi/ec.c
++++ b/drivers/acpi/ec.c
+@@ -197,6 +197,8 @@ static bool advance_transaction(struct acpi_ec *ec)
+ 				t->rdata[t->ri++] = acpi_ec_read_data(ec);
+ 				if (t->rlen == t->ri) {
+ 					t->flags |= ACPI_EC_COMMAND_COMPLETE;
++					if (t->command == ACPI_EC_COMMAND_QUERY)
++						pr_debug("hardware QR_EC completion\n");
+ 					wakeup = true;
+ 				}
+ 			} else
+@@ -208,7 +210,20 @@ static bool advance_transaction(struct acpi_ec *ec)
+ 		}
+ 		return wakeup;
+ 	} else {
+-		if ((status & ACPI_EC_FLAG_IBF) == 0) {
++		/*
++		 * There is firmware refusing to respond QR_EC when SCI_EVT
++		 * is not set, for which case, we complete the QR_EC
++		 * without issuing it to the firmware.
++		 * https://bugzilla.kernel.org/show_bug.cgi?id=86211
++		 */
++		if (!(status & ACPI_EC_FLAG_SCI) &&
++		    (t->command == ACPI_EC_COMMAND_QUERY)) {
++			t->flags |= ACPI_EC_COMMAND_POLL;
++			t->rdata[t->ri++] = 0x00;
++			t->flags |= ACPI_EC_COMMAND_COMPLETE;
++			pr_debug("software QR_EC completion\n");
++			wakeup = true;
++		} else if ((status & ACPI_EC_FLAG_IBF) == 0) {
+ 			acpi_ec_write_cmd(ec, t->command);
+ 			t->flags |= ACPI_EC_COMMAND_POLL;
+ 		} else
+@@ -288,11 +303,11 @@ static int acpi_ec_transaction_unlocked(struct acpi_ec *ec,
+ 	/* following two actions should be kept atomic */
+ 	ec->curr = t;
+ 	start_transaction(ec);
+-	if (ec->curr->command == ACPI_EC_COMMAND_QUERY)
+-		clear_bit(EC_FLAGS_QUERY_PENDING, &ec->flags);
+ 	spin_unlock_irqrestore(&ec->lock, tmp);
+ 	ret = ec_poll(ec);
+ 	spin_lock_irqsave(&ec->lock, tmp);
++	if (ec->curr->command == ACPI_EC_COMMAND_QUERY)
++		clear_bit(EC_FLAGS_QUERY_PENDING, &ec->flags);
+ 	ec->curr = NULL;
+ 	spin_unlock_irqrestore(&ec->lock, tmp);
+ 	return ret;
+diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
+index 3dca36d4ad26..17f9ec501972 100644
+--- a/drivers/acpi/processor_idle.c
++++ b/drivers/acpi/processor_idle.c
+@@ -1071,9 +1071,9 @@ int acpi_processor_cst_has_changed(struct acpi_processor *pr)
+ 
+ 	if (pr->id == 0 && cpuidle_get_driver() == &acpi_idle_driver) {
+ 
+-		cpuidle_pause_and_lock();
+ 		/* Protect against cpu-hotplug */
+ 		get_online_cpus();
++		cpuidle_pause_and_lock();
+ 
+ 		/* Disable all cpuidle devices */
+ 		for_each_online_cpu(cpu) {
+@@ -1100,8 +1100,8 @@ int acpi_processor_cst_has_changed(struct acpi_processor *pr)
+ 				cpuidle_enable_device(dev);
+ 			}
+ 		}
+-		put_online_cpus();
+ 		cpuidle_resume_and_unlock();
++		put_online_cpus();
+ 	}
+ 
+ 	return 0;
+diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
+index f775fa0d850f..551f29127369 100644
+--- a/drivers/acpi/scan.c
++++ b/drivers/acpi/scan.c
+@@ -351,7 +351,8 @@ static int acpi_scan_hot_remove(struct acpi_device *device)
+ 	unsigned long long sta;
+ 	acpi_status status;
+ 
+-	if (device->handler->hotplug.demand_offline && !acpi_force_hot_remove) {
++	if (device->handler && device->handler->hotplug.demand_offline
++	    && !acpi_force_hot_remove) {
+ 		if (!acpi_scan_is_offline(device, true))
+ 			return -EBUSY;
+ 	} else {
+@@ -664,8 +665,14 @@ static ssize_t
+ acpi_device_sun_show(struct device *dev, struct device_attribute *attr,
+ 		     char *buf) {
+ 	struct acpi_device *acpi_dev = to_acpi_device(dev);
++	acpi_status status;
++	unsigned long long sun;
++
++	status = acpi_evaluate_integer(acpi_dev->handle, "_SUN", NULL, &sun);
++	if (ACPI_FAILURE(status))
++		return -ENODEV;
+ 
+-	return sprintf(buf, "%lu\n", acpi_dev->pnp.sun);
++	return sprintf(buf, "%llu\n", sun);
+ }
+ static DEVICE_ATTR(sun, 0444, acpi_device_sun_show, NULL);
+ 
+@@ -687,7 +694,6 @@ static int acpi_device_setup_files(struct acpi_device *dev)
+ {
+ 	struct acpi_buffer buffer = {ACPI_ALLOCATE_BUFFER, NULL};
+ 	acpi_status status;
+-	unsigned long long sun;
+ 	int result = 0;
+ 
+ 	/*
+@@ -728,14 +734,10 @@ static int acpi_device_setup_files(struct acpi_device *dev)
+ 	if (dev->pnp.unique_id)
+ 		result = device_create_file(&dev->dev, &dev_attr_uid);
+ 
+-	status = acpi_evaluate_integer(dev->handle, "_SUN", NULL, &sun);
+-	if (ACPI_SUCCESS(status)) {
+-		dev->pnp.sun = (unsigned long)sun;
++	if (acpi_has_method(dev->handle, "_SUN")) {
+ 		result = device_create_file(&dev->dev, &dev_attr_sun);
+ 		if (result)
+ 			goto end;
+-	} else {
+-		dev->pnp.sun = (unsigned long)-1;
+ 	}
+ 
+ 	if (acpi_has_method(dev->handle, "_STA")) {
+@@ -919,12 +921,17 @@ static void acpi_device_notify(acpi_handle handle, u32 event, void *data)
+ 	device->driver->ops.notify(device, event);
+ }
+ 
+-static acpi_status acpi_device_notify_fixed(void *data)
++static void acpi_device_notify_fixed(void *data)
+ {
+ 	struct acpi_device *device = data;
+ 
+ 	/* Fixed hardware devices have no handles */
+ 	acpi_device_notify(NULL, ACPI_FIXED_HARDWARE_EVENT, device);
++}
++
++static acpi_status acpi_device_fixed_event(void *data)
++{
++	acpi_os_execute(OSL_NOTIFY_HANDLER, acpi_device_notify_fixed, data);
+ 	return AE_OK;
+ }
+ 
+@@ -935,12 +942,12 @@ static int acpi_device_install_notify_handler(struct acpi_device *device)
+ 	if (device->device_type == ACPI_BUS_TYPE_POWER_BUTTON)
+ 		status =
+ 		    acpi_install_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
+-						     acpi_device_notify_fixed,
++						     acpi_device_fixed_event,
+ 						     device);
+ 	else if (device->device_type == ACPI_BUS_TYPE_SLEEP_BUTTON)
+ 		status =
+ 		    acpi_install_fixed_event_handler(ACPI_EVENT_SLEEP_BUTTON,
+-						     acpi_device_notify_fixed,
++						     acpi_device_fixed_event,
+ 						     device);
+ 	else
+ 		status = acpi_install_notify_handler(device->handle,
+@@ -957,10 +964,10 @@ static void acpi_device_remove_notify_handler(struct acpi_device *device)
+ {
+ 	if (device->device_type == ACPI_BUS_TYPE_POWER_BUTTON)
+ 		acpi_remove_fixed_event_handler(ACPI_EVENT_POWER_BUTTON,
+-						acpi_device_notify_fixed);
++						acpi_device_fixed_event);
+ 	else if (device->device_type == ACPI_BUS_TYPE_SLEEP_BUTTON)
+ 		acpi_remove_fixed_event_handler(ACPI_EVENT_SLEEP_BUTTON,
+-						acpi_device_notify_fixed);
++						acpi_device_fixed_event);
+ 	else
+ 		acpi_remove_notify_handler(device->handle, ACPI_DEVICE_NOTIFY,
+ 					   acpi_device_notify);
+@@ -972,7 +979,7 @@ static int acpi_device_probe(struct device *dev)
+ 	struct acpi_driver *acpi_drv = to_acpi_driver(dev->driver);
+ 	int ret;
+ 
+-	if (acpi_dev->handler)
++	if (acpi_dev->handler && !acpi_is_pnp_device(acpi_dev))
+ 		return -EINVAL;
+ 
+ 	if (!acpi_drv->ops.add)
+diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
+index 350d52a8f781..4834b4cae540 100644
+--- a/drivers/acpi/video.c
++++ b/drivers/acpi/video.c
+@@ -82,9 +82,9 @@ module_param(allow_duplicates, bool, 0644);
+  * For Windows 8 systems: used to decide if video module
+  * should skip registering backlight interface of its own.
+  */
+-static int use_native_backlight_param = 1;
++static int use_native_backlight_param = -1;
+ module_param_named(use_native_backlight, use_native_backlight_param, int, 0444);
+-static bool use_native_backlight_dmi = false;
++static bool use_native_backlight_dmi = true;
+ 
+ static int register_count;
+ static struct mutex video_list_lock;
+@@ -415,6 +415,12 @@ static int __init video_set_use_native_backlight(const struct dmi_system_id *d)
+ 	return 0;
+ }
+ 
++static int __init video_disable_native_backlight(const struct dmi_system_id *d)
++{
++	use_native_backlight_dmi = false;
++	return 0;
++}
++
+ static struct dmi_system_id video_dmi_table[] __initdata = {
+ 	/*
+ 	 * Broken _BQC workaround http://bugzilla.kernel.org/show_bug.cgi?id=13121
+@@ -645,6 +651,41 @@ static struct dmi_system_id video_dmi_table[] __initdata = {
+ 		DMI_MATCH(DMI_PRODUCT_NAME, "HP EliteBook 8780w"),
+ 		},
+ 	},
++
++	/*
++	 * These models have a working acpi_video backlight control, and using
++	 * native backlight causes a regression where backlight does not work
++	 * when userspace is not handling brightness key events. Disable
++	 * native_backlight on these to fix this:
++	 * https://bugzilla.kernel.org/show_bug.cgi?id=81691
++	 */
++	{
++	 .callback = video_disable_native_backlight,
++	 .ident = "ThinkPad T420",
++	 .matches = {
++		DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T420"),
++		},
++	},
++	{
++	 .callback = video_disable_native_backlight,
++	 .ident = "ThinkPad T520",
++	 .matches = {
++		DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T520"),
++		},
++	},
++
++	/* The native backlight controls do not work on some older machines */
++	{
++	 /* https://bugs.freedesktop.org/show_bug.cgi?id=81515 */
++	 .callback = video_disable_native_backlight,
++	 .ident = "HP ENVY 15 Notebook",
++	 .matches = {
++		DMI_MATCH(DMI_SYS_VENDOR, "Hewlett-Packard"),
++		DMI_MATCH(DMI_PRODUCT_NAME, "HP ENVY 15 Notebook PC"),
++		},
++	},
+ 	{}
+ };
+ 
+diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
+index b2c98c1bc037..9dc02c429771 100644
+--- a/drivers/block/rbd.c
++++ b/drivers/block/rbd.c
+@@ -42,6 +42,7 @@
+ #include <linux/blkdev.h>
+ #include <linux/slab.h>
+ #include <linux/idr.h>
++#include <linux/workqueue.h>
+ 
+ #include "rbd_types.h"
+ 
+@@ -332,7 +333,10 @@ struct rbd_device {
+ 
+ 	char			name[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */
+ 
++	struct list_head	rq_queue;	/* incoming rq queue */
+ 	spinlock_t		lock;		/* queue, flags, open_count */
++	struct workqueue_struct	*rq_wq;
++	struct work_struct	rq_work;
+ 
+ 	struct rbd_image_header	header;
+ 	unsigned long		flags;		/* possibly lock protected */
+@@ -3183,102 +3187,129 @@ out:
+ 	return ret;
+ }
+ 
+-static void rbd_request_fn(struct request_queue *q)
+-		__releases(q->queue_lock) __acquires(q->queue_lock)
++static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq)
+ {
+-	struct rbd_device *rbd_dev = q->queuedata;
+-	struct request *rq;
++	struct rbd_img_request *img_request;
++	u64 offset = (u64)blk_rq_pos(rq) << SECTOR_SHIFT;
++	u64 length = blk_rq_bytes(rq);
++	bool wr = rq_data_dir(rq) == WRITE;
+ 	int result;
+ 
+-	while ((rq = blk_fetch_request(q))) {
+-		bool write_request = rq_data_dir(rq) == WRITE;
+-		struct rbd_img_request *img_request;
+-		u64 offset;
+-		u64 length;
++	/* Ignore/skip any zero-length requests */
+ 
+-		/* Ignore any non-FS requests that filter through. */
++	if (!length) {
++		dout("%s: zero-length request\n", __func__);
++		result = 0;
++		goto err_rq;
++	}
+ 
+-		if (rq->cmd_type != REQ_TYPE_FS) {
+-			dout("%s: non-fs request type %d\n", __func__,
+-				(int) rq->cmd_type);
+-			__blk_end_request_all(rq, 0);
+-			continue;
++	/* Disallow writes to a read-only device */
++
++	if (wr) {
++		if (rbd_dev->mapping.read_only) {
++			result = -EROFS;
++			goto err_rq;
+ 		}
++		rbd_assert(rbd_dev->spec->snap_id == CEPH_NOSNAP);
++	}
+ 
+-		/* Ignore/skip any zero-length requests */
++	/*
++	 * Quit early if the mapped snapshot no longer exists.  It's
++	 * still possible the snapshot will have disappeared by the
++	 * time our request arrives at the osd, but there's no sense in
++	 * sending it if we already know.
++	 */
++	if (!test_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags)) {
++		dout("request for non-existent snapshot");
++		rbd_assert(rbd_dev->spec->snap_id != CEPH_NOSNAP);
++		result = -ENXIO;
++		goto err_rq;
++	}
+ 
+-		offset = (u64) blk_rq_pos(rq) << SECTOR_SHIFT;
+-		length = (u64) blk_rq_bytes(rq);
++	if (offset && length > U64_MAX - offset + 1) {
++		rbd_warn(rbd_dev, "bad request range (%llu~%llu)", offset,
++			 length);
++		result = -EINVAL;
++		goto err_rq;	/* Shouldn't happen */
++	}
+ 
+-		if (!length) {
+-			dout("%s: zero-length request\n", __func__);
+-			__blk_end_request_all(rq, 0);
+-			continue;
+-		}
++	if (offset + length > rbd_dev->mapping.size) {
++		rbd_warn(rbd_dev, "beyond EOD (%llu~%llu > %llu)", offset,
++			 length, rbd_dev->mapping.size);
++		result = -EIO;
++		goto err_rq;
++	}
+ 
+-		spin_unlock_irq(q->queue_lock);
++	img_request = rbd_img_request_create(rbd_dev, offset, length, wr);
++	if (!img_request) {
++		result = -ENOMEM;
++		goto err_rq;
++	}
++	img_request->rq = rq;
+ 
+-		/* Disallow writes to a read-only device */
++	result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO, rq->bio);
++	if (result)
++		goto err_img_request;
+ 
+-		if (write_request) {
+-			result = -EROFS;
+-			if (rbd_dev->mapping.read_only)
+-				goto end_request;
+-			rbd_assert(rbd_dev->spec->snap_id == CEPH_NOSNAP);
+-		}
++	result = rbd_img_request_submit(img_request);
++	if (result)
++		goto err_img_request;
+ 
+-		/*
+-		 * Quit early if the mapped snapshot no longer
+-		 * exists.  It's still possible the snapshot will
+-		 * have disappeared by the time our request arrives
+-		 * at the osd, but there's no sense in sending it if
+-		 * we already know.
+-		 */
+-		if (!test_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags)) {
+-			dout("request for non-existent snapshot");
+-			rbd_assert(rbd_dev->spec->snap_id != CEPH_NOSNAP);
+-			result = -ENXIO;
+-			goto end_request;
+-		}
++	return;
+ 
+-		result = -EINVAL;
+-		if (offset && length > U64_MAX - offset + 1) {
+-			rbd_warn(rbd_dev, "bad request range (%llu~%llu)\n",
+-				offset, length);
+-			goto end_request;	/* Shouldn't happen */
+-		}
++err_img_request:
++	rbd_img_request_put(img_request);
++err_rq:
++	if (result)
++		rbd_warn(rbd_dev, "%s %llx at %llx result %d",
++			 wr ? "write" : "read", length, offset, result);
++	blk_end_request_all(rq, result);
++}
+ 
+-		result = -EIO;
+-		if (offset + length > rbd_dev->mapping.size) {
+-			rbd_warn(rbd_dev, "beyond EOD (%llu~%llu > %llu)\n",
+-				offset, length, rbd_dev->mapping.size);
+-			goto end_request;
+-		}
++static void rbd_request_workfn(struct work_struct *work)
++{
++	struct rbd_device *rbd_dev =
++	    container_of(work, struct rbd_device, rq_work);
++	struct request *rq, *next;
++	LIST_HEAD(requests);
+ 
+-		result = -ENOMEM;
+-		img_request = rbd_img_request_create(rbd_dev, offset, length,
+-							write_request);
+-		if (!img_request)
+-			goto end_request;
++	spin_lock_irq(&rbd_dev->lock); /* rq->q->queue_lock */
++	list_splice_init(&rbd_dev->rq_queue, &requests);
++	spin_unlock_irq(&rbd_dev->lock);
+ 
+-		img_request->rq = rq;
++	list_for_each_entry_safe(rq, next, &requests, queuelist) {
++		list_del_init(&rq->queuelist);
++		rbd_handle_request(rbd_dev, rq);
++	}
++}
+ 
+-		result = rbd_img_request_fill(img_request, OBJ_REQUEST_BIO,
+-						rq->bio);
+-		if (!result)
+-			result = rbd_img_request_submit(img_request);
+-		if (result)
+-			rbd_img_request_put(img_request);
+-end_request:
+-		spin_lock_irq(q->queue_lock);
+-		if (result < 0) {
+-			rbd_warn(rbd_dev, "%s %llx at %llx result %d\n",
+-				write_request ? "write" : "read",
+-				length, offset, result);
+-
+-			__blk_end_request_all(rq, result);
++/*
++ * Called with q->queue_lock held and interrupts disabled, possibly on
++ * the way to schedule().  Do not sleep here!
++ */
++static void rbd_request_fn(struct request_queue *q)
++{
++	struct rbd_device *rbd_dev = q->queuedata;
++	struct request *rq;
++	int queued = 0;
++
++	rbd_assert(rbd_dev);
++
++	while ((rq = blk_fetch_request(q))) {
++		/* Ignore any non-FS requests that filter through. */
++		if (rq->cmd_type != REQ_TYPE_FS) {
++			dout("%s: non-fs request type %d\n", __func__,
++				(int) rq->cmd_type);
++			__blk_end_request_all(rq, 0);
++			continue;
+ 		}
++
++		list_add_tail(&rq->queuelist, &rbd_dev->rq_queue);
++		queued++;
+ 	}
++
++	if (queued)
++		queue_work(rbd_dev->rq_wq, &rbd_dev->rq_work);
+ }
+ 
+ /*
+@@ -3848,6 +3879,8 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
+ 		return NULL;
+ 
+ 	spin_lock_init(&rbd_dev->lock);
++	INIT_LIST_HEAD(&rbd_dev->rq_queue);
++	INIT_WORK(&rbd_dev->rq_work, rbd_request_workfn);
+ 	rbd_dev->flags = 0;
+ 	atomic_set(&rbd_dev->parent_ref, 0);
+ 	INIT_LIST_HEAD(&rbd_dev->node);
+@@ -5066,12 +5099,17 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev)
+ 	ret = rbd_dev_mapping_set(rbd_dev);
+ 	if (ret)
+ 		goto err_out_disk;
++
+ 	set_capacity(rbd_dev->disk, rbd_dev->mapping.size / SECTOR_SIZE);
+ 	set_disk_ro(rbd_dev->disk, rbd_dev->mapping.read_only);
+ 
++	rbd_dev->rq_wq = alloc_workqueue(rbd_dev->disk->disk_name, 0, 0);
++	if (!rbd_dev->rq_wq)
++		goto err_out_mapping;
++
+ 	ret = rbd_bus_add_dev(rbd_dev);
+ 	if (ret)
+-		goto err_out_mapping;
++		goto err_out_workqueue;
+ 
+ 	/* Everything's ready.  Announce the disk to the world. */
+ 
+@@ -5083,6 +5121,9 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev)
+ 
+ 	return ret;
+ 
++err_out_workqueue:
++	destroy_workqueue(rbd_dev->rq_wq);
++	rbd_dev->rq_wq = NULL;
+ err_out_mapping:
+ 	rbd_dev_mapping_clear(rbd_dev);
+ err_out_disk:
+@@ -5314,6 +5355,7 @@ static void rbd_dev_device_release(struct device *dev)
+ {
+ 	struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
+ 
++	destroy_workqueue(rbd_dev->rq_wq);
+ 	rbd_free_disk(rbd_dev);
+ 	clear_bit(RBD_DEV_FLAG_EXISTS, &rbd_dev->flags);
+ 	rbd_dev_mapping_clear(rbd_dev);
+diff --git a/drivers/bluetooth/btmrvl_drv.h b/drivers/bluetooth/btmrvl_drv.h
+index dc79f88f8717..54d9f2e73495 100644
+--- a/drivers/bluetooth/btmrvl_drv.h
++++ b/drivers/bluetooth/btmrvl_drv.h
+@@ -68,6 +68,7 @@ struct btmrvl_adapter {
+ 	u8 hs_state;
+ 	u8 wakeup_tries;
+ 	wait_queue_head_t cmd_wait_q;
++	wait_queue_head_t event_hs_wait_q;
+ 	u8 cmd_complete;
+ 	bool is_suspended;
+ };
+diff --git a/drivers/bluetooth/btmrvl_main.c b/drivers/bluetooth/btmrvl_main.c
+index e9dbddb0b8f1..3ecba5c979bd 100644
+--- a/drivers/bluetooth/btmrvl_main.c
++++ b/drivers/bluetooth/btmrvl_main.c
+@@ -114,6 +114,7 @@ int btmrvl_process_event(struct btmrvl_private *priv, struct sk_buff *skb)
+ 			adapter->hs_state = HS_ACTIVATED;
+ 			if (adapter->psmode)
+ 				adapter->ps_state = PS_SLEEP;
++			wake_up_interruptible(&adapter->event_hs_wait_q);
+ 			BT_DBG("HS ACTIVATED!");
+ 		} else {
+ 			BT_DBG("HS Enable failed");
+@@ -253,11 +254,31 @@ EXPORT_SYMBOL_GPL(btmrvl_enable_ps);
+ 
+ int btmrvl_enable_hs(struct btmrvl_private *priv)
+ {
++	struct btmrvl_adapter *adapter = priv->adapter;
+ 	int ret;
+ 
+ 	ret = btmrvl_send_sync_cmd(priv, BT_CMD_HOST_SLEEP_ENABLE, NULL, 0);
+-	if (ret)
++	if (ret) {
+ 		BT_ERR("Host sleep enable command failed\n");
++		return ret;
++	}
++
++	ret = wait_event_interruptible_timeout(adapter->event_hs_wait_q,
++					       adapter->hs_state,
++			msecs_to_jiffies(WAIT_UNTIL_HS_STATE_CHANGED));
++	if (ret < 0) {
++		BT_ERR("event_hs_wait_q terminated (%d): %d,%d,%d",
++		       ret, adapter->hs_state, adapter->ps_state,
++		       adapter->wakeup_tries);
++	} else if (!ret) {
++		BT_ERR("hs_enable timeout: %d,%d,%d", adapter->hs_state,
++		       adapter->ps_state, adapter->wakeup_tries);
++		ret = -ETIMEDOUT;
++	} else {
++		BT_DBG("host sleep enabled: %d,%d,%d", adapter->hs_state,
++		       adapter->ps_state, adapter->wakeup_tries);
++		ret = 0;
++	}
+ 
+ 	return ret;
+ }
+@@ -358,6 +379,7 @@ static void btmrvl_init_adapter(struct btmrvl_private *priv)
+ 	}
+ 
+ 	init_waitqueue_head(&priv->adapter->cmd_wait_q);
++	init_waitqueue_head(&priv->adapter->event_hs_wait_q);
+ }
+ 
+ static void btmrvl_free_adapter(struct btmrvl_private *priv)
+@@ -666,6 +688,7 @@ int btmrvl_remove_card(struct btmrvl_private *priv)
+ 	hdev = priv->btmrvl_dev.hcidev;
+ 
+ 	wake_up_interruptible(&priv->adapter->cmd_wait_q);
++	wake_up_interruptible(&priv->adapter->event_hs_wait_q);
+ 
+ 	kthread_stop(priv->main_thread.task);
+ 
+diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
+index 62e10fd1e1cb..6af17002a115 100644
+--- a/drivers/char/tpm/tpm-interface.c
++++ b/drivers/char/tpm/tpm-interface.c
+@@ -491,11 +491,10 @@ static int tpm_startup(struct tpm_chip *chip, __be16 startup_type)
+ int tpm_get_timeouts(struct tpm_chip *chip)
+ {
+ 	struct tpm_cmd_t tpm_cmd;
+-	struct timeout_t *timeout_cap;
++	unsigned long new_timeout[4];
++	unsigned long old_timeout[4];
+ 	struct duration_t *duration_cap;
+ 	ssize_t rc;
+-	u32 timeout;
+-	unsigned int scale = 1;
+ 
+ 	tpm_cmd.header.in = tpm_getcap_header;
+ 	tpm_cmd.params.getcap_in.cap = TPM_CAP_PROP;
+@@ -529,25 +528,46 @@ int tpm_get_timeouts(struct tpm_chip *chip)
+ 	    != sizeof(tpm_cmd.header.out) + sizeof(u32) + 4 * sizeof(u32))
+ 		return -EINVAL;
+ 
+-	timeout_cap = &tpm_cmd.params.getcap_out.cap.timeout;
+-	/* Don't overwrite default if value is 0 */
+-	timeout = be32_to_cpu(timeout_cap->a);
+-	if (timeout && timeout < 1000) {
+-		/* timeouts in msec rather usec */
+-		scale = 1000;
+-		chip->vendor.timeout_adjusted = true;
++	old_timeout[0] = be32_to_cpu(tpm_cmd.params.getcap_out.cap.timeout.a);
++	old_timeout[1] = be32_to_cpu(tpm_cmd.params.getcap_out.cap.timeout.b);
++	old_timeout[2] = be32_to_cpu(tpm_cmd.params.getcap_out.cap.timeout.c);
++	old_timeout[3] = be32_to_cpu(tpm_cmd.params.getcap_out.cap.timeout.d);
++	memcpy(new_timeout, old_timeout, sizeof(new_timeout));
++
++	/*
++	 * Provide ability for vendor overrides of timeout values in case
++	 * of misreporting.
++	 */
++	if (chip->ops->update_timeouts != NULL)
++		chip->vendor.timeout_adjusted =
++			chip->ops->update_timeouts(chip, new_timeout);
++
++	if (!chip->vendor.timeout_adjusted) {
++		/* Don't overwrite default if value is 0 */
++		if (new_timeout[0] != 0 && new_timeout[0] < 1000) {
++			int i;
++
++			/* timeouts in msec rather usec */
++			for (i = 0; i != ARRAY_SIZE(new_timeout); i++)
++				new_timeout[i] *= 1000;
++			chip->vendor.timeout_adjusted = true;
++		}
++	}
++
++	/* Report adjusted timeouts */
++	if (chip->vendor.timeout_adjusted) {
++		dev_info(chip->dev,
++			 HW_ERR "Adjusting reported timeouts: A %lu->%luus B %lu->%luus C %lu->%luus D %lu->%luus\n",
++			 old_timeout[0], new_timeout[0],
++			 old_timeout[1], new_timeout[1],
++			 old_timeout[2], new_timeout[2],
++			 old_timeout[3], new_timeout[3]);
+ 	}
+-	if (timeout)
+-		chip->vendor.timeout_a = usecs_to_jiffies(timeout * scale);
+-	timeout = be32_to_cpu(timeout_cap->b);
+-	if (timeout)
+-		chip->vendor.timeout_b = usecs_to_jiffies(timeout * scale);
+-	timeout = be32_to_cpu(timeout_cap->c);
+-	if (timeout)
+-		chip->vendor.timeout_c = usecs_to_jiffies(timeout * scale);
+-	timeout = be32_to_cpu(timeout_cap->d);
+-	if (timeout)
+-		chip->vendor.timeout_d = usecs_to_jiffies(timeout * scale);
++
++	chip->vendor.timeout_a = usecs_to_jiffies(new_timeout[0]);
++	chip->vendor.timeout_b = usecs_to_jiffies(new_timeout[1]);
++	chip->vendor.timeout_c = usecs_to_jiffies(new_timeout[2]);
++	chip->vendor.timeout_d = usecs_to_jiffies(new_timeout[3]);
+ 
+ duration:
+ 	tpm_cmd.header.in = tpm_getcap_header;
+@@ -991,13 +1011,13 @@ int tpm_get_random(u32 chip_num, u8 *out, size_t max)
+ 	int err, total = 0, retries = 5;
+ 	u8 *dest = out;
+ 
++	if (!out || !num_bytes || max > TPM_MAX_RNG_DATA)
++		return -EINVAL;
++
+ 	chip = tpm_chip_find_get(chip_num);
+ 	if (chip == NULL)
+ 		return -ENODEV;
+ 
+-	if (!out || !num_bytes || max > TPM_MAX_RNG_DATA)
+-		return -EINVAL;
+-
+ 	do {
+ 		tpm_cmd.header.in = tpm_getrandom_header;
+ 		tpm_cmd.params.getrandom_in.num_bytes = cpu_to_be32(num_bytes);
+@@ -1016,6 +1036,7 @@ int tpm_get_random(u32 chip_num, u8 *out, size_t max)
+ 		num_bytes -= recd;
+ 	} while (retries-- && total < max);
+ 
++	tpm_chip_put(chip);
+ 	return total ? total : -EIO;
+ }
+ EXPORT_SYMBOL_GPL(tpm_get_random);
+@@ -1095,7 +1116,7 @@ struct tpm_chip *tpm_register_hardware(struct device *dev,
+ 		goto del_misc;
+ 
+ 	if (tpm_add_ppi(&dev->kobj))
+-		goto del_misc;
++		goto del_sysfs;
+ 
+ 	chip->bios_dir = tpm_bios_log_setup(chip->devname);
+ 
+@@ -1106,6 +1127,8 @@ struct tpm_chip *tpm_register_hardware(struct device *dev,
+ 
+ 	return chip;
+ 
++del_sysfs:
++	tpm_sysfs_del_device(chip);
+ del_misc:
+ 	tpm_dev_del_device(chip);
+ put_device:
+diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
+index a9ed2270c25d..2c46734b266d 100644
+--- a/drivers/char/tpm/tpm_tis.c
++++ b/drivers/char/tpm/tpm_tis.c
+@@ -373,6 +373,36 @@ out_err:
+ 	return rc;
+ }
+ 
++struct tis_vendor_timeout_override {
++	u32 did_vid;
++	unsigned long timeout_us[4];
++};
++
++static const struct tis_vendor_timeout_override vendor_timeout_overrides[] = {
++	/* Atmel 3204 */
++	{ 0x32041114, { (TIS_SHORT_TIMEOUT*1000), (TIS_LONG_TIMEOUT*1000),
++			(TIS_SHORT_TIMEOUT*1000), (TIS_SHORT_TIMEOUT*1000) } },
++};
++
++static bool tpm_tis_update_timeouts(struct tpm_chip *chip,
++				    unsigned long *timeout_cap)
++{
++	int i;
++	u32 did_vid;
++
++	did_vid = ioread32(chip->vendor.iobase + TPM_DID_VID(0));
++
++	for (i = 0; i != ARRAY_SIZE(vendor_timeout_overrides); i++) {
++		if (vendor_timeout_overrides[i].did_vid != did_vid)
++			continue;
++		memcpy(timeout_cap, vendor_timeout_overrides[i].timeout_us,
++		       sizeof(vendor_timeout_overrides[i].timeout_us));
++		return true;
++	}
++
++	return false;
++}
++
+ /*
+  * Early probing for iTPM with STS_DATA_EXPECT flaw.
+  * Try sending command without itpm flag set and if that
+@@ -437,6 +467,7 @@ static const struct tpm_class_ops tpm_tis = {
+ 	.recv = tpm_tis_recv,
+ 	.send = tpm_tis_send,
+ 	.cancel = tpm_tis_ready,
++	.update_timeouts = tpm_tis_update_timeouts,
+ 	.req_complete_mask = TPM_STS_DATA_AVAIL | TPM_STS_VALID,
+ 	.req_complete_val = TPM_STS_DATA_AVAIL | TPM_STS_VALID,
+ 	.req_canceled = tpm_tis_req_canceled,
+diff --git a/drivers/cpufreq/powernv-cpufreq.c b/drivers/cpufreq/powernv-cpufreq.c
+index bb1d08dc8cc8..379c0837f5a9 100644
+--- a/drivers/cpufreq/powernv-cpufreq.c
++++ b/drivers/cpufreq/powernv-cpufreq.c
+@@ -28,6 +28,7 @@
+ #include <linux/of.h>
+ 
+ #include <asm/cputhreads.h>
++#include <asm/firmware.h>
+ #include <asm/reg.h>
+ #include <asm/smp.h> /* Required for cpu_sibling_mask() in UP configs */
+ 
+@@ -98,7 +99,11 @@ static int init_powernv_pstates(void)
+ 		return -ENODEV;
+ 	}
+ 
+-	WARN_ON(len_ids != len_freqs);
++	if (len_ids != len_freqs) {
++		pr_warn("Entries in ibm,pstate-ids and "
++			"ibm,pstate-frequencies-mhz does not match\n");
++	}
++
+ 	nr_pstates = min(len_ids, len_freqs) / sizeof(u32);
+ 	if (!nr_pstates) {
+ 		pr_warn("No PStates found\n");
+@@ -131,7 +136,12 @@ static unsigned int pstate_id_to_freq(int pstate_id)
+ 	int i;
+ 
+ 	i = powernv_pstate_info.max - pstate_id;
+-	BUG_ON(i >= powernv_pstate_info.nr_pstates || i < 0);
++	if (i >= powernv_pstate_info.nr_pstates || i < 0) {
++		pr_warn("PState id %d outside of PState table, "
++			"reporting nominal id %d instead\n",
++			pstate_id, powernv_pstate_info.nominal);
++		i = powernv_pstate_info.max - powernv_pstate_info.nominal;
++	}
+ 
+ 	return powernv_freqs[i].frequency;
+ }
+@@ -321,6 +331,10 @@ static int __init powernv_cpufreq_init(void)
+ {
+ 	int rc = 0;
+ 
++	/* Don't probe on pseries (guest) platforms */
++	if (!firmware_has_feature(FW_FEATURE_OPALv3))
++		return -ENODEV;
++
+ 	/* Discover pstates from device tree and init */
+ 	rc = init_powernv_pstates();
+ 	if (rc) {
+diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
+index 74f5788d50b1..a64be578dab2 100644
+--- a/drivers/cpuidle/cpuidle-powernv.c
++++ b/drivers/cpuidle/cpuidle-powernv.c
+@@ -160,10 +160,10 @@ static int powernv_cpuidle_driver_init(void)
+ static int powernv_add_idle_states(void)
+ {
+ 	struct device_node *power_mgt;
+-	struct property *prop;
+ 	int nr_idle_states = 1; /* Snooze */
+ 	int dt_idle_states;
+-	u32 *flags;
++	const __be32 *idle_state_flags;
++	u32 len_flags, flags;
+ 	int i;
+ 
+ 	/* Currently we have snooze statically defined */
+@@ -174,18 +174,18 @@ static int powernv_add_idle_states(void)
+ 		return nr_idle_states;
+ 	}
+ 
+-	prop = of_find_property(power_mgt, "ibm,cpu-idle-state-flags", NULL);
+-	if (!prop) {
++	idle_state_flags = of_get_property(power_mgt, "ibm,cpu-idle-state-flags", &len_flags);
++	if (!idle_state_flags) {
+ 		pr_warn("DT-PowerMgmt: missing ibm,cpu-idle-state-flags\n");
+ 		return nr_idle_states;
+ 	}
+ 
+-	dt_idle_states = prop->length / sizeof(u32);
+-	flags = (u32 *) prop->value;
++	dt_idle_states = len_flags / sizeof(u32);
+ 
+ 	for (i = 0; i < dt_idle_states; i++) {
+ 
+-		if (flags[i] & IDLE_USE_INST_NAP) {
++		flags = be32_to_cpu(idle_state_flags[i]);
++		if (flags & IDLE_USE_INST_NAP) {
+ 			/* Add NAP state */
+ 			strcpy(powernv_states[nr_idle_states].name, "Nap");
+ 			strcpy(powernv_states[nr_idle_states].desc, "Nap");
+@@ -196,7 +196,7 @@ static int powernv_add_idle_states(void)
+ 			nr_idle_states++;
+ 		}
+ 
+-		if (flags[i] & IDLE_USE_INST_SLEEP) {
++		if (flags & IDLE_USE_INST_SLEEP) {
+ 			/* Add FASTSLEEP state */
+ 			strcpy(powernv_states[nr_idle_states].name, "FastSleep");
+ 			strcpy(powernv_states[nr_idle_states].desc, "FastSleep");
+diff --git a/drivers/firmware/efi/vars.c b/drivers/firmware/efi/vars.c
+index f0a43646a2f3..5abe943e3404 100644
+--- a/drivers/firmware/efi/vars.c
++++ b/drivers/firmware/efi/vars.c
+@@ -481,7 +481,7 @@ EXPORT_SYMBOL_GPL(efivar_entry_remove);
+  */
+ static void efivar_entry_list_del_unlock(struct efivar_entry *entry)
+ {
+-	WARN_ON(!spin_is_locked(&__efivars->lock));
++	lockdep_assert_held(&__efivars->lock);
+ 
+ 	list_del(&entry->list);
+ 	spin_unlock_irq(&__efivars->lock);
+@@ -507,7 +507,7 @@ int __efivar_entry_delete(struct efivar_entry *entry)
+ 	const struct efivar_operations *ops = __efivars->ops;
+ 	efi_status_t status;
+ 
+-	WARN_ON(!spin_is_locked(&__efivars->lock));
++	lockdep_assert_held(&__efivars->lock);
+ 
+ 	status = ops->set_variable(entry->var.VariableName,
+ 				   &entry->var.VendorGuid,
+@@ -667,7 +667,7 @@ struct efivar_entry *efivar_entry_find(efi_char16_t *name, efi_guid_t guid,
+ 	int strsize1, strsize2;
+ 	bool found = false;
+ 
+-	WARN_ON(!spin_is_locked(&__efivars->lock));
++	lockdep_assert_held(&__efivars->lock);
+ 
+ 	list_for_each_entry_safe(entry, n, head, list) {
+ 		strsize1 = ucs2_strsize(name, 1024);
+@@ -739,7 +739,7 @@ int __efivar_entry_get(struct efivar_entry *entry, u32 *attributes,
+ 	const struct efivar_operations *ops = __efivars->ops;
+ 	efi_status_t status;
+ 
+-	WARN_ON(!spin_is_locked(&__efivars->lock));
++	lockdep_assert_held(&__efivars->lock);
+ 
+ 	status = ops->get_variable(entry->var.VariableName,
+ 				   &entry->var.VendorGuid,
+diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c
+index 47ad74255bf1..dd469dbeaae1 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_display.c
++++ b/drivers/gpu/drm/nouveau/nouveau_display.c
+@@ -404,6 +404,11 @@ nouveau_display_fini(struct drm_device *dev)
+ {
+ 	struct nouveau_display *disp = nouveau_display(dev);
+ 	struct drm_connector *connector;
++	int head;
++
++	/* Make sure that drm and hw vblank irqs get properly disabled. */
++	for (head = 0; head < dev->mode_config.num_crtc; head++)
++		drm_vblank_off(dev, head);
+ 
+ 	/* disable hotplug interrupts */
+ 	list_for_each_entry(connector, &dev->mode_config.connector_list, head) {
+@@ -620,6 +625,8 @@ void
+ nouveau_display_resume(struct drm_device *dev)
+ {
+ 	struct drm_crtc *crtc;
++	int head;
++
+ 	nouveau_display_init(dev);
+ 
+ 	/* Force CLUT to get re-loaded during modeset */
+@@ -629,6 +636,10 @@ nouveau_display_resume(struct drm_device *dev)
+ 		nv_crtc->lut.depth = 0;
+ 	}
+ 
++	/* Make sure that drm and hw vblank irqs get resumed if needed. */
++	for (head = 0; head < dev->mode_config.num_crtc; head++)
++		drm_vblank_on(dev, head);
++
+ 	drm_helper_resume_force_mode(dev);
+ 
+ 	list_for_each_entry(crtc, &dev->mode_config.crtc_list, head) {
+diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.h b/drivers/gpu/drm/nouveau/nouveau_drm.h
+index 7efbafaf7c1d..b628addcdf69 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_drm.h
++++ b/drivers/gpu/drm/nouveau/nouveau_drm.h
+@@ -10,7 +10,7 @@
+ 
+ #define DRIVER_MAJOR		1
+ #define DRIVER_MINOR		1
+-#define DRIVER_PATCHLEVEL	1
++#define DRIVER_PATCHLEVEL	2
+ 
+ /*
+  * 1.1.1:
+@@ -21,6 +21,8 @@
+  *        to control registers on the MPs to enable performance counters,
+  *        and to control the warp error enable mask (OpenGL requires out of
+  *        bounds access to local memory to be silently ignored / return 0).
++ * 1.1.2:
++ *      - fixes multiple bugs in flip completion events and timestamping
+  */
+ 
+ #include <core/client.h>
+diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
+index 767f2cc44bd8..65a8cca603a4 100644
+--- a/drivers/gpu/drm/radeon/cik.c
++++ b/drivers/gpu/drm/radeon/cik.c
+@@ -7901,6 +7901,7 @@ restart_ih:
+ static int cik_startup(struct radeon_device *rdev)
+ {
+ 	struct radeon_ring *ring;
++	u32 nop;
+ 	int r;
+ 
+ 	/* enable pcie gen2/3 link */
+@@ -8034,9 +8035,15 @@ static int cik_startup(struct radeon_device *rdev)
+ 	}
+ 	cik_irq_set(rdev);
+ 
++	if (rdev->family == CHIP_HAWAII) {
++		nop = RADEON_CP_PACKET2;
++	} else {
++		nop = PACKET3(PACKET3_NOP, 0x3FFF);
++	}
++
+ 	ring = &rdev->ring[RADEON_RING_TYPE_GFX_INDEX];
+ 	r = radeon_ring_init(rdev, ring, ring->ring_size, RADEON_WB_CP_RPTR_OFFSET,
+-			     PACKET3(PACKET3_NOP, 0x3FFF));
++			     nop);
+ 	if (r)
+ 		return r;
+ 
+@@ -8044,7 +8051,7 @@ static int cik_startup(struct radeon_device *rdev)
+ 	/* type-2 packets are deprecated on MEC, use type-3 instead */
+ 	ring = &rdev->ring[CAYMAN_RING_TYPE_CP1_INDEX];
+ 	r = radeon_ring_init(rdev, ring, ring->ring_size, RADEON_WB_CP1_RPTR_OFFSET,
+-			     PACKET3(PACKET3_NOP, 0x3FFF));
++			     nop);
+ 	if (r)
+ 		return r;
+ 	ring->me = 1; /* first MEC */
+@@ -8055,7 +8062,7 @@ static int cik_startup(struct radeon_device *rdev)
+ 	/* type-2 packets are deprecated on MEC, use type-3 instead */
+ 	ring = &rdev->ring[CAYMAN_RING_TYPE_CP2_INDEX];
+ 	r = radeon_ring_init(rdev, ring, ring->ring_size, RADEON_WB_CP2_RPTR_OFFSET,
+-			     PACKET3(PACKET3_NOP, 0x3FFF));
++			     nop);
+ 	if (r)
+ 		return r;
+ 	/* dGPU only have 1 MEC */
+diff --git a/drivers/infiniband/core/iwcm.c b/drivers/infiniband/core/iwcm.c
+index 3d2e489ab732..ff9163dc1596 100644
+--- a/drivers/infiniband/core/iwcm.c
++++ b/drivers/infiniband/core/iwcm.c
+@@ -46,6 +46,7 @@
+ #include <linux/completion.h>
+ #include <linux/slab.h>
+ #include <linux/module.h>
++#include <linux/sysctl.h>
+ 
+ #include <rdma/iw_cm.h>
+ #include <rdma/ib_addr.h>
+@@ -65,6 +66,20 @@ struct iwcm_work {
+ 	struct list_head free_list;
+ };
+ 
++static unsigned int default_backlog = 256;
++
++static struct ctl_table_header *iwcm_ctl_table_hdr;
++static struct ctl_table iwcm_ctl_table[] = {
++	{
++		.procname	= "default_backlog",
++		.data		= &default_backlog,
++		.maxlen		= sizeof(default_backlog),
++		.mode		= 0644,
++		.proc_handler	= proc_dointvec,
++	},
++	{ }
++};
++
+ /*
+  * The following services provide a mechanism for pre-allocating iwcm_work
+  * elements.  The design pre-allocates them  based on the cm_id type:
+@@ -425,6 +440,9 @@ int iw_cm_listen(struct iw_cm_id *cm_id, int backlog)
+ 
+ 	cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+ 
++	if (!backlog)
++		backlog = default_backlog;
++
+ 	ret = alloc_work_entries(cm_id_priv, backlog);
+ 	if (ret)
+ 		return ret;
+@@ -1030,11 +1048,20 @@ static int __init iw_cm_init(void)
+ 	if (!iwcm_wq)
+ 		return -ENOMEM;
+ 
++	iwcm_ctl_table_hdr = register_net_sysctl(&init_net, "net/iw_cm",
++						 iwcm_ctl_table);
++	if (!iwcm_ctl_table_hdr) {
++		pr_err("iw_cm: couldn't register sysctl paths\n");
++		destroy_workqueue(iwcm_wq);
++		return -ENOMEM;
++	}
++
+ 	return 0;
+ }
+ 
+ static void __exit iw_cm_cleanup(void)
+ {
++	unregister_net_sysctl_table(iwcm_ctl_table_hdr);
+ 	destroy_workqueue(iwcm_wq);
+ }
+ 
+diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
+index e3c2c5b4297f..767000811cf9 100644
+--- a/drivers/infiniband/ulp/srp/ib_srp.c
++++ b/drivers/infiniband/ulp/srp/ib_srp.c
+@@ -130,6 +130,7 @@ static void srp_send_completion(struct ib_cq *cq, void *target_ptr);
+ static int srp_cm_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event);
+ 
+ static struct scsi_transport_template *ib_srp_transport_template;
++static struct workqueue_struct *srp_remove_wq;
+ 
+ static struct ib_client srp_client = {
+ 	.name   = "srp",
+@@ -731,7 +732,7 @@ static bool srp_queue_remove_work(struct srp_target_port *target)
+ 	spin_unlock_irq(&target->lock);
+ 
+ 	if (changed)
+-		queue_work(system_long_wq, &target->remove_work);
++		queue_work(srp_remove_wq, &target->remove_work);
+ 
+ 	return changed;
+ }
+@@ -3261,9 +3262,10 @@ static void srp_remove_one(struct ib_device *device)
+ 		spin_unlock(&host->target_lock);
+ 
+ 		/*
+-		 * Wait for target port removal tasks.
++		 * Wait for tl_err and target port removal tasks.
+ 		 */
+ 		flush_workqueue(system_long_wq);
++		flush_workqueue(srp_remove_wq);
+ 
+ 		kfree(host);
+ 	}
+@@ -3313,16 +3315,22 @@ static int __init srp_init_module(void)
+ 		indirect_sg_entries = cmd_sg_entries;
+ 	}
+ 
++	srp_remove_wq = create_workqueue("srp_remove");
++	if (IS_ERR(srp_remove_wq)) {
++		ret = PTR_ERR(srp_remove_wq);
++		goto out;
++	}
++
++	ret = -ENOMEM;
+ 	ib_srp_transport_template =
+ 		srp_attach_transport(&ib_srp_transport_functions);
+ 	if (!ib_srp_transport_template)
+-		return -ENOMEM;
++		goto destroy_wq;
+ 
+ 	ret = class_register(&srp_class);
+ 	if (ret) {
+ 		pr_err("couldn't register class infiniband_srp\n");
+-		srp_release_transport(ib_srp_transport_template);
+-		return ret;
++		goto release_tr;
+ 	}
+ 
+ 	ib_sa_register_client(&srp_sa_client);
+@@ -3330,13 +3338,22 @@ static int __init srp_init_module(void)
+ 	ret = ib_register_client(&srp_client);
+ 	if (ret) {
+ 		pr_err("couldn't register IB client\n");
+-		srp_release_transport(ib_srp_transport_template);
+-		ib_sa_unregister_client(&srp_sa_client);
+-		class_unregister(&srp_class);
+-		return ret;
++		goto unreg_sa;
+ 	}
+ 
+-	return 0;
++out:
++	return ret;
++
++unreg_sa:
++	ib_sa_unregister_client(&srp_sa_client);
++	class_unregister(&srp_class);
++
++release_tr:
++	srp_release_transport(ib_srp_transport_template);
++
++destroy_wq:
++	destroy_workqueue(srp_remove_wq);
++	goto out;
+ }
+ 
+ static void __exit srp_cleanup_module(void)
+@@ -3345,6 +3362,7 @@ static void __exit srp_cleanup_module(void)
+ 	ib_sa_unregister_client(&srp_sa_client);
+ 	class_unregister(&srp_class);
+ 	srp_release_transport(ib_srp_transport_template);
++	destroy_workqueue(srp_remove_wq);
+ }
+ 
+ module_init(srp_init_module);
+diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
+index 4aec6a29e316..710ffa1830ae 100644
+--- a/drivers/iommu/amd_iommu.c
++++ b/drivers/iommu/amd_iommu.c
+@@ -3227,14 +3227,16 @@ free_domains:
+ 
+ static void cleanup_domain(struct protection_domain *domain)
+ {
+-	struct iommu_dev_data *dev_data, *next;
++	struct iommu_dev_data *entry;
+ 	unsigned long flags;
+ 
+ 	write_lock_irqsave(&amd_iommu_devtable_lock, flags);
+ 
+-	list_for_each_entry_safe(dev_data, next, &domain->dev_list, list) {
+-		__detach_device(dev_data);
+-		atomic_set(&dev_data->bind, 0);
++	while (!list_empty(&domain->dev_list)) {
++		entry = list_first_entry(&domain->dev_list,
++					 struct iommu_dev_data, list);
++		__detach_device(entry);
++		atomic_set(&entry->bind, 0);
+ 	}
+ 
+ 	write_unlock_irqrestore(&amd_iommu_devtable_lock, flags);
+diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
+index 51b6b77dc3e5..382c1801a8f1 100644
+--- a/drivers/iommu/intel-iommu.c
++++ b/drivers/iommu/intel-iommu.c
+@@ -2523,22 +2523,46 @@ static bool device_has_rmrr(struct device *dev)
+ 	return false;
+ }
+ 
++/*
++ * There are a couple cases where we need to restrict the functionality of
++ * devices associated with RMRRs.  The first is when evaluating a device for
++ * identity mapping because problems exist when devices are moved in and out
++ * of domains and their respective RMRR information is lost.  This means that
++ * a device with associated RMRRs will never be in a "passthrough" domain.
++ * The second is use of the device through the IOMMU API.  This interface
++ * expects to have full control of the IOVA space for the device.  We cannot
++ * satisfy both the requirement that RMRR access is maintained and have an
++ * unencumbered IOVA space.  We also have no ability to quiesce the device's
++ * use of the RMRR space or even inform the IOMMU API user of the restriction.
++ * We therefore prevent devices associated with an RMRR from participating in
++ * the IOMMU API, which eliminates them from device assignment.
++ *
++ * In both cases we assume that PCI USB devices with RMRRs have them largely
++ * for historical reasons and that the RMRR space is not actively used post
++ * boot.  This exclusion may change if vendors begin to abuse it.
++ */
++static bool device_is_rmrr_locked(struct device *dev)
++{
++	if (!device_has_rmrr(dev))
++		return false;
++
++	if (dev_is_pci(dev)) {
++		struct pci_dev *pdev = to_pci_dev(dev);
++
++		if ((pdev->class >> 8) == PCI_CLASS_SERIAL_USB)
++			return false;
++	}
++
++	return true;
++}
++
+ static int iommu_should_identity_map(struct device *dev, int startup)
+ {
+ 
+ 	if (dev_is_pci(dev)) {
+ 		struct pci_dev *pdev = to_pci_dev(dev);
+ 
+-		/*
+-		 * We want to prevent any device associated with an RMRR from
+-		 * getting placed into the SI Domain. This is done because
+-		 * problems exist when devices are moved in and out of domains
+-		 * and their respective RMRR info is lost. We exempt USB devices
+-		 * from this process due to their usage of RMRRs that are known
+-		 * to not be needed after BIOS hand-off to OS.
+-		 */
+-		if (device_has_rmrr(dev) &&
+-		    (pdev->class >> 8) != PCI_CLASS_SERIAL_USB)
++		if (device_is_rmrr_locked(dev))
+ 			return 0;
+ 
+ 		if ((iommu_identity_mapping & IDENTMAP_AZALIA) && IS_AZALIA(pdev))
+@@ -3867,6 +3891,14 @@ static int device_notifier(struct notifier_block *nb,
+ 	    action != BUS_NOTIFY_DEL_DEVICE)
+ 		return 0;
+ 
++	/*
++	 * If the device is still attached to a device driver we can't
++	 * tear down the domain yet as DMA mappings may still be in use.
++	 * Wait for the BUS_NOTIFY_UNBOUND_DRIVER event to do that.
++	 */
++	if (action == BUS_NOTIFY_DEL_DEVICE && dev->driver != NULL)
++		return 0;
++
+ 	domain = find_domain(dev);
+ 	if (!domain)
+ 		return 0;
+@@ -4202,6 +4234,11 @@ static int intel_iommu_attach_device(struct iommu_domain *domain,
+ 	int addr_width;
+ 	u8 bus, devfn;
+ 
++	if (device_is_rmrr_locked(dev)) {
++		dev_warn(dev, "Device is ineligible for IOMMU domain attach due to platform RMRR requirement.  Contact your platform vendor.\n");
++		return -EPERM;
++	}
++
+ 	/* normally dev is not mapped */
+ 	if (unlikely(domain_context_mapped(dev))) {
+ 		struct dmar_domain *old_domain;
+diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
+index 5f59f1e3e5b1..922791009fc5 100644
+--- a/drivers/md/dm-table.c
++++ b/drivers/md/dm-table.c
+@@ -1386,6 +1386,14 @@ static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
+ 	return q && !blk_queue_add_random(q);
+ }
+ 
++static int queue_supports_sg_merge(struct dm_target *ti, struct dm_dev *dev,
++				   sector_t start, sector_t len, void *data)
++{
++	struct request_queue *q = bdev_get_queue(dev->bdev);
++
++	return q && !test_bit(QUEUE_FLAG_NO_SG_MERGE, &q->queue_flags);
++}
++
+ static bool dm_table_all_devices_attribute(struct dm_table *t,
+ 					   iterate_devices_callout_fn func)
+ {
+@@ -1464,6 +1472,11 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
+ 	if (!dm_table_supports_write_same(t))
+ 		q->limits.max_write_same_sectors = 0;
+ 
++	if (dm_table_all_devices_attribute(t, queue_supports_sg_merge))
++		queue_flag_clear_unlocked(QUEUE_FLAG_NO_SG_MERGE, q);
++	else
++		queue_flag_set_unlocked(QUEUE_FLAG_NO_SG_MERGE, q);
++
+ 	dm_table_set_integrity(t);
+ 
+ 	/*
+diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
+index 56e24c072b62..d7690f86fdb9 100644
+--- a/drivers/md/raid1.c
++++ b/drivers/md/raid1.c
+@@ -1501,12 +1501,12 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
+ 		mddev->degraded++;
+ 		set_bit(Faulty, &rdev->flags);
+ 		spin_unlock_irqrestore(&conf->device_lock, flags);
+-		/*
+-		 * if recovery is running, make sure it aborts.
+-		 */
+-		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ 	} else
+ 		set_bit(Faulty, &rdev->flags);
++	/*
++	 * if recovery is running, make sure it aborts.
++	 */
++	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+ 	printk(KERN_ALERT
+ 	       "md/raid1:%s: Disk failure on %s, disabling device.\n"
+diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
+index cb882aae9e20..a46124ecafc7 100644
+--- a/drivers/md/raid10.c
++++ b/drivers/md/raid10.c
+@@ -1684,13 +1684,12 @@ static void error(struct mddev *mddev, struct md_rdev *rdev)
+ 		spin_unlock_irqrestore(&conf->device_lock, flags);
+ 		return;
+ 	}
+-	if (test_and_clear_bit(In_sync, &rdev->flags)) {
++	if (test_and_clear_bit(In_sync, &rdev->flags))
+ 		mddev->degraded++;
+-			/*
+-		 * if recovery is running, make sure it aborts.
+-		 */
+-		set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+-	}
++	/*
++	 * If recovery is running, make sure it aborts.
++	 */
++	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
+ 	set_bit(Blocked, &rdev->flags);
+ 	set_bit(Faulty, &rdev->flags);
+ 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
+@@ -2954,6 +2953,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
+ 		 */
+ 		if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
+ 			end_reshape(conf);
++			close_sync(conf);
+ 			return 0;
+ 		}
+ 
+@@ -4411,7 +4411,7 @@ read_more:
+ 	read_bio->bi_private = r10_bio;
+ 	read_bio->bi_end_io = end_sync_read;
+ 	read_bio->bi_rw = READ;
+-	read_bio->bi_flags &= ~(BIO_POOL_MASK - 1);
++	read_bio->bi_flags &= (~0UL << BIO_RESET_BITS);
+ 	read_bio->bi_flags |= 1 << BIO_UPTODATE;
+ 	read_bio->bi_vcnt = 0;
+ 	read_bio->bi_iter.bi_size = 0;
+diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
+index 6234b2e84587..183588b11fc1 100644
+--- a/drivers/md/raid5.c
++++ b/drivers/md/raid5.c
+@@ -2922,7 +2922,7 @@ static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
+ 	      (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
+ 	      !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
+ 	     (sh->raid_conf->level == 6 && s->failed && s->to_write &&
+-	      s->to_write < sh->raid_conf->raid_disks - 2 &&
++	      s->to_write - s->non_overwrite < sh->raid_conf->raid_disks - 2 &&
+ 	      (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))))) {
+ 		/* we would like to get this block, possibly by computing it,
+ 		 * otherwise read it if the backing disk is insync
+@@ -3817,6 +3817,8 @@ static void handle_stripe(struct stripe_head *sh)
+ 				set_bit(R5_Wantwrite, &dev->flags);
+ 				if (prexor)
+ 					continue;
++				if (s.failed > 1)
++					continue;
+ 				if (!test_bit(R5_Insync, &dev->flags) ||
+ 				    ((i == sh->pd_idx || i == sh->qd_idx)  &&
+ 				     s.failed == 0))
+diff --git a/drivers/media/common/siano/Kconfig b/drivers/media/common/siano/Kconfig
+index f953d33ee151..4bfbd5f463d1 100644
+--- a/drivers/media/common/siano/Kconfig
++++ b/drivers/media/common/siano/Kconfig
+@@ -22,8 +22,7 @@ config SMS_SIANO_DEBUGFS
+ 	bool "Enable debugfs for smsdvb"
+ 	depends on SMS_SIANO_MDTV
+ 	depends on DEBUG_FS
+-	depends on SMS_USB_DRV
+-	depends on CONFIG_SMS_USB_DRV = CONFIG_SMS_SDIO_DRV
++	depends on SMS_USB_DRV = SMS_SDIO_DRV
+ 
+ 	---help---
+ 	  Choose Y to enable visualizing a dump of the frontend
+diff --git a/drivers/media/i2c/mt9v032.c b/drivers/media/i2c/mt9v032.c
+index 40172b8d8ea2..f04d0bbd9cfd 100644
+--- a/drivers/media/i2c/mt9v032.c
++++ b/drivers/media/i2c/mt9v032.c
+@@ -305,8 +305,8 @@ mt9v032_update_hblank(struct mt9v032 *mt9v032)
+ 
+ 	if (mt9v032->version->version == MT9V034_CHIP_ID_REV1)
+ 		min_hblank += (mt9v032->hratio - 1) * 10;
+-	min_hblank = max_t(unsigned int, (int)mt9v032->model->data->min_row_time - crop->width,
+-			   (int)min_hblank);
++	min_hblank = max_t(int, mt9v032->model->data->min_row_time - crop->width,
++			   min_hblank);
+ 	hblank = max_t(unsigned int, mt9v032->hblank, min_hblank);
+ 
+ 	return mt9v032_write(client, MT9V032_HORIZONTAL_BLANKING, hblank);
+diff --git a/drivers/media/media-device.c b/drivers/media/media-device.c
+index 88b97c9e64ac..73a432934bd8 100644
+--- a/drivers/media/media-device.c
++++ b/drivers/media/media-device.c
+@@ -106,8 +106,6 @@ static long media_device_enum_entities(struct media_device *mdev,
+ 	if (ent->name) {
+ 		strncpy(u_ent.name, ent->name, sizeof(u_ent.name));
+ 		u_ent.name[sizeof(u_ent.name) - 1] = '\0';
+-	} else {
+-		memset(u_ent.name, 0, sizeof(u_ent.name));
+ 	}
+ 	u_ent.type = ent->type;
+ 	u_ent.revision = ent->revision;
+diff --git a/drivers/media/platform/vsp1/vsp1_video.c b/drivers/media/platform/vsp1/vsp1_video.c
+index 8a1253e51f04..677e3aa04eee 100644
+--- a/drivers/media/platform/vsp1/vsp1_video.c
++++ b/drivers/media/platform/vsp1/vsp1_video.c
+@@ -654,8 +654,6 @@ static int vsp1_video_buffer_prepare(struct vb2_buffer *vb)
+ 	if (vb->num_planes < format->num_planes)
+ 		return -EINVAL;
+ 
+-	buf->video = video;
+-
+ 	for (i = 0; i < vb->num_planes; ++i) {
+ 		buf->addr[i] = vb2_dma_contig_plane_dma_addr(vb, i);
+ 		buf->length[i] = vb2_plane_size(vb, i);
+diff --git a/drivers/media/platform/vsp1/vsp1_video.h b/drivers/media/platform/vsp1/vsp1_video.h
+index c04d48fa2999..7284320d5433 100644
+--- a/drivers/media/platform/vsp1/vsp1_video.h
++++ b/drivers/media/platform/vsp1/vsp1_video.h
+@@ -90,7 +90,6 @@ static inline struct vsp1_pipeline *to_vsp1_pipeline(struct media_entity *e)
+ }
+ 
+ struct vsp1_video_buffer {
+-	struct vsp1_video *video;
+ 	struct vb2_buffer buf;
+ 	struct list_head queue;
+ 
+diff --git a/drivers/media/tuners/xc4000.c b/drivers/media/tuners/xc4000.c
+index 2018befabb5a..e71decbfd0af 100644
+--- a/drivers/media/tuners/xc4000.c
++++ b/drivers/media/tuners/xc4000.c
+@@ -93,7 +93,7 @@ struct xc4000_priv {
+ 	struct firmware_description *firm;
+ 	int	firm_size;
+ 	u32	if_khz;
+-	u32	freq_hz;
++	u32	freq_hz, freq_offset;
+ 	u32	bandwidth;
+ 	u8	video_standard;
+ 	u8	rf_mode;
+@@ -1157,14 +1157,14 @@ static int xc4000_set_params(struct dvb_frontend *fe)
+ 	case SYS_ATSC:
+ 		dprintk(1, "%s() VSB modulation\n", __func__);
+ 		priv->rf_mode = XC_RF_MODE_AIR;
+-		priv->freq_hz = c->frequency - 1750000;
++		priv->freq_offset = 1750000;
+ 		priv->video_standard = XC4000_DTV6;
+ 		type = DTV6;
+ 		break;
+ 	case SYS_DVBC_ANNEX_B:
+ 		dprintk(1, "%s() QAM modulation\n", __func__);
+ 		priv->rf_mode = XC_RF_MODE_CABLE;
+-		priv->freq_hz = c->frequency - 1750000;
++		priv->freq_offset = 1750000;
+ 		priv->video_standard = XC4000_DTV6;
+ 		type = DTV6;
+ 		break;
+@@ -1173,23 +1173,23 @@ static int xc4000_set_params(struct dvb_frontend *fe)
+ 		dprintk(1, "%s() OFDM\n", __func__);
+ 		if (bw == 0) {
+ 			if (c->frequency < 400000000) {
+-				priv->freq_hz = c->frequency - 2250000;
++				priv->freq_offset = 2250000;
+ 			} else {
+-				priv->freq_hz = c->frequency - 2750000;
++				priv->freq_offset = 2750000;
+ 			}
+ 			priv->video_standard = XC4000_DTV7_8;
+ 			type = DTV78;
+ 		} else if (bw <= 6000000) {
+ 			priv->video_standard = XC4000_DTV6;
+-			priv->freq_hz = c->frequency - 1750000;
++			priv->freq_offset = 1750000;
+ 			type = DTV6;
+ 		} else if (bw <= 7000000) {
+ 			priv->video_standard = XC4000_DTV7;
+-			priv->freq_hz = c->frequency - 2250000;
++			priv->freq_offset = 2250000;
+ 			type = DTV7;
+ 		} else {
+ 			priv->video_standard = XC4000_DTV8;
+-			priv->freq_hz = c->frequency - 2750000;
++			priv->freq_offset = 2750000;
+ 			type = DTV8;
+ 		}
+ 		priv->rf_mode = XC_RF_MODE_AIR;
+@@ -1200,6 +1200,8 @@ static int xc4000_set_params(struct dvb_frontend *fe)
+ 		goto fail;
+ 	}
+ 
++	priv->freq_hz = c->frequency - priv->freq_offset;
++
+ 	dprintk(1, "%s() frequency=%d (compensated)\n",
+ 		__func__, priv->freq_hz);
+ 
+@@ -1520,7 +1522,7 @@ static int xc4000_get_frequency(struct dvb_frontend *fe, u32 *freq)
+ {
+ 	struct xc4000_priv *priv = fe->tuner_priv;
+ 
+-	*freq = priv->freq_hz;
++	*freq = priv->freq_hz + priv->freq_offset;
+ 
+ 	if (debug) {
+ 		mutex_lock(&priv->lock);
+diff --git a/drivers/media/tuners/xc5000.c b/drivers/media/tuners/xc5000.c
+index 2b3d514be672..3091cf7be7a1 100644
+--- a/drivers/media/tuners/xc5000.c
++++ b/drivers/media/tuners/xc5000.c
+@@ -56,7 +56,7 @@ struct xc5000_priv {
+ 
+ 	u32 if_khz;
+ 	u16 xtal_khz;
+-	u32 freq_hz;
++	u32 freq_hz, freq_offset;
+ 	u32 bandwidth;
+ 	u8  video_standard;
+ 	u8  rf_mode;
+@@ -749,13 +749,13 @@ static int xc5000_set_params(struct dvb_frontend *fe)
+ 	case SYS_ATSC:
+ 		dprintk(1, "%s() VSB modulation\n", __func__);
+ 		priv->rf_mode = XC_RF_MODE_AIR;
+-		priv->freq_hz = freq - 1750000;
++		priv->freq_offset = 1750000;
+ 		priv->video_standard = DTV6;
+ 		break;
+ 	case SYS_DVBC_ANNEX_B:
+ 		dprintk(1, "%s() QAM modulation\n", __func__);
+ 		priv->rf_mode = XC_RF_MODE_CABLE;
+-		priv->freq_hz = freq - 1750000;
++		priv->freq_offset = 1750000;
+ 		priv->video_standard = DTV6;
+ 		break;
+ 	case SYS_ISDBT:
+@@ -770,15 +770,15 @@ static int xc5000_set_params(struct dvb_frontend *fe)
+ 		switch (bw) {
+ 		case 6000000:
+ 			priv->video_standard = DTV6;
+-			priv->freq_hz = freq - 1750000;
++			priv->freq_offset = 1750000;
+ 			break;
+ 		case 7000000:
+ 			priv->video_standard = DTV7;
+-			priv->freq_hz = freq - 2250000;
++			priv->freq_offset = 2250000;
+ 			break;
+ 		case 8000000:
+ 			priv->video_standard = DTV8;
+-			priv->freq_hz = freq - 2750000;
++			priv->freq_offset = 2750000;
+ 			break;
+ 		default:
+ 			printk(KERN_ERR "xc5000 bandwidth not set!\n");
+@@ -792,15 +792,15 @@ static int xc5000_set_params(struct dvb_frontend *fe)
+ 		priv->rf_mode = XC_RF_MODE_CABLE;
+ 		if (bw <= 6000000) {
+ 			priv->video_standard = DTV6;
+-			priv->freq_hz = freq - 1750000;
++			priv->freq_offset = 1750000;
+ 			b = 6;
+ 		} else if (bw <= 7000000) {
+ 			priv->video_standard = DTV7;
+-			priv->freq_hz = freq - 2250000;
++			priv->freq_offset = 2250000;
+ 			b = 7;
+ 		} else {
+ 			priv->video_standard = DTV7_8;
+-			priv->freq_hz = freq - 2750000;
++			priv->freq_offset = 2750000;
+ 			b = 8;
+ 		}
+ 		dprintk(1, "%s() Bandwidth %dMHz (%d)\n", __func__,
+@@ -811,6 +811,8 @@ static int xc5000_set_params(struct dvb_frontend *fe)
+ 		return -EINVAL;
+ 	}
+ 
++	priv->freq_hz = freq - priv->freq_offset;
++
+ 	dprintk(1, "%s() frequency=%d (compensated to %d)\n",
+ 		__func__, freq, priv->freq_hz);
+ 
+@@ -1061,7 +1063,7 @@ static int xc5000_get_frequency(struct dvb_frontend *fe, u32 *freq)
+ {
+ 	struct xc5000_priv *priv = fe->tuner_priv;
+ 	dprintk(1, "%s()\n", __func__);
+-	*freq = priv->freq_hz;
++	*freq = priv->freq_hz + priv->freq_offset;
+ 	return 0;
+ }
+ 
+diff --git a/drivers/media/usb/au0828/au0828-video.c b/drivers/media/usb/au0828/au0828-video.c
+index 9038194513c5..49124b76e4cf 100644
+--- a/drivers/media/usb/au0828/au0828-video.c
++++ b/drivers/media/usb/au0828/au0828-video.c
+@@ -787,11 +787,27 @@ static int au0828_i2s_init(struct au0828_dev *dev)
+ 
+ /*
+  * Auvitek au0828 analog stream enable
+- * Please set interface0 to AS5 before enable the stream
+  */
+ static int au0828_analog_stream_enable(struct au0828_dev *d)
+ {
++	struct usb_interface *iface;
++	int ret;
++
+ 	dprintk(1, "au0828_analog_stream_enable called\n");
++
++	iface = usb_ifnum_to_if(d->usbdev, 0);
++	if (iface && iface->cur_altsetting->desc.bAlternateSetting != 5) {
++		dprintk(1, "Changing intf#0 to alt 5\n");
++		/* set au0828 interface0 to AS5 here again */
++		ret = usb_set_interface(d->usbdev, 0, 5);
++		if (ret < 0) {
++			printk(KERN_INFO "Au0828 can't set alt setting to 5!\n");
++			return -EBUSY;
++		}
++	}
++
++	/* FIXME: size should be calculated using d->width, d->height */
++
+ 	au0828_writereg(d, AU0828_SENSORCTRL_VBI_103, 0x00);
+ 	au0828_writereg(d, 0x106, 0x00);
+ 	/* set x position */
+@@ -1002,15 +1018,6 @@ static int au0828_v4l2_open(struct file *filp)
+ 		return -ERESTARTSYS;
+ 	}
+ 	if (dev->users == 0) {
+-		/* set au0828 interface0 to AS5 here again */
+-		ret = usb_set_interface(dev->usbdev, 0, 5);
+-		if (ret < 0) {
+-			mutex_unlock(&dev->lock);
+-			printk(KERN_INFO "Au0828 can't set alternate to 5!\n");
+-			kfree(fh);
+-			return -EBUSY;
+-		}
+-
+ 		au0828_analog_stream_enable(dev);
+ 		au0828_analog_stream_reset(dev);
+ 
+@@ -1252,13 +1259,6 @@ static int au0828_set_format(struct au0828_dev *dev, unsigned int cmd,
+ 		}
+ 	}
+ 
+-	/* set au0828 interface0 to AS5 here again */
+-	ret = usb_set_interface(dev->usbdev, 0, 5);
+-	if (ret < 0) {
+-		printk(KERN_INFO "Au0828 can't set alt setting to 5!\n");
+-		return -EBUSY;
+-	}
+-
+ 	au0828_analog_stream_enable(dev);
+ 
+ 	return 0;
+diff --git a/drivers/media/v4l2-core/videobuf2-core.c b/drivers/media/v4l2-core/videobuf2-core.c
+index 7c4489c42365..1d67e95311d6 100644
+--- a/drivers/media/v4l2-core/videobuf2-core.c
++++ b/drivers/media/v4l2-core/videobuf2-core.c
+@@ -1750,12 +1750,14 @@ static int vb2_start_streaming(struct vb2_queue *q)
+ 		__enqueue_in_driver(vb);
+ 
+ 	/* Tell the driver to start streaming */
++	q->start_streaming_called = 1;
+ 	ret = call_qop(q, start_streaming, q,
+ 		       atomic_read(&q->owned_by_drv_count));
+-	q->start_streaming_called = ret == 0;
+ 	if (!ret)
+ 		return 0;
+ 
++	q->start_streaming_called = 0;
++
+ 	dprintk(1, "driver refused to start streaming\n");
+ 	if (WARN_ON(atomic_read(&q->owned_by_drv_count))) {
+ 		unsigned i;
+diff --git a/drivers/mfd/omap-usb-host.c b/drivers/mfd/omap-usb-host.c
+index b48d80c367f9..33a9234b701c 100644
+--- a/drivers/mfd/omap-usb-host.c
++++ b/drivers/mfd/omap-usb-host.c
+@@ -445,7 +445,7 @@ static unsigned omap_usbhs_rev1_hostconfig(struct usbhs_hcd_omap *omap,
+ 
+ 		for (i = 0; i < omap->nports; i++) {
+ 			if (is_ehci_phy_mode(pdata->port_mode[i])) {
+-				reg &= OMAP_UHH_HOSTCONFIG_ULPI_BYPASS;
++				reg &= ~OMAP_UHH_HOSTCONFIG_ULPI_BYPASS;
+ 				break;
+ 			}
+ 		}
+diff --git a/drivers/mfd/rtsx_usb.c b/drivers/mfd/rtsx_usb.c
+index 6352bec8419a..71f387ce8cbd 100644
+--- a/drivers/mfd/rtsx_usb.c
++++ b/drivers/mfd/rtsx_usb.c
+@@ -744,6 +744,7 @@ static struct usb_device_id rtsx_usb_usb_ids[] = {
+ 	{ USB_DEVICE(0x0BDA, 0x0140) },
+ 	{ }
+ };
++MODULE_DEVICE_TABLE(usb, rtsx_usb_usb_ids);
+ 
+ static struct usb_driver rtsx_usb_driver = {
+ 	.name			= "rtsx_usb",
+diff --git a/drivers/mfd/twl4030-power.c b/drivers/mfd/twl4030-power.c
+index 3bc969a5916b..4d3ff3771491 100644
+--- a/drivers/mfd/twl4030-power.c
++++ b/drivers/mfd/twl4030-power.c
+@@ -724,24 +724,24 @@ static struct twl4030_script *omap3_idle_scripts[] = {
+  * above.
+  */
+ static struct twl4030_resconfig omap3_idle_rconfig[] = {
+-	TWL_REMAP_SLEEP(RES_VAUX1, DEV_GRP_NULL, 0, 0),
+-	TWL_REMAP_SLEEP(RES_VAUX2, DEV_GRP_NULL, 0, 0),
+-	TWL_REMAP_SLEEP(RES_VAUX3, DEV_GRP_NULL, 0, 0),
+-	TWL_REMAP_SLEEP(RES_VAUX4, DEV_GRP_NULL, 0, 0),
+-	TWL_REMAP_SLEEP(RES_VMMC1, DEV_GRP_NULL, 0, 0),
+-	TWL_REMAP_SLEEP(RES_VMMC2, DEV_GRP_NULL, 0, 0),
++	TWL_REMAP_SLEEP(RES_VAUX1, TWL4030_RESCONFIG_UNDEF, 0, 0),
++	TWL_REMAP_SLEEP(RES_VAUX2, TWL4030_RESCONFIG_UNDEF, 0, 0),
++	TWL_REMAP_SLEEP(RES_VAUX3, TWL4030_RESCONFIG_UNDEF, 0, 0),
++	TWL_REMAP_SLEEP(RES_VAUX4, TWL4030_RESCONFIG_UNDEF, 0, 0),
++	TWL_REMAP_SLEEP(RES_VMMC1, TWL4030_RESCONFIG_UNDEF, 0, 0),
++	TWL_REMAP_SLEEP(RES_VMMC2, TWL4030_RESCONFIG_UNDEF, 0, 0),
+ 	TWL_REMAP_OFF(RES_VPLL1, DEV_GRP_P1, 3, 1),
+ 	TWL_REMAP_SLEEP(RES_VPLL2, DEV_GRP_P1, 0, 0),
+-	TWL_REMAP_SLEEP(RES_VSIM, DEV_GRP_NULL, 0, 0),
+-	TWL_REMAP_SLEEP(RES_VDAC, DEV_GRP_NULL, 0, 0),
++	TWL_REMAP_SLEEP(RES_VSIM, TWL4030_RESCONFIG_UNDEF, 0, 0),
++	TWL_REMAP_SLEEP(RES_VDAC, TWL4030_RESCONFIG_UNDEF, 0, 0),
+ 	TWL_REMAP_SLEEP(RES_VINTANA1, TWL_DEV_GRP_P123, 1, 2),
+ 	TWL_REMAP_SLEEP(RES_VINTANA2, TWL_DEV_GRP_P123, 0, 2),
+ 	TWL_REMAP_SLEEP(RES_VINTDIG, TWL_DEV_GRP_P123, 1, 2),
+ 	TWL_REMAP_SLEEP(RES_VIO, TWL_DEV_GRP_P123, 2, 2),
+ 	TWL_REMAP_OFF(RES_VDD1, DEV_GRP_P1, 4, 1),
+ 	TWL_REMAP_OFF(RES_VDD2, DEV_GRP_P1, 3, 1),
+-	TWL_REMAP_SLEEP(RES_VUSB_1V5, DEV_GRP_NULL, 0, 0),
+-	TWL_REMAP_SLEEP(RES_VUSB_1V8, DEV_GRP_NULL, 0, 0),
++	TWL_REMAP_SLEEP(RES_VUSB_1V5, TWL4030_RESCONFIG_UNDEF, 0, 0),
++	TWL_REMAP_SLEEP(RES_VUSB_1V8, TWL4030_RESCONFIG_UNDEF, 0, 0),
+ 	TWL_REMAP_SLEEP(RES_VUSB_3V1, TWL_DEV_GRP_P123, 0, 0),
+ 	/* Resource #20 USB charge pump skipped */
+ 	TWL_REMAP_SLEEP(RES_REGEN, TWL_DEV_GRP_P123, 2, 1),
+diff --git a/drivers/mtd/ftl.c b/drivers/mtd/ftl.c
+index 19d637266fcd..71e4f6ccae2f 100644
+--- a/drivers/mtd/ftl.c
++++ b/drivers/mtd/ftl.c
+@@ -1075,7 +1075,6 @@ static void ftl_add_mtd(struct mtd_blktrans_ops *tr, struct mtd_info *mtd)
+ 			return;
+ 	}
+ 
+-	ftl_freepart(partition);
+ 	kfree(partition);
+ }
+ 
+diff --git a/drivers/mtd/nand/omap2.c b/drivers/mtd/nand/omap2.c
+index f0ed92e210a1..e2b9b345177a 100644
+--- a/drivers/mtd/nand/omap2.c
++++ b/drivers/mtd/nand/omap2.c
+@@ -931,7 +931,7 @@ static int omap_calculate_ecc(struct mtd_info *mtd, const u_char *dat,
+ 	u32 val;
+ 
+ 	val = readl(info->reg.gpmc_ecc_config);
+-	if (((val >> ECC_CONFIG_CS_SHIFT)  & ~CS_MASK) != info->gpmc_cs)
++	if (((val >> ECC_CONFIG_CS_SHIFT) & CS_MASK) != info->gpmc_cs)
+ 		return -EINVAL;
+ 
+ 	/* read ecc result */
+diff --git a/drivers/power/bq2415x_charger.c b/drivers/power/bq2415x_charger.c
+index 79a37f6d3307..e384844a1ae1 100644
+--- a/drivers/power/bq2415x_charger.c
++++ b/drivers/power/bq2415x_charger.c
+@@ -840,8 +840,7 @@ static int bq2415x_notifier_call(struct notifier_block *nb,
+ 	if (bq->automode < 1)
+ 		return NOTIFY_OK;
+ 
+-	sysfs_notify(&bq->charger.dev->kobj, NULL, "reported_mode");
+-	bq2415x_set_mode(bq, bq->reported_mode);
++	schedule_delayed_work(&bq->work, 0);
+ 
+ 	return NOTIFY_OK;
+ }
+@@ -892,6 +891,11 @@ static void bq2415x_timer_work(struct work_struct *work)
+ 	int error;
+ 	int boost;
+ 
++	if (bq->automode > 0 && (bq->reported_mode != bq->mode)) {
++		sysfs_notify(&bq->charger.dev->kobj, NULL, "reported_mode");
++		bq2415x_set_mode(bq, bq->reported_mode);
++	}
++
+ 	if (!bq->autotimer)
+ 		return;
+ 
+diff --git a/drivers/regulator/arizona-ldo1.c b/drivers/regulator/arizona-ldo1.c
+index 04f262a836b2..4c9db589f6c1 100644
+--- a/drivers/regulator/arizona-ldo1.c
++++ b/drivers/regulator/arizona-ldo1.c
+@@ -143,8 +143,6 @@ static struct regulator_ops arizona_ldo1_ops = {
+ 	.map_voltage = regulator_map_voltage_linear,
+ 	.get_voltage_sel = regulator_get_voltage_sel_regmap,
+ 	.set_voltage_sel = regulator_set_voltage_sel_regmap,
+-	.get_bypass = regulator_get_bypass_regmap,
+-	.set_bypass = regulator_set_bypass_regmap,
+ };
+ 
+ static const struct regulator_desc arizona_ldo1 = {
+diff --git a/drivers/regulator/tps65218-regulator.c b/drivers/regulator/tps65218-regulator.c
+index 9effe48c605e..8b7a0a9ebdfe 100644
+--- a/drivers/regulator/tps65218-regulator.c
++++ b/drivers/regulator/tps65218-regulator.c
+@@ -68,7 +68,7 @@ static const struct regulator_linear_range ldo1_dcdc3_ranges[] = {
+ 
+ static const struct regulator_linear_range dcdc4_ranges[] = {
+ 	REGULATOR_LINEAR_RANGE(1175000, 0x0, 0xf, 25000),
+-	REGULATOR_LINEAR_RANGE(1550000, 0x10, 0x34, 50000),
++	REGULATOR_LINEAR_RANGE(1600000, 0x10, 0x34, 50000),
+ };
+ 
+ static struct tps_info tps65218_pmic_regs[] = {
+diff --git a/drivers/scsi/bfa/bfa_ioc.h b/drivers/scsi/bfa/bfa_ioc.h
+index 2e28392c2fb6..a38aafa030b3 100644
+--- a/drivers/scsi/bfa/bfa_ioc.h
++++ b/drivers/scsi/bfa/bfa_ioc.h
+@@ -72,7 +72,7 @@ struct bfa_sge_s {
+ } while (0)
+ 
+ #define bfa_swap_words(_x)  (	\
+-	((_x) << 32) | ((_x) >> 32))
++	((u64)(_x) << 32) | ((u64)(_x) >> 32))
+ 
+ #ifdef __BIG_ENDIAN
+ #define bfa_sge_to_be(_x)
+diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
+index 88d46fe6bf98..769be4d50037 100644
+--- a/drivers/scsi/scsi.c
++++ b/drivers/scsi/scsi.c
+@@ -368,8 +368,8 @@ scsi_alloc_host_cmd_pool(struct Scsi_Host *shost)
+ 	if (!pool)
+ 		return NULL;
+ 
+-	pool->cmd_name = kasprintf(GFP_KERNEL, "%s_cmd", hostt->name);
+-	pool->sense_name = kasprintf(GFP_KERNEL, "%s_sense", hostt->name);
++	pool->cmd_name = kasprintf(GFP_KERNEL, "%s_cmd", hostt->proc_name);
++	pool->sense_name = kasprintf(GFP_KERNEL, "%s_sense", hostt->proc_name);
+ 	if (!pool->cmd_name || !pool->sense_name) {
+ 		scsi_free_host_cmd_pool(pool);
+ 		return NULL;
+@@ -380,6 +380,10 @@ scsi_alloc_host_cmd_pool(struct Scsi_Host *shost)
+ 		pool->slab_flags |= SLAB_CACHE_DMA;
+ 		pool->gfp_mask = __GFP_DMA;
+ 	}
++
++	if (hostt->cmd_size)
++		hostt->cmd_pool = pool;
++
+ 	return pool;
+ }
+ 
+@@ -424,8 +428,10 @@ out:
+ out_free_slab:
+ 	kmem_cache_destroy(pool->cmd_slab);
+ out_free_pool:
+-	if (hostt->cmd_size)
++	if (hostt->cmd_size) {
+ 		scsi_free_host_cmd_pool(pool);
++		hostt->cmd_pool = NULL;
++	}
+ 	goto out;
+ }
+ 
+@@ -447,8 +453,10 @@ static void scsi_put_host_cmd_pool(struct Scsi_Host *shost)
+ 	if (!--pool->users) {
+ 		kmem_cache_destroy(pool->cmd_slab);
+ 		kmem_cache_destroy(pool->sense_slab);
+-		if (hostt->cmd_size)
++		if (hostt->cmd_size) {
+ 			scsi_free_host_cmd_pool(pool);
++			hostt->cmd_pool = NULL;
++		}
+ 	}
+ 	mutex_unlock(&host_cmd_pool_mutex);
+ }
+diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
+index f969aca0b54e..49014a143c6a 100644
+--- a/drivers/scsi/scsi_devinfo.c
++++ b/drivers/scsi/scsi_devinfo.c
+@@ -222,6 +222,7 @@ static struct {
+ 	{"PIONEER", "CD-ROM DRM-602X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN},
+ 	{"PIONEER", "CD-ROM DRM-604X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN},
+ 	{"PIONEER", "CD-ROM DRM-624X", NULL, BLIST_FORCELUN | BLIST_SINGLELUN},
++	{"Promise", "VTrak E610f", NULL, BLIST_SPARSELUN | BLIST_NO_RSOC},
+ 	{"Promise", "", NULL, BLIST_SPARSELUN},
+ 	{"QUANTUM", "XP34301", "1071", BLIST_NOTQ},
+ 	{"REGAL", "CDC-4X", NULL, BLIST_MAX5LUN | BLIST_SINGLELUN},
+diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
+index e02b3aab56ce..a299b82e6b09 100644
+--- a/drivers/scsi/scsi_scan.c
++++ b/drivers/scsi/scsi_scan.c
+@@ -922,6 +922,12 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
+ 	if (*bflags & BLIST_USE_10_BYTE_MS)
+ 		sdev->use_10_for_ms = 1;
+ 
++	/* some devices don't like REPORT SUPPORTED OPERATION CODES
++	 * and will simply timeout causing sd_mod init to take a very
++	 * very long time */
++	if (*bflags & BLIST_NO_RSOC)
++		sdev->no_report_opcodes = 1;
++
+ 	/* set the device running here so that slave configure
+ 	 * may do I/O */
+ 	ret = scsi_device_set_state(sdev, SDEV_RUNNING);
+@@ -950,7 +956,9 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
+ 
+ 	sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;
+ 
+-	if (*bflags & BLIST_SKIP_VPD_PAGES)
++	if (*bflags & BLIST_TRY_VPD_PAGES)
++		sdev->try_vpd_pages = 1;
++	else if (*bflags & BLIST_SKIP_VPD_PAGES)
+ 		sdev->skip_vpd_pages = 1;
+ 
+ 	transport_configure_device(&sdev->sdev_gendev);
+@@ -1239,6 +1247,12 @@ static void scsi_sequential_lun_scan(struct scsi_target *starget,
+ 		max_dev_lun = min(8U, max_dev_lun);
+ 
+ 	/*
++	 * Stop scanning at 255 unless BLIST_SCSI3LUN
++	 */
++	if (!(bflags & BLIST_SCSI3LUN))
++		max_dev_lun = min(256U, max_dev_lun);
++
++	/*
+ 	 * We have already scanned LUN 0, so start at LUN 1. Keep scanning
+ 	 * until we reach the max, or no LUN is found and we are not
+ 	 * sparse_lun.
+diff --git a/drivers/scsi/scsi_transport_srp.c b/drivers/scsi/scsi_transport_srp.c
+index 13e898332e45..a0c5bfdc5366 100644
+--- a/drivers/scsi/scsi_transport_srp.c
++++ b/drivers/scsi/scsi_transport_srp.c
+@@ -473,7 +473,8 @@ static void __srp_start_tl_fail_timers(struct srp_rport *rport)
+ 	if (delay > 0)
+ 		queue_delayed_work(system_long_wq, &rport->reconnect_work,
+ 				   1UL * delay * HZ);
+-	if (srp_rport_set_state(rport, SRP_RPORT_BLOCKED) == 0) {
++	if ((fast_io_fail_tmo >= 0 || dev_loss_tmo >= 0) &&
++	    srp_rport_set_state(rport, SRP_RPORT_BLOCKED) == 0) {
+ 		pr_debug("%s new state: %d\n", dev_name(&shost->shost_gendev),
+ 			 rport->state);
+ 		scsi_target_block(&shost->shost_gendev);
+diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
+index 6825eda1114a..ed2e99eca336 100644
+--- a/drivers/scsi/sd.c
++++ b/drivers/scsi/sd.c
+@@ -2681,6 +2681,11 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
+ 
+ static int sd_try_extended_inquiry(struct scsi_device *sdp)
+ {
++	/* Attempt VPD inquiry if the device blacklist explicitly calls
++	 * for it.
++	 */
++	if (sdp->try_vpd_pages)
++		return 1;
+ 	/*
+ 	 * Although VPD inquiries can go to SCSI-2 type devices,
+ 	 * some USB ones crash on receiving them, and the pages
+diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
+index 9969fa1ef7c4..ed0f899e8aa5 100644
+--- a/drivers/scsi/storvsc_drv.c
++++ b/drivers/scsi/storvsc_drv.c
+@@ -33,6 +33,7 @@
+ #include <linux/device.h>
+ #include <linux/hyperv.h>
+ #include <linux/mempool.h>
++#include <linux/blkdev.h>
+ #include <scsi/scsi.h>
+ #include <scsi/scsi_cmnd.h>
+ #include <scsi/scsi_host.h>
+@@ -330,17 +331,17 @@ static int storvsc_timeout = 180;
+ 
+ static void storvsc_on_channel_callback(void *context);
+ 
+-/*
+- * In Hyper-V, each port/path/target maps to 1 scsi host adapter.  In
+- * reality, the path/target is not used (ie always set to 0) so our
+- * scsi host adapter essentially has 1 bus with 1 target that contains
+- * up to 256 luns.
+- */
+-#define STORVSC_MAX_LUNS_PER_TARGET			64
+-#define STORVSC_MAX_TARGETS				1
+-#define STORVSC_MAX_CHANNELS				1
++#define STORVSC_MAX_LUNS_PER_TARGET			255
++#define STORVSC_MAX_TARGETS				2
++#define STORVSC_MAX_CHANNELS				8
+ 
++#define STORVSC_FC_MAX_LUNS_PER_TARGET			255
++#define STORVSC_FC_MAX_TARGETS				128
++#define STORVSC_FC_MAX_CHANNELS				8
+ 
++#define STORVSC_IDE_MAX_LUNS_PER_TARGET			64
++#define STORVSC_IDE_MAX_TARGETS				1
++#define STORVSC_IDE_MAX_CHANNELS			1
+ 
+ struct storvsc_cmd_request {
+ 	struct list_head entry;
+@@ -1017,6 +1018,13 @@ static void storvsc_handle_error(struct vmscsi_request *vm_srb,
+ 		case ATA_12:
+ 			set_host_byte(scmnd, DID_PASSTHROUGH);
+ 			break;
++		/*
++		 * On Some Windows hosts TEST_UNIT_READY command can return
++		 * SRB_STATUS_ERROR, let the upper level code deal with it
++		 * based on the sense information.
++		 */
++		case TEST_UNIT_READY:
++			break;
+ 		default:
+ 			set_host_byte(scmnd, DID_TARGET_FAILURE);
+ 		}
+@@ -1518,6 +1526,16 @@ static int storvsc_host_reset_handler(struct scsi_cmnd *scmnd)
+ 	return SUCCESS;
+ }
+ 
++/*
++ * The host guarantees to respond to each command, although I/O latencies might
++ * be unbounded on Azure.  Reset the timer unconditionally to give the host a
++ * chance to perform EH.
++ */
++static enum blk_eh_timer_return storvsc_eh_timed_out(struct scsi_cmnd *scmnd)
++{
++	return BLK_EH_RESET_TIMER;
++}
++
+ static bool storvsc_scsi_cmd_ok(struct scsi_cmnd *scmnd)
+ {
+ 	bool allowed = true;
+@@ -1553,9 +1571,19 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
+ 	struct vmscsi_request *vm_srb;
+ 	struct stor_mem_pools *memp = scmnd->device->hostdata;
+ 
+-	if (!storvsc_scsi_cmd_ok(scmnd)) {
+-		scmnd->scsi_done(scmnd);
+-		return 0;
++	if (vmstor_current_major <= VMSTOR_WIN8_MAJOR) {
++		/*
++		 * On legacy hosts filter unimplemented commands.
++		 * Future hosts are expected to correctly handle
++		 * unsupported commands. Furthermore, it is
++		 * possible that some of the currently
++		 * unsupported commands maybe supported in
++		 * future versions of the host.
++		 */
++		if (!storvsc_scsi_cmd_ok(scmnd)) {
++			scmnd->scsi_done(scmnd);
++			return 0;
++		}
+ 	}
+ 
+ 	request_size = sizeof(struct storvsc_cmd_request);
+@@ -1580,26 +1608,24 @@ static int storvsc_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *scmnd)
+ 	vm_srb = &cmd_request->vstor_packet.vm_srb;
+ 	vm_srb->win8_extension.time_out_value = 60;
+ 
++	vm_srb->win8_extension.srb_flags |=
++		(SRB_FLAGS_QUEUE_ACTION_ENABLE |
++		SRB_FLAGS_DISABLE_SYNCH_TRANSFER);
+ 
+ 	/* Build the SRB */
+ 	switch (scmnd->sc_data_direction) {
+ 	case DMA_TO_DEVICE:
+ 		vm_srb->data_in = WRITE_TYPE;
+ 		vm_srb->win8_extension.srb_flags |= SRB_FLAGS_DATA_OUT;
+-		vm_srb->win8_extension.srb_flags |=
+-			(SRB_FLAGS_QUEUE_ACTION_ENABLE |
+-			SRB_FLAGS_DISABLE_SYNCH_TRANSFER);
+ 		break;
+ 	case DMA_FROM_DEVICE:
+ 		vm_srb->data_in = READ_TYPE;
+ 		vm_srb->win8_extension.srb_flags |= SRB_FLAGS_DATA_IN;
+-		vm_srb->win8_extension.srb_flags |=
+-			(SRB_FLAGS_QUEUE_ACTION_ENABLE |
+-			SRB_FLAGS_DISABLE_SYNCH_TRANSFER);
+ 		break;
+ 	default:
+ 		vm_srb->data_in = UNKNOWN_TYPE;
+-		vm_srb->win8_extension.srb_flags = 0;
++		vm_srb->win8_extension.srb_flags |= (SRB_FLAGS_DATA_IN |
++						     SRB_FLAGS_DATA_OUT);
+ 		break;
+ 	}
+ 
+@@ -1687,11 +1713,11 @@ static struct scsi_host_template scsi_driver = {
+ 	.bios_param =		storvsc_get_chs,
+ 	.queuecommand =		storvsc_queuecommand,
+ 	.eh_host_reset_handler =	storvsc_host_reset_handler,
++	.eh_timed_out =		storvsc_eh_timed_out,
+ 	.slave_alloc =		storvsc_device_alloc,
+ 	.slave_destroy =	storvsc_device_destroy,
+ 	.slave_configure =	storvsc_device_configure,
+-	.cmd_per_lun =		1,
+-	/* 64 max_queue * 1 target */
++	.cmd_per_lun =		255,
+ 	.can_queue =		STORVSC_MAX_IO_REQUESTS*STORVSC_MAX_TARGETS,
+ 	.this_id =		-1,
+ 	/* no use setting to 0 since ll_blk_rw reset it to 1 */
+@@ -1743,19 +1769,25 @@ static int storvsc_probe(struct hv_device *device,
+ 	 * set state to properly communicate with the host.
+ 	 */
+ 
+-	if (vmbus_proto_version == VERSION_WIN8) {
+-		sense_buffer_size = POST_WIN7_STORVSC_SENSE_BUFFER_SIZE;
+-		vmscsi_size_delta = 0;
+-		vmstor_current_major = VMSTOR_WIN8_MAJOR;
+-		vmstor_current_minor = VMSTOR_WIN8_MINOR;
+-	} else {
++	switch (vmbus_proto_version) {
++	case VERSION_WS2008:
++	case VERSION_WIN7:
+ 		sense_buffer_size = PRE_WIN8_STORVSC_SENSE_BUFFER_SIZE;
+ 		vmscsi_size_delta = sizeof(struct vmscsi_win8_extension);
+ 		vmstor_current_major = VMSTOR_WIN7_MAJOR;
+ 		vmstor_current_minor = VMSTOR_WIN7_MINOR;
++		break;
++	default:
++		sense_buffer_size = POST_WIN7_STORVSC_SENSE_BUFFER_SIZE;
++		vmscsi_size_delta = 0;
++		vmstor_current_major = VMSTOR_WIN8_MAJOR;
++		vmstor_current_minor = VMSTOR_WIN8_MINOR;
++		break;
+ 	}
+ 
+-
++	if (dev_id->driver_data == SFC_GUID)
++		scsi_driver.can_queue = (STORVSC_MAX_IO_REQUESTS *
++					 STORVSC_FC_MAX_TARGETS);
+ 	host = scsi_host_alloc(&scsi_driver,
+ 			       sizeof(struct hv_host_device));
+ 	if (!host)
+@@ -1789,12 +1821,25 @@ static int storvsc_probe(struct hv_device *device,
+ 	host_dev->path = stor_device->path_id;
+ 	host_dev->target = stor_device->target_id;
+ 
+-	/* max # of devices per target */
+-	host->max_lun = STORVSC_MAX_LUNS_PER_TARGET;
+-	/* max # of targets per channel */
+-	host->max_id = STORVSC_MAX_TARGETS;
+-	/* max # of channels */
+-	host->max_channel = STORVSC_MAX_CHANNELS - 1;
++	switch (dev_id->driver_data) {
++	case SFC_GUID:
++		host->max_lun = STORVSC_FC_MAX_LUNS_PER_TARGET;
++		host->max_id = STORVSC_FC_MAX_TARGETS;
++		host->max_channel = STORVSC_FC_MAX_CHANNELS - 1;
++		break;
++
++	case SCSI_GUID:
++		host->max_lun = STORVSC_MAX_LUNS_PER_TARGET;
++		host->max_id = STORVSC_MAX_TARGETS;
++		host->max_channel = STORVSC_MAX_CHANNELS - 1;
++		break;
++
++	default:
++		host->max_lun = STORVSC_IDE_MAX_LUNS_PER_TARGET;
++		host->max_id = STORVSC_IDE_MAX_TARGETS;
++		host->max_channel = STORVSC_IDE_MAX_CHANNELS - 1;
++		break;
++	}
+ 	/* max cmd length */
+ 	host->max_cmd_len = STORVSC_MAX_CMD_LEN;
+ 
+diff --git a/drivers/spi/spi-omap2-mcspi.c b/drivers/spi/spi-omap2-mcspi.c
+index 4dc77df38864..68441fa448de 100644
+--- a/drivers/spi/spi-omap2-mcspi.c
++++ b/drivers/spi/spi-omap2-mcspi.c
+@@ -149,6 +149,7 @@ struct omap2_mcspi_cs {
+ 	void __iomem		*base;
+ 	unsigned long		phys;
+ 	int			word_len;
++	u16			mode;
+ 	struct list_head	node;
+ 	/* Context save and restore shadow register */
+ 	u32			chconf0, chctrl0;
+@@ -926,6 +927,8 @@ static int omap2_mcspi_setup_transfer(struct spi_device *spi,
+ 
+ 	mcspi_write_chconf0(spi, l);
+ 
++	cs->mode = spi->mode;
++
+ 	dev_dbg(&spi->dev, "setup: speed %d, sample %s edge, clk %s\n",
+ 			speed_hz,
+ 			(spi->mode & SPI_CPHA) ? "trailing" : "leading",
+@@ -998,6 +1001,7 @@ static int omap2_mcspi_setup(struct spi_device *spi)
+ 			return -ENOMEM;
+ 		cs->base = mcspi->base + spi->chip_select * 0x14;
+ 		cs->phys = mcspi->phys + spi->chip_select * 0x14;
++		cs->mode = 0;
+ 		cs->chconf0 = 0;
+ 		cs->chctrl0 = 0;
+ 		spi->controller_state = cs;
+@@ -1079,6 +1083,16 @@ static void omap2_mcspi_work(struct omap2_mcspi *mcspi, struct spi_message *m)
+ 	cs = spi->controller_state;
+ 	cd = spi->controller_data;
+ 
++	/*
++	 * The slave driver could have changed spi->mode in which case
++	 * it will be different from cs->mode (the current hardware setup).
++	 * If so, set par_override (even though its not a parity issue) so
++	 * omap2_mcspi_setup_transfer will be called to configure the hardware
++	 * with the correct mode on the first iteration of the loop below.
++	 */
++	if (spi->mode != cs->mode)
++		par_override = 1;
++
+ 	omap2_mcspi_set_enable(spi, 0);
+ 	list_for_each_entry(t, &m->transfers, transfer_list) {
+ 		if (t->tx_buf == NULL && t->rx_buf == NULL && t->len) {
+diff --git a/drivers/spi/spi-orion.c b/drivers/spi/spi-orion.c
+index d018a4aac3a1..c206a4ad83cd 100644
+--- a/drivers/spi/spi-orion.c
++++ b/drivers/spi/spi-orion.c
+@@ -346,8 +346,6 @@ static int orion_spi_probe(struct platform_device *pdev)
+ 	struct resource *r;
+ 	unsigned long tclk_hz;
+ 	int status = 0;
+-	const u32 *iprop;
+-	int size;
+ 
+ 	master = spi_alloc_master(&pdev->dev, sizeof(*spi));
+ 	if (master == NULL) {
+@@ -358,10 +356,10 @@ static int orion_spi_probe(struct platform_device *pdev)
+ 	if (pdev->id != -1)
+ 		master->bus_num = pdev->id;
+ 	if (pdev->dev.of_node) {
+-		iprop = of_get_property(pdev->dev.of_node, "cell-index",
+-					&size);
+-		if (iprop && size == sizeof(*iprop))
+-			master->bus_num = *iprop;
++		u32 cell_index;
++		if (!of_property_read_u32(pdev->dev.of_node, "cell-index",
++					  &cell_index))
++			master->bus_num = cell_index;
+ 	}
+ 
+ 	/* we support only mode 0, and no options */
+diff --git a/drivers/spi/spi-pxa2xx.c b/drivers/spi/spi-pxa2xx.c
+index fe792106bdc5..46f45ca2c694 100644
+--- a/drivers/spi/spi-pxa2xx.c
++++ b/drivers/spi/spi-pxa2xx.c
+@@ -1074,6 +1074,7 @@ static struct acpi_device_id pxa2xx_spi_acpi_match[] = {
+ 	{ "INT3430", 0 },
+ 	{ "INT3431", 0 },
+ 	{ "80860F0E", 0 },
++	{ "8086228E", 0 },
+ 	{ },
+ };
+ MODULE_DEVICE_TABLE(acpi, pxa2xx_spi_acpi_match);
+diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
+index 500713882ad5..48dcb2e97b90 100644
+--- a/drivers/xen/events/events_fifo.c
++++ b/drivers/xen/events/events_fifo.c
+@@ -99,6 +99,25 @@ static unsigned evtchn_fifo_nr_channels(void)
+ 	return event_array_pages * EVENT_WORDS_PER_PAGE;
+ }
+ 
++static int init_control_block(int cpu,
++                              struct evtchn_fifo_control_block *control_block)
++{
++	struct evtchn_fifo_queue *q = &per_cpu(cpu_queue, cpu);
++	struct evtchn_init_control init_control;
++	unsigned int i;
++
++	/* Reset the control block and the local HEADs. */
++	clear_page(control_block);
++	for (i = 0; i < EVTCHN_FIFO_MAX_QUEUES; i++)
++		q->head[i] = 0;
++
++	init_control.control_gfn = virt_to_mfn(control_block);
++	init_control.offset      = 0;
++	init_control.vcpu        = cpu;
++
++	return HYPERVISOR_event_channel_op(EVTCHNOP_init_control, &init_control);
++}
++
+ static void free_unused_array_pages(void)
+ {
+ 	unsigned i;
+@@ -323,7 +342,6 @@ static void evtchn_fifo_resume(void)
+ 
+ 	for_each_possible_cpu(cpu) {
+ 		void *control_block = per_cpu(cpu_control_block, cpu);
+-		struct evtchn_init_control init_control;
+ 		int ret;
+ 
+ 		if (!control_block)
+@@ -340,12 +358,7 @@ static void evtchn_fifo_resume(void)
+ 			continue;
+ 		}
+ 
+-		init_control.control_gfn = virt_to_mfn(control_block);
+-		init_control.offset = 0;
+-		init_control.vcpu = cpu;
+-
+-		ret = HYPERVISOR_event_channel_op(EVTCHNOP_init_control,
+-						  &init_control);
++		ret = init_control_block(cpu, control_block);
+ 		if (ret < 0)
+ 			BUG();
+ 	}
+@@ -373,30 +386,25 @@ static const struct evtchn_ops evtchn_ops_fifo = {
+ 	.resume            = evtchn_fifo_resume,
+ };
+ 
+-static int evtchn_fifo_init_control_block(unsigned cpu)
++static int evtchn_fifo_alloc_control_block(unsigned cpu)
+ {
+-	struct page *control_block = NULL;
+-	struct evtchn_init_control init_control;
++	void *control_block = NULL;
+ 	int ret = -ENOMEM;
+ 
+-	control_block = alloc_page(GFP_KERNEL|__GFP_ZERO);
++	control_block = (void *)__get_free_page(GFP_KERNEL);
+ 	if (control_block == NULL)
+ 		goto error;
+ 
+-	init_control.control_gfn = virt_to_mfn(page_address(control_block));
+-	init_control.offset      = 0;
+-	init_control.vcpu        = cpu;
+-
+-	ret = HYPERVISOR_event_channel_op(EVTCHNOP_init_control, &init_control);
++	ret = init_control_block(cpu, control_block);
+ 	if (ret < 0)
+ 		goto error;
+ 
+-	per_cpu(cpu_control_block, cpu) = page_address(control_block);
++	per_cpu(cpu_control_block, cpu) = control_block;
+ 
+ 	return 0;
+ 
+   error:
+-	__free_page(control_block);
++	free_page((unsigned long)control_block);
+ 	return ret;
+ }
+ 
+@@ -410,7 +418,7 @@ static int evtchn_fifo_cpu_notification(struct notifier_block *self,
+ 	switch (action) {
+ 	case CPU_UP_PREPARE:
+ 		if (!per_cpu(cpu_control_block, cpu))
+-			ret = evtchn_fifo_init_control_block(cpu);
++			ret = evtchn_fifo_alloc_control_block(cpu);
+ 		break;
+ 	default:
+ 		break;
+@@ -427,7 +435,7 @@ int __init xen_evtchn_fifo_init(void)
+ 	int cpu = get_cpu();
+ 	int ret;
+ 
+-	ret = evtchn_fifo_init_control_block(cpu);
++	ret = evtchn_fifo_alloc_control_block(cpu);
+ 	if (ret < 0)
+ 		goto out;
+ 
+diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
+index de6aed8c78e5..c97fd86cfb1b 100644
+--- a/fs/cifs/cifsglob.h
++++ b/fs/cifs/cifsglob.h
+@@ -70,11 +70,6 @@
+ #define SERVER_NAME_LENGTH 40
+ #define SERVER_NAME_LEN_WITH_NULL     (SERVER_NAME_LENGTH + 1)
+ 
+-/* used to define string lengths for reversing unicode strings */
+-/*         (256+1)*2 = 514                                     */
+-/*           (max path length + 1 for null) * 2 for unicode    */
+-#define MAX_NAME 514
+-
+ /* SMB echo "timeout" -- FIXME: tunable? */
+ #define SMB_ECHO_INTERVAL (60 * HZ)
+ 
+@@ -404,6 +399,8 @@ struct smb_version_operations {
+ 			const struct cifs_fid *, u32 *);
+ 	int (*set_acl)(struct cifs_ntsd *, __u32, struct inode *, const char *,
+ 			int);
++	/* check if we need to issue closedir */
++	bool (*dir_needs_close)(struct cifsFileInfo *);
+ };
+ 
+ struct smb_version_values {
+diff --git a/fs/cifs/file.c b/fs/cifs/file.c
+index e90a1e9aa627..9de08c9dd106 100644
+--- a/fs/cifs/file.c
++++ b/fs/cifs/file.c
+@@ -762,7 +762,7 @@ int cifs_closedir(struct inode *inode, struct file *file)
+ 
+ 	cifs_dbg(FYI, "Freeing private data in close dir\n");
+ 	spin_lock(&cifs_file_list_lock);
+-	if (!cfile->srch_inf.endOfSearch && !cfile->invalidHandle) {
++	if (server->ops->dir_needs_close(cfile)) {
+ 		cfile->invalidHandle = true;
+ 		spin_unlock(&cifs_file_list_lock);
+ 		if (server->ops->close_dir)
+@@ -2823,7 +2823,7 @@ cifs_uncached_read_into_pages(struct TCP_Server_Info *server,
+ 		total_read += result;
+ 	}
+ 
+-	return total_read > 0 ? total_read : result;
++	return total_read > 0 && result != -EAGAIN ? total_read : result;
+ }
+ 
+ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
+@@ -3231,7 +3231,7 @@ cifs_readpages_read_into_pages(struct TCP_Server_Info *server,
+ 		total_read += result;
+ 	}
+ 
+-	return total_read > 0 ? total_read : result;
++	return total_read > 0 && result != -EAGAIN ? total_read : result;
+ }
+ 
+ static int cifs_readpages(struct file *file, struct address_space *mapping,
+diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
+index a174605f6afa..d322e7d4e123 100644
+--- a/fs/cifs/inode.c
++++ b/fs/cifs/inode.c
+@@ -1710,13 +1710,22 @@ cifs_rename(struct inode *source_dir, struct dentry *source_dentry,
+ unlink_target:
+ 	/* Try unlinking the target dentry if it's not negative */
+ 	if (target_dentry->d_inode && (rc == -EACCES || rc == -EEXIST)) {
+-		tmprc = cifs_unlink(target_dir, target_dentry);
++		if (d_is_dir(target_dentry))
++			tmprc = cifs_rmdir(target_dir, target_dentry);
++		else
++			tmprc = cifs_unlink(target_dir, target_dentry);
+ 		if (tmprc)
+ 			goto cifs_rename_exit;
+ 		rc = cifs_do_rename(xid, source_dentry, from_name,
+ 				    target_dentry, to_name);
+ 	}
+ 
++	/* force revalidate to go get info when needed */
++	CIFS_I(source_dir)->time = CIFS_I(target_dir)->time = 0;
++
++	source_dir->i_ctime = source_dir->i_mtime = target_dir->i_ctime =
++		target_dir->i_mtime = current_fs_time(source_dir->i_sb);
++
+ cifs_rename_exit:
+ 	kfree(info_buf_source);
+ 	kfree(from_name);
+diff --git a/fs/cifs/readdir.c b/fs/cifs/readdir.c
+index b15862e0f68c..b334a89d6a66 100644
+--- a/fs/cifs/readdir.c
++++ b/fs/cifs/readdir.c
+@@ -593,11 +593,11 @@ find_cifs_entry(const unsigned int xid, struct cifs_tcon *tcon, loff_t pos,
+ 		/* close and restart search */
+ 		cifs_dbg(FYI, "search backing up - close and restart search\n");
+ 		spin_lock(&cifs_file_list_lock);
+-		if (!cfile->srch_inf.endOfSearch && !cfile->invalidHandle) {
++		if (server->ops->dir_needs_close(cfile)) {
+ 			cfile->invalidHandle = true;
+ 			spin_unlock(&cifs_file_list_lock);
+-			if (server->ops->close)
+-				server->ops->close(xid, tcon, &cfile->fid);
++			if (server->ops->close_dir)
++				server->ops->close_dir(xid, tcon, &cfile->fid);
+ 		} else
+ 			spin_unlock(&cifs_file_list_lock);
+ 		if (cfile->srch_inf.ntwrk_buf_start) {
+diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c
+index d1fdfa848703..84ca0a4caaeb 100644
+--- a/fs/cifs/smb1ops.c
++++ b/fs/cifs/smb1ops.c
+@@ -1009,6 +1009,12 @@ cifs_is_read_op(__u32 oplock)
+ 	return oplock == OPLOCK_READ;
+ }
+ 
++static bool
++cifs_dir_needs_close(struct cifsFileInfo *cfile)
++{
++	return !cfile->srch_inf.endOfSearch && !cfile->invalidHandle;
++}
++
+ struct smb_version_operations smb1_operations = {
+ 	.send_cancel = send_nt_cancel,
+ 	.compare_fids = cifs_compare_fids,
+@@ -1078,6 +1084,7 @@ struct smb_version_operations smb1_operations = {
+ 	.query_mf_symlink = cifs_query_mf_symlink,
+ 	.create_mf_symlink = cifs_create_mf_symlink,
+ 	.is_read_op = cifs_is_read_op,
++	.dir_needs_close = cifs_dir_needs_close,
+ #ifdef CONFIG_CIFS_XATTR
+ 	.query_all_EAs = CIFSSMBQAllEAs,
+ 	.set_EA = CIFSSMBSetEA,
+diff --git a/fs/cifs/smb2file.c b/fs/cifs/smb2file.c
+index 3f17b4550831..45992944e238 100644
+--- a/fs/cifs/smb2file.c
++++ b/fs/cifs/smb2file.c
+@@ -50,7 +50,7 @@ smb2_open_file(const unsigned int xid, struct cifs_open_parms *oparms,
+ 		goto out;
+ 	}
+ 
+-	smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + MAX_NAME * 2,
++	smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + PATH_MAX * 2,
+ 			    GFP_KERNEL);
+ 	if (smb2_data == NULL) {
+ 		rc = -ENOMEM;
+diff --git a/fs/cifs/smb2inode.c b/fs/cifs/smb2inode.c
+index 84c012a6aba0..215f8d3e3e53 100644
+--- a/fs/cifs/smb2inode.c
++++ b/fs/cifs/smb2inode.c
+@@ -131,7 +131,7 @@ smb2_query_path_info(const unsigned int xid, struct cifs_tcon *tcon,
+ 	*adjust_tz = false;
+ 	*symlink = false;
+ 
+-	smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + MAX_NAME * 2,
++	smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + PATH_MAX * 2,
+ 			    GFP_KERNEL);
+ 	if (smb2_data == NULL)
+ 		return -ENOMEM;
+diff --git a/fs/cifs/smb2maperror.c b/fs/cifs/smb2maperror.c
+index 94bd4fbb13d3..a689514e260f 100644
+--- a/fs/cifs/smb2maperror.c
++++ b/fs/cifs/smb2maperror.c
+@@ -214,7 +214,7 @@ static const struct status_to_posix_error smb2_error_map_table[] = {
+ 	{STATUS_BREAKPOINT, -EIO, "STATUS_BREAKPOINT"},
+ 	{STATUS_SINGLE_STEP, -EIO, "STATUS_SINGLE_STEP"},
+ 	{STATUS_BUFFER_OVERFLOW, -EIO, "STATUS_BUFFER_OVERFLOW"},
+-	{STATUS_NO_MORE_FILES, -EIO, "STATUS_NO_MORE_FILES"},
++	{STATUS_NO_MORE_FILES, -ENODATA, "STATUS_NO_MORE_FILES"},
+ 	{STATUS_WAKE_SYSTEM_DEBUGGER, -EIO, "STATUS_WAKE_SYSTEM_DEBUGGER"},
+ 	{STATUS_HANDLES_CLOSED, -EIO, "STATUS_HANDLES_CLOSED"},
+ 	{STATUS_NO_INHERITANCE, -EIO, "STATUS_NO_INHERITANCE"},
+@@ -605,7 +605,7 @@ static const struct status_to_posix_error smb2_error_map_table[] = {
+ 	{STATUS_MAPPED_FILE_SIZE_ZERO, -EIO, "STATUS_MAPPED_FILE_SIZE_ZERO"},
+ 	{STATUS_TOO_MANY_OPENED_FILES, -EMFILE, "STATUS_TOO_MANY_OPENED_FILES"},
+ 	{STATUS_CANCELLED, -EIO, "STATUS_CANCELLED"},
+-	{STATUS_CANNOT_DELETE, -EIO, "STATUS_CANNOT_DELETE"},
++	{STATUS_CANNOT_DELETE, -EACCES, "STATUS_CANNOT_DELETE"},
+ 	{STATUS_INVALID_COMPUTER_NAME, -EIO, "STATUS_INVALID_COMPUTER_NAME"},
+ 	{STATUS_FILE_DELETED, -EIO, "STATUS_FILE_DELETED"},
+ 	{STATUS_SPECIAL_ACCOUNT, -EIO, "STATUS_SPECIAL_ACCOUNT"},
+diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
+index 787844bde384..f325c59e12e6 100644
+--- a/fs/cifs/smb2ops.c
++++ b/fs/cifs/smb2ops.c
+@@ -339,7 +339,7 @@ smb2_query_file_info(const unsigned int xid, struct cifs_tcon *tcon,
+ 	int rc;
+ 	struct smb2_file_all_info *smb2_data;
+ 
+-	smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + MAX_NAME * 2,
++	smb2_data = kzalloc(sizeof(struct smb2_file_all_info) + PATH_MAX * 2,
+ 			    GFP_KERNEL);
+ 	if (smb2_data == NULL)
+ 		return -ENOMEM;
+@@ -1104,6 +1104,12 @@ smb3_parse_lease_buf(void *buf, unsigned int *epoch)
+ 	return le32_to_cpu(lc->lcontext.LeaseState);
+ }
+ 
++static bool
++smb2_dir_needs_close(struct cifsFileInfo *cfile)
++{
++	return !cfile->invalidHandle;
++}
++
+ struct smb_version_operations smb20_operations = {
+ 	.compare_fids = smb2_compare_fids,
+ 	.setup_request = smb2_setup_request,
+@@ -1177,6 +1183,7 @@ struct smb_version_operations smb20_operations = {
+ 	.create_lease_buf = smb2_create_lease_buf,
+ 	.parse_lease_buf = smb2_parse_lease_buf,
+ 	.clone_range = smb2_clone_range,
++	.dir_needs_close = smb2_dir_needs_close,
+ };
+ 
+ struct smb_version_operations smb21_operations = {
+@@ -1252,6 +1259,7 @@ struct smb_version_operations smb21_operations = {
+ 	.create_lease_buf = smb2_create_lease_buf,
+ 	.parse_lease_buf = smb2_parse_lease_buf,
+ 	.clone_range = smb2_clone_range,
++	.dir_needs_close = smb2_dir_needs_close,
+ };
+ 
+ struct smb_version_operations smb30_operations = {
+@@ -1330,6 +1338,7 @@ struct smb_version_operations smb30_operations = {
+ 	.parse_lease_buf = smb3_parse_lease_buf,
+ 	.clone_range = smb2_clone_range,
+ 	.validate_negotiate = smb3_validate_negotiate,
++	.dir_needs_close = smb2_dir_needs_close,
+ };
+ 
+ struct smb_version_values smb20_values = {
+diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
+index b0b260dbb19d..87077559a0ab 100644
+--- a/fs/cifs/smb2pdu.c
++++ b/fs/cifs/smb2pdu.c
+@@ -922,7 +922,8 @@ tcon_exit:
+ tcon_error_exit:
+ 	if (rsp->hdr.Status == STATUS_BAD_NETWORK_NAME) {
+ 		cifs_dbg(VFS, "BAD_NETWORK_NAME: %s\n", tree);
+-		tcon->bad_network_name = true;
++		if (tcon)
++			tcon->bad_network_name = true;
+ 	}
+ 	goto tcon_exit;
+ }
+@@ -1545,7 +1546,7 @@ SMB2_query_info(const unsigned int xid, struct cifs_tcon *tcon,
+ {
+ 	return query_info(xid, tcon, persistent_fid, volatile_fid,
+ 			  FILE_ALL_INFORMATION,
+-			  sizeof(struct smb2_file_all_info) + MAX_NAME * 2,
++			  sizeof(struct smb2_file_all_info) + PATH_MAX * 2,
+ 			  sizeof(struct smb2_file_all_info), data);
+ }
+ 
+@@ -2141,6 +2142,10 @@ SMB2_query_directory(const unsigned int xid, struct cifs_tcon *tcon,
+ 	rsp = (struct smb2_query_directory_rsp *)iov[0].iov_base;
+ 
+ 	if (rc) {
++		if (rc == -ENODATA && rsp->hdr.Status == STATUS_NO_MORE_FILES) {
++			srch_inf->endOfSearch = true;
++			rc = 0;
++		}
+ 		cifs_stats_fail_inc(tcon, SMB2_QUERY_DIRECTORY_HE);
+ 		goto qdir_exit;
+ 	}
+@@ -2178,11 +2183,6 @@ SMB2_query_directory(const unsigned int xid, struct cifs_tcon *tcon,
+ 	else
+ 		cifs_dbg(VFS, "illegal search buffer type\n");
+ 
+-	if (rsp->hdr.Status == STATUS_NO_MORE_FILES)
+-		srch_inf->endOfSearch = 1;
+-	else
+-		srch_inf->endOfSearch = 0;
+-
+ 	return rc;
+ 
+ qdir_exit:
+diff --git a/fs/dcache.c b/fs/dcache.c
+index 06f65857a855..e1308c5423ed 100644
+--- a/fs/dcache.c
++++ b/fs/dcache.c
+@@ -106,8 +106,7 @@ static inline struct hlist_bl_head *d_hash(const struct dentry *parent,
+ 					unsigned int hash)
+ {
+ 	hash += (unsigned long) parent / L1_CACHE_BYTES;
+-	hash = hash + (hash >> d_hash_shift);
+-	return dentry_hashtable + (hash & d_hash_mask);
++	return dentry_hashtable + hash_32(hash, d_hash_shift);
+ }
+ 
+ /* Statistics gathering. */
+diff --git a/fs/namei.c b/fs/namei.c
+index 9eb787e5c167..17ca8b85c308 100644
+--- a/fs/namei.c
++++ b/fs/namei.c
+@@ -34,6 +34,7 @@
+ #include <linux/device_cgroup.h>
+ #include <linux/fs_struct.h>
+ #include <linux/posix_acl.h>
++#include <linux/hash.h>
+ #include <asm/uaccess.h>
+ 
+ #include "internal.h"
+@@ -1629,8 +1630,7 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
+ 
+ static inline unsigned int fold_hash(unsigned long hash)
+ {
+-	hash += hash >> (8*sizeof(int));
+-	return hash;
++	return hash_64(hash, 32);
+ }
+ 
+ #else	/* 32-bit case */
+diff --git a/fs/namespace.c b/fs/namespace.c
+index 182bc41cd887..140d17705683 100644
+--- a/fs/namespace.c
++++ b/fs/namespace.c
+@@ -779,6 +779,20 @@ static void attach_mnt(struct mount *mnt,
+ 	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
+ }
+ 
++static void attach_shadowed(struct mount *mnt,
++			struct mount *parent,
++			struct mount *shadows)
++{
++	if (shadows) {
++		hlist_add_after_rcu(&shadows->mnt_hash, &mnt->mnt_hash);
++		list_add(&mnt->mnt_child, &shadows->mnt_child);
++	} else {
++		hlist_add_head_rcu(&mnt->mnt_hash,
++				m_hash(&parent->mnt, mnt->mnt_mountpoint));
++		list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
++	}
++}
++
+ /*
+  * vfsmount lock must be held for write
+  */
+@@ -797,12 +811,7 @@ static void commit_tree(struct mount *mnt, struct mount *shadows)
+ 
+ 	list_splice(&head, n->list.prev);
+ 
+-	if (shadows)
+-		hlist_add_after_rcu(&shadows->mnt_hash, &mnt->mnt_hash);
+-	else
+-		hlist_add_head_rcu(&mnt->mnt_hash,
+-				m_hash(&parent->mnt, mnt->mnt_mountpoint));
+-	list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
++	attach_shadowed(mnt, parent, shadows);
+ 	touch_mnt_namespace(n);
+ }
+ 
+@@ -890,8 +899,21 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
+ 
+ 	mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
+ 	/* Don't allow unprivileged users to change mount flags */
+-	if ((flag & CL_UNPRIVILEGED) && (mnt->mnt.mnt_flags & MNT_READONLY))
+-		mnt->mnt.mnt_flags |= MNT_LOCK_READONLY;
++	if (flag & CL_UNPRIVILEGED) {
++		mnt->mnt.mnt_flags |= MNT_LOCK_ATIME;
++
++		if (mnt->mnt.mnt_flags & MNT_READONLY)
++			mnt->mnt.mnt_flags |= MNT_LOCK_READONLY;
++
++		if (mnt->mnt.mnt_flags & MNT_NODEV)
++			mnt->mnt.mnt_flags |= MNT_LOCK_NODEV;
++
++		if (mnt->mnt.mnt_flags & MNT_NOSUID)
++			mnt->mnt.mnt_flags |= MNT_LOCK_NOSUID;
++
++		if (mnt->mnt.mnt_flags & MNT_NOEXEC)
++			mnt->mnt.mnt_flags |= MNT_LOCK_NOEXEC;
++	}
+ 
+ 	/* Don't allow unprivileged users to reveal what is under a mount */
+ 	if ((flag & CL_UNPRIVILEGED) && list_empty(&old->mnt_expire))
+@@ -1213,6 +1235,11 @@ static void namespace_unlock(void)
+ 	head.first->pprev = &head.first;
+ 	INIT_HLIST_HEAD(&unmounted);
+ 
++	/* undo decrements we'd done in umount_tree() */
++	hlist_for_each_entry(mnt, &head, mnt_hash)
++		if (mnt->mnt_ex_mountpoint.mnt)
++			mntget(mnt->mnt_ex_mountpoint.mnt);
++
+ 	up_write(&namespace_sem);
+ 
+ 	synchronize_rcu();
+@@ -1249,6 +1276,9 @@ void umount_tree(struct mount *mnt, int how)
+ 		hlist_add_head(&p->mnt_hash, &tmp_list);
+ 	}
+ 
++	hlist_for_each_entry(p, &tmp_list, mnt_hash)
++		list_del_init(&p->mnt_child);
++
+ 	if (how)
+ 		propagate_umount(&tmp_list);
+ 
+@@ -1259,9 +1289,9 @@ void umount_tree(struct mount *mnt, int how)
+ 		p->mnt_ns = NULL;
+ 		if (how < 2)
+ 			p->mnt.mnt_flags |= MNT_SYNC_UMOUNT;
+-		list_del_init(&p->mnt_child);
+ 		if (mnt_has_parent(p)) {
+ 			put_mountpoint(p->mnt_mp);
++			mnt_add_count(p->mnt_parent, -1);
+ 			/* move the reference to mountpoint into ->mnt_ex_mountpoint */
+ 			p->mnt_ex_mountpoint.dentry = p->mnt_mountpoint;
+ 			p->mnt_ex_mountpoint.mnt = &p->mnt_parent->mnt;
+@@ -1492,6 +1522,7 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
+ 			continue;
+ 
+ 		for (s = r; s; s = next_mnt(s, r)) {
++			struct mount *t = NULL;
+ 			if (!(flag & CL_COPY_UNBINDABLE) &&
+ 			    IS_MNT_UNBINDABLE(s)) {
+ 				s = skip_mnt_tree(s);
+@@ -1513,7 +1544,14 @@ struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
+ 				goto out;
+ 			lock_mount_hash();
+ 			list_add_tail(&q->mnt_list, &res->mnt_list);
+-			attach_mnt(q, parent, p->mnt_mp);
++			mnt_set_mountpoint(parent, p->mnt_mp, q);
++			if (!list_empty(&parent->mnt_mounts)) {
++				t = list_last_entry(&parent->mnt_mounts,
++					struct mount, mnt_child);
++				if (t->mnt_mp != p->mnt_mp)
++					t = NULL;
++			}
++			attach_shadowed(q, parent, t);
+ 			unlock_mount_hash();
+ 		}
+ 	}
+@@ -1896,9 +1934,6 @@ static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+ 	if (readonly_request == __mnt_is_readonly(mnt))
+ 		return 0;
+ 
+-	if (mnt->mnt_flags & MNT_LOCK_READONLY)
+-		return -EPERM;
+-
+ 	if (readonly_request)
+ 		error = mnt_make_readonly(real_mount(mnt));
+ 	else
+@@ -1924,6 +1959,33 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
+ 	if (path->dentry != path->mnt->mnt_root)
+ 		return -EINVAL;
+ 
++	/* Don't allow changing of locked mnt flags.
++	 *
++	 * No locks need to be held here while testing the various
++	 * MNT_LOCK flags because those flags can never be cleared
++	 * once they are set.
++	 */
++	if ((mnt->mnt.mnt_flags & MNT_LOCK_READONLY) &&
++	    !(mnt_flags & MNT_READONLY)) {
++		return -EPERM;
++	}
++	if ((mnt->mnt.mnt_flags & MNT_LOCK_NODEV) &&
++	    !(mnt_flags & MNT_NODEV)) {
++		return -EPERM;
++	}
++	if ((mnt->mnt.mnt_flags & MNT_LOCK_NOSUID) &&
++	    !(mnt_flags & MNT_NOSUID)) {
++		return -EPERM;
++	}
++	if ((mnt->mnt.mnt_flags & MNT_LOCK_NOEXEC) &&
++	    !(mnt_flags & MNT_NOEXEC)) {
++		return -EPERM;
++	}
++	if ((mnt->mnt.mnt_flags & MNT_LOCK_ATIME) &&
++	    ((mnt->mnt.mnt_flags & MNT_ATIME_MASK) != (mnt_flags & MNT_ATIME_MASK))) {
++		return -EPERM;
++	}
++
+ 	err = security_sb_remount(sb, data);
+ 	if (err)
+ 		return err;
+@@ -1937,7 +1999,7 @@ static int do_remount(struct path *path, int flags, int mnt_flags,
+ 		err = do_remount_sb(sb, flags, data, 0);
+ 	if (!err) {
+ 		lock_mount_hash();
+-		mnt_flags |= mnt->mnt.mnt_flags & MNT_PROPAGATION_MASK;
++		mnt_flags |= mnt->mnt.mnt_flags & ~MNT_USER_SETTABLE_MASK;
+ 		mnt->mnt.mnt_flags = mnt_flags;
+ 		touch_mnt_namespace(mnt->mnt_ns);
+ 		unlock_mount_hash();
+@@ -2122,7 +2184,7 @@ static int do_new_mount(struct path *path, const char *fstype, int flags,
+ 		 */
+ 		if (!(type->fs_flags & FS_USERNS_DEV_MOUNT)) {
+ 			flags |= MS_NODEV;
+-			mnt_flags |= MNT_NODEV;
++			mnt_flags |= MNT_NODEV | MNT_LOCK_NODEV;
+ 		}
+ 	}
+ 
+@@ -2436,6 +2498,14 @@ long do_mount(const char *dev_name, const char *dir_name,
+ 	if (flags & MS_RDONLY)
+ 		mnt_flags |= MNT_READONLY;
+ 
++	/* The default atime for remount is preservation */
++	if ((flags & MS_REMOUNT) &&
++	    ((flags & (MS_NOATIME | MS_NODIRATIME | MS_RELATIME |
++		       MS_STRICTATIME)) == 0)) {
++		mnt_flags &= ~MNT_ATIME_MASK;
++		mnt_flags |= path.mnt->mnt_flags & MNT_ATIME_MASK;
++	}
++
+ 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
+ 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
+ 		   MS_STRICTATIME);
+diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
+index ee9cb3795c2b..7e948ffba461 100644
+--- a/fs/notify/fanotify/fanotify.c
++++ b/fs/notify/fanotify/fanotify.c
+@@ -70,8 +70,15 @@ static int fanotify_get_response(struct fsnotify_group *group,
+ 	wait_event(group->fanotify_data.access_waitq, event->response ||
+ 				atomic_read(&group->fanotify_data.bypass_perm));
+ 
+-	if (!event->response) /* bypass_perm set */
++	if (!event->response) {	/* bypass_perm set */
++		/*
++		 * Event was canceled because group is being destroyed. Remove
++		 * it from group's event list because we are responsible for
++		 * freeing the permission event.
++		 */
++		fsnotify_remove_event(group, &event->fae.fse);
+ 		return 0;
++	}
+ 
+ 	/* userspace responded, convert to something usable */
+ 	switch (event->response) {
+diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
+index 3fdc8a3e1134..2685bc9ea2c9 100644
+--- a/fs/notify/fanotify/fanotify_user.c
++++ b/fs/notify/fanotify/fanotify_user.c
+@@ -359,6 +359,11 @@ static int fanotify_release(struct inode *ignored, struct file *file)
+ #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+ 	struct fanotify_perm_event_info *event, *next;
+ 
++	/*
++	 * There may be still new events arriving in the notification queue
++	 * but since userspace cannot use fanotify fd anymore, no event can
++	 * enter or leave access_list by now.
++	 */
+ 	spin_lock(&group->fanotify_data.access_lock);
+ 
+ 	atomic_inc(&group->fanotify_data.bypass_perm);
+@@ -373,6 +378,13 @@ static int fanotify_release(struct inode *ignored, struct file *file)
+ 	}
+ 	spin_unlock(&group->fanotify_data.access_lock);
+ 
++	/*
++	 * Since bypass_perm is set, newly queued events will not wait for
++	 * access response. Wake up the already sleeping ones now.
++	 * synchronize_srcu() in fsnotify_destroy_group() will wait for all
++	 * processes sleeping in fanotify_handle_event() waiting for access
++	 * response and thus also for all permission events to be freed.
++	 */
+ 	wake_up(&group->fanotify_data.access_waitq);
+ #endif
+ 
+diff --git a/fs/notify/notification.c b/fs/notify/notification.c
+index 1e58402171a5..25a07c70f1c9 100644
+--- a/fs/notify/notification.c
++++ b/fs/notify/notification.c
+@@ -73,7 +73,8 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
+ 	/* Overflow events are per-group and we don't want to free them */
+ 	if (!event || event->mask == FS_Q_OVERFLOW)
+ 		return;
+-
++	/* If the event is still queued, we have a problem... */
++	WARN_ON(!list_empty(&event->list));
+ 	group->ops->free_event(event);
+ }
+ 
+@@ -125,6 +126,21 @@ queue:
+ }
+ 
+ /*
++ * Remove @event from group's notification queue. It is the responsibility of
++ * the caller to destroy the event.
++ */
++void fsnotify_remove_event(struct fsnotify_group *group,
++			   struct fsnotify_event *event)
++{
++	mutex_lock(&group->notification_mutex);
++	if (!list_empty(&event->list)) {
++		list_del_init(&event->list);
++		group->q_len--;
++	}
++	mutex_unlock(&group->notification_mutex);
++}
++
++/*
+  * Remove and return the first event from the notification list.  It is the
+  * responsibility of the caller to destroy the obtained event
+  */
+diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
+index 6f66b3751ace..53e6c40ed4c6 100644
+--- a/fs/ocfs2/ioctl.c
++++ b/fs/ocfs2/ioctl.c
+@@ -35,9 +35,8 @@
+ 		copy_to_user((typeof(a) __user *)b, &(a), sizeof(a))
+ 
+ /*
+- * This call is void because we are already reporting an error that may
+- * be -EFAULT.  The error will be returned from the ioctl(2) call.  It's
+- * just a best-effort to tell userspace that this request caused the error.
++ * This is just a best-effort to tell userspace that this request
++ * caused the error.
+  */
+ static inline void o2info_set_request_error(struct ocfs2_info_request *kreq,
+ 					struct ocfs2_info_request __user *req)
+@@ -146,136 +145,105 @@ bail:
+ static int ocfs2_info_handle_blocksize(struct inode *inode,
+ 				       struct ocfs2_info_request __user *req)
+ {
+-	int status = -EFAULT;
+ 	struct ocfs2_info_blocksize oib;
+ 
+ 	if (o2info_from_user(oib, req))
+-		goto bail;
++		return -EFAULT;
+ 
+ 	oib.ib_blocksize = inode->i_sb->s_blocksize;
+ 
+ 	o2info_set_request_filled(&oib.ib_req);
+ 
+ 	if (o2info_to_user(oib, req))
+-		goto bail;
+-
+-	status = 0;
+-bail:
+-	if (status)
+-		o2info_set_request_error(&oib.ib_req, req);
++		return -EFAULT;
+ 
+-	return status;
++	return 0;
+ }
+ 
+ static int ocfs2_info_handle_clustersize(struct inode *inode,
+ 					 struct ocfs2_info_request __user *req)
+ {
+-	int status = -EFAULT;
+ 	struct ocfs2_info_clustersize oic;
+ 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ 
+ 	if (o2info_from_user(oic, req))
+-		goto bail;
++		return -EFAULT;
+ 
+ 	oic.ic_clustersize = osb->s_clustersize;
+ 
+ 	o2info_set_request_filled(&oic.ic_req);
+ 
+ 	if (o2info_to_user(oic, req))
+-		goto bail;
+-
+-	status = 0;
+-bail:
+-	if (status)
+-		o2info_set_request_error(&oic.ic_req, req);
++		return -EFAULT;
+ 
+-	return status;
++	return 0;
+ }
+ 
+ static int ocfs2_info_handle_maxslots(struct inode *inode,
+ 				      struct ocfs2_info_request __user *req)
+ {
+-	int status = -EFAULT;
+ 	struct ocfs2_info_maxslots oim;
+ 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ 
+ 	if (o2info_from_user(oim, req))
+-		goto bail;
++		return -EFAULT;
+ 
+ 	oim.im_max_slots = osb->max_slots;
+ 
+ 	o2info_set_request_filled(&oim.im_req);
+ 
+ 	if (o2info_to_user(oim, req))
+-		goto bail;
++		return -EFAULT;
+ 
+-	status = 0;
+-bail:
+-	if (status)
+-		o2info_set_request_error(&oim.im_req, req);
+-
+-	return status;
++	return 0;
+ }
+ 
+ static int ocfs2_info_handle_label(struct inode *inode,
+ 				   struct ocfs2_info_request __user *req)
+ {
+-	int status = -EFAULT;
+ 	struct ocfs2_info_label oil;
+ 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ 
+ 	if (o2info_from_user(oil, req))
+-		goto bail;
++		return -EFAULT;
+ 
+ 	memcpy(oil.il_label, osb->vol_label, OCFS2_MAX_VOL_LABEL_LEN);
+ 
+ 	o2info_set_request_filled(&oil.il_req);
+ 
+ 	if (o2info_to_user(oil, req))
+-		goto bail;
++		return -EFAULT;
+ 
+-	status = 0;
+-bail:
+-	if (status)
+-		o2info_set_request_error(&oil.il_req, req);
+-
+-	return status;
++	return 0;
+ }
+ 
+ static int ocfs2_info_handle_uuid(struct inode *inode,
+ 				  struct ocfs2_info_request __user *req)
+ {
+-	int status = -EFAULT;
+ 	struct ocfs2_info_uuid oiu;
+ 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ 
+ 	if (o2info_from_user(oiu, req))
+-		goto bail;
++		return -EFAULT;
+ 
+ 	memcpy(oiu.iu_uuid_str, osb->uuid_str, OCFS2_TEXT_UUID_LEN + 1);
+ 
+ 	o2info_set_request_filled(&oiu.iu_req);
+ 
+ 	if (o2info_to_user(oiu, req))
+-		goto bail;
+-
+-	status = 0;
+-bail:
+-	if (status)
+-		o2info_set_request_error(&oiu.iu_req, req);
++		return -EFAULT;
+ 
+-	return status;
++	return 0;
+ }
+ 
+ static int ocfs2_info_handle_fs_features(struct inode *inode,
+ 					 struct ocfs2_info_request __user *req)
+ {
+-	int status = -EFAULT;
+ 	struct ocfs2_info_fs_features oif;
+ 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ 
+ 	if (o2info_from_user(oif, req))
+-		goto bail;
++		return -EFAULT;
+ 
+ 	oif.if_compat_features = osb->s_feature_compat;
+ 	oif.if_incompat_features = osb->s_feature_incompat;
+@@ -284,39 +252,28 @@ static int ocfs2_info_handle_fs_features(struct inode *inode,
+ 	o2info_set_request_filled(&oif.if_req);
+ 
+ 	if (o2info_to_user(oif, req))
+-		goto bail;
++		return -EFAULT;
+ 
+-	status = 0;
+-bail:
+-	if (status)
+-		o2info_set_request_error(&oif.if_req, req);
+-
+-	return status;
++	return 0;
+ }
+ 
+ static int ocfs2_info_handle_journal_size(struct inode *inode,
+ 					  struct ocfs2_info_request __user *req)
+ {
+-	int status = -EFAULT;
+ 	struct ocfs2_info_journal_size oij;
+ 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ 
+ 	if (o2info_from_user(oij, req))
+-		goto bail;
++		return -EFAULT;
+ 
+ 	oij.ij_journal_size = i_size_read(osb->journal->j_inode);
+ 
+ 	o2info_set_request_filled(&oij.ij_req);
+ 
+ 	if (o2info_to_user(oij, req))
+-		goto bail;
++		return -EFAULT;
+ 
+-	status = 0;
+-bail:
+-	if (status)
+-		o2info_set_request_error(&oij.ij_req, req);
+-
+-	return status;
++	return 0;
+ }
+ 
+ static int ocfs2_info_scan_inode_alloc(struct ocfs2_super *osb,
+@@ -373,7 +330,7 @@ static int ocfs2_info_handle_freeinode(struct inode *inode,
+ 	u32 i;
+ 	u64 blkno = -1;
+ 	char namebuf[40];
+-	int status = -EFAULT, type = INODE_ALLOC_SYSTEM_INODE;
++	int status, type = INODE_ALLOC_SYSTEM_INODE;
+ 	struct ocfs2_info_freeinode *oifi = NULL;
+ 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+ 	struct inode *inode_alloc = NULL;
+@@ -385,8 +342,10 @@ static int ocfs2_info_handle_freeinode(struct inode *inode,
+ 		goto out_err;
+ 	}
+ 
+-	if (o2info_from_user(*oifi, req))
+-		goto bail;
++	if (o2info_from_user(*oifi, req)) {
++		status = -EFAULT;
++		goto out_free;
++	}
+ 
+ 	oifi->ifi_slotnum = osb->max_slots;
+ 
+@@ -424,14 +383,16 @@ static int ocfs2_info_handle_freeinode(struct inode *inode,
+ 
+ 	o2info_set_request_filled(&oifi->ifi_req);
+ 
+-	if (o2info_to_user(*oifi, req))
+-		goto bail;
++	if (o2info_to_user(*oifi, req)) {
++		status = -EFAULT;
++		goto out_free;
++	}
+ 
+ 	status = 0;
+ bail:
+ 	if (status)
+ 		o2info_set_request_error(&oifi->ifi_req, req);
+-
++out_free:
+ 	kfree(oifi);
+ out_err:
+ 	return status;
+@@ -658,7 +619,7 @@ static int ocfs2_info_handle_freefrag(struct inode *inode,
+ {
+ 	u64 blkno = -1;
+ 	char namebuf[40];
+-	int status = -EFAULT, type = GLOBAL_BITMAP_SYSTEM_INODE;
++	int status, type = GLOBAL_BITMAP_SYSTEM_INODE;
+ 
+ 	struct ocfs2_info_freefrag *oiff;
+ 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+@@ -671,8 +632,10 @@ static int ocfs2_info_handle_freefrag(struct inode *inode,
+ 		goto out_err;
+ 	}
+ 
+-	if (o2info_from_user(*oiff, req))
+-		goto bail;
++	if (o2info_from_user(*oiff, req)) {
++		status = -EFAULT;
++		goto out_free;
++	}
+ 	/*
+ 	 * chunksize from userspace should be power of 2.
+ 	 */
+@@ -711,14 +674,14 @@ static int ocfs2_info_handle_freefrag(struct inode *inode,
+ 
+ 	if (o2info_to_user(*oiff, req)) {
+ 		status = -EFAULT;
+-		goto bail;
++		goto out_free;
+ 	}
+ 
+ 	status = 0;
+ bail:
+ 	if (status)
+ 		o2info_set_request_error(&oiff->iff_req, req);
+-
++out_free:
+ 	kfree(oiff);
+ out_err:
+ 	return status;
+@@ -727,23 +690,17 @@ out_err:
+ static int ocfs2_info_handle_unknown(struct inode *inode,
+ 				     struct ocfs2_info_request __user *req)
+ {
+-	int status = -EFAULT;
+ 	struct ocfs2_info_request oir;
+ 
+ 	if (o2info_from_user(oir, req))
+-		goto bail;
++		return -EFAULT;
+ 
+ 	o2info_clear_request_filled(&oir);
+ 
+ 	if (o2info_to_user(oir, req))
+-		goto bail;
++		return -EFAULT;
+ 
+-	status = 0;
+-bail:
+-	if (status)
+-		o2info_set_request_error(&oir, req);
+-
+-	return status;
++	return 0;
+ }
+ 
+ /*
+diff --git a/fs/pnode.c b/fs/pnode.c
+index 302bf22c4a30..aae331a5d03b 100644
+--- a/fs/pnode.c
++++ b/fs/pnode.c
+@@ -381,6 +381,7 @@ static void __propagate_umount(struct mount *mnt)
+ 		 * other children
+ 		 */
+ 		if (child && list_empty(&child->mnt_mounts)) {
++			list_del_init(&child->mnt_child);
+ 			hlist_del_init_rcu(&child->mnt_hash);
+ 			hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash);
+ 		}
+diff --git a/fs/proc/array.c b/fs/proc/array.c
+index 64db2bceac59..3e1290b0492e 100644
+--- a/fs/proc/array.c
++++ b/fs/proc/array.c
+@@ -297,15 +297,11 @@ static void render_cap_t(struct seq_file *m, const char *header,
+ 	seq_puts(m, header);
+ 	CAP_FOR_EACH_U32(__capi) {
+ 		seq_printf(m, "%08x",
+-			   a->cap[(_KERNEL_CAPABILITY_U32S-1) - __capi]);
++			   a->cap[CAP_LAST_U32 - __capi]);
+ 	}
+ 	seq_putc(m, '\n');
+ }
+ 
+-/* Remove non-existent capabilities */
+-#define NORM_CAPS(v) (v.cap[CAP_TO_INDEX(CAP_LAST_CAP)] &= \
+-				CAP_TO_MASK(CAP_LAST_CAP + 1) - 1)
+-
+ static inline void task_cap(struct seq_file *m, struct task_struct *p)
+ {
+ 	const struct cred *cred;
+@@ -319,11 +315,6 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p)
+ 	cap_bset	= cred->cap_bset;
+ 	rcu_read_unlock();
+ 
+-	NORM_CAPS(cap_inheritable);
+-	NORM_CAPS(cap_permitted);
+-	NORM_CAPS(cap_effective);
+-	NORM_CAPS(cap_bset);
+-
+ 	render_cap_t(m, "CapInh:\t", &cap_inheritable);
+ 	render_cap_t(m, "CapPrm:\t", &cap_permitted);
+ 	render_cap_t(m, "CapEff:\t", &cap_effective);
+diff --git a/fs/reiserfs/do_balan.c b/fs/reiserfs/do_balan.c
+index 54fdf196bfb2..4d5e5297793f 100644
+--- a/fs/reiserfs/do_balan.c
++++ b/fs/reiserfs/do_balan.c
+@@ -286,12 +286,14 @@ static int balance_leaf_when_delete(struct tree_balance *tb, int flag)
+ 	return 0;
+ }
+ 
+-static void balance_leaf_insert_left(struct tree_balance *tb,
+-				     struct item_head *ih, const char *body)
++static unsigned int balance_leaf_insert_left(struct tree_balance *tb,
++					     struct item_head *const ih,
++					     const char * const body)
+ {
+ 	int ret;
+ 	struct buffer_info bi;
+ 	int n = B_NR_ITEMS(tb->L[0]);
++	unsigned body_shift_bytes = 0;
+ 
+ 	if (tb->item_pos == tb->lnum[0] - 1 && tb->lbytes != -1) {
+ 		/* part of new item falls into L[0] */
+@@ -329,7 +331,7 @@ static void balance_leaf_insert_left(struct tree_balance *tb,
+ 
+ 		put_ih_item_len(ih, new_item_len);
+ 		if (tb->lbytes > tb->zeroes_num) {
+-			body += (tb->lbytes - tb->zeroes_num);
++			body_shift_bytes = tb->lbytes - tb->zeroes_num;
+ 			tb->zeroes_num = 0;
+ 		} else
+ 			tb->zeroes_num -= tb->lbytes;
+@@ -349,11 +351,12 @@ static void balance_leaf_insert_left(struct tree_balance *tb,
+ 		tb->insert_size[0] = 0;
+ 		tb->zeroes_num = 0;
+ 	}
++	return body_shift_bytes;
+ }
+ 
+ static void balance_leaf_paste_left_shift_dirent(struct tree_balance *tb,
+-						 struct item_head *ih,
+-						 const char *body)
++						 struct item_head * const ih,
++						 const char * const body)
+ {
+ 	int n = B_NR_ITEMS(tb->L[0]);
+ 	struct buffer_info bi;
+@@ -413,17 +416,18 @@ static void balance_leaf_paste_left_shift_dirent(struct tree_balance *tb,
+ 	tb->pos_in_item -= tb->lbytes;
+ }
+ 
+-static void balance_leaf_paste_left_shift(struct tree_balance *tb,
+-					  struct item_head *ih,
+-					  const char *body)
++static unsigned int balance_leaf_paste_left_shift(struct tree_balance *tb,
++						  struct item_head * const ih,
++						  const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	int n = B_NR_ITEMS(tb->L[0]);
+ 	struct buffer_info bi;
++	int body_shift_bytes = 0;
+ 
+ 	if (is_direntry_le_ih(item_head(tbS0, tb->item_pos))) {
+ 		balance_leaf_paste_left_shift_dirent(tb, ih, body);
+-		return;
++		return 0;
+ 	}
+ 
+ 	RFALSE(tb->lbytes <= 0,
+@@ -497,7 +501,7 @@ static void balance_leaf_paste_left_shift(struct tree_balance *tb,
+ 		 * insert_size[0]
+ 		 */
+ 		if (l_n > tb->zeroes_num) {
+-			body += (l_n - tb->zeroes_num);
++			body_shift_bytes = l_n - tb->zeroes_num;
+ 			tb->zeroes_num = 0;
+ 		} else
+ 			tb->zeroes_num -= l_n;
+@@ -526,13 +530,14 @@ static void balance_leaf_paste_left_shift(struct tree_balance *tb,
+ 		 */
+ 		leaf_shift_left(tb, tb->lnum[0], tb->lbytes);
+ 	}
++	return body_shift_bytes;
+ }
+ 
+ 
+ /* appended item will be in L[0] in whole */
+ static void balance_leaf_paste_left_whole(struct tree_balance *tb,
+-					  struct item_head *ih,
+-					  const char *body)
++					  struct item_head * const ih,
++					  const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	int n = B_NR_ITEMS(tb->L[0]);
+@@ -584,39 +589,44 @@ static void balance_leaf_paste_left_whole(struct tree_balance *tb,
+ 	tb->zeroes_num = 0;
+ }
+ 
+-static void balance_leaf_paste_left(struct tree_balance *tb,
+-				    struct item_head *ih, const char *body)
++static unsigned int balance_leaf_paste_left(struct tree_balance *tb,
++					    struct item_head * const ih,
++					    const char * const body)
+ {
+ 	/* we must shift the part of the appended item */
+ 	if (tb->item_pos == tb->lnum[0] - 1 && tb->lbytes != -1)
+-		balance_leaf_paste_left_shift(tb, ih, body);
++		return balance_leaf_paste_left_shift(tb, ih, body);
+ 	else
+ 		balance_leaf_paste_left_whole(tb, ih, body);
++	return 0;
+ }
+ 
+ /* Shift lnum[0] items from S[0] to the left neighbor L[0] */
+-static void balance_leaf_left(struct tree_balance *tb, struct item_head *ih,
+-			      const char *body, int flag)
++static unsigned int balance_leaf_left(struct tree_balance *tb,
++				      struct item_head * const ih,
++				      const char * const body, int flag)
+ {
+ 	if (tb->lnum[0] <= 0)
+-		return;
++		return 0;
+ 
+ 	/* new item or it part falls to L[0], shift it too */
+ 	if (tb->item_pos < tb->lnum[0]) {
+ 		BUG_ON(flag != M_INSERT && flag != M_PASTE);
+ 
+ 		if (flag == M_INSERT)
+-			balance_leaf_insert_left(tb, ih, body);
++			return balance_leaf_insert_left(tb, ih, body);
+ 		else /* M_PASTE */
+-			balance_leaf_paste_left(tb, ih, body);
++			return balance_leaf_paste_left(tb, ih, body);
+ 	} else
+ 		/* new item doesn't fall into L[0] */
+ 		leaf_shift_left(tb, tb->lnum[0], tb->lbytes);
++	return 0;
+ }
+ 
+ 
+ static void balance_leaf_insert_right(struct tree_balance *tb,
+-				      struct item_head *ih, const char *body)
++				      struct item_head * const ih,
++				      const char * const body)
+ {
+ 
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+@@ -704,7 +714,8 @@ static void balance_leaf_insert_right(struct tree_balance *tb,
+ 
+ 
+ static void balance_leaf_paste_right_shift_dirent(struct tree_balance *tb,
+-				     struct item_head *ih, const char *body)
++				     struct item_head * const ih,
++				     const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	struct buffer_info bi;
+@@ -754,7 +765,8 @@ static void balance_leaf_paste_right_shift_dirent(struct tree_balance *tb,
+ }
+ 
+ static void balance_leaf_paste_right_shift(struct tree_balance *tb,
+-				     struct item_head *ih, const char *body)
++				     struct item_head * const ih,
++				     const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	int n_shift, n_rem, r_zeroes_number, version;
+@@ -831,7 +843,8 @@ static void balance_leaf_paste_right_shift(struct tree_balance *tb,
+ }
+ 
+ static void balance_leaf_paste_right_whole(struct tree_balance *tb,
+-				     struct item_head *ih, const char *body)
++				     struct item_head * const ih,
++				     const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	int n = B_NR_ITEMS(tbS0);
+@@ -874,7 +887,8 @@ static void balance_leaf_paste_right_whole(struct tree_balance *tb,
+ }
+ 
+ static void balance_leaf_paste_right(struct tree_balance *tb,
+-				     struct item_head *ih, const char *body)
++				     struct item_head * const ih,
++				     const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	int n = B_NR_ITEMS(tbS0);
+@@ -896,8 +910,9 @@ static void balance_leaf_paste_right(struct tree_balance *tb,
+ }
+ 
+ /* shift rnum[0] items from S[0] to the right neighbor R[0] */
+-static void balance_leaf_right(struct tree_balance *tb, struct item_head *ih,
+-			       const char *body, int flag)
++static void balance_leaf_right(struct tree_balance *tb,
++			       struct item_head * const ih,
++			       const char * const body, int flag)
+ {
+ 	if (tb->rnum[0] <= 0)
+ 		return;
+@@ -911,8 +926,8 @@ static void balance_leaf_right(struct tree_balance *tb, struct item_head *ih,
+ }
+ 
+ static void balance_leaf_new_nodes_insert(struct tree_balance *tb,
+-					  struct item_head *ih,
+-					  const char *body,
++					  struct item_head * const ih,
++					  const char * const body,
+ 					  struct item_head *insert_key,
+ 					  struct buffer_head **insert_ptr,
+ 					  int i)
+@@ -1003,8 +1018,8 @@ static void balance_leaf_new_nodes_insert(struct tree_balance *tb,
+ 
+ /* we append to directory item */
+ static void balance_leaf_new_nodes_paste_dirent(struct tree_balance *tb,
+-					 struct item_head *ih,
+-					 const char *body,
++					 struct item_head * const ih,
++					 const char * const body,
+ 					 struct item_head *insert_key,
+ 					 struct buffer_head **insert_ptr,
+ 					 int i)
+@@ -1058,8 +1073,8 @@ static void balance_leaf_new_nodes_paste_dirent(struct tree_balance *tb,
+ }
+ 
+ static void balance_leaf_new_nodes_paste_shift(struct tree_balance *tb,
+-					 struct item_head *ih,
+-					 const char *body,
++					 struct item_head * const ih,
++					 const char * const body,
+ 					 struct item_head *insert_key,
+ 					 struct buffer_head **insert_ptr,
+ 					 int i)
+@@ -1131,8 +1146,8 @@ static void balance_leaf_new_nodes_paste_shift(struct tree_balance *tb,
+ }
+ 
+ static void balance_leaf_new_nodes_paste_whole(struct tree_balance *tb,
+-					       struct item_head *ih,
+-					       const char *body,
++					       struct item_head * const ih,
++					       const char * const body,
+ 					       struct item_head *insert_key,
+ 					       struct buffer_head **insert_ptr,
+ 					       int i)
+@@ -1184,8 +1199,8 @@ static void balance_leaf_new_nodes_paste_whole(struct tree_balance *tb,
+ 
+ }
+ static void balance_leaf_new_nodes_paste(struct tree_balance *tb,
+-					 struct item_head *ih,
+-					 const char *body,
++					 struct item_head * const ih,
++					 const char * const body,
+ 					 struct item_head *insert_key,
+ 					 struct buffer_head **insert_ptr,
+ 					 int i)
+@@ -1214,8 +1229,8 @@ static void balance_leaf_new_nodes_paste(struct tree_balance *tb,
+ 
+ /* Fill new nodes that appear in place of S[0] */
+ static void balance_leaf_new_nodes(struct tree_balance *tb,
+-				   struct item_head *ih,
+-				   const char *body,
++				   struct item_head * const ih,
++				   const char * const body,
+ 				   struct item_head *insert_key,
+ 				   struct buffer_head **insert_ptr,
+ 				   int flag)
+@@ -1254,8 +1269,8 @@ static void balance_leaf_new_nodes(struct tree_balance *tb,
+ }
+ 
+ static void balance_leaf_finish_node_insert(struct tree_balance *tb,
+-					    struct item_head *ih,
+-					    const char *body)
++					    struct item_head * const ih,
++					    const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	struct buffer_info bi;
+@@ -1271,8 +1286,8 @@ static void balance_leaf_finish_node_insert(struct tree_balance *tb,
+ }
+ 
+ static void balance_leaf_finish_node_paste_dirent(struct tree_balance *tb,
+-						  struct item_head *ih,
+-						  const char *body)
++						  struct item_head * const ih,
++						  const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	struct item_head *pasted = item_head(tbS0, tb->item_pos);
+@@ -1305,8 +1320,8 @@ static void balance_leaf_finish_node_paste_dirent(struct tree_balance *tb,
+ }
+ 
+ static void balance_leaf_finish_node_paste(struct tree_balance *tb,
+-					   struct item_head *ih,
+-					   const char *body)
++					   struct item_head * const ih,
++					   const char * const body)
+ {
+ 	struct buffer_head *tbS0 = PATH_PLAST_BUFFER(tb->tb_path);
+ 	struct buffer_info bi;
+@@ -1349,8 +1364,8 @@ static void balance_leaf_finish_node_paste(struct tree_balance *tb,
+  * of the affected item which remains in S
+  */
+ static void balance_leaf_finish_node(struct tree_balance *tb,
+-				      struct item_head *ih,
+-				      const char *body, int flag)
++				      struct item_head * const ih,
++				      const char * const body, int flag)
+ {
+ 	/* if we must insert or append into buffer S[0] */
+ 	if (0 <= tb->item_pos && tb->item_pos < tb->s0num) {
+@@ -1402,7 +1417,7 @@ static int balance_leaf(struct tree_balance *tb, struct item_head *ih,
+ 	    && is_indirect_le_ih(item_head(tbS0, tb->item_pos)))
+ 		tb->pos_in_item *= UNFM_P_SIZE;
+ 
+-	balance_leaf_left(tb, ih, body, flag);
++	body += balance_leaf_left(tb, ih, body, flag);
+ 
+ 	/* tb->lnum[0] > 0 */
+ 	/* Calculate new item position */
+diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
+index e8870de4627e..a88b1b3e7db3 100644
+--- a/fs/reiserfs/journal.c
++++ b/fs/reiserfs/journal.c
+@@ -1947,8 +1947,6 @@ static int do_journal_release(struct reiserfs_transaction_handle *th,
+ 		}
+ 	}
+ 
+-	/* wait for all commits to finish */
+-	cancel_delayed_work(&SB_JOURNAL(sb)->j_work);
+ 
+ 	/*
+ 	 * We must release the write lock here because
+@@ -1956,8 +1954,14 @@ static int do_journal_release(struct reiserfs_transaction_handle *th,
+ 	 */
+ 	reiserfs_write_unlock(sb);
+ 
++	/*
++	 * Cancel flushing of old commits. Note that neither of these works
++	 * will be requeued because superblock is being shutdown and doesn't
++	 * have MS_ACTIVE set.
++	 */
+ 	cancel_delayed_work_sync(&REISERFS_SB(sb)->old_work);
+-	flush_workqueue(REISERFS_SB(sb)->commit_wq);
++	/* wait for all commits to finish */
++	cancel_delayed_work_sync(&SB_JOURNAL(sb)->j_work);
+ 
+ 	free_journal_ram(sb);
+ 
+@@ -4292,9 +4296,15 @@ static int do_journal_end(struct reiserfs_transaction_handle *th, int flags)
+ 	if (flush) {
+ 		flush_commit_list(sb, jl, 1);
+ 		flush_journal_list(sb, jl, 1);
+-	} else if (!(jl->j_state & LIST_COMMIT_PENDING))
+-		queue_delayed_work(REISERFS_SB(sb)->commit_wq,
+-				   &journal->j_work, HZ / 10);
++	} else if (!(jl->j_state & LIST_COMMIT_PENDING)) {
++		/*
++		 * Avoid queueing work when sb is being shut down. Transaction
++		 * will be flushed on journal shutdown.
++		 */
++		if (sb->s_flags & MS_ACTIVE)
++			queue_delayed_work(REISERFS_SB(sb)->commit_wq,
++					   &journal->j_work, HZ / 10);
++	}
+ 
+ 	/*
+ 	 * if the next transaction has any chance of wrapping, flush
+diff --git a/fs/reiserfs/lbalance.c b/fs/reiserfs/lbalance.c
+index d6744c8b24e1..3a74d15eb814 100644
+--- a/fs/reiserfs/lbalance.c
++++ b/fs/reiserfs/lbalance.c
+@@ -899,8 +899,9 @@ void leaf_delete_items(struct buffer_info *cur_bi, int last_first,
+ 
+ /* insert item into the leaf node in position before */
+ void leaf_insert_into_buf(struct buffer_info *bi, int before,
+-			  struct item_head *inserted_item_ih,
+-			  const char *inserted_item_body, int zeros_number)
++			  struct item_head * const inserted_item_ih,
++			  const char * const inserted_item_body,
++			  int zeros_number)
+ {
+ 	struct buffer_head *bh = bi->bi_bh;
+ 	int nr, free_space;
+diff --git a/fs/reiserfs/reiserfs.h b/fs/reiserfs/reiserfs.h
+index bf53888c7f59..735c2c2b4536 100644
+--- a/fs/reiserfs/reiserfs.h
++++ b/fs/reiserfs/reiserfs.h
+@@ -3216,11 +3216,12 @@ int leaf_shift_right(struct tree_balance *tb, int shift_num, int shift_bytes);
+ void leaf_delete_items(struct buffer_info *cur_bi, int last_first, int first,
+ 		       int del_num, int del_bytes);
+ void leaf_insert_into_buf(struct buffer_info *bi, int before,
+-			  struct item_head *inserted_item_ih,
+-			  const char *inserted_item_body, int zeros_number);
+-void leaf_paste_in_buffer(struct buffer_info *bi, int pasted_item_num,
+-			  int pos_in_item, int paste_size, const char *body,
++			  struct item_head * const inserted_item_ih,
++			  const char * const inserted_item_body,
+ 			  int zeros_number);
++void leaf_paste_in_buffer(struct buffer_info *bi, int pasted_item_num,
++			  int pos_in_item, int paste_size,
++			  const char * const body, int zeros_number);
+ void leaf_cut_from_buffer(struct buffer_info *bi, int cut_item_num,
+ 			  int pos_in_item, int cut_size);
+ void leaf_paste_entries(struct buffer_info *bi, int item_num, int before,
+diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
+index a392cef6acc6..5fd8f57e07fc 100644
+--- a/fs/reiserfs/super.c
++++ b/fs/reiserfs/super.c
+@@ -100,7 +100,11 @@ void reiserfs_schedule_old_flush(struct super_block *s)
+ 	struct reiserfs_sb_info *sbi = REISERFS_SB(s);
+ 	unsigned long delay;
+ 
+-	if (s->s_flags & MS_RDONLY)
++	/*
++	 * Avoid scheduling flush when sb is being shut down. It can race
++	 * with journal shutdown and free still queued delayed work.
++	 */
++	if (s->s_flags & MS_RDONLY || !(s->s_flags & MS_ACTIVE))
+ 		return;
+ 
+ 	spin_lock(&sbi->old_work_lock);
+diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
+index faaf716e2080..02614349690d 100644
+--- a/fs/xfs/xfs_aops.c
++++ b/fs/xfs/xfs_aops.c
+@@ -1753,11 +1753,72 @@ xfs_vm_readpages(
+ 	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
+ }
+ 
++/*
++ * This is basically a copy of __set_page_dirty_buffers() with one
++ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
++ * dirty, we'll never be able to clean them because we don't write buffers
++ * beyond EOF, and that means we can't invalidate pages that span EOF
++ * that have been marked dirty. Further, the dirty state can leak into
++ * the file interior if the file is extended, resulting in all sorts of
++ * bad things happening as the state does not match the underlying data.
++ *
++ * XXX: this really indicates that bufferheads in XFS need to die. Warts like
++ * this only exist because of bufferheads and how the generic code manages them.
++ */
++STATIC int
++xfs_vm_set_page_dirty(
++	struct page		*page)
++{
++	struct address_space	*mapping = page->mapping;
++	struct inode		*inode = mapping->host;
++	loff_t			end_offset;
++	loff_t			offset;
++	int			newly_dirty;
++
++	if (unlikely(!mapping))
++		return !TestSetPageDirty(page);
++
++	end_offset = i_size_read(inode);
++	offset = page_offset(page);
++
++	spin_lock(&mapping->private_lock);
++	if (page_has_buffers(page)) {
++		struct buffer_head *head = page_buffers(page);
++		struct buffer_head *bh = head;
++
++		do {
++			if (offset < end_offset)
++				set_buffer_dirty(bh);
++			bh = bh->b_this_page;
++			offset += 1 << inode->i_blkbits;
++		} while (bh != head);
++	}
++	newly_dirty = !TestSetPageDirty(page);
++	spin_unlock(&mapping->private_lock);
++
++	if (newly_dirty) {
++		/* sigh - __set_page_dirty() is static, so copy it here, too */
++		unsigned long flags;
++
++		spin_lock_irqsave(&mapping->tree_lock, flags);
++		if (page->mapping) {	/* Race with truncate? */
++			WARN_ON_ONCE(!PageUptodate(page));
++			account_page_dirtied(page, mapping);
++			radix_tree_tag_set(&mapping->page_tree,
++					page_index(page), PAGECACHE_TAG_DIRTY);
++		}
++		spin_unlock_irqrestore(&mapping->tree_lock, flags);
++		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
++	}
++	return newly_dirty;
++}
++
+ const struct address_space_operations xfs_address_space_operations = {
+ 	.readpage		= xfs_vm_readpage,
+ 	.readpages		= xfs_vm_readpages,
+ 	.writepage		= xfs_vm_writepage,
+ 	.writepages		= xfs_vm_writepages,
++	.set_page_dirty		= xfs_vm_set_page_dirty,
+ 	.releasepage		= xfs_vm_releasepage,
+ 	.invalidatepage		= xfs_vm_invalidatepage,
+ 	.write_begin		= xfs_vm_write_begin,
+diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
+index 3ee0cd43edc0..c9656491d823 100644
+--- a/fs/xfs/xfs_dquot.c
++++ b/fs/xfs/xfs_dquot.c
+@@ -974,7 +974,8 @@ xfs_qm_dqflush(
+ 	 * Get the buffer containing the on-disk dquot
+ 	 */
+ 	error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp, dqp->q_blkno,
+-				   mp->m_quotainfo->qi_dqchunklen, 0, &bp, NULL);
++				   mp->m_quotainfo->qi_dqchunklen, 0, &bp,
++				   &xfs_dquot_buf_ops);
+ 	if (error)
+ 		goto out_unlock;
+ 
+diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
+index 1f66779d7a46..055459999660 100644
+--- a/fs/xfs/xfs_file.c
++++ b/fs/xfs/xfs_file.c
+@@ -295,7 +295,16 @@ xfs_file_read_iter(
+ 				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
+ 				return ret;
+ 			}
+-			truncate_pagecache_range(VFS_I(ip), pos, -1);
++
++			/*
++			 * Invalidate whole pages. This can return an error if
++			 * we fail to invalidate a page, but this should never
++			 * happen on XFS. Warn if it does fail.
++			 */
++			ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
++						pos >> PAGE_CACHE_SHIFT, -1);
++			WARN_ON_ONCE(ret);
++			ret = 0;
+ 		}
+ 		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
+ 	}
+@@ -634,7 +643,15 @@ xfs_file_dio_aio_write(
+ 						    pos, -1);
+ 		if (ret)
+ 			goto out;
+-		truncate_pagecache_range(VFS_I(ip), pos, -1);
++		/*
++		 * Invalidate whole pages. This can return an error if
++		 * we fail to invalidate a page, but this should never
++		 * happen on XFS. Warn if it does fail.
++		 */
++		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
++						pos >> PAGE_CACHE_SHIFT, -1);
++		WARN_ON_ONCE(ret);
++		ret = 0;
+ 	}
+ 
+ 	/*
+diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
+index 981af0f6504b..8c962890fe17 100644
+--- a/fs/xfs/xfs_log_recover.c
++++ b/fs/xfs/xfs_log_recover.c
+@@ -2125,6 +2125,17 @@ xlog_recover_validate_buf_type(
+ 	__uint16_t		magic16;
+ 	__uint16_t		magicda;
+ 
++	/*
++	 * We can only do post recovery validation on items on CRC enabled
++	 * fielsystems as we need to know when the buffer was written to be able
++	 * to determine if we should have replayed the item. If we replay old
++	 * metadata over a newer buffer, then it will enter a temporarily
++	 * inconsistent state resulting in verification failures. Hence for now
++	 * just avoid the verification stage for non-crc filesystems
++	 */
++	if (!xfs_sb_version_hascrc(&mp->m_sb))
++		return;
++
+ 	magic32 = be32_to_cpu(*(__be32 *)bp->b_addr);
+ 	magic16 = be16_to_cpu(*(__be16*)bp->b_addr);
+ 	magicda = be16_to_cpu(info->magic);
+@@ -2162,8 +2173,6 @@ xlog_recover_validate_buf_type(
+ 		bp->b_ops = &xfs_agf_buf_ops;
+ 		break;
+ 	case XFS_BLFT_AGFL_BUF:
+-		if (!xfs_sb_version_hascrc(&mp->m_sb))
+-			break;
+ 		if (magic32 != XFS_AGFL_MAGIC) {
+ 			xfs_warn(mp, "Bad AGFL block magic!");
+ 			ASSERT(0);
+@@ -2196,10 +2205,6 @@ xlog_recover_validate_buf_type(
+ #endif
+ 		break;
+ 	case XFS_BLFT_DINO_BUF:
+-		/*
+-		 * we get here with inode allocation buffers, not buffers that
+-		 * track unlinked list changes.
+-		 */
+ 		if (magic16 != XFS_DINODE_MAGIC) {
+ 			xfs_warn(mp, "Bad INODE block magic!");
+ 			ASSERT(0);
+@@ -2279,8 +2284,6 @@ xlog_recover_validate_buf_type(
+ 		bp->b_ops = &xfs_attr3_leaf_buf_ops;
+ 		break;
+ 	case XFS_BLFT_ATTR_RMT_BUF:
+-		if (!xfs_sb_version_hascrc(&mp->m_sb))
+-			break;
+ 		if (magic32 != XFS_ATTR3_RMT_MAGIC) {
+ 			xfs_warn(mp, "Bad attr remote magic!");
+ 			ASSERT(0);
+@@ -2387,16 +2390,7 @@ xlog_recover_do_reg_buffer(
+ 	/* Shouldn't be any more regions */
+ 	ASSERT(i == item->ri_total);
+ 
+-	/*
+-	 * We can only do post recovery validation on items on CRC enabled
+-	 * fielsystems as we need to know when the buffer was written to be able
+-	 * to determine if we should have replayed the item. If we replay old
+-	 * metadata over a newer buffer, then it will enter a temporarily
+-	 * inconsistent state resulting in verification failures. Hence for now
+-	 * just avoid the verification stage for non-crc filesystems
+-	 */
+-	if (xfs_sb_version_hascrc(&mp->m_sb))
+-		xlog_recover_validate_buf_type(mp, bp, buf_f);
++	xlog_recover_validate_buf_type(mp, bp, buf_f);
+ }
+ 
+ /*
+@@ -2504,12 +2498,29 @@ xlog_recover_buffer_pass2(
+ 	}
+ 
+ 	/*
+-	 * recover the buffer only if we get an LSN from it and it's less than
++	 * Recover the buffer only if we get an LSN from it and it's less than
+ 	 * the lsn of the transaction we are replaying.
++	 *
++	 * Note that we have to be extremely careful of readahead here.
++	 * Readahead does not attach verfiers to the buffers so if we don't
++	 * actually do any replay after readahead because of the LSN we found
++	 * in the buffer if more recent than that current transaction then we
++	 * need to attach the verifier directly. Failure to do so can lead to
++	 * future recovery actions (e.g. EFI and unlinked list recovery) can
++	 * operate on the buffers and they won't get the verifier attached. This
++	 * can lead to blocks on disk having the correct content but a stale
++	 * CRC.
++	 *
++	 * It is safe to assume these clean buffers are currently up to date.
++	 * If the buffer is dirtied by a later transaction being replayed, then
++	 * the verifier will be reset to match whatever recover turns that
++	 * buffer into.
+ 	 */
+ 	lsn = xlog_recover_get_buf_lsn(mp, bp);
+-	if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0)
++	if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
++		xlog_recover_validate_buf_type(mp, bp, buf_f);
+ 		goto out_release;
++	}
+ 
+ 	if (buf_f->blf_flags & XFS_BLF_INODE_BUF) {
+ 		error = xlog_recover_do_inode_buffer(mp, item, bp, buf_f);
+diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
+index 6d26759c779a..6c51e2f97c0a 100644
+--- a/fs/xfs/xfs_qm.c
++++ b/fs/xfs/xfs_qm.c
+@@ -1005,6 +1005,12 @@ xfs_qm_dqiter_bufs(
+ 		if (error)
+ 			break;
+ 
++		/*
++		 * A corrupt buffer might not have a verifier attached, so
++		 * make sure we have the correct one attached before writeback
++		 * occurs.
++		 */
++		bp->b_ops = &xfs_dquot_buf_ops;
+ 		xfs_qm_reset_dqcounts(mp, bp, firstid, type);
+ 		xfs_buf_delwri_queue(bp, buffer_list);
+ 		xfs_buf_relse(bp);
+@@ -1090,7 +1096,7 @@ xfs_qm_dqiterate(
+ 					xfs_buf_readahead(mp->m_ddev_targp,
+ 					       XFS_FSB_TO_DADDR(mp, rablkno),
+ 					       mp->m_quotainfo->qi_dqchunklen,
+-					       NULL);
++					       &xfs_dquot_buf_ops);
+ 					rablkno++;
+ 				}
+ 			}
+diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
+index b5714580801a..0826a4407e8e 100644
+--- a/include/acpi/acpi_bus.h
++++ b/include/acpi/acpi_bus.h
+@@ -246,7 +246,6 @@ struct acpi_device_pnp {
+ 	acpi_device_name device_name;	/* Driver-determined */
+ 	acpi_device_class device_class;	/*        "          */
+ 	union acpi_object *str_obj;	/* unicode string for _STR method */
+-	unsigned long sun;		/* _SUN */
+ };
+ 
+ #define acpi_device_bid(d)	((d)->pnp.bus_id)
+diff --git a/include/linux/capability.h b/include/linux/capability.h
+index 84b13ad67c1c..aa93e5ef594c 100644
+--- a/include/linux/capability.h
++++ b/include/linux/capability.h
+@@ -78,8 +78,11 @@ extern const kernel_cap_t __cap_init_eff_set;
+ # error Fix up hand-coded capability macro initializers
+ #else /* HAND-CODED capability initializers */
+ 
++#define CAP_LAST_U32			((_KERNEL_CAPABILITY_U32S) - 1)
++#define CAP_LAST_U32_VALID_MASK		(CAP_TO_MASK(CAP_LAST_CAP + 1) -1)
++
+ # define CAP_EMPTY_SET    ((kernel_cap_t){{ 0, 0 }})
+-# define CAP_FULL_SET     ((kernel_cap_t){{ ~0, ~0 }})
++# define CAP_FULL_SET     ((kernel_cap_t){{ ~0, CAP_LAST_U32_VALID_MASK }})
+ # define CAP_FS_SET       ((kernel_cap_t){{ CAP_FS_MASK_B0 \
+ 				    | CAP_TO_MASK(CAP_LINUX_IMMUTABLE), \
+ 				    CAP_FS_MASK_B1 } })
+diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
+index fc7718c6bd3e..d2be2526ec48 100644
+--- a/include/linux/fsnotify_backend.h
++++ b/include/linux/fsnotify_backend.h
+@@ -326,6 +326,8 @@ extern int fsnotify_add_notify_event(struct fsnotify_group *group,
+ 				     struct fsnotify_event *event,
+ 				     int (*merge)(struct list_head *,
+ 						  struct fsnotify_event *));
++/* Remove passed event from groups notification queue */
++extern void fsnotify_remove_event(struct fsnotify_group *group, struct fsnotify_event *event);
+ /* true if the group notification queue is empty */
+ extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group);
+ /* return, but do not dequeue the first event on the notification queue */
+diff --git a/include/linux/mount.h b/include/linux/mount.h
+index 839bac270904..b0c1e6574e7f 100644
+--- a/include/linux/mount.h
++++ b/include/linux/mount.h
+@@ -42,13 +42,20 @@ struct mnt_namespace;
+  * flag, consider how it interacts with shared mounts.
+  */
+ #define MNT_SHARED_MASK	(MNT_UNBINDABLE)
+-#define MNT_PROPAGATION_MASK	(MNT_SHARED | MNT_UNBINDABLE)
++#define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
++				 | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
++				 | MNT_READONLY)
++#define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
+ 
+ #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
+ 			    MNT_DOOMED | MNT_SYNC_UMOUNT | MNT_MARKED)
+ 
+ #define MNT_INTERNAL	0x4000
+ 
++#define MNT_LOCK_ATIME		0x040000
++#define MNT_LOCK_NOEXEC		0x080000
++#define MNT_LOCK_NOSUID		0x100000
++#define MNT_LOCK_NODEV		0x200000
+ #define MNT_LOCK_READONLY	0x400000
+ #define MNT_LOCKED		0x800000
+ #define MNT_DOOMED		0x1000000
+diff --git a/include/linux/tpm.h b/include/linux/tpm.h
+index fff1d0976f80..8350c538b486 100644
+--- a/include/linux/tpm.h
++++ b/include/linux/tpm.h
+@@ -39,6 +39,9 @@ struct tpm_class_ops {
+ 	int (*send) (struct tpm_chip *chip, u8 *buf, size_t len);
+ 	void (*cancel) (struct tpm_chip *chip);
+ 	u8 (*status) (struct tpm_chip *chip);
++	bool (*update_timeouts)(struct tpm_chip *chip,
++				unsigned long *timeout_cap);
++
+ };
+ 
+ #if defined(CONFIG_TCG_TPM) || defined(CONFIG_TCG_TPM_MODULE)
+diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
+index 27ab31017f09..758bc9f0f399 100644
+--- a/include/scsi/scsi_device.h
++++ b/include/scsi/scsi_device.h
+@@ -155,6 +155,7 @@ struct scsi_device {
+ 	unsigned skip_ms_page_8:1;	/* do not use MODE SENSE page 0x08 */
+ 	unsigned skip_ms_page_3f:1;	/* do not use MODE SENSE page 0x3f */
+ 	unsigned skip_vpd_pages:1;	/* do not read VPD pages */
++	unsigned try_vpd_pages:1;	/* attempt to read VPD pages */
+ 	unsigned use_192_bytes_for_3f:1; /* ask for 192 bytes from page 0x3f */
+ 	unsigned no_start_on_add:1;	/* do not issue start on add */
+ 	unsigned allow_restart:1; /* issue START_UNIT in error handler */
+diff --git a/include/scsi/scsi_devinfo.h b/include/scsi/scsi_devinfo.h
+index 447d2d7466fc..183eaab7c380 100644
+--- a/include/scsi/scsi_devinfo.h
++++ b/include/scsi/scsi_devinfo.h
+@@ -32,4 +32,9 @@
+ #define BLIST_ATTACH_PQ3	0x1000000 /* Scan: Attach to PQ3 devices */
+ #define BLIST_NO_DIF		0x2000000 /* Disable T10 PI (DIF) */
+ #define BLIST_SKIP_VPD_PAGES	0x4000000 /* Ignore SBC-3 VPD pages */
++#define BLIST_SCSI3LUN		0x8000000 /* Scan more than 256 LUNs
++					     for sequential scan */
++#define BLIST_TRY_VPD_PAGES	0x10000000 /* Attempt to read VPD pages */
++#define BLIST_NO_RSOC		0x20000000 /* don't try to issue RSOC */
++
+ #endif
+diff --git a/include/uapi/rdma/rdma_user_cm.h b/include/uapi/rdma/rdma_user_cm.h
+index 99b80abf360a..3066718eb120 100644
+--- a/include/uapi/rdma/rdma_user_cm.h
++++ b/include/uapi/rdma/rdma_user_cm.h
+@@ -34,6 +34,7 @@
+ #define RDMA_USER_CM_H
+ 
+ #include <linux/types.h>
++#include <linux/socket.h>
+ #include <linux/in6.h>
+ #include <rdma/ib_user_verbs.h>
+ #include <rdma/ib_user_sa.h>
+diff --git a/kernel/audit.c b/kernel/audit.c
+index 3ef2e0e797e8..ba2ff5a5c600 100644
+--- a/kernel/audit.c
++++ b/kernel/audit.c
+@@ -1677,7 +1677,7 @@ void audit_log_cap(struct audit_buffer *ab, char *prefix, kernel_cap_t *cap)
+ 	audit_log_format(ab, " %s=", prefix);
+ 	CAP_FOR_EACH_U32(i) {
+ 		audit_log_format(ab, "%08x",
+-				 cap->cap[(_KERNEL_CAPABILITY_U32S-1) - i]);
++				 cap->cap[CAP_LAST_U32 - i]);
+ 	}
+ }
+ 
+diff --git a/kernel/capability.c b/kernel/capability.c
+index a5cf13c018ce..989f5bfc57dc 100644
+--- a/kernel/capability.c
++++ b/kernel/capability.c
+@@ -258,6 +258,10 @@ SYSCALL_DEFINE2(capset, cap_user_header_t, header, const cap_user_data_t, data)
+ 		i++;
+ 	}
+ 
++	effective.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++	permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++	inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++
+ 	new = prepare_creds();
+ 	if (!new)
+ 		return -ENOMEM;
+diff --git a/kernel/smp.c b/kernel/smp.c
+index 80c33f8de14f..86e59ee8dd76 100644
+--- a/kernel/smp.c
++++ b/kernel/smp.c
+@@ -661,7 +661,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
+ 			if (cond_func(cpu, info)) {
+ 				ret = smp_call_function_single(cpu, func,
+ 								info, wait);
+-				WARN_ON_ONCE(!ret);
++				WARN_ON_ONCE(ret);
+ 			}
+ 		preempt_enable();
+ 	}
+diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
+index ff7027199a9a..b95381ebdd5e 100644
+--- a/kernel/trace/ring_buffer.c
++++ b/kernel/trace/ring_buffer.c
+@@ -1984,7 +1984,7 @@ rb_add_time_stamp(struct ring_buffer_event *event, u64 delta)
+ 
+ /**
+  * rb_update_event - update event type and data
+- * @event: the even to update
++ * @event: the event to update
+  * @type: the type of event
+  * @length: the size of the event field in the ring buffer
+  *
+@@ -3357,21 +3357,16 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
+ 	struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer;
+ 
+ 	/* Iterator usage is expected to have record disabled */
+-	if (list_empty(&cpu_buffer->reader_page->list)) {
+-		iter->head_page = rb_set_head_page(cpu_buffer);
+-		if (unlikely(!iter->head_page))
+-			return;
+-		iter->head = iter->head_page->read;
+-	} else {
+-		iter->head_page = cpu_buffer->reader_page;
+-		iter->head = cpu_buffer->reader_page->read;
+-	}
++	iter->head_page = cpu_buffer->reader_page;
++	iter->head = cpu_buffer->reader_page->read;
++
++	iter->cache_reader_page = iter->head_page;
++	iter->cache_read = iter->head;
++
+ 	if (iter->head)
+ 		iter->read_stamp = cpu_buffer->read_stamp;
+ 	else
+ 		iter->read_stamp = iter->head_page->page->time_stamp;
+-	iter->cache_reader_page = cpu_buffer->reader_page;
+-	iter->cache_read = cpu_buffer->read;
+ }
+ 
+ /**
+@@ -3764,12 +3759,14 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts)
+ 		return NULL;
+ 
+ 	/*
+-	 * We repeat when a time extend is encountered.
+-	 * Since the time extend is always attached to a data event,
+-	 * we should never loop more than once.
+-	 * (We never hit the following condition more than twice).
++	 * We repeat when a time extend is encountered or we hit
++	 * the end of the page. Since the time extend is always attached
++	 * to a data event, we should never loop more than three times.
++	 * Once for going to next page, once on time extend, and
++	 * finally once to get the event.
++	 * (We never hit the following condition more than thrice).
+ 	 */
+-	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 2))
++	if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3))
+ 		return NULL;
+ 
+ 	if (rb_per_cpu_empty(cpu_buffer))
+diff --git a/lib/assoc_array.c b/lib/assoc_array.c
+index c0b1007011e1..2404d03e251a 100644
+--- a/lib/assoc_array.c
++++ b/lib/assoc_array.c
+@@ -1723,11 +1723,13 @@ ascend_old_tree:
+ 		shortcut = assoc_array_ptr_to_shortcut(ptr);
+ 		slot = shortcut->parent_slot;
+ 		cursor = shortcut->back_pointer;
++		if (!cursor)
++			goto gc_complete;
+ 	} else {
+ 		slot = node->parent_slot;
+ 		cursor = ptr;
+ 	}
+-	BUG_ON(!ptr);
++	BUG_ON(!cursor);
+ 	node = assoc_array_ptr_to_node(cursor);
+ 	slot++;
+ 	goto continue_node;
+@@ -1735,7 +1737,7 @@ ascend_old_tree:
+ gc_complete:
+ 	edit->set[0].to = new_root;
+ 	assoc_array_apply_edit(edit);
+-	edit->array->nr_leaves_on_tree = nr_leaves_on_tree;
++	array->nr_leaves_on_tree = nr_leaves_on_tree;
+ 	return 0;
+ 
+ enomem:
+diff --git a/mm/filemap.c b/mm/filemap.c
+index 900edfaf6df5..8163e0439493 100644
+--- a/mm/filemap.c
++++ b/mm/filemap.c
+@@ -2584,7 +2584,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ 		 * that this differs from normal direct-io semantics, which
+ 		 * will return -EFOO even if some bytes were written.
+ 		 */
+-		if (unlikely(status < 0) && !written) {
++		if (unlikely(status < 0)) {
+ 			err = status;
+ 			goto out;
+ 		}
+diff --git a/mm/hugetlb.c b/mm/hugetlb.c
+index 7a0a73d2fcff..7ae54449f252 100644
+--- a/mm/hugetlb.c
++++ b/mm/hugetlb.c
+@@ -1089,6 +1089,9 @@ void dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
+ 	unsigned long pfn;
+ 	struct hstate *h;
+ 
++	if (!hugepages_supported())
++		return;
++
+ 	/* Set scan step to minimum hugepage size */
+ 	for_each_hstate(h)
+ 		if (order > huge_page_order(h))
+diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
+index 640c54ec1bd2..3787be160c2b 100644
+--- a/net/bluetooth/hci_event.c
++++ b/net/bluetooth/hci_event.c
+@@ -3538,18 +3538,14 @@ static void hci_io_capa_request_evt(struct hci_dev *hdev, struct sk_buff *skb)
+ 
+ 		/* If we are initiators, there is no remote information yet */
+ 		if (conn->remote_auth == 0xff) {
+-			cp.authentication = conn->auth_type;
+-
+ 			/* Request MITM protection if our IO caps allow it
+ 			 * except for the no-bonding case.
+-			 * conn->auth_type is not updated here since
+-			 * that might cause the user confirmation to be
+-			 * rejected in case the remote doesn't have the
+-			 * IO capabilities for MITM.
+ 			 */
+ 			if (conn->io_capability != HCI_IO_NO_INPUT_OUTPUT &&
+-			    cp.authentication != HCI_AT_NO_BONDING)
+-				cp.authentication |= 0x01;
++			    conn->auth_type != HCI_AT_NO_BONDING)
++				conn->auth_type |= 0x01;
++
++			cp.authentication = conn->auth_type;
+ 		} else {
+ 			conn->auth_type = hci_get_auth_req(conn);
+ 			cp.authentication = conn->auth_type;
+@@ -3621,9 +3617,12 @@ static void hci_user_confirm_request_evt(struct hci_dev *hdev,
+ 	rem_mitm = (conn->remote_auth & 0x01);
+ 
+ 	/* If we require MITM but the remote device can't provide that
+-	 * (it has NoInputNoOutput) then reject the confirmation request
++	 * (it has NoInputNoOutput) then reject the confirmation
++	 * request. We check the security level here since it doesn't
++	 * necessarily match conn->auth_type.
+ 	 */
+-	if (loc_mitm && conn->remote_cap == HCI_IO_NO_INPUT_OUTPUT) {
++	if (conn->pending_sec_level > BT_SECURITY_MEDIUM &&
++	    conn->remote_cap == HCI_IO_NO_INPUT_OUTPUT) {
+ 		BT_DBG("Rejecting request: remote device can't provide MITM");
+ 		hci_send_cmd(hdev, HCI_OP_USER_CONFIRM_NEG_REPLY,
+ 			     sizeof(ev->bdaddr), &ev->bdaddr);
+@@ -4177,8 +4176,8 @@ static void process_adv_report(struct hci_dev *hdev, u8 type, bdaddr_t *bdaddr,
+ 	 * sending a merged device found event.
+ 	 */
+ 	mgmt_device_found(hdev, &d->last_adv_addr, LE_LINK,
+-			  d->last_adv_addr_type, NULL, rssi, 0, 1, data, len,
+-			  d->last_adv_data, d->last_adv_data_len);
++			  d->last_adv_addr_type, NULL, rssi, 0, 1,
++			  d->last_adv_data, d->last_adv_data_len, data, len);
+ 	clear_pending_adv_report(hdev);
+ }
+ 
+diff --git a/net/bluetooth/l2cap_sock.c b/net/bluetooth/l2cap_sock.c
+index e1378693cc90..d0fd8b04f2e6 100644
+--- a/net/bluetooth/l2cap_sock.c
++++ b/net/bluetooth/l2cap_sock.c
+@@ -1111,7 +1111,8 @@ static int l2cap_sock_shutdown(struct socket *sock, int how)
+ 		l2cap_chan_close(chan, 0);
+ 		lock_sock(sk);
+ 
+-		if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime)
++		if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
++		    !(current->flags & PF_EXITING))
+ 			err = bt_sock_wait_state(sk, BT_CLOSED,
+ 						 sk->sk_lingertime);
+ 	}
+diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c
+index 754b6fe4f742..881f7de412cc 100644
+--- a/net/bluetooth/rfcomm/core.c
++++ b/net/bluetooth/rfcomm/core.c
+@@ -1909,10 +1909,13 @@ static struct rfcomm_session *rfcomm_process_rx(struct rfcomm_session *s)
+ 	/* Get data directly from socket receive queue without copying it. */
+ 	while ((skb = skb_dequeue(&sk->sk_receive_queue))) {
+ 		skb_orphan(skb);
+-		if (!skb_linearize(skb))
++		if (!skb_linearize(skb)) {
+ 			s = rfcomm_recv_frame(s, skb);
+-		else
++			if (!s)
++				break;
++		} else {
+ 			kfree_skb(skb);
++		}
+ 	}
+ 
+ 	if (s && (sk->sk_state == BT_CLOSED))
+diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
+index c603a5eb4720..8bbbb5ec468c 100644
+--- a/net/bluetooth/rfcomm/sock.c
++++ b/net/bluetooth/rfcomm/sock.c
+@@ -918,7 +918,8 @@ static int rfcomm_sock_shutdown(struct socket *sock, int how)
+ 		sk->sk_shutdown = SHUTDOWN_MASK;
+ 		__rfcomm_sock_close(sk);
+ 
+-		if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime)
++		if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
++		    !(current->flags & PF_EXITING))
+ 			err = bt_sock_wait_state(sk, BT_CLOSED, sk->sk_lingertime);
+ 	}
+ 	release_sock(sk);
+diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
+index c06dbd3938e8..dbbbc0292bd0 100644
+--- a/net/bluetooth/sco.c
++++ b/net/bluetooth/sco.c
+@@ -909,7 +909,8 @@ static int sco_sock_shutdown(struct socket *sock, int how)
+ 		sco_sock_clear_timer(sk);
+ 		__sco_sock_close(sk);
+ 
+-		if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime)
++		if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
++		    !(current->flags & PF_EXITING))
+ 			err = bt_sock_wait_state(sk, BT_CLOSED,
+ 						 sk->sk_lingertime);
+ 	}
+@@ -929,7 +930,8 @@ static int sco_sock_release(struct socket *sock)
+ 
+ 	sco_sock_close(sk);
+ 
+-	if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime) {
++	if (sock_flag(sk, SOCK_LINGER) && sk->sk_lingertime &&
++	    !(current->flags & PF_EXITING)) {
+ 		lock_sock(sk);
+ 		err = bt_sock_wait_state(sk, BT_CLOSED, sk->sk_lingertime);
+ 		release_sock(sk);
+diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c
+index 96238ba95f2b..de6662b14e1f 100644
+--- a/net/ceph/auth_x.c
++++ b/net/ceph/auth_x.c
+@@ -13,8 +13,6 @@
+ #include "auth_x.h"
+ #include "auth_x_protocol.h"
+ 
+-#define TEMP_TICKET_BUF_LEN	256
+-
+ static void ceph_x_validate_tickets(struct ceph_auth_client *ac, int *pneed);
+ 
+ static int ceph_x_is_authenticated(struct ceph_auth_client *ac)
+@@ -64,7 +62,7 @@ static int ceph_x_encrypt(struct ceph_crypto_key *secret,
+ }
+ 
+ static int ceph_x_decrypt(struct ceph_crypto_key *secret,
+-			  void **p, void *end, void *obuf, size_t olen)
++			  void **p, void *end, void **obuf, size_t olen)
+ {
+ 	struct ceph_x_encrypt_header head;
+ 	size_t head_len = sizeof(head);
+@@ -75,8 +73,14 @@ static int ceph_x_decrypt(struct ceph_crypto_key *secret,
+ 		return -EINVAL;
+ 
+ 	dout("ceph_x_decrypt len %d\n", len);
+-	ret = ceph_decrypt2(secret, &head, &head_len, obuf, &olen,
+-			    *p, len);
++	if (*obuf == NULL) {
++		*obuf = kmalloc(len, GFP_NOFS);
++		if (!*obuf)
++			return -ENOMEM;
++		olen = len;
++	}
++
++	ret = ceph_decrypt2(secret, &head, &head_len, *obuf, &olen, *p, len);
+ 	if (ret)
+ 		return ret;
+ 	if (head.struct_v != 1 || le64_to_cpu(head.magic) != CEPHX_ENC_MAGIC)
+@@ -129,139 +133,120 @@ static void remove_ticket_handler(struct ceph_auth_client *ac,
+ 	kfree(th);
+ }
+ 
+-static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac,
+-				    struct ceph_crypto_key *secret,
+-				    void *buf, void *end)
++static int process_one_ticket(struct ceph_auth_client *ac,
++			      struct ceph_crypto_key *secret,
++			      void **p, void *end)
+ {
+ 	struct ceph_x_info *xi = ac->private;
+-	int num;
+-	void *p = buf;
++	int type;
++	u8 tkt_struct_v, blob_struct_v;
++	struct ceph_x_ticket_handler *th;
++	void *dbuf = NULL;
++	void *dp, *dend;
++	int dlen;
++	char is_enc;
++	struct timespec validity;
++	struct ceph_crypto_key old_key;
++	void *ticket_buf = NULL;
++	void *tp, *tpend;
++	struct ceph_timespec new_validity;
++	struct ceph_crypto_key new_session_key;
++	struct ceph_buffer *new_ticket_blob;
++	unsigned long new_expires, new_renew_after;
++	u64 new_secret_id;
+ 	int ret;
+-	char *dbuf;
+-	char *ticket_buf;
+-	u8 reply_struct_v;
+ 
+-	dbuf = kmalloc(TEMP_TICKET_BUF_LEN, GFP_NOFS);
+-	if (!dbuf)
+-		return -ENOMEM;
++	ceph_decode_need(p, end, sizeof(u32) + 1, bad);
+ 
+-	ret = -ENOMEM;
+-	ticket_buf = kmalloc(TEMP_TICKET_BUF_LEN, GFP_NOFS);
+-	if (!ticket_buf)
+-		goto out_dbuf;
++	type = ceph_decode_32(p);
++	dout(" ticket type %d %s\n", type, ceph_entity_type_name(type));
+ 
+-	ceph_decode_need(&p, end, 1 + sizeof(u32), bad);
+-	reply_struct_v = ceph_decode_8(&p);
+-	if (reply_struct_v != 1)
++	tkt_struct_v = ceph_decode_8(p);
++	if (tkt_struct_v != 1)
+ 		goto bad;
+-	num = ceph_decode_32(&p);
+-	dout("%d tickets\n", num);
+-	while (num--) {
+-		int type;
+-		u8 tkt_struct_v, blob_struct_v;
+-		struct ceph_x_ticket_handler *th;
+-		void *dp, *dend;
+-		int dlen;
+-		char is_enc;
+-		struct timespec validity;
+-		struct ceph_crypto_key old_key;
+-		void *tp, *tpend;
+-		struct ceph_timespec new_validity;
+-		struct ceph_crypto_key new_session_key;
+-		struct ceph_buffer *new_ticket_blob;
+-		unsigned long new_expires, new_renew_after;
+-		u64 new_secret_id;
+-
+-		ceph_decode_need(&p, end, sizeof(u32) + 1, bad);
+-
+-		type = ceph_decode_32(&p);
+-		dout(" ticket type %d %s\n", type, ceph_entity_type_name(type));
+-
+-		tkt_struct_v = ceph_decode_8(&p);
+-		if (tkt_struct_v != 1)
+-			goto bad;
+-
+-		th = get_ticket_handler(ac, type);
+-		if (IS_ERR(th)) {
+-			ret = PTR_ERR(th);
+-			goto out;
+-		}
+ 
+-		/* blob for me */
+-		dlen = ceph_x_decrypt(secret, &p, end, dbuf,
+-				      TEMP_TICKET_BUF_LEN);
+-		if (dlen <= 0) {
+-			ret = dlen;
+-			goto out;
+-		}
+-		dout(" decrypted %d bytes\n", dlen);
+-		dend = dbuf + dlen;
+-		dp = dbuf;
++	th = get_ticket_handler(ac, type);
++	if (IS_ERR(th)) {
++		ret = PTR_ERR(th);
++		goto out;
++	}
+ 
+-		tkt_struct_v = ceph_decode_8(&dp);
+-		if (tkt_struct_v != 1)
+-			goto bad;
++	/* blob for me */
++	dlen = ceph_x_decrypt(secret, p, end, &dbuf, 0);
++	if (dlen <= 0) {
++		ret = dlen;
++		goto out;
++	}
++	dout(" decrypted %d bytes\n", dlen);
++	dp = dbuf;
++	dend = dp + dlen;
+ 
+-		memcpy(&old_key, &th->session_key, sizeof(old_key));
+-		ret = ceph_crypto_key_decode(&new_session_key, &dp, dend);
+-		if (ret)
+-			goto out;
++	tkt_struct_v = ceph_decode_8(&dp);
++	if (tkt_struct_v != 1)
++		goto bad;
+ 
+-		ceph_decode_copy(&dp, &new_validity, sizeof(new_validity));
+-		ceph_decode_timespec(&validity, &new_validity);
+-		new_expires = get_seconds() + validity.tv_sec;
+-		new_renew_after = new_expires - (validity.tv_sec / 4);
+-		dout(" expires=%lu renew_after=%lu\n", new_expires,
+-		     new_renew_after);
++	memcpy(&old_key, &th->session_key, sizeof(old_key));
++	ret = ceph_crypto_key_decode(&new_session_key, &dp, dend);
++	if (ret)
++		goto out;
+ 
+-		/* ticket blob for service */
+-		ceph_decode_8_safe(&p, end, is_enc, bad);
+-		tp = ticket_buf;
+-		if (is_enc) {
+-			/* encrypted */
+-			dout(" encrypted ticket\n");
+-			dlen = ceph_x_decrypt(&old_key, &p, end, ticket_buf,
+-					      TEMP_TICKET_BUF_LEN);
+-			if (dlen < 0) {
+-				ret = dlen;
+-				goto out;
+-			}
+-			dlen = ceph_decode_32(&tp);
+-		} else {
+-			/* unencrypted */
+-			ceph_decode_32_safe(&p, end, dlen, bad);
+-			ceph_decode_need(&p, end, dlen, bad);
+-			ceph_decode_copy(&p, ticket_buf, dlen);
++	ceph_decode_copy(&dp, &new_validity, sizeof(new_validity));
++	ceph_decode_timespec(&validity, &new_validity);
++	new_expires = get_seconds() + validity.tv_sec;
++	new_renew_after = new_expires - (validity.tv_sec / 4);
++	dout(" expires=%lu renew_after=%lu\n", new_expires,
++	     new_renew_after);
++
++	/* ticket blob for service */
++	ceph_decode_8_safe(p, end, is_enc, bad);
++	if (is_enc) {
++		/* encrypted */
++		dout(" encrypted ticket\n");
++		dlen = ceph_x_decrypt(&old_key, p, end, &ticket_buf, 0);
++		if (dlen < 0) {
++			ret = dlen;
++			goto out;
+ 		}
+-		tpend = tp + dlen;
+-		dout(" ticket blob is %d bytes\n", dlen);
+-		ceph_decode_need(&tp, tpend, 1 + sizeof(u64), bad);
+-		blob_struct_v = ceph_decode_8(&tp);
+-		new_secret_id = ceph_decode_64(&tp);
+-		ret = ceph_decode_buffer(&new_ticket_blob, &tp, tpend);
+-		if (ret)
++		tp = ticket_buf;
++		dlen = ceph_decode_32(&tp);
++	} else {
++		/* unencrypted */
++		ceph_decode_32_safe(p, end, dlen, bad);
++		ticket_buf = kmalloc(dlen, GFP_NOFS);
++		if (!ticket_buf) {
++			ret = -ENOMEM;
+ 			goto out;
+-
+-		/* all is well, update our ticket */
+-		ceph_crypto_key_destroy(&th->session_key);
+-		if (th->ticket_blob)
+-			ceph_buffer_put(th->ticket_blob);
+-		th->session_key = new_session_key;
+-		th->ticket_blob = new_ticket_blob;
+-		th->validity = new_validity;
+-		th->secret_id = new_secret_id;
+-		th->expires = new_expires;
+-		th->renew_after = new_renew_after;
+-		dout(" got ticket service %d (%s) secret_id %lld len %d\n",
+-		     type, ceph_entity_type_name(type), th->secret_id,
+-		     (int)th->ticket_blob->vec.iov_len);
+-		xi->have_keys |= th->service;
++		}
++		tp = ticket_buf;
++		ceph_decode_need(p, end, dlen, bad);
++		ceph_decode_copy(p, ticket_buf, dlen);
+ 	}
++	tpend = tp + dlen;
++	dout(" ticket blob is %d bytes\n", dlen);
++	ceph_decode_need(&tp, tpend, 1 + sizeof(u64), bad);
++	blob_struct_v = ceph_decode_8(&tp);
++	new_secret_id = ceph_decode_64(&tp);
++	ret = ceph_decode_buffer(&new_ticket_blob, &tp, tpend);
++	if (ret)
++		goto out;
++
++	/* all is well, update our ticket */
++	ceph_crypto_key_destroy(&th->session_key);
++	if (th->ticket_blob)
++		ceph_buffer_put(th->ticket_blob);
++	th->session_key = new_session_key;
++	th->ticket_blob = new_ticket_blob;
++	th->validity = new_validity;
++	th->secret_id = new_secret_id;
++	th->expires = new_expires;
++	th->renew_after = new_renew_after;
++	dout(" got ticket service %d (%s) secret_id %lld len %d\n",
++	     type, ceph_entity_type_name(type), th->secret_id,
++	     (int)th->ticket_blob->vec.iov_len);
++	xi->have_keys |= th->service;
+ 
+-	ret = 0;
+ out:
+ 	kfree(ticket_buf);
+-out_dbuf:
+ 	kfree(dbuf);
+ 	return ret;
+ 
+@@ -270,6 +255,34 @@ bad:
+ 	goto out;
+ }
+ 
++static int ceph_x_proc_ticket_reply(struct ceph_auth_client *ac,
++				    struct ceph_crypto_key *secret,
++				    void *buf, void *end)
++{
++	void *p = buf;
++	u8 reply_struct_v;
++	u32 num;
++	int ret;
++
++	ceph_decode_8_safe(&p, end, reply_struct_v, bad);
++	if (reply_struct_v != 1)
++		return -EINVAL;
++
++	ceph_decode_32_safe(&p, end, num, bad);
++	dout("%d tickets\n", num);
++
++	while (num--) {
++		ret = process_one_ticket(ac, secret, &p, end);
++		if (ret)
++			return ret;
++	}
++
++	return 0;
++
++bad:
++	return -EINVAL;
++}
++
+ static int ceph_x_build_authorizer(struct ceph_auth_client *ac,
+ 				   struct ceph_x_ticket_handler *th,
+ 				   struct ceph_x_authorizer *au)
+@@ -583,13 +596,14 @@ static int ceph_x_verify_authorizer_reply(struct ceph_auth_client *ac,
+ 	struct ceph_x_ticket_handler *th;
+ 	int ret = 0;
+ 	struct ceph_x_authorize_reply reply;
++	void *preply = &reply;
+ 	void *p = au->reply_buf;
+ 	void *end = p + sizeof(au->reply_buf);
+ 
+ 	th = get_ticket_handler(ac, au->service);
+ 	if (IS_ERR(th))
+ 		return PTR_ERR(th);
+-	ret = ceph_x_decrypt(&th->session_key, &p, end, &reply, sizeof(reply));
++	ret = ceph_x_decrypt(&th->session_key, &p, end, &preply, sizeof(reply));
+ 	if (ret < 0)
+ 		return ret;
+ 	if (ret != sizeof(reply))
+diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
+index 1948d592aa54..3d9ddc2842e1 100644
+--- a/net/ceph/messenger.c
++++ b/net/ceph/messenger.c
+@@ -900,7 +900,7 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor,
+ 	BUG_ON(page_count > (int)USHRT_MAX);
+ 	cursor->page_count = (unsigned short)page_count;
+ 	BUG_ON(length > SIZE_MAX - cursor->page_offset);
+-	cursor->last_piece = (size_t)cursor->page_offset + length <= PAGE_SIZE;
++	cursor->last_piece = cursor->page_offset + cursor->resid <= PAGE_SIZE;
+ }
+ 
+ static struct page *
+diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
+index 067d3af2eaf6..61fcfc304f68 100644
+--- a/net/ceph/mon_client.c
++++ b/net/ceph/mon_client.c
+@@ -1181,7 +1181,15 @@ static struct ceph_msg *mon_alloc_msg(struct ceph_connection *con,
+ 	if (!m) {
+ 		pr_info("alloc_msg unknown type %d\n", type);
+ 		*skip = 1;
++	} else if (front_len > m->front_alloc_len) {
++		pr_warning("mon_alloc_msg front %d > prealloc %d (%u#%llu)\n",
++			   front_len, m->front_alloc_len,
++			   (unsigned int)con->peer_name.type,
++			   le64_to_cpu(con->peer_name.num));
++		ceph_msg_put(m);
++		m = ceph_msg_new(type, front_len, GFP_NOFS, false);
+ 	}
++
+ 	return m;
+ }
+ 
+diff --git a/security/commoncap.c b/security/commoncap.c
+index b9d613e0ef14..963dc5981661 100644
+--- a/security/commoncap.c
++++ b/security/commoncap.c
+@@ -421,6 +421,9 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
+ 		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
+ 	}
+ 
++	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++	cpu_caps->inheritable.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
++
+ 	return 0;
+ }
+ 
+diff --git a/sound/soc/blackfin/bf5xx-i2s-pcm.c b/sound/soc/blackfin/bf5xx-i2s-pcm.c
+index a3881c4381c9..bcf591373a7a 100644
+--- a/sound/soc/blackfin/bf5xx-i2s-pcm.c
++++ b/sound/soc/blackfin/bf5xx-i2s-pcm.c
+@@ -290,19 +290,19 @@ static int bf5xx_pcm_silence(struct snd_pcm_substream *substream,
+ 	unsigned int sample_size = runtime->sample_bits / 8;
+ 	void *buf = runtime->dma_area;
+ 	struct bf5xx_i2s_pcm_data *dma_data;
+-	unsigned int offset, size;
++	unsigned int offset, samples;
+ 
+ 	dma_data = snd_soc_dai_get_dma_data(rtd->cpu_dai, substream);
+ 
+ 	if (dma_data->tdm_mode) {
+ 		offset = pos * 8 * sample_size;
+-		size = count * 8 * sample_size;
++		samples = count * 8;
+ 	} else {
+ 		offset = frames_to_bytes(runtime, pos);
+-		size = frames_to_bytes(runtime, count);
++		samples = count * runtime->channels;
+ 	}
+ 
+-	snd_pcm_format_set_silence(runtime->format, buf + offset, size);
++	snd_pcm_format_set_silence(runtime->format, buf + offset, samples);
+ 
+ 	return 0;
+ }
+diff --git a/sound/soc/codecs/adau1701.c b/sound/soc/codecs/adau1701.c
+index d71c59cf7bdd..370b742117ef 100644
+--- a/sound/soc/codecs/adau1701.c
++++ b/sound/soc/codecs/adau1701.c
+@@ -230,8 +230,10 @@ static int adau1701_reg_read(void *context, unsigned int reg,
+ 
+ 	*value = 0;
+ 
+-	for (i = 0; i < size; i++)
+-		*value |= recv_buf[i] << (i * 8);
++	for (i = 0; i < size; i++) {
++		*value <<= 8;
++		*value |= recv_buf[i];
++	}
+ 
+ 	return 0;
+ }
+diff --git a/sound/soc/codecs/max98090.c b/sound/soc/codecs/max98090.c
+index f5fccc7a8e89..d97f1ce7ff7d 100644
+--- a/sound/soc/codecs/max98090.c
++++ b/sound/soc/codecs/max98090.c
+@@ -2284,7 +2284,7 @@ static int max98090_probe(struct snd_soc_codec *codec)
+ 	/* Register for interrupts */
+ 	dev_dbg(codec->dev, "irq = %d\n", max98090->irq);
+ 
+-	ret = request_threaded_irq(max98090->irq, NULL,
++	ret = devm_request_threaded_irq(codec->dev, max98090->irq, NULL,
+ 		max98090_interrupt, IRQF_TRIGGER_FALLING | IRQF_ONESHOT,
+ 		"max98090_interrupt", codec);
+ 	if (ret < 0) {
+diff --git a/sound/soc/codecs/rt5640.c b/sound/soc/codecs/rt5640.c
+index de80e89b5fd8..70679cf14c83 100644
+--- a/sound/soc/codecs/rt5640.c
++++ b/sound/soc/codecs/rt5640.c
+@@ -2059,6 +2059,7 @@ static struct snd_soc_codec_driver soc_codec_dev_rt5640 = {
+ static const struct regmap_config rt5640_regmap = {
+ 	.reg_bits = 8,
+ 	.val_bits = 16,
++	.use_single_rw = true,
+ 
+ 	.max_register = RT5640_VENDOR_ID2 + 1 + (ARRAY_SIZE(rt5640_ranges) *
+ 					       RT5640_PR_SPACING),
+diff --git a/sound/soc/codecs/tlv320aic31xx.c b/sound/soc/codecs/tlv320aic31xx.c
+index 23419109ecac..1cdae8ccc61b 100644
+--- a/sound/soc/codecs/tlv320aic31xx.c
++++ b/sound/soc/codecs/tlv320aic31xx.c
+@@ -1178,7 +1178,7 @@ static void aic31xx_pdata_from_of(struct aic31xx_priv *aic31xx)
+ }
+ #endif /* CONFIG_OF */
+ 
+-static void aic31xx_device_init(struct aic31xx_priv *aic31xx)
++static int aic31xx_device_init(struct aic31xx_priv *aic31xx)
+ {
+ 	int ret, i;
+ 
+@@ -1197,7 +1197,7 @@ static void aic31xx_device_init(struct aic31xx_priv *aic31xx)
+ 					    "aic31xx-reset-pin");
+ 		if (ret < 0) {
+ 			dev_err(aic31xx->dev, "not able to acquire gpio\n");
+-			return;
++			return ret;
+ 		}
+ 	}
+ 
+@@ -1210,6 +1210,7 @@ static void aic31xx_device_init(struct aic31xx_priv *aic31xx)
+ 	if (ret != 0)
+ 		dev_err(aic31xx->dev, "Failed to request supplies: %d\n", ret);
+ 
++	return ret;
+ }
+ 
+ static int aic31xx_i2c_probe(struct i2c_client *i2c,
+@@ -1239,7 +1240,9 @@ static int aic31xx_i2c_probe(struct i2c_client *i2c,
+ 
+ 	aic31xx->pdata.codec_type = id->driver_data;
+ 
+-	aic31xx_device_init(aic31xx);
++	ret = aic31xx_device_init(aic31xx);
++	if (ret)
++		return ret;
+ 
+ 	return snd_soc_register_codec(&i2c->dev, &soc_codec_driver_aic31xx,
+ 				     aic31xx_dai_driver,
+diff --git a/sound/soc/codecs/wm8994.c b/sound/soc/codecs/wm8994.c
+index 247b39013fba..9719d3ca8e47 100644
+--- a/sound/soc/codecs/wm8994.c
++++ b/sound/soc/codecs/wm8994.c
+@@ -3505,6 +3505,7 @@ static irqreturn_t wm8994_mic_irq(int irq, void *data)
+ 	return IRQ_HANDLED;
+ }
+ 
++/* Should be called with accdet_lock held */
+ static void wm1811_micd_stop(struct snd_soc_codec *codec)
+ {
+ 	struct wm8994_priv *wm8994 = snd_soc_codec_get_drvdata(codec);
+@@ -3512,14 +3513,10 @@ static void wm1811_micd_stop(struct snd_soc_codec *codec)
+ 	if (!wm8994->jackdet)
+ 		return;
+ 
+-	mutex_lock(&wm8994->accdet_lock);
+-
+ 	snd_soc_update_bits(codec, WM8958_MIC_DETECT_1, WM8958_MICD_ENA, 0);
+ 
+ 	wm1811_jackdet_set_mode(codec, WM1811_JACKDET_MODE_JACK);
+ 
+-	mutex_unlock(&wm8994->accdet_lock);
+-
+ 	if (wm8994->wm8994->pdata.jd_ext_cap)
+ 		snd_soc_dapm_disable_pin(&codec->dapm,
+ 					 "MICBIAS2");
+@@ -3560,10 +3557,10 @@ static void wm8958_open_circuit_work(struct work_struct *work)
+ 						  open_circuit_work.work);
+ 	struct device *dev = wm8994->wm8994->dev;
+ 
+-	wm1811_micd_stop(wm8994->hubs.codec);
+-
+ 	mutex_lock(&wm8994->accdet_lock);
+ 
++	wm1811_micd_stop(wm8994->hubs.codec);
++
+ 	dev_dbg(dev, "Reporting open circuit\n");
+ 
+ 	wm8994->jack_mic = false;
+diff --git a/sound/soc/codecs/wm_adsp.c b/sound/soc/codecs/wm_adsp.c
+index 060027182dcb..2537725dd53f 100644
+--- a/sound/soc/codecs/wm_adsp.c
++++ b/sound/soc/codecs/wm_adsp.c
+@@ -1758,3 +1758,5 @@ int wm_adsp2_init(struct wm_adsp *adsp, bool dvfs)
+ 	return 0;
+ }
+ EXPORT_SYMBOL_GPL(wm_adsp2_init);
++
++MODULE_LICENSE("GPL v2");
+diff --git a/sound/soc/intel/sst-baytrail-pcm.c b/sound/soc/intel/sst-baytrail-pcm.c
+index 8eab97368ea7..599401c0c655 100644
+--- a/sound/soc/intel/sst-baytrail-pcm.c
++++ b/sound/soc/intel/sst-baytrail-pcm.c
+@@ -32,7 +32,7 @@ static const struct snd_pcm_hardware sst_byt_pcm_hardware = {
+ 				  SNDRV_PCM_INFO_PAUSE |
+ 				  SNDRV_PCM_INFO_RESUME,
+ 	.formats		= SNDRV_PCM_FMTBIT_S16_LE |
+-				  SNDRV_PCM_FORMAT_S24_LE,
++				  SNDRV_PCM_FMTBIT_S24_LE,
+ 	.period_bytes_min	= 384,
+ 	.period_bytes_max	= 48000,
+ 	.periods_min		= 2,
+diff --git a/sound/soc/intel/sst-haswell-pcm.c b/sound/soc/intel/sst-haswell-pcm.c
+index 058efb17c568..61bf6da4bb02 100644
+--- a/sound/soc/intel/sst-haswell-pcm.c
++++ b/sound/soc/intel/sst-haswell-pcm.c
+@@ -80,7 +80,7 @@ static const struct snd_pcm_hardware hsw_pcm_hardware = {
+ 				  SNDRV_PCM_INFO_PAUSE |
+ 				  SNDRV_PCM_INFO_RESUME |
+ 				  SNDRV_PCM_INFO_NO_PERIOD_WAKEUP,
+-	.formats		= SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FORMAT_S24_LE |
++	.formats		= SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+ 				  SNDRV_PCM_FMTBIT_S32_LE,
+ 	.period_bytes_min	= PAGE_SIZE,
+ 	.period_bytes_max	= (HSW_PCM_PERIODS_MAX / HSW_PCM_PERIODS_MIN) * PAGE_SIZE,
+@@ -400,7 +400,15 @@ static int hsw_pcm_hw_params(struct snd_pcm_substream *substream,
+ 		sst_hsw_stream_set_valid(hsw, pcm_data->stream, 16);
+ 		break;
+ 	case SNDRV_PCM_FORMAT_S24_LE:
+-		bits = SST_HSW_DEPTH_24BIT;
++		bits = SST_HSW_DEPTH_32BIT;
++		sst_hsw_stream_set_valid(hsw, pcm_data->stream, 24);
++		break;
++	case SNDRV_PCM_FORMAT_S8:
++		bits = SST_HSW_DEPTH_8BIT;
++		sst_hsw_stream_set_valid(hsw, pcm_data->stream, 8);
++		break;
++	case SNDRV_PCM_FORMAT_S32_LE:
++		bits = SST_HSW_DEPTH_32BIT;
+ 		sst_hsw_stream_set_valid(hsw, pcm_data->stream, 32);
+ 		break;
+ 	default:
+@@ -685,8 +693,9 @@ static int hsw_pcm_new(struct snd_soc_pcm_runtime *rtd)
+ }
+ 
+ #define HSW_FORMATS \
+-	(SNDRV_PCM_FMTBIT_S20_3LE | SNDRV_PCM_FMTBIT_S16_LE |\
+-	 SNDRV_PCM_FMTBIT_S32_LE)
++	(SNDRV_PCM_FMTBIT_S32_LE | SNDRV_PCM_FMTBIT_S24_LE | \
++	SNDRV_PCM_FMTBIT_S20_3LE | SNDRV_PCM_FMTBIT_S16_LE |\
++	SNDRV_PCM_FMTBIT_S8)
+ 
+ static struct snd_soc_dai_driver hsw_dais[] = {
+ 	{
+@@ -696,7 +705,7 @@ static struct snd_soc_dai_driver hsw_dais[] = {
+ 			.channels_min = 2,
+ 			.channels_max = 2,
+ 			.rates = SNDRV_PCM_RATE_48000,
+-			.formats = SNDRV_PCM_FMTBIT_S16_LE,
++			.formats = SNDRV_PCM_FMTBIT_S24_LE | SNDRV_PCM_FMTBIT_S16_LE,
+ 		},
+ 	},
+ 	{
+@@ -727,8 +736,8 @@ static struct snd_soc_dai_driver hsw_dais[] = {
+ 			.stream_name = "Loopback Capture",
+ 			.channels_min = 2,
+ 			.channels_max = 2,
+-			.rates = SNDRV_PCM_RATE_8000_192000,
+-			.formats = HSW_FORMATS,
++			.rates = SNDRV_PCM_RATE_48000,
++			.formats = SNDRV_PCM_FMTBIT_S24_LE | SNDRV_PCM_FMTBIT_S16_LE,
+ 		},
+ 	},
+ 	{
+@@ -737,8 +746,8 @@ static struct snd_soc_dai_driver hsw_dais[] = {
+ 			.stream_name = "Analog Capture",
+ 			.channels_min = 2,
+ 			.channels_max = 2,
+-			.rates = SNDRV_PCM_RATE_8000_192000,
+-			.formats = HSW_FORMATS,
++			.rates = SNDRV_PCM_RATE_48000,
++			.formats = SNDRV_PCM_FMTBIT_S24_LE | SNDRV_PCM_FMTBIT_S16_LE,
+ 		},
+ 	},
+ };
+diff --git a/sound/soc/omap/omap-twl4030.c b/sound/soc/omap/omap-twl4030.c
+index f8a6adc2d81c..4336d1831485 100644
+--- a/sound/soc/omap/omap-twl4030.c
++++ b/sound/soc/omap/omap-twl4030.c
+@@ -260,7 +260,7 @@ static struct snd_soc_dai_link omap_twl4030_dai_links[] = {
+ 		.stream_name = "TWL4030 Voice",
+ 		.cpu_dai_name = "omap-mcbsp.3",
+ 		.codec_dai_name = "twl4030-voice",
+-		.platform_name = "omap-mcbsp.2",
++		.platform_name = "omap-mcbsp.3",
+ 		.codec_name = "twl4030-codec",
+ 		.dai_fmt = SND_SOC_DAIFMT_DSP_A | SND_SOC_DAIFMT_IB_NF |
+ 			   SND_SOC_DAIFMT_CBM_CFM,
+diff --git a/sound/soc/pxa/pxa-ssp.c b/sound/soc/pxa/pxa-ssp.c
+index 199a8b377553..a8e097433074 100644
+--- a/sound/soc/pxa/pxa-ssp.c
++++ b/sound/soc/pxa/pxa-ssp.c
+@@ -723,7 +723,8 @@ static int pxa_ssp_probe(struct snd_soc_dai *dai)
+ 		ssp_handle = of_parse_phandle(dev->of_node, "port", 0);
+ 		if (!ssp_handle) {
+ 			dev_err(dev, "unable to get 'port' phandle\n");
+-			return -ENODEV;
++			ret = -ENODEV;
++			goto err_priv;
+ 		}
+ 
+ 		priv->ssp = pxa_ssp_request_of(ssp_handle, "SoC audio");
+@@ -764,9 +765,7 @@ static int pxa_ssp_remove(struct snd_soc_dai *dai)
+ 			  SNDRV_PCM_RATE_48000 | SNDRV_PCM_RATE_64000 |	\
+ 			  SNDRV_PCM_RATE_88200 | SNDRV_PCM_RATE_96000)
+ 
+-#define PXA_SSP_FORMATS (SNDRV_PCM_FMTBIT_S16_LE |\
+-			    SNDRV_PCM_FMTBIT_S24_LE |	\
+-			    SNDRV_PCM_FMTBIT_S32_LE)
++#define PXA_SSP_FORMATS (SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S32_LE)
+ 
+ static const struct snd_soc_dai_ops pxa_ssp_dai_ops = {
+ 	.startup	= pxa_ssp_startup,
+diff --git a/sound/soc/samsung/i2s.c b/sound/soc/samsung/i2s.c
+index 2ac76fa3e742..5f9b255a8b38 100644
+--- a/sound/soc/samsung/i2s.c
++++ b/sound/soc/samsung/i2s.c
+@@ -920,11 +920,9 @@ static int i2s_suspend(struct snd_soc_dai *dai)
+ {
+ 	struct i2s_dai *i2s = to_info(dai);
+ 
+-	if (dai->active) {
+-		i2s->suspend_i2smod = readl(i2s->addr + I2SMOD);
+-		i2s->suspend_i2scon = readl(i2s->addr + I2SCON);
+-		i2s->suspend_i2spsr = readl(i2s->addr + I2SPSR);
+-	}
++	i2s->suspend_i2smod = readl(i2s->addr + I2SMOD);
++	i2s->suspend_i2scon = readl(i2s->addr + I2SCON);
++	i2s->suspend_i2spsr = readl(i2s->addr + I2SPSR);
+ 
+ 	return 0;
+ }
+@@ -933,11 +931,9 @@ static int i2s_resume(struct snd_soc_dai *dai)
+ {
+ 	struct i2s_dai *i2s = to_info(dai);
+ 
+-	if (dai->active) {
+-		writel(i2s->suspend_i2scon, i2s->addr + I2SCON);
+-		writel(i2s->suspend_i2smod, i2s->addr + I2SMOD);
+-		writel(i2s->suspend_i2spsr, i2s->addr + I2SPSR);
+-	}
++	writel(i2s->suspend_i2scon, i2s->addr + I2SCON);
++	writel(i2s->suspend_i2smod, i2s->addr + I2SMOD);
++	writel(i2s->suspend_i2spsr, i2s->addr + I2SPSR);
+ 
+ 	return 0;
+ }
+diff --git a/sound/soc/soc-pcm.c b/sound/soc/soc-pcm.c
+index 54d18f22a33e..4ea656770d65 100644
+--- a/sound/soc/soc-pcm.c
++++ b/sound/soc/soc-pcm.c
+@@ -2069,6 +2069,7 @@ int soc_dpcm_runtime_update(struct snd_soc_card *card)
+ 			dpcm_be_disconnect(fe, SNDRV_PCM_STREAM_PLAYBACK);
+ 		}
+ 
++		dpcm_path_put(&list);
+ capture:
+ 		/* skip if FE doesn't have capture capability */
+ 		if (!fe->cpu_dai->driver->capture.channels_min)
+diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
+index e66e710cc595..0a8a9db43d34 100644
+--- a/tools/testing/selftests/Makefile
++++ b/tools/testing/selftests/Makefile
+@@ -4,6 +4,7 @@ TARGETS += efivarfs
+ TARGETS += kcmp
+ TARGETS += memory-hotplug
+ TARGETS += mqueue
++TARGETS += mount
+ TARGETS += net
+ TARGETS += ptrace
+ TARGETS += timers
+diff --git a/tools/testing/selftests/mount/Makefile b/tools/testing/selftests/mount/Makefile
+new file mode 100644
+index 000000000000..337d853c2b72
+--- /dev/null
++++ b/tools/testing/selftests/mount/Makefile
+@@ -0,0 +1,17 @@
++# Makefile for mount selftests.
++
++all: unprivileged-remount-test
++
++unprivileged-remount-test: unprivileged-remount-test.c
++	gcc -Wall -O2 unprivileged-remount-test.c -o unprivileged-remount-test
++
++# Allow specific tests to be selected.
++test_unprivileged_remount: unprivileged-remount-test
++	@if [ -f /proc/self/uid_map ] ; then ./unprivileged-remount-test ; fi
++
++run_tests: all test_unprivileged_remount
++
++clean:
++	rm -f unprivileged-remount-test
++
++.PHONY: all test_unprivileged_remount
+diff --git a/tools/testing/selftests/mount/unprivileged-remount-test.c b/tools/testing/selftests/mount/unprivileged-remount-test.c
+new file mode 100644
+index 000000000000..1b3ff2fda4d0
+--- /dev/null
++++ b/tools/testing/selftests/mount/unprivileged-remount-test.c
+@@ -0,0 +1,242 @@
++#define _GNU_SOURCE
++#include <sched.h>
++#include <stdio.h>
++#include <errno.h>
++#include <string.h>
++#include <sys/types.h>
++#include <sys/mount.h>
++#include <sys/wait.h>
++#include <stdlib.h>
++#include <unistd.h>
++#include <fcntl.h>
++#include <grp.h>
++#include <stdbool.h>
++#include <stdarg.h>
++
++#ifndef CLONE_NEWNS
++# define CLONE_NEWNS 0x00020000
++#endif
++#ifndef CLONE_NEWUTS
++# define CLONE_NEWUTS 0x04000000
++#endif
++#ifndef CLONE_NEWIPC
++# define CLONE_NEWIPC 0x08000000
++#endif
++#ifndef CLONE_NEWNET
++# define CLONE_NEWNET 0x40000000
++#endif
++#ifndef CLONE_NEWUSER
++# define CLONE_NEWUSER 0x10000000
++#endif
++#ifndef CLONE_NEWPID
++# define CLONE_NEWPID 0x20000000
++#endif
++
++#ifndef MS_RELATIME
++#define MS_RELATIME (1 << 21)
++#endif
++#ifndef MS_STRICTATIME
++#define MS_STRICTATIME (1 << 24)
++#endif
++
++static void die(char *fmt, ...)
++{
++	va_list ap;
++	va_start(ap, fmt);
++	vfprintf(stderr, fmt, ap);
++	va_end(ap);
++	exit(EXIT_FAILURE);
++}
++
++static void write_file(char *filename, char *fmt, ...)
++{
++	char buf[4096];
++	int fd;
++	ssize_t written;
++	int buf_len;
++	va_list ap;
++
++	va_start(ap, fmt);
++	buf_len = vsnprintf(buf, sizeof(buf), fmt, ap);
++	va_end(ap);
++	if (buf_len < 0) {
++		die("vsnprintf failed: %s\n",
++		    strerror(errno));
++	}
++	if (buf_len >= sizeof(buf)) {
++		die("vsnprintf output truncated\n");
++	}
++
++	fd = open(filename, O_WRONLY);
++	if (fd < 0) {
++		die("open of %s failed: %s\n",
++		    filename, strerror(errno));
++	}
++	written = write(fd, buf, buf_len);
++	if (written != buf_len) {
++		if (written >= 0) {
++			die("short write to %s\n", filename);
++		} else {
++			die("write to %s failed: %s\n",
++				filename, strerror(errno));
++		}
++	}
++	if (close(fd) != 0) {
++		die("close of %s failed: %s\n",
++			filename, strerror(errno));
++	}
++}
++
++static void create_and_enter_userns(void)
++{
++	uid_t uid;
++	gid_t gid;
++
++	uid = getuid();
++	gid = getgid();
++
++	if (unshare(CLONE_NEWUSER) !=0) {
++		die("unshare(CLONE_NEWUSER) failed: %s\n",
++			strerror(errno));
++	}
++
++	write_file("/proc/self/uid_map", "0 %d 1", uid);
++	write_file("/proc/self/gid_map", "0 %d 1", gid);
++
++	if (setgroups(0, NULL) != 0) {
++		die("setgroups failed: %s\n",
++			strerror(errno));
++	}
++	if (setgid(0) != 0) {
++		die ("setgid(0) failed %s\n",
++			strerror(errno));
++	}
++	if (setuid(0) != 0) {
++		die("setuid(0) failed %s\n",
++			strerror(errno));
++	}
++}
++
++static
++bool test_unpriv_remount(int mount_flags, int remount_flags, int invalid_flags)
++{
++	pid_t child;
++
++	child = fork();
++	if (child == -1) {
++		die("fork failed: %s\n",
++			strerror(errno));
++	}
++	if (child != 0) { /* parent */
++		pid_t pid;
++		int status;
++		pid = waitpid(child, &status, 0);
++		if (pid == -1) {
++			die("waitpid failed: %s\n",
++				strerror(errno));
++		}
++		if (pid != child) {
++			die("waited for %d got %d\n",
++				child, pid);
++		}
++		if (!WIFEXITED(status)) {
++			die("child did not terminate cleanly\n");
++		}
++		return WEXITSTATUS(status) == EXIT_SUCCESS ? true : false;
++	}
++
++	create_and_enter_userns();
++	if (unshare(CLONE_NEWNS) != 0) {
++		die("unshare(CLONE_NEWNS) failed: %s\n",
++			strerror(errno));
++	}
++
++	if (mount("testing", "/tmp", "ramfs", mount_flags, NULL) != 0) {
++		die("mount of /tmp failed: %s\n",
++			strerror(errno));
++	}
++
++	create_and_enter_userns();
++
++	if (unshare(CLONE_NEWNS) != 0) {
++		die("unshare(CLONE_NEWNS) failed: %s\n",
++			strerror(errno));
++	}
++
++	if (mount("/tmp", "/tmp", "none",
++		  MS_REMOUNT | MS_BIND | remount_flags, NULL) != 0) {
++		/* system("cat /proc/self/mounts"); */
++		die("remount of /tmp failed: %s\n",
++		    strerror(errno));
++	}
++
++	if (mount("/tmp", "/tmp", "none",
++		  MS_REMOUNT | MS_BIND | invalid_flags, NULL) == 0) {
++		/* system("cat /proc/self/mounts"); */
++		die("remount of /tmp with invalid flags "
++		    "succeeded unexpectedly\n");
++	}
++	exit(EXIT_SUCCESS);
++}
++
++static bool test_unpriv_remount_simple(int mount_flags)
++{
++	return test_unpriv_remount(mount_flags, mount_flags, 0);
++}
++
++static bool test_unpriv_remount_atime(int mount_flags, int invalid_flags)
++{
++	return test_unpriv_remount(mount_flags, mount_flags, invalid_flags);
++}
++
++int main(int argc, char **argv)
++{
++	if (!test_unpriv_remount_simple(MS_RDONLY|MS_NODEV)) {
++		die("MS_RDONLY malfunctions\n");
++	}
++	if (!test_unpriv_remount_simple(MS_NODEV)) {
++		die("MS_NODEV malfunctions\n");
++	}
++	if (!test_unpriv_remount_simple(MS_NOSUID|MS_NODEV)) {
++		die("MS_NOSUID malfunctions\n");
++	}
++	if (!test_unpriv_remount_simple(MS_NOEXEC|MS_NODEV)) {
++		die("MS_NOEXEC malfunctions\n");
++	}
++	if (!test_unpriv_remount_atime(MS_RELATIME|MS_NODEV,
++				       MS_NOATIME|MS_NODEV))
++	{
++		die("MS_RELATIME malfunctions\n");
++	}
++	if (!test_unpriv_remount_atime(MS_STRICTATIME|MS_NODEV,
++				       MS_NOATIME|MS_NODEV))
++	{
++		die("MS_STRICTATIME malfunctions\n");
++	}
++	if (!test_unpriv_remount_atime(MS_NOATIME|MS_NODEV,
++				       MS_STRICTATIME|MS_NODEV))
++	{
++		die("MS_RELATIME malfunctions\n");
++	}
++	if (!test_unpriv_remount_atime(MS_RELATIME|MS_NODIRATIME|MS_NODEV,
++				       MS_NOATIME|MS_NODEV))
++	{
++		die("MS_RELATIME malfunctions\n");
++	}
++	if (!test_unpriv_remount_atime(MS_STRICTATIME|MS_NODIRATIME|MS_NODEV,
++				       MS_NOATIME|MS_NODEV))
++	{
++		die("MS_RELATIME malfunctions\n");
++	}
++	if (!test_unpriv_remount_atime(MS_NOATIME|MS_NODIRATIME|MS_NODEV,
++				       MS_STRICTATIME|MS_NODEV))
++	{
++		die("MS_RELATIME malfunctions\n");
++	}
++	if (!test_unpriv_remount(MS_STRICTATIME|MS_NODEV, MS_NODEV,
++				 MS_NOATIME|MS_NODEV))
++	{
++		die("Default atime malfunctions\n");
++	}
++	return EXIT_SUCCESS;
++}


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-22 23:37 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-09-22 23:37 UTC (permalink / raw
  To: gentoo-commits

commit:     935e025ffecfe6c163188f4f9725352501bf0a6e
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Mon Sep 22 23:37:15 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Mon Sep 22 23:37:15 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=935e025f

Fix UDEV auto selection to add FHANDLE. Remove from systemd. Thanks to Steven Presser. See bug #523126

---
 4567_distro-Gentoo-Kconfig.patch | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/4567_distro-Gentoo-Kconfig.patch b/4567_distro-Gentoo-Kconfig.patch
index 652e2a7..71dbf09 100644
--- a/4567_distro-Gentoo-Kconfig.patch
+++ b/4567_distro-Gentoo-Kconfig.patch
@@ -1,15 +1,15 @@
---- a/Kconfig	2014-04-02 09:45:05.389224541 -0400
-+++ b/Kconfig	2014-04-02 09:45:39.269224273 -0400
+--- a/Kconfig   2014-04-02 09:45:05.389224541 -0400
++++ b/Kconfig   2014-04-02 09:45:39.269224273 -0400
 @@ -8,4 +8,6 @@ config SRCARCH
- 	string
- 	option env="SRCARCH"
- 
+	string
+	option env="SRCARCH"
+
 +source "distro/Kconfig"
 +
  source "arch/$SRCARCH/Kconfig"
---- 	1969-12-31 19:00:00.000000000 -0500
-+++ b/distro/Kconfig	2014-04-02 09:57:03.539218861 -0400
-@@ -0,0 +1,108 @@
+--- /dev/null	2014-09-22 14:19:24.316977284 -0400
++++ distro/Kconfig	2014-09-22 19:30:35.670959281 -0400
+@@ -0,0 +1,109 @@
 +menu "Gentoo Linux"
 +
 +config GENTOO_LINUX
@@ -34,6 +34,8 @@
 +	select DEVTMPFS
 +	select TMPFS
 +
++	select FHANDLE
++
 +	select MMU
 +	select SHMEM
 +
@@ -89,7 +91,6 @@
 +	select CGROUPS
 +	select EPOLL
 +	select FANOTIFY
-+	select FHANDLE
 +	select INOTIFY_USER
 +	select NET
 +	select NET_NS 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-26 19:40 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-09-26 19:40 UTC (permalink / raw
  To: gentoo-commits

commit:     d9d386b72f6c05e68b48912cc93da59331852155
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Fri Sep 26 19:40:17 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Fri Sep 26 19:40:17 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=d9d386b7

Add multipath-tcp patch. Fix distro config.

---
 0000_README                                 |     4 +
 2500_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 ++++++++++++++++++++++++++
 4567_distro-Gentoo-Kconfig.patch            |    19 +-
 3 files changed, 19243 insertions(+), 10 deletions(-)

diff --git a/0000_README b/0000_README
index 706e53e..d92e6b7 100644
--- a/0000_README
+++ b/0000_README
@@ -58,6 +58,10 @@ Patch:  2400_kcopy-patch-for-infiniband-driver.patch
 From:   Alexey Shvetsov <alexxy@gentoo.org>
 Desc:   Zero copy for infiniband psm userspace driver
 
+Patch:  2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
+From:   http://multipath-tcp.org/
+Desc:   Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
+
 Patch:  2700_ThinkPad-30-brightness-control-fix.patch
 From:   Seth Forshee <seth.forshee@canonical.com>
 Desc:   ACPI: Disable Windows 8 compatibility for some Lenovo ThinkPads

diff --git a/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch b/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
new file mode 100644
index 0000000..3000da3
--- /dev/null
+++ b/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
@@ -0,0 +1,19230 @@
+diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
+index 768a0fb67dd6..5a46d91a8df9 100644
+--- a/drivers/infiniband/hw/cxgb4/cm.c
++++ b/drivers/infiniband/hw/cxgb4/cm.c
+@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
+ 	 */
+ 	memset(&tmp_opt, 0, sizeof(tmp_opt));
+ 	tcp_clear_options(&tmp_opt);
+-	tcp_parse_options(skb, &tmp_opt, 0, NULL);
++	tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
+ 
+ 	req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
+ 	memset(req, 0, sizeof(*req));
+diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
+index 2faef339d8f2..d86c853ffaad 100644
+--- a/include/linux/ipv6.h
++++ b/include/linux/ipv6.h
+@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ 	return inet_sk(__sk)->pinet6;
+ }
+ 
+-static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
+-{
+-	struct request_sock *req = reqsk_alloc(ops);
+-
+-	if (req)
+-		inet_rsk(req)->pktopts = NULL;
+-
+-	return req;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ 	return (struct raw6_sock *)sk;
+@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ 	return NULL;
+ }
+ 
+-static inline struct inet6_request_sock *
+-			inet6_rsk(const struct request_sock *rsk)
+-{
+-	return NULL;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ 	return NULL;
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index ec89301ada41..99ea4b0e3693 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
+ 						  bool zero_okay,
+ 						  __sum16 check)
+ {
+-	if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
+-		skb->csum_valid = 1;
++	if (skb_csum_unnecessary(skb)) {
++		return false;
++	} else if (zero_okay && !check) {
++		skb->ip_summed = CHECKSUM_UNNECESSARY;
+ 		return false;
+ 	}
+ 
+diff --git a/include/linux/tcp.h b/include/linux/tcp.h
+index a0513210798f..7bc2e078d6ca 100644
+--- a/include/linux/tcp.h
++++ b/include/linux/tcp.h
+@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
+ /* TCP Fast Open */
+ #define TCP_FASTOPEN_COOKIE_MIN	4	/* Min Fast Open Cookie size in bytes */
+ #define TCP_FASTOPEN_COOKIE_MAX	16	/* Max Fast Open Cookie size in bytes */
+-#define TCP_FASTOPEN_COOKIE_SIZE 8	/* the size employed by this impl. */
++#define TCP_FASTOPEN_COOKIE_SIZE 4	/* the size employed by this impl. */
+ 
+ /* TCP Fast Open Cookie as stored in memory */
+ struct tcp_fastopen_cookie {
+@@ -72,6 +72,51 @@ struct tcp_sack_block {
+ 	u32	end_seq;
+ };
+ 
++struct tcp_out_options {
++	u16	options;	/* bit field of OPTION_* */
++	u8	ws;		/* window scale, 0 to disable */
++	u8	num_sack_blocks;/* number of SACK blocks to include */
++	u8	hash_size;	/* bytes in hash_location */
++	u16	mss;		/* 0 to disable */
++	__u8	*hash_location;	/* temporary pointer, overloaded */
++	__u32	tsval, tsecr;	/* need to include OPTION_TS */
++	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
++#ifdef CONFIG_MPTCP
++	u16	mptcp_options;	/* bit field of MPTCP related OPTION_* */
++	u8	dss_csum:1,
++		add_addr_v4:1,
++		add_addr_v6:1;	/* dss-checksum required? */
++
++	union {
++		struct {
++			__u64	sender_key;	/* sender's key for mptcp */
++			__u64	receiver_key;	/* receiver's key for mptcp */
++		} mp_capable;
++
++		struct {
++			__u64	sender_truncated_mac;
++			__u32	sender_nonce;
++					/* random number of the sender */
++			__u32	token;	/* token for mptcp */
++			u8	low_prio:1;
++		} mp_join_syns;
++	};
++
++	struct {
++		struct in_addr addr;
++		u8 addr_id;
++	} add_addr4;
++
++	struct {
++		struct in6_addr addr;
++		u8 addr_id;
++	} add_addr6;
++
++	u16	remove_addrs;	/* list of address id */
++	u8	addr_id;	/* address id (mp_join or add_address) */
++#endif /* CONFIG_MPTCP */
++};
++
+ /*These are used to set the sack_ok field in struct tcp_options_received */
+ #define TCP_SACK_SEEN     (1 << 0)   /*1 = peer is SACK capable, */
+ #define TCP_FACK_ENABLED  (1 << 1)   /*1 = FACK is enabled locally*/
+@@ -95,6 +140,9 @@ struct tcp_options_received {
+ 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+ };
+ 
++struct mptcp_cb;
++struct mptcp_tcp_sock;
++
+ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+ {
+ 	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
+ 
+ struct tcp_request_sock {
+ 	struct inet_request_sock 	req;
+-#ifdef CONFIG_TCP_MD5SIG
+-	/* Only used by TCP MD5 Signature so far. */
+ 	const struct tcp_request_sock_ops *af_specific;
+-#endif
+ 	struct sock			*listener; /* needed for TFO */
+ 	u32				rcv_isn;
+ 	u32				snt_isn;
+@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
+ 	return (struct tcp_request_sock *)req;
+ }
+ 
++struct tcp_md5sig_key;
++
+ struct tcp_sock {
+ 	/* inet_connection_sock has to be the first member of tcp_sock */
+ 	struct inet_connection_sock	inet_conn;
+@@ -326,6 +373,37 @@ struct tcp_sock {
+ 	 * socket. Used to retransmit SYNACKs etc.
+ 	 */
+ 	struct request_sock *fastopen_rsk;
++
++	/* MPTCP/TCP-specific callbacks */
++	const struct tcp_sock_ops	*ops;
++
++	struct mptcp_cb		*mpcb;
++	struct sock		*meta_sk;
++	/* We keep these flags even if CONFIG_MPTCP is not checked, because
++	 * it allows checking MPTCP capability just by checking the mpc flag,
++	 * rather than adding ifdefs everywhere.
++	 */
++	u16     mpc:1,          /* Other end is multipath capable */
++		inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
++		send_mp_fclose:1,
++		request_mptcp:1, /* Did we send out an MP_CAPABLE?
++				  * (this speeds up mptcp_doit() in tcp_recvmsg)
++				  */
++		mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
++		pf:1, /* Potentially Failed state: when this flag is set, we
++		       * stop using the subflow
++		       */
++		mp_killed:1, /* Killed with a tcp_done in mptcp? */
++		was_meta_sk:1,	/* This was a meta sk (in case of reuse) */
++		is_master_sk,
++		close_it:1,	/* Must close socket in mptcp_data_ready? */
++		closing:1;
++	struct mptcp_tcp_sock *mptcp;
++#ifdef CONFIG_MPTCP
++	struct hlist_nulls_node tk_table;
++	u32		mptcp_loc_token;
++	u64		mptcp_loc_key;
++#endif /* CONFIG_MPTCP */
+ };
+ 
+ enum tsq_flags {
+@@ -337,6 +415,8 @@ enum tsq_flags {
+ 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
+ 				    * tcp_v{4|6}_mtu_reduced()
+ 				    */
++	MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
++	MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
+ };
+ 
+ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
+@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
+ #ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_md5sig_key	  *tw_md5_key;
+ #endif
++	struct mptcp_tw		  *mptcp_tw;
+ };
+ 
+ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
+diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
+index 74af137304be..83f63033897a 100644
+--- a/include/net/inet6_connection_sock.h
++++ b/include/net/inet6_connection_sock.h
+@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
+ 
+ struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
+ 				      const struct request_sock *req);
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++		    const u32 rnd, const u32 synq_hsize);
+ 
+ struct request_sock *inet6_csk_search_req(const struct sock *sk,
+ 					  struct request_sock ***prevp,
+diff --git a/include/net/inet_common.h b/include/net/inet_common.h
+index fe7994c48b75..780f229f46a8 100644
+--- a/include/net/inet_common.h
++++ b/include/net/inet_common.h
+@@ -1,6 +1,8 @@
+ #ifndef _INET_COMMON_H
+ #define _INET_COMMON_H
+ 
++#include <net/sock.h>
++
+ extern const struct proto_ops inet_stream_ops;
+ extern const struct proto_ops inet_dgram_ops;
+ 
+@@ -13,6 +15,8 @@ struct sock;
+ struct sockaddr;
+ struct socket;
+ 
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
+ int inet_release(struct socket *sock);
+ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ 			int addr_len, int flags);
+diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
+index 7a4313887568..f62159e39839 100644
+--- a/include/net/inet_connection_sock.h
++++ b/include/net/inet_connection_sock.h
+@@ -30,6 +30,7 @@
+ 
+ struct inet_bind_bucket;
+ struct tcp_congestion_ops;
++struct tcp_options_received;
+ 
+ /*
+  * Pointers to address related TCP functions
+@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
+ 
+ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
+ 
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++		   const u32 synq_hsize);
++
+ struct request_sock *inet_csk_search_req(const struct sock *sk,
+ 					 struct request_sock ***prevp,
+ 					 const __be16 rport,
+diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
+index b1edf17bec01..6a32d8d6b85e 100644
+--- a/include/net/inet_sock.h
++++ b/include/net/inet_sock.h
+@@ -86,10 +86,14 @@ struct inet_request_sock {
+ 				wscale_ok  : 1,
+ 				ecn_ok	   : 1,
+ 				acked	   : 1,
+-				no_srccheck: 1;
++				no_srccheck: 1,
++				mptcp_rqsk : 1,
++				saw_mpc    : 1;
+ 	kmemcheck_bitfield_end(flags);
+-	struct ip_options_rcu	*opt;
+-	struct sk_buff		*pktopts;
++	union {
++		struct ip_options_rcu	*opt;
++		struct sk_buff		*pktopts;
++	};
+ 	u32                     ir_mark;
+ };
+ 
+diff --git a/include/net/mptcp.h b/include/net/mptcp.h
+new file mode 100644
+index 000000000000..712780fc39e4
+--- /dev/null
++++ b/include/net/mptcp.h
+@@ -0,0 +1,1439 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_H
++#define _MPTCP_H
++
++#include <linux/inetdevice.h>
++#include <linux/ipv6.h>
++#include <linux/list.h>
++#include <linux/net.h>
++#include <linux/netpoll.h>
++#include <linux/skbuff.h>
++#include <linux/socket.h>
++#include <linux/tcp.h>
++#include <linux/kernel.h>
++
++#include <asm/byteorder.h>
++#include <asm/unaligned.h>
++#include <crypto/hash.h>
++#include <net/tcp.h>
++
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	#define ntohll(x)  be64_to_cpu(x)
++	#define htonll(x)  cpu_to_be64(x)
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	#define ntohll(x) (x)
++	#define htonll(x) (x)
++#endif
++
++struct mptcp_loc4 {
++	u8		loc4_id;
++	u8		low_prio:1;
++	struct in_addr	addr;
++};
++
++struct mptcp_rem4 {
++	u8		rem4_id;
++	__be16		port;
++	struct in_addr	addr;
++};
++
++struct mptcp_loc6 {
++	u8		loc6_id;
++	u8		low_prio:1;
++	struct in6_addr	addr;
++};
++
++struct mptcp_rem6 {
++	u8		rem6_id;
++	__be16		port;
++	struct in6_addr	addr;
++};
++
++struct mptcp_request_sock {
++	struct tcp_request_sock		req;
++	/* hlist-nulls entry to the hash-table. Depending on whether this is a
++	 * a new MPTCP connection or an additional subflow, the request-socket
++	 * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
++	 */
++	struct hlist_nulls_node		hash_entry;
++
++	union {
++		struct {
++			/* Only on initial subflows */
++			u64		mptcp_loc_key;
++			u64		mptcp_rem_key;
++			u32		mptcp_loc_token;
++		};
++
++		struct {
++			/* Only on additional subflows */
++			struct mptcp_cb	*mptcp_mpcb;
++			u32		mptcp_rem_nonce;
++			u32		mptcp_loc_nonce;
++			u64		mptcp_hash_tmac;
++		};
++	};
++
++	u8				loc_id;
++	u8				rem_id; /* Address-id in the MP_JOIN */
++	u8				dss_csum:1,
++					is_sub:1, /* Is this a new subflow? */
++					low_prio:1, /* Interface set to low-prio? */
++					rcv_low_prio:1;
++};
++
++struct mptcp_options_received {
++	u16	saw_mpc:1,
++		dss_csum:1,
++		drop_me:1,
++
++		is_mp_join:1,
++		join_ack:1,
++
++		saw_low_prio:2, /* 0x1 - low-prio set for this subflow
++				 * 0x2 - low-prio set for another subflow
++				 */
++		low_prio:1,
++
++		saw_add_addr:2, /* Saw at least one add_addr option:
++				 * 0x1: IPv4 - 0x2: IPv6
++				 */
++		more_add_addr:1, /* Saw one more add-addr. */
++
++		saw_rem_addr:1, /* Saw at least one rem_addr option */
++		more_rem_addr:1, /* Saw one more rem-addr. */
++
++		mp_fail:1,
++		mp_fclose:1;
++	u8	rem_id;		/* Address-id in the MP_JOIN */
++	u8	prio_addr_id;	/* Address-id in the MP_PRIO */
++
++	const unsigned char *add_addr_ptr; /* Pointer to add-address option */
++	const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
++
++	u32	data_ack;
++	u32	data_seq;
++	u16	data_len;
++
++	u32	mptcp_rem_token;/* Remote token */
++
++	/* Key inside the option (from mp_capable or fast_close) */
++	u64	mptcp_key;
++
++	u32	mptcp_recv_nonce;
++	u64	mptcp_recv_tmac;
++	u8	mptcp_recv_mac[20];
++};
++
++struct mptcp_tcp_sock {
++	struct tcp_sock	*next;		/* Next subflow socket */
++	struct hlist_node cb_list;
++	struct mptcp_options_received rx_opt;
++
++	 /* Those three fields record the current mapping */
++	u64	map_data_seq;
++	u32	map_subseq;
++	u16	map_data_len;
++	u16	slave_sk:1,
++		fully_established:1,
++		establish_increased:1,
++		second_packet:1,
++		attached:1,
++		send_mp_fail:1,
++		include_mpc:1,
++		mapping_present:1,
++		map_data_fin:1,
++		low_prio:1, /* use this socket as backup */
++		rcv_low_prio:1, /* Peer sent low-prio option to us */
++		send_mp_prio:1, /* Trigger to send mp_prio on this socket */
++		pre_established:1; /* State between sending 3rd ACK and
++				    * receiving the fourth ack of new subflows.
++				    */
++
++	/* isn: needed to translate abs to relative subflow seqnums */
++	u32	snt_isn;
++	u32	rcv_isn;
++	u8	path_index;
++	u8	loc_id;
++	u8	rem_id;
++
++#define MPTCP_SCHED_SIZE 4
++	u8	mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
++
++	struct sk_buff  *shortcut_ofoqueue; /* Shortcut to the current modified
++					     * skb in the ofo-queue.
++					     */
++
++	int	init_rcv_wnd;
++	u32	infinite_cutoff_seq;
++	struct delayed_work work;
++	u32	mptcp_loc_nonce;
++	struct tcp_sock *tp; /* Where is my daddy? */
++	u32	last_end_data_seq;
++
++	/* MP_JOIN subflow: timer for retransmitting the 3rd ack */
++	struct timer_list mptcp_ack_timer;
++
++	/* HMAC of the third ack */
++	char sender_mac[20];
++};
++
++struct mptcp_tw {
++	struct list_head list;
++	u64 loc_key;
++	u64 rcv_nxt;
++	struct mptcp_cb __rcu *mpcb;
++	u8 meta_tw:1,
++	   in_list:1;
++};
++
++#define MPTCP_PM_NAME_MAX 16
++struct mptcp_pm_ops {
++	struct list_head list;
++
++	/* Signal the creation of a new MPTCP-session. */
++	void (*new_session)(const struct sock *meta_sk);
++	void (*release_sock)(struct sock *meta_sk);
++	void (*fully_established)(struct sock *meta_sk);
++	void (*new_remote_address)(struct sock *meta_sk);
++	int  (*get_local_id)(sa_family_t family, union inet_addr *addr,
++			     struct net *net, bool *low_prio);
++	void (*addr_signal)(struct sock *sk, unsigned *size,
++			    struct tcp_out_options *opts, struct sk_buff *skb);
++	void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
++			  sa_family_t family, __be16 port, u8 id);
++	void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
++	void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
++	void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
++
++	char		name[MPTCP_PM_NAME_MAX];
++	struct module	*owner;
++};
++
++#define MPTCP_SCHED_NAME_MAX 16
++struct mptcp_sched_ops {
++	struct list_head list;
++
++	struct sock *		(*get_subflow)(struct sock *meta_sk,
++					       struct sk_buff *skb,
++					       bool zero_wnd_test);
++	struct sk_buff *	(*next_segment)(struct sock *meta_sk,
++						int *reinject,
++						struct sock **subsk,
++						unsigned int *limit);
++	void			(*init)(struct sock *sk);
++
++	char			name[MPTCP_SCHED_NAME_MAX];
++	struct module		*owner;
++};
++
++struct mptcp_cb {
++	/* list of sockets in this multipath connection */
++	struct tcp_sock *connection_list;
++	/* list of sockets that need a call to release_cb */
++	struct hlist_head callback_list;
++
++	/* High-order bits of 64-bit sequence numbers */
++	u32 snd_high_order[2];
++	u32 rcv_high_order[2];
++
++	u16	send_infinite_mapping:1,
++		in_time_wait:1,
++		list_rcvd:1, /* XXX TO REMOVE */
++		addr_signal:1, /* Path-manager wants us to call addr_signal */
++		dss_csum:1,
++		server_side:1,
++		infinite_mapping_rcv:1,
++		infinite_mapping_snd:1,
++		dfin_combined:1,   /* Was the DFIN combined with subflow-fin? */
++		passive_close:1,
++		snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
++		rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
++
++	/* socket count in this connection */
++	u8 cnt_subflows;
++	u8 cnt_established;
++
++	struct mptcp_sched_ops *sched_ops;
++
++	struct sk_buff_head reinject_queue;
++	/* First cache-line boundary is here minus 8 bytes. But from the
++	 * reinject-queue only the next and prev pointers are regularly
++	 * accessed. Thus, the whole data-path is on a single cache-line.
++	 */
++
++	u64	csum_cutoff_seq;
++
++	/***** Start of fields, used for connection closure */
++	spinlock_t	 tw_lock;
++	unsigned char	 mptw_state;
++	u8		 dfin_path_index;
++
++	struct list_head tw_list;
++
++	/***** Start of fields, used for subflow establishment and closure */
++	atomic_t	mpcb_refcnt;
++
++	/* Mutex needed, because otherwise mptcp_close will complain that the
++	 * socket is owned by the user.
++	 * E.g., mptcp_sub_close_wq is taking the meta-lock.
++	 */
++	struct mutex	mpcb_mutex;
++
++	/***** Start of fields, used for subflow establishment */
++	struct sock *meta_sk;
++
++	/* Master socket, also part of the connection_list, this
++	 * socket is the one that the application sees.
++	 */
++	struct sock *master_sk;
++
++	__u64	mptcp_loc_key;
++	__u64	mptcp_rem_key;
++	__u32	mptcp_loc_token;
++	__u32	mptcp_rem_token;
++
++#define MPTCP_PM_SIZE 608
++	u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
++	struct mptcp_pm_ops *pm_ops;
++
++	u32 path_index_bits;
++	/* Next pi to pick up in case a new path becomes available */
++	u8 next_path_index;
++
++	/* Original snd/rcvbuf of the initial subflow.
++	 * Used for the new subflows on the server-side to allow correct
++	 * autotuning
++	 */
++	int orig_sk_rcvbuf;
++	int orig_sk_sndbuf;
++	u32 orig_window_clamp;
++
++	/* Timer for retransmitting SYN/ACK+MP_JOIN */
++	struct timer_list synack_timer;
++};
++
++#define MPTCP_SUB_CAPABLE			0
++#define MPTCP_SUB_LEN_CAPABLE_SYN		12
++#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN		12
++#define MPTCP_SUB_LEN_CAPABLE_ACK		20
++#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN		20
++
++#define MPTCP_SUB_JOIN			1
++#define MPTCP_SUB_LEN_JOIN_SYN		12
++#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN	12
++#define MPTCP_SUB_LEN_JOIN_SYNACK	16
++#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN	16
++#define MPTCP_SUB_LEN_JOIN_ACK		24
++#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN	24
++
++#define MPTCP_SUB_DSS		2
++#define MPTCP_SUB_LEN_DSS	4
++#define MPTCP_SUB_LEN_DSS_ALIGN	4
++
++/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
++ * as they are part of the DSS-option.
++ * To get the total length, just add the different options together.
++ */
++#define MPTCP_SUB_LEN_SEQ	10
++#define MPTCP_SUB_LEN_SEQ_CSUM	12
++#define MPTCP_SUB_LEN_SEQ_ALIGN	12
++
++#define MPTCP_SUB_LEN_SEQ_64		14
++#define MPTCP_SUB_LEN_SEQ_CSUM_64	16
++#define MPTCP_SUB_LEN_SEQ_64_ALIGN	16
++
++#define MPTCP_SUB_LEN_ACK	4
++#define MPTCP_SUB_LEN_ACK_ALIGN	4
++
++#define MPTCP_SUB_LEN_ACK_64		8
++#define MPTCP_SUB_LEN_ACK_64_ALIGN	8
++
++/* This is the "default" option-length we will send out most often.
++ * MPTCP DSS-header
++ * 32-bit data sequence number
++ * 32-bit data ack
++ *
++ * It is necessary to calculate the effective MSS we will be using when
++ * sending data.
++ */
++#define MPTCP_SUB_LEN_DSM_ALIGN  (MPTCP_SUB_LEN_DSS_ALIGN +		\
++				  MPTCP_SUB_LEN_SEQ_ALIGN +		\
++				  MPTCP_SUB_LEN_ACK_ALIGN)
++
++#define MPTCP_SUB_ADD_ADDR		3
++#define MPTCP_SUB_LEN_ADD_ADDR4		8
++#define MPTCP_SUB_LEN_ADD_ADDR6		20
++#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN	8
++#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN	20
++
++#define MPTCP_SUB_REMOVE_ADDR	4
++#define MPTCP_SUB_LEN_REMOVE_ADDR	4
++
++#define MPTCP_SUB_PRIO		5
++#define MPTCP_SUB_LEN_PRIO	3
++#define MPTCP_SUB_LEN_PRIO_ADDR	4
++#define MPTCP_SUB_LEN_PRIO_ALIGN	4
++
++#define MPTCP_SUB_FAIL		6
++#define MPTCP_SUB_LEN_FAIL	12
++#define MPTCP_SUB_LEN_FAIL_ALIGN	12
++
++#define MPTCP_SUB_FCLOSE	7
++#define MPTCP_SUB_LEN_FCLOSE	12
++#define MPTCP_SUB_LEN_FCLOSE_ALIGN	12
++
++
++#define OPTION_MPTCP		(1 << 5)
++
++#ifdef CONFIG_MPTCP
++
++/* Used for checking if the mptcp initialization has been successful */
++extern bool mptcp_init_failed;
++
++/* MPTCP options */
++#define OPTION_TYPE_SYN		(1 << 0)
++#define OPTION_TYPE_SYNACK	(1 << 1)
++#define OPTION_TYPE_ACK		(1 << 2)
++#define OPTION_MP_CAPABLE	(1 << 3)
++#define OPTION_DATA_ACK		(1 << 4)
++#define OPTION_ADD_ADDR		(1 << 5)
++#define OPTION_MP_JOIN		(1 << 6)
++#define OPTION_MP_FAIL		(1 << 7)
++#define OPTION_MP_FCLOSE	(1 << 8)
++#define OPTION_REMOVE_ADDR	(1 << 9)
++#define OPTION_MP_PRIO		(1 << 10)
++
++/* MPTCP flags: both TX and RX */
++#define MPTCPHDR_SEQ		0x01 /* DSS.M option is present */
++#define MPTCPHDR_FIN		0x02 /* DSS.F option is present */
++#define MPTCPHDR_SEQ64_INDEX	0x04 /* index of seq in mpcb->snd_high_order */
++/* MPTCP flags: RX only */
++#define MPTCPHDR_ACK		0x08
++#define MPTCPHDR_SEQ64_SET	0x10 /* Did we received a 64-bit seq number?  */
++#define MPTCPHDR_SEQ64_OFO	0x20 /* Is it not in our circular array? */
++#define MPTCPHDR_DSS_CSUM	0x40
++#define MPTCPHDR_JOIN		0x80
++/* MPTCP flags: TX only */
++#define MPTCPHDR_INF		0x08
++
++struct mptcp_option {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ver:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ver:4;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_capable {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ver:4,
++		sub:4;
++	__u8	h:1,
++		rsv:5,
++		b:1,
++		a:1;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ver:4;
++	__u8	a:1,
++		b:1,
++		rsv:5,
++		h:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u64	sender_key;
++	__u64	receiver_key;
++} __attribute__((__packed__));
++
++struct mp_join {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	b:1,
++		rsv:3,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:3,
++		b:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++	union {
++		struct {
++			u32	token;
++			u32	nonce;
++		} syn;
++		struct {
++			__u64	mac;
++			u32	nonce;
++		} synack;
++		struct {
++			__u8	mac[20];
++		} ack;
++	} u;
++} __attribute__((__packed__));
++
++struct mp_dss {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		A:1,
++		a:1,
++		M:1,
++		m:1,
++		F:1,
++		rsv2:3;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:3,
++		F:1,
++		m:1,
++		M:1,
++		a:1,
++		A:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_add_addr {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ipver:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ipver:4;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++	union {
++		struct {
++			struct in_addr	addr;
++			__be16		port;
++		} v4;
++		struct {
++			struct in6_addr	addr;
++			__be16		port;
++		} v6;
++	} u;
++} __attribute__((__packed__));
++
++struct mp_remove_addr {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	rsv:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++	/* list of addr_id */
++	__u8	addrs_id;
++};
++
++struct mp_fail {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:8;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__be64	data_seq;
++} __attribute__((__packed__));
++
++struct mp_fclose {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:8;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u64	key;
++} __attribute__((__packed__));
++
++struct mp_prio {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	b:1,
++		rsv:3,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:3,
++		b:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++} __attribute__((__packed__));
++
++static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
++{
++	return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
++}
++
++#define MPTCP_APP	2
++
++extern int sysctl_mptcp_enabled;
++extern int sysctl_mptcp_checksum;
++extern int sysctl_mptcp_debug;
++extern int sysctl_mptcp_syn_retries;
++
++extern struct workqueue_struct *mptcp_wq;
++
++#define mptcp_debug(fmt, args...)					\
++	do {								\
++		if (unlikely(sysctl_mptcp_debug))			\
++			pr_err(__FILE__ ": " fmt, ##args);	\
++	} while (0)
++
++/* Iterates over all subflows */
++#define mptcp_for_each_tp(mpcb, tp)					\
++	for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
++
++#define mptcp_for_each_sk(mpcb, sk)					\
++	for ((sk) = (struct sock *)(mpcb)->connection_list;		\
++	     sk;							\
++	     sk = (struct sock *)tcp_sk(sk)->mptcp->next)
++
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)			\
++	for (__sk = (struct sock *)(__mpcb)->connection_list,		\
++	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
++	     __sk;							\
++	     __sk = __temp,						\
++	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
++
++/* Iterates over all bit set to 1 in a bitset */
++#define mptcp_for_each_bit_set(b, i)					\
++	for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
++
++#define mptcp_for_each_bit_unset(b, i)					\
++	mptcp_for_each_bit_set(~b, i)
++
++extern struct lock_class_key meta_key;
++extern struct lock_class_key meta_slock_key;
++extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
++
++/* This is needed to ensure that two subsequent key/nonce-generation result in
++ * different keys/nonces if the IPs and ports are the same.
++ */
++extern u32 mptcp_seed;
++
++#define MPTCP_HASH_SIZE                1024
++
++extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++extern spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
++
++/* Lock, protecting the two hash-tables that hold the token. Namely,
++ * mptcp_reqsk_tk_htb and tk_hashtable
++ */
++extern spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
++
++/* Request-sockets can be hashed in the tk_htb for collision-detection or in
++ * the regular htb for join-connections. We need to define different NULLS
++ * values so that we can correctly detect a request-socket that has been
++ * recycled. See also c25eb3bfb9729.
++ */
++#define MPTCP_REQSK_NULLS_BASE (1U << 29)
++
++
++void mptcp_data_ready(struct sock *sk);
++void mptcp_write_space(struct sock *sk);
++
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++			      struct sock *sk);
++void mptcp_ofo_queue(struct sock *meta_sk);
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++		   gfp_t flags);
++void mptcp_del_sock(struct sock *sk);
++void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
++void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
++void mptcp_update_sndbuf(const struct tcp_sock *tp);
++void mptcp_send_fin(struct sock *meta_sk);
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
++bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++		      int push_one, gfp_t gfp);
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++			     struct mptcp_options_received *mopt);
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++			 struct mptcp_options_received *mopt,
++			 const struct sk_buff *skb);
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++		       unsigned *remaining);
++void mptcp_synack_options(struct request_sock *req,
++			  struct tcp_out_options *opts,
++			  unsigned *remaining);
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++			       struct tcp_out_options *opts, unsigned *size);
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++			 const struct tcp_out_options *opts,
++			 struct sk_buff *skb);
++void mptcp_close(struct sock *meta_sk, long timeout);
++int mptcp_doit(struct sock *sk);
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++			   struct request_sock *req,
++			   struct request_sock **prev);
++struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
++				   struct request_sock *req,
++				   struct request_sock **prev,
++				   const struct mptcp_options_received *mopt);
++u32 __mptcp_select_window(struct sock *sk);
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++					__u32 *window_clamp, int wscale_ok,
++					__u8 *rcv_wscale, __u32 init_rcv_wnd,
++					const struct sock *sk);
++unsigned int mptcp_current_mss(struct sock *meta_sk);
++int mptcp_select_size(const struct sock *meta_sk, bool sg);
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++		     u32 *hash_out);
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
++void mptcp_fin(struct sock *meta_sk);
++void mptcp_retransmit_timer(struct sock *meta_sk);
++int mptcp_write_wakeup(struct sock *meta_sk);
++void mptcp_sub_close_wq(struct work_struct *work);
++void mptcp_sub_close(struct sock *sk, unsigned long delay);
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
++void mptcp_fallback_meta_sk(struct sock *meta_sk);
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_ack_handler(unsigned long);
++int mptcp_check_rtt(const struct tcp_sock *tp, int time);
++int mptcp_check_snd_buf(const struct tcp_sock *tp);
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++			 const struct sk_buff *skb);
++void __init mptcp_init(void);
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
++void mptcp_destroy_sock(struct sock *sk);
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++				    const struct sk_buff *skb,
++				    const struct mptcp_options_received *mopt);
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++				  int large_allowed);
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
++void mptcp_time_wait(struct sock *sk, int state, int timeo);
++void mptcp_disconnect(struct sock *sk);
++bool mptcp_should_expand_sndbuf(const struct sock *sk);
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_tsq_flags(struct sock *sk);
++void mptcp_tsq_sub_deferred(struct sock *meta_sk);
++struct mp_join *mptcp_find_join(const struct sk_buff *skb);
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
++void mptcp_hash_remove(struct tcp_sock *meta_tp);
++struct sock *mptcp_hash_find(const struct net *net, const u32 token);
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
++int mptcp_do_join_short(struct sk_buff *skb,
++			const struct mptcp_options_received *mopt,
++			struct net *net);
++void mptcp_reqsk_destructor(struct request_sock *req);
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++			   const struct mptcp_options_received *mopt,
++			   const struct sk_buff *skb);
++int mptcp_check_req(struct sk_buff *skb, struct net *net);
++void mptcp_connect_init(struct sock *sk);
++void mptcp_sub_force_close(struct sock *sk);
++int mptcp_sub_len_remove_addr_align(u16 bitfield);
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++			    const struct sk_buff *skb);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++			   struct sk_buff *skb);
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
++void mptcp_init_congestion_control(struct sock *sk);
++
++/* MPTCP-path-manager registration/initialization functions */
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_init_path_manager(struct mptcp_cb *mpcb);
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
++void mptcp_fallback_default(struct mptcp_cb *mpcb);
++void mptcp_get_default_path_manager(char *name);
++int mptcp_set_default_path_manager(const char *name);
++extern struct mptcp_pm_ops mptcp_pm_default;
++
++/* MPTCP-scheduler registration/initialization functions */
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_init_scheduler(struct mptcp_cb *mpcb);
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
++void mptcp_get_default_scheduler(char *name);
++int mptcp_set_default_scheduler(const char *name);
++extern struct mptcp_sched_ops mptcp_sched_default;
++
++static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
++					    unsigned long len)
++{
++	sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
++		       jiffies + len);
++}
++
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
++{
++	sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
++}
++
++static inline bool is_mptcp_enabled(const struct sock *sk)
++{
++	if (!sysctl_mptcp_enabled || mptcp_init_failed)
++		return false;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++		return false;
++
++	return true;
++}
++
++static inline int mptcp_pi_to_flag(int pi)
++{
++	return 1 << (pi - 1);
++}
++
++static inline
++struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
++{
++	return (struct mptcp_request_sock *)req;
++}
++
++static inline
++struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
++{
++	return (struct request_sock *)req;
++}
++
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++	struct sock *sk_it;
++
++	if (tcp_sk(sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
++		if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
++		    !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
++			return false;
++	}
++
++	return true;
++}
++
++static inline void mptcp_push_pending_frames(struct sock *meta_sk)
++{
++	/* We check packets out and send-head here. TCP only checks the
++	 * send-head. But, MPTCP also checks packets_out, as this is an
++	 * indication that we might want to do opportunistic reinjection.
++	 */
++	if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
++		struct tcp_sock *tp = tcp_sk(meta_sk);
++
++		/* We don't care about the MSS, because it will be set in
++		 * mptcp_write_xmit.
++		 */
++		__tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
++	}
++}
++
++static inline void mptcp_send_reset(struct sock *sk)
++{
++	tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++	mptcp_sub_force_close(sk);
++}
++
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
++}
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
++}
++
++/* Is it a data-fin while in infinite mapping mode?
++ * In infinite mode, a subflow-fin is in fact a data-fin.
++ */
++static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
++				     const struct tcp_sock *tp)
++{
++	return mptcp_is_data_fin(skb) ||
++	       (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
++}
++
++static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
++{
++	u64 data_seq_high = (u32)(data_seq >> 32);
++
++	if (mpcb->rcv_high_order[0] == data_seq_high)
++		return 0;
++	else if (mpcb->rcv_high_order[1] == data_seq_high)
++		return MPTCPHDR_SEQ64_INDEX;
++	else
++		return MPTCPHDR_SEQ64_OFO;
++}
++
++/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
++ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
++ */
++static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
++					    u32 *data_seq,
++					    struct mptcp_cb *mpcb)
++{
++	__u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
++
++	if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
++		u64 data_seq64 = get_unaligned_be64(ptr);
++
++		if (mpcb)
++			TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
++
++		*data_seq = (u32)data_seq64;
++		ptr++;
++	} else {
++		*data_seq = get_unaligned_be32(ptr);
++	}
++
++	return ptr;
++}
++
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++	return tcp_sk(sk)->meta_sk;
++}
++
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++	return tcp_sk(tp->meta_sk);
++}
++
++static inline int is_meta_tp(const struct tcp_sock *tp)
++{
++	return tp->mpcb && mptcp_meta_tp(tp) == tp;
++}
++
++static inline int is_meta_sk(const struct sock *sk)
++{
++	return sk->sk_type == SOCK_STREAM  && sk->sk_protocol == IPPROTO_TCP &&
++	       mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
++}
++
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++	return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
++}
++
++static inline void mptcp_hash_request_remove(struct request_sock *req)
++{
++	int in_softirq = 0;
++
++	if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
++		return;
++
++	if (in_softirq()) {
++		spin_lock(&mptcp_reqsk_hlock);
++		in_softirq = 1;
++	} else {
++		spin_lock_bh(&mptcp_reqsk_hlock);
++	}
++
++	hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++
++	if (in_softirq)
++		spin_unlock(&mptcp_reqsk_hlock);
++	else
++		spin_unlock_bh(&mptcp_reqsk_hlock);
++}
++
++static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
++{
++	mopt->saw_mpc = 0;
++	mopt->dss_csum = 0;
++	mopt->drop_me = 0;
++
++	mopt->is_mp_join = 0;
++	mopt->join_ack = 0;
++
++	mopt->saw_low_prio = 0;
++	mopt->low_prio = 0;
++
++	mopt->saw_add_addr = 0;
++	mopt->more_add_addr = 0;
++
++	mopt->saw_rem_addr = 0;
++	mopt->more_rem_addr = 0;
++
++	mopt->mp_fail = 0;
++	mopt->mp_fclose = 0;
++}
++
++static inline void mptcp_reset_mopt(struct tcp_sock *tp)
++{
++	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++	mopt->saw_low_prio = 0;
++	mopt->saw_add_addr = 0;
++	mopt->more_add_addr = 0;
++	mopt->saw_rem_addr = 0;
++	mopt->more_rem_addr = 0;
++	mopt->join_ack = 0;
++	mopt->mp_fail = 0;
++	mopt->mp_fclose = 0;
++}
++
++static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
++						 const struct mptcp_cb *mpcb)
++{
++	return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
++			MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
++}
++
++static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
++					u32 data_seq_32)
++{
++	return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
++}
++
++static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
++{
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++				     meta_tp->rcv_nxt);
++}
++
++static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
++{
++	if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
++		struct mptcp_cb *mpcb = meta_tp->mpcb;
++		mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++		mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
++	}
++}
++
++static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
++					   u32 old_rcv_nxt)
++{
++	if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
++		struct mptcp_cb *mpcb = meta_tp->mpcb;
++		mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
++		mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
++	}
++}
++
++static inline int mptcp_sk_can_send(const struct sock *sk)
++{
++	return tcp_passive_fastopen(sk) ||
++	       ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
++		!tcp_sk(sk)->mptcp->pre_established);
++}
++
++static inline int mptcp_sk_can_recv(const struct sock *sk)
++{
++	return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
++}
++
++static inline int mptcp_sk_can_send_ack(const struct sock *sk)
++{
++	return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
++					TCPF_CLOSE | TCPF_LISTEN)) &&
++	       !tcp_sk(sk)->mptcp->pre_established;
++}
++
++/* Only support GSO if all subflows supports it */
++static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
++{
++	struct sock *sk;
++
++	if (tcp_sk(meta_sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++		if (!sk_can_gso(sk))
++			return false;
++	}
++	return true;
++}
++
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++	struct sock *sk;
++
++	if (tcp_sk(meta_sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++		if (!(sk->sk_route_caps & NETIF_F_SG))
++			return false;
++	}
++	return true;
++}
++
++static inline void mptcp_set_rto(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *sk_it;
++	struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
++	__u32 max_rto = 0;
++
++	/* We are in recovery-phase on the MPTCP-level. Do not update the
++	 * RTO, because this would kill exponential backoff.
++	 */
++	if (micsk->icsk_retransmits)
++		return;
++
++	mptcp_for_each_sk(tp->mpcb, sk_it) {
++		if (mptcp_sk_can_send(sk_it) &&
++		    inet_csk(sk_it)->icsk_rto > max_rto)
++			max_rto = inet_csk(sk_it)->icsk_rto;
++	}
++	if (max_rto) {
++		micsk->icsk_rto = max_rto << 1;
++
++		/* A successfull rto-measurement - reset backoff counter */
++		micsk->icsk_backoff = 0;
++	}
++}
++
++static inline int mptcp_sysctl_syn_retries(void)
++{
++	return sysctl_mptcp_syn_retries;
++}
++
++static inline void mptcp_sub_close_passive(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
++
++	/* Only close, if the app did a send-shutdown (passive close), and we
++	 * received the data-ack of the data-fin.
++	 */
++	if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
++		mptcp_sub_close(sk, 0);
++}
++
++static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* If data has been acknowleged on the meta-level, fully_established
++	 * will have been set before and thus we will not fall back to infinite
++	 * mapping.
++	 */
++	if (likely(tp->mptcp->fully_established))
++		return false;
++
++	if (!(flag & MPTCP_FLAG_DATA_ACKED))
++		return false;
++
++	/* Don't fallback twice ;) */
++	if (tp->mpcb->infinite_mapping_snd)
++		return false;
++
++	pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
++	       __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
++	       &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
++	       __builtin_return_address(0));
++	if (!is_master_tp(tp))
++		return true;
++
++	tp->mpcb->infinite_mapping_snd = 1;
++	tp->mpcb->infinite_mapping_rcv = 1;
++	tp->mptcp->fully_established = 1;
++
++	return false;
++}
++
++/* Find the first index whose bit in the bit-field == 0 */
++static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
++{
++	u8 base = mpcb->next_path_index;
++	int i;
++
++	/* Start at 1, because 0 is reserved for the meta-sk */
++	mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
++		if (i + base < 1)
++			continue;
++		if (i + base >= sizeof(mpcb->path_index_bits) * 8)
++			break;
++		i += base;
++		mpcb->path_index_bits |= (1 << i);
++		mpcb->next_path_index = i + 1;
++		return i;
++	}
++	mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
++		if (i >= sizeof(mpcb->path_index_bits) * 8)
++			break;
++		if (i < 1)
++			continue;
++		mpcb->path_index_bits |= (1 << i);
++		mpcb->next_path_index = i + 1;
++		return i;
++	}
++
++	return 0;
++}
++
++static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
++{
++	return sk->sk_family == AF_INET6 &&
++	       ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
++}
++
++/* TCP and MPTCP mpc flag-depending functions */
++u16 mptcp_select_window(struct sock *sk);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_tcp_set_rto(struct sock *sk);
++
++/* TCP and MPTCP flag-depending functions */
++bool mptcp_prune_ofo_queue(struct sock *sk);
++
++#else /* CONFIG_MPTCP */
++#define mptcp_debug(fmt, args...)	\
++	do {				\
++	} while (0)
++
++/* Without MPTCP, we just do one iteration
++ * over the only socket available. This assumes that
++ * the sk/tp arg is the socket in that case.
++ */
++#define mptcp_for_each_sk(mpcb, sk)
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++	return false;
++}
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++	return false;
++}
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++	return NULL;
++}
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++	return NULL;
++}
++static inline int is_meta_sk(const struct sock *sk)
++{
++	return 0;
++}
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++	return 0;
++}
++static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
++static inline void mptcp_del_sock(const struct sock *sk) {}
++static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
++static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
++static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
++static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
++					    const struct sock *sk) {}
++static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
++static inline void mptcp_set_rto(const struct sock *sk) {}
++static inline void mptcp_send_fin(const struct sock *meta_sk) {}
++static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
++				       const struct mptcp_options_received *mopt,
++				       const struct sk_buff *skb) {}
++static inline void mptcp_syn_options(const struct sock *sk,
++				     struct tcp_out_options *opts,
++				     unsigned *remaining) {}
++static inline void mptcp_synack_options(struct request_sock *req,
++					struct tcp_out_options *opts,
++					unsigned *remaining) {}
++
++static inline void mptcp_established_options(struct sock *sk,
++					     struct sk_buff *skb,
++					     struct tcp_out_options *opts,
++					     unsigned *size) {}
++static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++				       const struct tcp_out_options *opts,
++				       struct sk_buff *skb) {}
++static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
++static inline int mptcp_doit(struct sock *sk)
++{
++	return 0;
++}
++static inline int mptcp_check_req_fastopen(struct sock *child,
++					   struct request_sock *req)
++{
++	return 1;
++}
++static inline int mptcp_check_req_master(const struct sock *sk,
++					 const struct sock *child,
++					 struct request_sock *req,
++					 struct request_sock **prev)
++{
++	return 1;
++}
++static inline struct sock *mptcp_check_req_child(struct sock *sk,
++						 struct sock *child,
++						 struct request_sock *req,
++						 struct request_sock **prev,
++						 const struct mptcp_options_received *mopt)
++{
++	return NULL;
++}
++static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++	return 0;
++}
++static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++	return 0;
++}
++static inline void mptcp_sub_close_passive(struct sock *sk) {}
++static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
++{
++	return false;
++}
++static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
++static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++	return 0;
++}
++static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++	return 0;
++}
++static inline int mptcp_sysctl_syn_retries(void)
++{
++	return 0;
++}
++static inline void mptcp_send_reset(const struct sock *sk) {}
++static inline int mptcp_handle_options(struct sock *sk,
++				       const struct tcphdr *th,
++				       struct sk_buff *skb)
++{
++	return 0;
++}
++static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
++static inline void  __init mptcp_init(void) {}
++static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++	return 0;
++}
++static inline bool mptcp_sk_can_gso(const struct sock *sk)
++{
++	return false;
++}
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++	return false;
++}
++static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
++						u32 mss_now, int large_allowed)
++{
++	return 0;
++}
++static inline void mptcp_destroy_sock(struct sock *sk) {}
++static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
++						  struct sock **skptr,
++						  struct sk_buff *skb,
++						  const struct mptcp_options_received *mopt)
++{
++	return 0;
++}
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++	return false;
++}
++static inline int mptcp_init_tw_sock(struct sock *sk,
++				     struct tcp_timewait_sock *tw)
++{
++	return 0;
++}
++static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
++static inline void mptcp_disconnect(struct sock *sk) {}
++static inline void mptcp_tsq_flags(struct sock *sk) {}
++static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
++static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
++static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
++static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
++					 const struct tcp_options_received *rx_opt,
++					 const struct mptcp_options_received *mopt,
++					 const struct sk_buff *skb) {}
++static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++					  const struct sk_buff *skb) {}
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_H */
+diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
+new file mode 100644
+index 000000000000..93ad97c77c5a
+--- /dev/null
++++ b/include/net/mptcp_v4.h
+@@ -0,0 +1,67 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef MPTCP_V4_H_
++#define MPTCP_V4_H_
++
++
++#include <linux/in.h>
++#include <linux/skbuff.h>
++#include <net/mptcp.h>
++#include <net/request_sock.h>
++#include <net/sock.h>
++
++extern struct request_sock_ops mptcp_request_sock_ops;
++extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++#ifdef CONFIG_MPTCP
++
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++				 const __be32 laddr, const struct net *net);
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++			   struct mptcp_rem4 *rem);
++int mptcp_pm_v4_init(void);
++void mptcp_pm_v4_undo(void);
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++
++#else
++
++static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
++				  const struct sk_buff *skb)
++{
++	return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* MPTCP_V4_H_ */
+diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
+new file mode 100644
+index 000000000000..49a4f30ccd4d
+--- /dev/null
++++ b/include/net/mptcp_v6.h
+@@ -0,0 +1,69 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_V6_H
++#define _MPTCP_V6_H
++
++#include <linux/in6.h>
++#include <net/if_inet6.h>
++
++#include <net/mptcp.h>
++
++
++#ifdef CONFIG_MPTCP
++extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
++extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
++extern struct request_sock_ops mptcp6_request_sock_ops;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++				 const struct in6_addr *laddr, const struct net *net);
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++			   struct mptcp_rem6 *rem);
++int mptcp_pm_v6_init(void);
++void mptcp_pm_v6_undo(void);
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++			 __be16 sport, __be16 dport);
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++		     __be16 sport, __be16 dport);
++
++#else /* CONFIG_MPTCP */
++
++#define mptcp_v6_mapped ipv6_mapped
++
++static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_V6_H */
+diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
+index 361d26077196..bae95a11c531 100644
+--- a/include/net/net_namespace.h
++++ b/include/net/net_namespace.h
+@@ -16,6 +16,7 @@
+ #include <net/netns/packet.h>
+ #include <net/netns/ipv4.h>
+ #include <net/netns/ipv6.h>
++#include <net/netns/mptcp.h>
+ #include <net/netns/ieee802154_6lowpan.h>
+ #include <net/netns/sctp.h>
+ #include <net/netns/dccp.h>
+@@ -92,6 +93,9 @@ struct net {
+ #if IS_ENABLED(CONFIG_IPV6)
+ 	struct netns_ipv6	ipv6;
+ #endif
++#if IS_ENABLED(CONFIG_MPTCP)
++	struct netns_mptcp	mptcp;
++#endif
+ #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
+ 	struct netns_ieee802154_lowpan	ieee802154_lowpan;
+ #endif
+diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
+new file mode 100644
+index 000000000000..bad418b04cc8
+--- /dev/null
++++ b/include/net/netns/mptcp.h
+@@ -0,0 +1,44 @@
++/*
++ *	MPTCP implementation - MPTCP namespace
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef __NETNS_MPTCP_H__
++#define __NETNS_MPTCP_H__
++
++#include <linux/compiler.h>
++
++enum {
++	MPTCP_PM_FULLMESH = 0,
++	MPTCP_PM_MAX
++};
++
++struct netns_mptcp {
++	void *path_managers[MPTCP_PM_MAX];
++};
++
++#endif /* __NETNS_MPTCP_H__ */
+diff --git a/include/net/request_sock.h b/include/net/request_sock.h
+index 7f830ff67f08..e79e87a8e1a6 100644
+--- a/include/net/request_sock.h
++++ b/include/net/request_sock.h
+@@ -164,7 +164,7 @@ struct request_sock_queue {
+ };
+ 
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+-		      unsigned int nr_table_entries);
++		      unsigned int nr_table_entries, gfp_t flags);
+ 
+ void __reqsk_queue_destroy(struct request_sock_queue *queue);
+ void reqsk_queue_destroy(struct request_sock_queue *queue);
+diff --git a/include/net/sock.h b/include/net/sock.h
+index 156350745700..0e23cae8861f 100644
+--- a/include/net/sock.h
++++ b/include/net/sock.h
+@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
+ 
+ int sk_wait_data(struct sock *sk, long *timeo);
+ 
++/* START - needed for MPTCP */
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
++void sock_lock_init(struct sock *sk);
++
++extern struct lock_class_key af_callback_keys[AF_MAX];
++extern char *const af_family_clock_key_strings[AF_MAX+1];
++
++#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
++/* END - needed for MPTCP */
++
+ struct request_sock_ops;
+ struct timewait_sock_ops;
+ struct inet_hashinfo;
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 7286db80e8b8..ff92e74cd684 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TCPOPT_SACK             5       /* SACK Block */
+ #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
+ #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
++#define TCPOPT_MPTCP		30
+ #define TCPOPT_EXP		254	/* Experimental */
+ /* Magic number to be after the option value for sharing TCP
+  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
+@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define	TFO_SERVER_WO_SOCKOPT1	0x400
+ #define	TFO_SERVER_WO_SOCKOPT2	0x800
+ 
++/* Flags from tcp_input.c for tcp_ack */
++#define FLAG_DATA               0x01 /* Incoming frame contained data.          */
++#define FLAG_WIN_UPDATE         0x02 /* Incoming ACK was a window update.       */
++#define FLAG_DATA_ACKED         0x04 /* This ACK acknowledged new data.         */
++#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted.  */
++#define FLAG_SYN_ACKED          0x10 /* This ACK acknowledged SYN.              */
++#define FLAG_DATA_SACKED        0x20 /* New SACK.                               */
++#define FLAG_ECE                0x40 /* ECE in this ACK                         */
++#define FLAG_SLOWPATH           0x100 /* Do not skip RFC checks for window update.*/
++#define FLAG_ORIG_SACK_ACKED    0x200 /* Never retransmitted data are (s)acked  */
++#define FLAG_SND_UNA_ADVANCED   0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
++#define FLAG_DSACKING_ACK       0x800 /* SACK blocks contained D-SACK info */
++#define FLAG_SACK_RENEGING      0x2000 /* snd_una advanced to a sacked seq */
++#define FLAG_UPDATE_TS_RECENT   0x4000 /* tcp_replace_ts_recent() */
++#define MPTCP_FLAG_DATA_ACKED	0x8000
++
++#define FLAG_ACKED              (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
++#define FLAG_NOT_DUP            (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
++#define FLAG_CA_ALERT           (FLAG_DATA_SACKED|FLAG_ECE)
++#define FLAG_FORWARD_PROGRESS   (FLAG_ACKED|FLAG_DATA_SACKED)
++
+ extern struct inet_timewait_death_row tcp_death_row;
+ 
+ /* sysctl variables for tcp */
+@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
+ #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+ #define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
+ 
++/**** START - Exports needed for MPTCP ****/
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
++
++struct mptcp_options_received;
++
++void tcp_enter_quickack_mode(struct sock *sk);
++int tcp_close_state(struct sock *sk);
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++			 const struct sk_buff *skb);
++int tcp_xmit_probe_skb(struct sock *sk, int urgent);
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++		     gfp_t gfp_mask);
++unsigned int tcp_mss_split_point(const struct sock *sk,
++				 const struct sk_buff *skb,
++				 unsigned int mss_now,
++				 unsigned int max_segs,
++				 int nonagle);
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		    unsigned int cur_mss, int nonagle);
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		      unsigned int cur_mss);
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++		      unsigned int mss_now);
++void __pskb_trim_head(struct sk_buff *skb, int len);
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
++void tcp_reset(struct sock *sk);
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++			   const u32 ack_seq, const u32 nwin);
++bool tcp_urg_mode(const struct tcp_sock *tp);
++void tcp_ack_probe(struct sock *sk);
++void tcp_rearm_rto(struct sock *sk);
++int tcp_write_timeout(struct sock *sk);
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++			   unsigned int timeout, bool syn_set);
++void tcp_write_err(struct sock *sk);
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++			  unsigned int mss_now);
++
++int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req);
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc);
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
++void tcp_v4_reqsk_destructor(struct request_sock *req);
++
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req);
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl, struct request_sock *req,
++		       u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
++void tcp_v6_destroy_sock(struct sock *sk);
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
++void tcp_v6_hash(struct sock *sk);
++struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++			          struct request_sock *req,
++				  struct dst_entry *dst);
++void tcp_v6_reqsk_destructor(struct request_sock *req);
++
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
++				       int large_allowed);
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
++
++void skb_clone_fraglist(struct sk_buff *skb);
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
++
++void inet_twsk_free(struct inet_timewait_sock *tw);
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
++/* These states need RST on ABORT according to RFC793 */
++static inline bool tcp_need_reset(int state)
++{
++	return (1 << state) &
++	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
++		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
++}
++
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
++			    int hlen);
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++			       bool *fragstolen);
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
++		      struct sk_buff *from, bool *fragstolen);
++/**** END - Exports needed for MPTCP ****/
++
+ void tcp_tasklet_init(void);
+ 
+ void tcp_v4_err(struct sk_buff *skb, u32);
+@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 		size_t len, int nonblock, int flags, int *addr_len);
+ void tcp_parse_options(const struct sk_buff *skb,
+ 		       struct tcp_options_received *opt_rx,
++		       struct mptcp_options_received *mopt_rx,
+ 		       int estab, struct tcp_fastopen_cookie *foc);
+ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
+ 
+@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
+ 
+ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ 			      u16 *mssp);
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
+-#else
+-static inline __u32 cookie_v4_init_sequence(struct sock *sk,
+-					    struct sk_buff *skb,
+-					    __u16 *mss)
+-{
+-	return 0;
+-}
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++			      __u16 *mss);
+ #endif
+ 
+ __u32 cookie_init_timestamp(struct request_sock *req);
+@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
+ 			      const struct tcphdr *th, u16 *mssp);
+ __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
+ 			      __u16 *mss);
+-#else
+-static inline __u32 cookie_v6_init_sequence(struct sock *sk,
+-					    struct sk_buff *skb,
+-					    __u16 *mss)
+-{
+-	return 0;
+-}
+ #endif
+ /* tcp_output.c */
+ 
+@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
+ void tcp_send_loss_probe(struct sock *sk);
+ bool tcp_schedule_loss_probe(struct sock *sk);
+ 
++u16 tcp_select_window(struct sock *sk);
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++		int push_one, gfp_t gfp);
++
+ /* tcp_input.c */
+ void tcp_resume_early_retransmit(struct sock *sk);
+ void tcp_rearm_rto(struct sock *sk);
+ void tcp_reset(struct sock *sk);
++void tcp_set_rto(struct sock *sk);
++bool tcp_should_expand_sndbuf(const struct sock *sk);
++bool tcp_prune_ofo_queue(struct sock *sk);
+ 
+ /* tcp_timer.c */
+ void tcp_init_xmit_timers(struct sock *);
+@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
+  */
+ struct tcp_skb_cb {
+ 	union {
+-		struct inet_skb_parm	h4;
++		union {
++			struct inet_skb_parm	h4;
+ #if IS_ENABLED(CONFIG_IPV6)
+-		struct inet6_skb_parm	h6;
++			struct inet6_skb_parm	h6;
+ #endif
+-	} header;	/* For incoming frames		*/
++		} header;	/* For incoming frames		*/
++#ifdef CONFIG_MPTCP
++		union {			/* For MPTCP outgoing frames */
++			__u32 path_mask; /* paths that tried to send this skb */
++			__u32 dss[6];	/* DSS options */
++		};
++#endif
++	};
+ 	__u32		seq;		/* Starting sequence number	*/
+ 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
+ 	__u32		when;		/* used to compute rtt's	*/
++#ifdef CONFIG_MPTCP
++	__u8		mptcp_flags;	/* flags for the MPTCP layer    */
++	__u8		dss_off;	/* Number of 4-byte words until
++					 * seq-number */
++#endif
+ 	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
+ 
+ 	__u8		sacked;		/* State flags for SACK/FACK.	*/
+@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
+ /* Determine a window scaling and initial window to offer. */
+ void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
+ 			       __u32 *window_clamp, int wscale_ok,
+-			       __u8 *rcv_wscale, __u32 init_rcv_wnd);
++			       __u8 *rcv_wscale, __u32 init_rcv_wnd,
++			       const struct sock *sk);
+ 
+ static inline int tcp_win_from_space(int space)
+ {
+@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
+ 		space - (space>>sysctl_tcp_adv_win_scale);
+ }
+ 
++#ifdef CONFIG_MPTCP
++extern struct static_key mptcp_static_key;
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++	return static_key_false(&mptcp_static_key) && tp->mpc;
++}
++#else
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++	return 0;
++}
++#endif
++
+ /* Note: caller must be prepared to deal with negative returns */ 
+ static inline int tcp_space(const struct sock *sk)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = tcp_sk(sk)->meta_sk;
++
+ 	return tcp_win_from_space(sk->sk_rcvbuf -
+ 				  atomic_read(&sk->sk_rmem_alloc));
+ } 
+ 
+ static inline int tcp_full_space(const struct sock *sk)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = tcp_sk(sk)->meta_sk;
++
+ 	return tcp_win_from_space(sk->sk_rcvbuf); 
+ }
+ 
+@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
+ 	ireq->wscale_ok = rx_opt->wscale_ok;
+ 	ireq->acked = 0;
+ 	ireq->ecn_ok = 0;
++	ireq->mptcp_rqsk = 0;
++	ireq->saw_mpc = 0;
+ 	ireq->ir_rmt_port = tcp_hdr(skb)->source;
+ 	ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
+ }
+@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
+ void tcp4_proc_exit(void);
+ #endif
+ 
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++		     const struct tcp_request_sock_ops *af_ops,
++		     struct sock *sk, struct sk_buff *skb);
++
+ /* TCP af-specific functions */
+ struct tcp_sock_af_ops {
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
+ #endif
+ };
+ 
++/* TCP/MPTCP-specific functions */
++struct tcp_sock_ops {
++	u32 (*__select_window)(struct sock *sk);
++	u16 (*select_window)(struct sock *sk);
++	void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
++				      __u32 *window_clamp, int wscale_ok,
++				      __u8 *rcv_wscale, __u32 init_rcv_wnd,
++				      const struct sock *sk);
++	void (*init_buffer_space)(struct sock *sk);
++	void (*set_rto)(struct sock *sk);
++	bool (*should_expand_sndbuf)(const struct sock *sk);
++	void (*send_fin)(struct sock *sk);
++	bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
++			   int push_one, gfp_t gfp);
++	void (*send_active_reset)(struct sock *sk, gfp_t priority);
++	int (*write_wakeup)(struct sock *sk);
++	bool (*prune_ofo_queue)(struct sock *sk);
++	void (*retransmit_timer)(struct sock *sk);
++	void (*time_wait)(struct sock *sk, int state, int timeo);
++	void (*cleanup_rbuf)(struct sock *sk, int copied);
++	void (*init_congestion_control)(struct sock *sk);
++};
++extern const struct tcp_sock_ops tcp_specific;
++
+ struct tcp_request_sock_ops {
++	u16 mss_clamp;
+ #ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_md5sig_key	*(*md5_lookup) (struct sock *sk,
+ 						struct request_sock *req);
+@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
+ 						  const struct request_sock *req,
+ 						  const struct sk_buff *skb);
+ #endif
++	int (*init_req)(struct request_sock *req, struct sock *sk,
++			 struct sk_buff *skb);
++#ifdef CONFIG_SYN_COOKIES
++	__u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
++				 __u16 *mss);
++#endif
++	struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
++				       const struct request_sock *req,
++				       bool *strict);
++	__u32 (*init_seq)(const struct sk_buff *skb);
++	int (*send_synack)(struct sock *sk, struct dst_entry *dst,
++			   struct flowi *fl, struct request_sock *req,
++			   u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++	void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
++			       const unsigned long timeout);
+ };
+ 
++#ifdef CONFIG_SYN_COOKIES
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++					 struct sock *sk, struct sk_buff *skb,
++					 __u16 *mss)
++{
++	return ops->cookie_init_seq(sk, skb, mss);
++}
++#else
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++					 struct sock *sk, struct sk_buff *skb,
++					 __u16 *mss)
++{
++	return 0;
++}
++#endif
++
+ int tcpv4_offload_init(void);
+ 
+ void tcp_v4_init(void);
+diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
+index 9cf2394f0bcf..c2634b6ed854 100644
+--- a/include/uapi/linux/if.h
++++ b/include/uapi/linux/if.h
+@@ -109,6 +109,9 @@ enum net_device_flags {
+ #define IFF_DORMANT			IFF_DORMANT
+ #define IFF_ECHO			IFF_ECHO
+ 
++#define IFF_NOMULTIPATH	0x80000		/* Disable for MPTCP 		*/
++#define IFF_MPBACKUP	0x100000	/* Use as backup path for MPTCP */
++
+ #define IFF_VOLATILE	(IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
+ 		IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
+ 
+diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
+index 3b9718328d8b..487475681d84 100644
+--- a/include/uapi/linux/tcp.h
++++ b/include/uapi/linux/tcp.h
+@@ -112,6 +112,7 @@ enum {
+ #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
+ #define TCP_TIMESTAMP		24
+ #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
++#define MPTCP_ENABLED		26
+ 
+ struct tcp_repair_opt {
+ 	__u32	opt_code;
+diff --git a/net/Kconfig b/net/Kconfig
+index d92afe4204d9..96b58593ad5e 100644
+--- a/net/Kconfig
++++ b/net/Kconfig
+@@ -79,6 +79,7 @@ if INET
+ source "net/ipv4/Kconfig"
+ source "net/ipv6/Kconfig"
+ source "net/netlabel/Kconfig"
++source "net/mptcp/Kconfig"
+ 
+ endif # if INET
+ 
+diff --git a/net/Makefile b/net/Makefile
+index cbbbe6d657ca..244bac1435b1 100644
+--- a/net/Makefile
++++ b/net/Makefile
+@@ -20,6 +20,7 @@ obj-$(CONFIG_INET)		+= ipv4/
+ obj-$(CONFIG_XFRM)		+= xfrm/
+ obj-$(CONFIG_UNIX)		+= unix/
+ obj-$(CONFIG_NET)		+= ipv6/
++obj-$(CONFIG_MPTCP)		+= mptcp/
+ obj-$(CONFIG_PACKET)		+= packet/
+ obj-$(CONFIG_NET_KEY)		+= key/
+ obj-$(CONFIG_BRIDGE)		+= bridge/
+diff --git a/net/core/dev.c b/net/core/dev.c
+index 367a586d0c8a..215d2757fbf6 100644
+--- a/net/core/dev.c
++++ b/net/core/dev.c
+@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
+ 
+ 	dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
+ 			       IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
+-			       IFF_AUTOMEDIA)) |
++			       IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
+ 		     (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
+ 				    IFF_ALLMULTI));
+ 
+diff --git a/net/core/request_sock.c b/net/core/request_sock.c
+index 467f326126e0..909dfa13f499 100644
+--- a/net/core/request_sock.c
++++ b/net/core/request_sock.c
+@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
+ EXPORT_SYMBOL(sysctl_max_syn_backlog);
+ 
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+-		      unsigned int nr_table_entries)
++		      unsigned int nr_table_entries,
++		      gfp_t flags)
+ {
+ 	size_t lopt_size = sizeof(struct listen_sock);
+ 	struct listen_sock *lopt;
+@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
+ 	nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
+ 	lopt_size += nr_table_entries * sizeof(struct request_sock *);
+ 	if (lopt_size > PAGE_SIZE)
+-		lopt = vzalloc(lopt_size);
++		lopt = __vmalloc(lopt_size,
++			flags | __GFP_HIGHMEM | __GFP_ZERO,
++			PAGE_KERNEL);
+ 	else
+-		lopt = kzalloc(lopt_size, GFP_KERNEL);
++		lopt = kzalloc(lopt_size, flags);
+ 	if (lopt == NULL)
+ 		return -ENOMEM;
+ 
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..8abc5d60fbe3 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
+ 	skb_drop_list(&skb_shinfo(skb)->frag_list);
+ }
+ 
+-static void skb_clone_fraglist(struct sk_buff *skb)
++void skb_clone_fraglist(struct sk_buff *skb)
+ {
+ 	struct sk_buff *list;
+ 
+@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
+ 	skb->inner_mac_header += off;
+ }
+ 
+-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+ {
+ 	__copy_skb_header(new, old);
+ 
+diff --git a/net/core/sock.c b/net/core/sock.c
+index 026e01f70274..359295523177 100644
+--- a/net/core/sock.c
++++ b/net/core/sock.c
+@@ -136,6 +136,11 @@
+ 
+ #include <trace/events/sock.h>
+ 
++#ifdef CONFIG_MPTCP
++#include <net/mptcp.h>
++#include <net/inet_common.h>
++#endif
++
+ #ifdef CONFIG_INET
+ #include <net/tcp.h>
+ #endif
+@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
+   "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG"      ,
+   "slock-AF_NFC"   , "slock-AF_VSOCK"    ,"slock-AF_MAX"
+ };
+-static const char *const af_family_clock_key_strings[AF_MAX+1] = {
++char *const af_family_clock_key_strings[AF_MAX+1] = {
+   "clock-AF_UNSPEC", "clock-AF_UNIX"     , "clock-AF_INET"     ,
+   "clock-AF_AX25"  , "clock-AF_IPX"      , "clock-AF_APPLETALK",
+   "clock-AF_NETROM", "clock-AF_BRIDGE"   , "clock-AF_ATMPVC"   ,
+@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
+  * sk_callback_lock locking rules are per-address-family,
+  * so split the lock classes by using a per-AF key:
+  */
+-static struct lock_class_key af_callback_keys[AF_MAX];
++struct lock_class_key af_callback_keys[AF_MAX];
+ 
+ /* Take into consideration the size of the struct sk_buff overhead in the
+  * determination of these values, since that is non-constant across
+@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
+ 	}
+ }
+ 
+-#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
+-
+ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
+ {
+ 	if (sk->sk_flags & flags) {
+@@ -1253,8 +1256,25 @@ lenout:
+  *
+  * (We also register the sk_lock with the lock validator.)
+  */
+-static inline void sock_lock_init(struct sock *sk)
+-{
++void sock_lock_init(struct sock *sk)
++{
++#ifdef CONFIG_MPTCP
++	/* Reclassify the lock-class for subflows */
++	if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
++		if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
++			sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
++						      &meta_slock_key,
++						      "sk_lock-AF_INET-MPTCP",
++						      &meta_key);
++
++			/* We don't yet have the mptcp-point.
++			 * Thus we still need inet_sock_destruct
++			 */
++			sk->sk_destruct = inet_sock_destruct;
++			return;
++		}
++#endif
++
+ 	sock_lock_init_class_and_name(sk,
+ 			af_family_slock_key_strings[sk->sk_family],
+ 			af_family_slock_keys + sk->sk_family,
+@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
+ }
+ EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
+ 
+-static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
+ 		int family)
+ {
+ 	struct sock *sk;
+diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
+index 4db3c2a1679c..04cb17d4b0ce 100644
+--- a/net/dccp/ipv6.c
++++ b/net/dccp/ipv6.c
+@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ 	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
+ 		goto drop;
+ 
+-	req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
++	req = inet_reqsk_alloc(&dccp6_request_sock_ops);
+ 	if (req == NULL)
+ 		goto drop;
+ 
+diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
+index 05c57f0fcabe..630434db0085 100644
+--- a/net/ipv4/Kconfig
++++ b/net/ipv4/Kconfig
+@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
+ 	For further details see:
+ 	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
+ 
++config TCP_CONG_COUPLED
++	tristate "MPTCP COUPLED CONGESTION CONTROL"
++	depends on MPTCP
++	default n
++	---help---
++	MultiPath TCP Coupled Congestion Control
++	To enable it, just put 'coupled' in tcp_congestion_control
++
++config TCP_CONG_OLIA
++	tristate "MPTCP Opportunistic Linked Increase"
++	depends on MPTCP
++	default n
++	---help---
++	MultiPath TCP Opportunistic Linked Increase Congestion Control
++	To enable it, just put 'olia' in tcp_congestion_control
++
++config TCP_CONG_WVEGAS
++	tristate "MPTCP WVEGAS CONGESTION CONTROL"
++	depends on MPTCP
++	default n
++	---help---
++	wVegas congestion control for MPTCP
++	To enable it, just put 'wvegas' in tcp_congestion_control
++
+ choice
+ 	prompt "Default TCP congestion control"
+ 	default DEFAULT_CUBIC
+@@ -584,6 +608,15 @@ choice
+ 	config DEFAULT_WESTWOOD
+ 		bool "Westwood" if TCP_CONG_WESTWOOD=y
+ 
++	config DEFAULT_COUPLED
++		bool "Coupled" if TCP_CONG_COUPLED=y
++
++	config DEFAULT_OLIA
++		bool "Olia" if TCP_CONG_OLIA=y
++
++	config DEFAULT_WVEGAS
++		bool "Wvegas" if TCP_CONG_WVEGAS=y
++
+ 	config DEFAULT_RENO
+ 		bool "Reno"
+ 
+@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
+ 	default "vegas" if DEFAULT_VEGAS
+ 	default "westwood" if DEFAULT_WESTWOOD
+ 	default "veno" if DEFAULT_VENO
++	default "coupled" if DEFAULT_COUPLED
++	default "wvegas" if DEFAULT_WVEGAS
+ 	default "reno" if DEFAULT_RENO
+ 	default "cubic"
+ 
+diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
+index d156b3c5f363..4afd6d8d9028 100644
+--- a/net/ipv4/af_inet.c
++++ b/net/ipv4/af_inet.c
+@@ -104,6 +104,7 @@
+ #include <net/ip_fib.h>
+ #include <net/inet_connection_sock.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/ping.h>
+@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
+  *	Create an inet socket.
+  */
+ 
+-static int inet_create(struct net *net, struct socket *sock, int protocol,
+-		       int kern)
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ 	struct sock *sk;
+ 	struct inet_protosw *answer;
+@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
+ 	lock_sock(sk2);
+ 
+ 	sock_rps_record_flow(sk2);
++
++	if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
++		struct sock *sk_it = sk2;
++
++		mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++
++		if (tcp_sk(sk2)->mpcb->master_sk) {
++			sk_it = tcp_sk(sk2)->mpcb->master_sk;
++
++			write_lock_bh(&sk_it->sk_callback_lock);
++			sk_it->sk_wq = newsock->wq;
++			sk_it->sk_socket = newsock;
++			write_unlock_bh(&sk_it->sk_callback_lock);
++		}
++	}
++
+ 	WARN_ON(!((1 << sk2->sk_state) &
+ 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
+ 		  TCPF_CLOSE_WAIT | TCPF_CLOSE)));
+@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
+ 
+ 	ip_init();
+ 
++	/* We must initialize MPTCP before TCP. */
++	mptcp_init();
++
+ 	tcp_v4_init();
+ 
+ 	/* Setup TCP slab cache for open requests. */
+diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
+index 14d02ea905b6..7d734d8af19b 100644
+--- a/net/ipv4/inet_connection_sock.c
++++ b/net/ipv4/inet_connection_sock.c
+@@ -23,6 +23,7 @@
+ #include <net/route.h>
+ #include <net/tcp_states.h>
+ #include <net/xfrm.h>
++#include <net/mptcp.h>
+ 
+ #ifdef INET_CSK_DEBUG
+ const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
+@@ -465,8 +466,8 @@ no_route:
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
+ 
+-static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
+-				 const u32 rnd, const u32 synq_hsize)
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++		   const u32 synq_hsize)
+ {
+ 	return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
+ }
+@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
+ 
+ 	lopt->clock_hand = i;
+ 
+-	if (lopt->qlen)
++	if (lopt->qlen && !is_meta_sk(parent))
+ 		inet_csk_reset_keepalive_timer(parent, interval);
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
+@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
+ 				 const struct request_sock *req,
+ 				 const gfp_t priority)
+ {
+-	struct sock *newsk = sk_clone_lock(sk, priority);
++	struct sock *newsk;
++
++	newsk = sk_clone_lock(sk, priority);
+ 
+ 	if (newsk != NULL) {
+ 		struct inet_connection_sock *newicsk = inet_csk(newsk);
+@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+-	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
++	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
++				   GFP_KERNEL);
+ 
+ 	if (rc != 0)
+ 		return rc;
+@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
+ 
+ 	while ((req = acc_req) != NULL) {
+ 		struct sock *child = req->sk;
++		bool mutex_taken = false;
+ 
+ 		acc_req = req->dl_next;
+ 
++		if (is_meta_sk(child)) {
++			mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
++			mutex_taken = true;
++		}
+ 		local_bh_disable();
+ 		bh_lock_sock(child);
+ 		WARN_ON(sock_owned_by_user(child));
+@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
+ 
+ 		bh_unlock_sock(child);
+ 		local_bh_enable();
++		if (mutex_taken)
++			mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
+ 		sock_put(child);
+ 
+ 		sk_acceptq_removed(sk);
+diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
+index c86624b36a62..0ff3fe004d62 100644
+--- a/net/ipv4/syncookies.c
++++ b/net/ipv4/syncookies.c
+@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ }
+ EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
+ 
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++			      __u16 *mssp)
+ {
+ 	const struct iphdr *iph = ip_hdr(skb);
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ 
+ 	/* check for timestamp cookie support */
+ 	memset(&tcp_opt, 0, sizeof(tcp_opt));
+-	tcp_parse_options(skb, &tcp_opt, 0, NULL);
++	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+ 
+ 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ 		goto out;
+@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ 	/* Try to redo what tcp_v4_send_synack did. */
+ 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
+ 
+-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+-				  &req->rcv_wnd, &req->window_clamp,
+-				  ireq->wscale_ok, &rcv_wscale,
+-				  dst_metric(&rt->dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++				       &req->rcv_wnd, &req->window_clamp,
++				       ireq->wscale_ok, &rcv_wscale,
++				       dst_metric(&rt->dst, RTAX_INITRWND), sk);
+ 
+ 	ireq->rcv_wscale  = rcv_wscale;
+ 
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 9d2118e5fbc7..2cb89f886d45 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -271,6 +271,7 @@
+ 
+ #include <net/icmp.h>
+ #include <net/inet_common.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/xfrm.h>
+ #include <net/ip.h>
+@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+ 	return period;
+ }
+ 
++const struct tcp_sock_ops tcp_specific = {
++	.__select_window		= __tcp_select_window,
++	.select_window			= tcp_select_window,
++	.select_initial_window		= tcp_select_initial_window,
++	.init_buffer_space		= tcp_init_buffer_space,
++	.set_rto			= tcp_set_rto,
++	.should_expand_sndbuf		= tcp_should_expand_sndbuf,
++	.init_congestion_control	= tcp_init_congestion_control,
++	.send_fin			= tcp_send_fin,
++	.write_xmit			= tcp_write_xmit,
++	.send_active_reset		= tcp_send_active_reset,
++	.write_wakeup			= tcp_write_wakeup,
++	.prune_ofo_queue		= tcp_prune_ofo_queue,
++	.retransmit_timer		= tcp_retransmit_timer,
++	.time_wait			= tcp_time_wait,
++	.cleanup_rbuf			= tcp_cleanup_rbuf,
++};
++
+ /* Address-family independent initialization for a tcp_sock.
+  *
+  * NOTE: A lot of things set to zero explicitly by call to
+@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
+ 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
+ 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
+ 
++	tp->ops = &tcp_specific;
++
+ 	local_bh_disable();
+ 	sock_update_memcg(sk);
+ 	sk_sockets_allocated_inc(sk);
+@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+ 	int ret;
+ 
+ 	sock_rps_record_flow(sk);
++
++#ifdef CONFIG_MPTCP
++	if (mptcp(tcp_sk(sk))) {
++		struct sock *sk_it;
++		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++#endif
+ 	/*
+ 	 * We can't seek on a socket input
+ 	 */
+@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
+ 	return NULL;
+ }
+ 
+-static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
+-				       int large_allowed)
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 xmit_size_goal, old_size_goal;
+@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+ {
+ 	int mss_now;
+ 
+-	mss_now = tcp_current_mss(sk);
+-	*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	if (mptcp(tcp_sk(sk))) {
++		mss_now = mptcp_current_mss(sk);
++		*size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	} else {
++		mss_now = tcp_current_mss(sk);
++		*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	}
+ 
+ 	return mss_now;
+ }
+@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+ 	 * is fully established.
+ 	 */
+ 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+-	    !tcp_passive_fastopen(sk)) {
++	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++				  tp->mpcb->master_sk : sk)) {
+ 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ 			goto out_err;
+ 	}
+ 
++	if (mptcp(tp)) {
++		struct sock *sk_it = sk;
++
++		/* We must check this with socket-lock hold because we iterate
++		 * over the subflows.
++		 */
++		if (!mptcp_can_sendpage(sk)) {
++			ssize_t ret;
++
++			release_sock(sk);
++			ret = sock_no_sendpage(sk->sk_socket, page, offset,
++					       size, flags);
++			lock_sock(sk);
++			return ret;
++		}
++
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++
+ 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+ 
+ 	mss_now = tcp_send_mss(sk, &size_goal, flags);
+@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+ {
+ 	ssize_t res;
+ 
+-	if (!(sk->sk_route_caps & NETIF_F_SG) ||
+-	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
++	/* If MPTCP is enabled, we check it later after establishment */
++	if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
++	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
+ 		return sock_no_sendpage(sk->sk_socket, page, offset, size,
+ 					flags);
+ 
+@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	int tmp = tp->mss_cache;
+ 
++	if (mptcp(tp))
++		return mptcp_select_size(sk, sg);
++
+ 	if (sg) {
+ 		if (sk_can_gso(sk)) {
+ 			/* Small frames wont use a full page:
+@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 	 * is fully established.
+ 	 */
+ 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+-	    !tcp_passive_fastopen(sk)) {
++	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++				  tp->mpcb->master_sk : sk)) {
+ 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ 			goto do_error;
+ 	}
+ 
++	if (mptcp(tp)) {
++		struct sock *sk_it = sk;
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++
+ 	if (unlikely(tp->repair)) {
+ 		if (tp->repair_queue == TCP_RECV_QUEUE) {
+ 			copied = tcp_send_rcvq(sk, msg, size);
+@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
+ 		goto out_err;
+ 
+-	sg = !!(sk->sk_route_caps & NETIF_F_SG);
++	if (mptcp(tp))
++		sg = mptcp_can_sg(sk);
++	else
++		sg = !!(sk->sk_route_caps & NETIF_F_SG);
+ 
+ 	while (--iovlen >= 0) {
+ 		size_t seglen = iov->iov_len;
+@@ -1183,8 +1251,15 @@ new_segment:
+ 
+ 				/*
+ 				 * Check whether we can use HW checksum.
++				 *
++				 * If dss-csum is enabled, we do not do hw-csum.
++				 * In case of non-mptcp we check the
++				 * device-capabilities.
++				 * In case of mptcp, hw-csum's will be handled
++				 * later in mptcp_write_xmit.
+ 				 */
+-				if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
++				if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
++				    (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ 					skb->ip_summed = CHECKSUM_PARTIAL;
+ 
+ 				skb_entail(sk, skb);
+@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
+ 
+ 		/* Optimize, __tcp_select_window() is not cheap. */
+ 		if (2*rcv_window_now <= tp->window_clamp) {
+-			__u32 new_window = __tcp_select_window(sk);
++			__u32 new_window = tp->ops->__select_window(sk);
+ 
+ 			/* Send ACK now, if this read freed lots of space
+ 			 * in our buffer. Certainly, new_window is new window.
+@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ 	/* Clean up data we have read: This will do ACK frames. */
+ 	if (copied > 0) {
+ 		tcp_recv_skb(sk, seq, &offset);
+-		tcp_cleanup_rbuf(sk, copied);
++		tp->ops->cleanup_rbuf(sk, copied);
+ 	}
+ 	return copied;
+ }
+@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 
+ 	lock_sock(sk);
+ 
++#ifdef CONFIG_MPTCP
++	if (mptcp(tp)) {
++		struct sock *sk_it;
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++#endif
++
+ 	err = -ENOTCONN;
+ 	if (sk->sk_state == TCP_LISTEN)
+ 		goto out;
+@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 			}
+ 		}
+ 
+-		tcp_cleanup_rbuf(sk, copied);
++		tp->ops->cleanup_rbuf(sk, copied);
+ 
+ 		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
+ 			/* Install new reader */
+@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 			if (tp->rcv_wnd == 0 &&
+ 			    !skb_queue_empty(&sk->sk_async_wait_queue)) {
+ 				tcp_service_net_dma(sk, true);
+-				tcp_cleanup_rbuf(sk, copied);
++				tp->ops->cleanup_rbuf(sk, copied);
+ 			} else
+ 				dma_async_issue_pending(tp->ucopy.dma_chan);
+ 		}
+@@ -1993,7 +2076,7 @@ skip_copy:
+ 	 */
+ 
+ 	/* Clean up data we have read: This will do ACK frames. */
+-	tcp_cleanup_rbuf(sk, copied);
++	tp->ops->cleanup_rbuf(sk, copied);
+ 
+ 	release_sock(sk);
+ 	return copied;
+@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
+   /* TCP_CLOSING	*/ TCP_CLOSING,
+ };
+ 
+-static int tcp_close_state(struct sock *sk)
++int tcp_close_state(struct sock *sk)
+ {
+ 	int next = (int)new_state[sk->sk_state];
+ 	int ns = next & TCP_STATE_MASK;
+@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
+ 	     TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
+ 		/* Clear out any half completed packets.  FIN if needed. */
+ 		if (tcp_close_state(sk))
+-			tcp_send_fin(sk);
++			tcp_sk(sk)->ops->send_fin(sk);
+ 	}
+ }
+ EXPORT_SYMBOL(tcp_shutdown);
+@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
+ 	int data_was_unread = 0;
+ 	int state;
+ 
++	if (is_meta_sk(sk)) {
++		mptcp_close(sk, timeout);
++		return;
++	}
++
+ 	lock_sock(sk);
+ 	sk->sk_shutdown = SHUTDOWN_MASK;
+ 
+@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
+ 		/* Unread data was tossed, zap the connection. */
+ 		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
+ 		tcp_set_state(sk, TCP_CLOSE);
+-		tcp_send_active_reset(sk, sk->sk_allocation);
++		tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
+ 	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
+ 		/* Check zero linger _after_ checking for unread data. */
+ 		sk->sk_prot->disconnect(sk, 0);
+@@ -2247,7 +2335,7 @@ adjudge_to_death:
+ 		struct tcp_sock *tp = tcp_sk(sk);
+ 		if (tp->linger2 < 0) {
+ 			tcp_set_state(sk, TCP_CLOSE);
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			NET_INC_STATS_BH(sock_net(sk),
+ 					LINUX_MIB_TCPABORTONLINGER);
+ 		} else {
+@@ -2257,7 +2345,8 @@ adjudge_to_death:
+ 				inet_csk_reset_keepalive_timer(sk,
+ 						tmo - TCP_TIMEWAIT_LEN);
+ 			} else {
+-				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++				tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
++							   tmo);
+ 				goto out;
+ 			}
+ 		}
+@@ -2266,7 +2355,7 @@ adjudge_to_death:
+ 		sk_mem_reclaim(sk);
+ 		if (tcp_check_oom(sk, 0)) {
+ 			tcp_set_state(sk, TCP_CLOSE);
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			NET_INC_STATS_BH(sock_net(sk),
+ 					LINUX_MIB_TCPABORTONMEMORY);
+ 		}
+@@ -2291,15 +2380,6 @@ out:
+ }
+ EXPORT_SYMBOL(tcp_close);
+ 
+-/* These states need RST on ABORT according to RFC793 */
+-
+-static inline bool tcp_need_reset(int state)
+-{
+-	return (1 << state) &
+-	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
+-		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
+-}
+-
+ int tcp_disconnect(struct sock *sk, int flags)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
+@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
+ 		/* The last check adjusts for discrepancy of Linux wrt. RFC
+ 		 * states
+ 		 */
+-		tcp_send_active_reset(sk, gfp_any());
++		tp->ops->send_active_reset(sk, gfp_any());
+ 		sk->sk_err = ECONNRESET;
+ 	} else if (old_state == TCP_SYN_SENT)
+ 		sk->sk_err = ECONNRESET;
+@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
+ 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+ 		inet_reset_saddr(sk);
+ 
++	if (is_meta_sk(sk)) {
++		mptcp_disconnect(sk);
++	} else {
++		if (tp->inside_tk_table)
++			mptcp_hash_remove_bh(tp);
++	}
++
+ 	sk->sk_shutdown = 0;
+ 	sock_reset_flag(sk, SOCK_DONE);
+ 	tp->srtt_us = 0;
+@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 		break;
+ 
+ 	case TCP_DEFER_ACCEPT:
++		/* An established MPTCP-connection (mptcp(tp) only returns true
++		 * if the socket is established) should not use DEFER on new
++		 * subflows.
++		 */
++		if (mptcp(tp))
++			break;
+ 		/* Translate value in seconds to number of retransmits */
+ 		icsk->icsk_accept_queue.rskq_defer_accept =
+ 			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 			    (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
+ 			    inet_csk_ack_scheduled(sk)) {
+ 				icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
+-				tcp_cleanup_rbuf(sk, 1);
++				tp->ops->cleanup_rbuf(sk, 1);
+ 				if (!(val & 1))
+ 					icsk->icsk_ack.pingpong = 1;
+ 			}
+@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 		tp->notsent_lowat = val;
+ 		sk->sk_write_space(sk);
+ 		break;
++#ifdef CONFIG_MPTCP
++	case MPTCP_ENABLED:
++		if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
++			if (val)
++				tp->mptcp_enabled = 1;
++			else
++				tp->mptcp_enabled = 0;
++		} else {
++			err = -EPERM;
++		}
++		break;
++#endif
+ 	default:
+ 		err = -ENOPROTOOPT;
+ 		break;
+@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
+ 	case TCP_NOTSENT_LOWAT:
+ 		val = tp->notsent_lowat;
+ 		break;
++#ifdef CONFIG_MPTCP
++	case MPTCP_ENABLED:
++		val = tp->mptcp_enabled;
++		break;
++#endif
+ 	default:
+ 		return -ENOPROTOOPT;
+ 	}
+@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
+ 	if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
+ 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
+ 
++	WARN_ON(sk->sk_state == TCP_CLOSE);
+ 	tcp_set_state(sk, TCP_CLOSE);
++
+ 	tcp_clear_xmit_timers(sk);
++
+ 	if (req != NULL)
+ 		reqsk_fastopen_remove(sk, req, false);
+ 
+diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
+index 9771563ab564..5c230d96c4c1 100644
+--- a/net/ipv4/tcp_fastopen.c
++++ b/net/ipv4/tcp_fastopen.c
+@@ -7,6 +7,7 @@
+ #include <linux/rculist.h>
+ #include <net/inetpeer.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ 
+ int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
+ 
+@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ {
+ 	struct tcp_sock *tp;
+ 	struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
+-	struct sock *child;
++	struct sock *child, *meta_sk;
+ 
+ 	req->num_retrans = 0;
+ 	req->num_timeout = 0;
+@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ 	/* Add the child socket directly into the accept queue */
+ 	inet_csk_reqsk_queue_add(sk, req, child);
+ 
+-	/* Now finish processing the fastopen child socket. */
+-	inet_csk(child)->icsk_af_ops->rebuild_header(child);
+-	tcp_init_congestion_control(child);
+-	tcp_mtup_init(child);
+-	tcp_init_metrics(child);
+-	tcp_init_buffer_space(child);
+-
+ 	/* Queue the data carried in the SYN packet. We need to first
+ 	 * bump skb's refcnt because the caller will attempt to free it.
+ 	 *
+@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ 		tp->syn_data_acked = 1;
+ 	}
+ 	tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++
++	meta_sk = child;
++	if (!mptcp_check_req_fastopen(meta_sk, req)) {
++		child = tcp_sk(meta_sk)->mpcb->master_sk;
++		tp = tcp_sk(child);
++	}
++
++	/* Now finish processing the fastopen child socket. */
++	inet_csk(child)->icsk_af_ops->rebuild_header(child);
++	tp->ops->init_congestion_control(child);
++	tcp_mtup_init(child);
++	tcp_init_metrics(child);
++	tp->ops->init_buffer_space(child);
++
+ 	sk->sk_data_ready(sk);
+-	bh_unlock_sock(child);
++	if (mptcp(tcp_sk(child)))
++		bh_unlock_sock(child);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(child);
+ 	WARN_ON(req->sk == NULL);
+ 	return true;
+diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
+index 40639c288dc2..3273bb69f387 100644
+--- a/net/ipv4/tcp_input.c
++++ b/net/ipv4/tcp_input.c
+@@ -74,6 +74,9 @@
+ #include <linux/ipsec.h>
+ #include <asm/unaligned.h>
+ #include <net/netdma.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
+ 
+ int sysctl_tcp_timestamps __read_mostly = 1;
+ int sysctl_tcp_window_scaling __read_mostly = 1;
+@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
+ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
+ int sysctl_tcp_early_retrans __read_mostly = 3;
+ 
+-#define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
+-#define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
+-#define FLAG_DATA_ACKED		0x04 /* This ACK acknowledged new data.		*/
+-#define FLAG_RETRANS_DATA_ACKED	0x08 /* "" "" some of which was retransmitted.	*/
+-#define FLAG_SYN_ACKED		0x10 /* This ACK acknowledged SYN.		*/
+-#define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
+-#define FLAG_ECE		0x40 /* ECE in this ACK				*/
+-#define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/
+-#define FLAG_ORIG_SACK_ACKED	0x200 /* Never retransmitted data are (s)acked	*/
+-#define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
+-#define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info */
+-#define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a sacked seq */
+-#define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
+-
+-#define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
+-#define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
+-#define FLAG_CA_ALERT		(FLAG_DATA_SACKED|FLAG_ECE)
+-#define FLAG_FORWARD_PROGRESS	(FLAG_ACKED|FLAG_DATA_SACKED)
+-
+ #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
+ #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
+ 
+@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
+ 		icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
+ }
+ 
+-static void tcp_enter_quickack_mode(struct sock *sk)
++void tcp_enter_quickack_mode(struct sock *sk)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	tcp_incr_quickack(sk);
+@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ 	per_mss = roundup_pow_of_two(per_mss) +
+ 		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
+ 
+-	nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
+-	nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++	if (mptcp(tp)) {
++		nr_segs = mptcp_check_snd_buf(tp);
++	} else {
++		nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
++		nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++	}
+ 
+ 	/* Fast Recovery (RFC 5681 3.2) :
+ 	 * Cubic needs 1.7 factor, rounded to 2 to include
+@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ 	 */
+ 	sndmem = 2 * nr_segs * per_mss;
+ 
+-	if (sk->sk_sndbuf < sndmem)
++	/* MPTCP: after this sndmem is the new contribution of the
++	 * current subflow to the aggregated sndbuf */
++	if (sk->sk_sndbuf < sndmem) {
++		int old_sndbuf = sk->sk_sndbuf;
+ 		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
++		/* MPTCP: ok, the subflow sndbuf has grown, reflect
++		 * this in the aggregate buffer.*/
++		if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
++			mptcp_update_sndbuf(tp);
++	}
+ }
+ 
+ /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
+@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
+ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
+ 
+ 	/* Check #1 */
+-	if (tp->rcv_ssthresh < tp->window_clamp &&
+-	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
++	if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
++	    (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
+ 	    !sk_under_memory_pressure(sk)) {
+ 		int incr;
+ 
+@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ 		 * will fit to rcvbuf in future.
+ 		 */
+ 		if (tcp_win_from_space(skb->truesize) <= skb->len)
+-			incr = 2 * tp->advmss;
++			incr = 2 * meta_tp->advmss;
+ 		else
+-			incr = __tcp_grow_window(sk, skb);
++			incr = __tcp_grow_window(meta_sk, skb);
+ 
+ 		if (incr) {
+ 			incr = max_t(int, incr, 2 * skb->len);
+-			tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
+-					       tp->window_clamp);
++			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
++					            meta_tp->window_clamp);
+ 			inet_csk(sk)->icsk_ack.quick |= 1;
+ 		}
+ 	}
+@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
+ 	int copied;
+ 
+ 	time = tcp_time_stamp - tp->rcvq_space.time;
+-	if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
++	if (mptcp(tp)) {
++		if (mptcp_check_rtt(tp, time))
++			return;
++	} else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+ 		return;
+ 
+ 	/* Number of bytes copied to user in last RTT */
+@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
+ /* Calculate rto without backoff.  This is the second half of Van Jacobson's
+  * routine referred to above.
+  */
+-static void tcp_set_rto(struct sock *sk)
++void tcp_set_rto(struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	/* Old crap is replaced with new one. 8)
+@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
+ 	int len;
+ 	int in_sack;
+ 
+-	if (!sk_can_gso(sk))
++	/* For MPTCP we cannot shift skb-data and remove one skb from the
++	 * send-queue, because this will make us loose the DSS-option (which
++	 * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
++	 */
++	if (!sk_can_gso(sk) || mptcp(tp))
+ 		goto fallback;
+ 
+ 	/* Normally R but no L won't result in plain S */
+@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
+ 		return false;
+ 
+ 	tcp_rtt_estimator(sk, seq_rtt_us);
+-	tcp_set_rto(sk);
++	tp->ops->set_rto(sk);
+ 
+ 	/* RFC6298: only reset backoff on valid RTT measurement. */
+ 	inet_csk(sk)->icsk_backoff = 0;
+@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
+ }
+ 
+ /* If we get here, the whole TSO packet has not been acked. */
+-static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 packets_acked;
+@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ 		 */
+ 		if (!(scb->tcp_flags & TCPHDR_SYN)) {
+ 			flag |= FLAG_DATA_ACKED;
++			if (mptcp(tp) && mptcp_is_data_seq(skb))
++				flag |= MPTCP_FLAG_DATA_ACKED;
+ 		} else {
+ 			flag |= FLAG_SYN_ACKED;
+ 			tp->retrans_stamp = 0;
+@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ 	return flag;
+ }
+ 
+-static void tcp_ack_probe(struct sock *sk)
++void tcp_ack_probe(struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
+ /* Check that window update is acceptable.
+  * The function assumes that snd_una<=ack<=snd_next.
+  */
+-static inline bool tcp_may_update_window(const struct tcp_sock *tp,
+-					const u32 ack, const u32 ack_seq,
+-					const u32 nwin)
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++			   const u32 ack_seq, const u32 nwin)
+ {
+ 	return	after(ack, tp->snd_una) ||
+ 		after(ack_seq, tp->snd_wl1) ||
+@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
+ }
+ 
+ /* This routine deals with incoming acks, but not outgoing ones. */
+-static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
++static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+ 				    sack_rtt_us);
+ 	acked -= tp->packets_out;
+ 
++	if (mptcp(tp)) {
++		if (mptcp_fallback_infinite(sk, flag)) {
++			pr_err("%s resetting flow\n", __func__);
++			mptcp_send_reset(sk);
++			goto invalid_ack;
++		}
++
++		mptcp_clean_rtx_infinite(skb, sk);
++	}
++
+ 	/* Advance cwnd if state allows */
+ 	if (tcp_may_raise_cwnd(sk, flag))
+ 		tcp_cong_avoid(sk, ack, acked);
+@@ -3512,8 +3528,9 @@ old_ack:
+  * the fast version below fails.
+  */
+ void tcp_parse_options(const struct sk_buff *skb,
+-		       struct tcp_options_received *opt_rx, int estab,
+-		       struct tcp_fastopen_cookie *foc)
++		       struct tcp_options_received *opt_rx,
++		       struct mptcp_options_received *mopt,
++		       int estab, struct tcp_fastopen_cookie *foc)
+ {
+ 	const unsigned char *ptr;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
+ 				 */
+ 				break;
+ #endif
++			case TCPOPT_MPTCP:
++				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++				break;
+ 			case TCPOPT_EXP:
+ 				/* Fast Open option shares code 254 using a
+ 				 * 16 bits magic number. It's valid only in
+@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
+ 		if (tcp_parse_aligned_timestamp(tp, th))
+ 			return true;
+ 	}
+-
+-	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
++	tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
++			  1, NULL);
+ 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+ 
+@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
+ 		dst = __sk_dst_get(sk);
+ 		if (!dst || !dst_metric(dst, RTAX_QUICKACK))
+ 			inet_csk(sk)->icsk_ack.pingpong = 1;
++		if (mptcp(tp))
++			mptcp_sub_close_passive(sk);
+ 		break;
+ 
+ 	case TCP_CLOSE_WAIT:
+@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
+ 		tcp_set_state(sk, TCP_CLOSING);
+ 		break;
+ 	case TCP_FIN_WAIT2:
++		if (mptcp(tp)) {
++			/* The socket will get closed by mptcp_data_ready.
++			 * We first have to process all data-sequences.
++			 */
++			tp->close_it = 1;
++			break;
++		}
+ 		/* Received a FIN -- send ACK and enter TIME_WAIT. */
+ 		tcp_send_ack(sk);
+-		tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++		tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ 		break;
+ 	default:
+ 		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
+@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
+ 	if (!sock_flag(sk, SOCK_DEAD)) {
+ 		sk->sk_state_change(sk);
+ 
++		/* Don't wake up MPTCP-subflows */
++		if (mptcp(tp))
++			return;
++
+ 		/* Do not send POLL_HUP for half duplex close. */
+ 		if (sk->sk_shutdown == SHUTDOWN_MASK ||
+ 		    sk->sk_state == TCP_CLOSE)
+@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
+ 			tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
+ 		}
+ 
+-		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
++		/* In case of MPTCP, the segment may be empty if it's a
++		 * non-data DATA_FIN. (see beginning of tcp_data_queue)
++		 */
++		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
++		    !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
+ 			SOCK_DEBUG(sk, "ofo packet was already received\n");
+ 			__skb_unlink(skb, &tp->out_of_order_queue);
+ 			__kfree_skb(skb);
+@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
+ 	}
+ }
+ 
+-static bool tcp_prune_ofo_queue(struct sock *sk);
+ static int tcp_prune_queue(struct sock *sk);
+ 
+ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ 				 unsigned int size)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = mptcp_meta_sk(sk);
++
+ 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+ 	    !sk_rmem_schedule(sk, skb, size)) {
+ 
+@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ 			return -1;
+ 
+ 		if (!sk_rmem_schedule(sk, skb, size)) {
+-			if (!tcp_prune_ofo_queue(sk))
++			if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
+ 				return -1;
+ 
+ 			if (!sk_rmem_schedule(sk, skb, size))
+@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+  * Better try to coalesce them right now to avoid future collapses.
+  * Returns true if caller should free @from instead of queueing it
+  */
+-static bool tcp_try_coalesce(struct sock *sk,
+-			     struct sk_buff *to,
+-			     struct sk_buff *from,
+-			     bool *fragstolen)
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
++		      bool *fragstolen)
+ {
+ 	int delta;
+ 
+ 	*fragstolen = false;
+ 
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++		return false;
++
+ 	if (tcp_hdr(from)->fin)
+ 		return false;
+ 
+@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ 
+ 	/* Do skb overlap to previous one? */
+ 	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+-		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++		/* MPTCP allows non-data data-fin to be in the ofo-queue */
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
++		    !(mptcp(tp) && end_seq == seq)) {
+ 			/* All the bits are present. Drop. */
+ 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
+ 			__kfree_skb(skb);
+@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ 					 end_seq);
+ 			break;
+ 		}
++		/* MPTCP allows non-data data-fin to be in the ofo-queue */
++		if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
++			continue;
+ 		__skb_unlink(skb1, &tp->out_of_order_queue);
+ 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
+ 				 TCP_SKB_CB(skb1)->end_seq);
+@@ -4280,8 +4325,8 @@ end:
+ 	}
+ }
+ 
+-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
+-		  bool *fragstolen)
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++			       bool *fragstolen)
+ {
+ 	int eaten;
+ 	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
+ 	int eaten = -1;
+ 	bool fragstolen = false;
+ 
+-	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
++	/* If no data is present, but a data_fin is in the options, we still
++	 * have to call mptcp_queue_skb later on. */
++	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
++	    !(mptcp(tp) && mptcp_is_data_fin(skb)))
+ 		goto drop;
+ 
+ 	skb_dst_drop(skb);
+@@ -4389,7 +4437,7 @@ queue_and_out:
+ 			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
+ 		}
+ 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+-		if (skb->len)
++		if (skb->len || mptcp_is_data_fin(skb))
+ 			tcp_event_data_recv(sk, skb);
+ 		if (th->fin)
+ 			tcp_fin(sk);
+@@ -4411,7 +4459,11 @@ queue_and_out:
+ 
+ 		if (eaten > 0)
+ 			kfree_skb_partial(skb, fragstolen);
+-		if (!sock_flag(sk, SOCK_DEAD))
++		if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
++			/* MPTCP: we always have to call data_ready, because
++			 * we may be about to receive a data-fin, which still
++			 * must get queued.
++			 */
+ 			sk->sk_data_ready(sk);
+ 		return;
+ 	}
+@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
+ 		next = skb_queue_next(list, skb);
+ 
+ 	__skb_unlink(skb, list);
++	if (mptcp(tcp_sk(sk)))
++		mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
+ 	__kfree_skb(skb);
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
+ 
+@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
+  * Purge the out-of-order queue.
+  * Return true if queue was pruned.
+  */
+-static bool tcp_prune_ofo_queue(struct sock *sk)
++bool tcp_prune_ofo_queue(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	bool res = false;
+@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
+ 	/* Collapsing did not help, destructive actions follow.
+ 	 * This must not ever occur. */
+ 
+-	tcp_prune_ofo_queue(sk);
++	tp->ops->prune_ofo_queue(sk);
+ 
+ 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+ 		return 0;
+@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
+ 	return -1;
+ }
+ 
+-static bool tcp_should_expand_sndbuf(const struct sock *sk)
++/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
++ * As additional protections, we do not touch cwnd in retransmission phases,
++ * and if application hit its sndbuf limit recently.
++ */
++void tcp_cwnd_application_limited(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
++	    sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
++		/* Limited by application or receiver window. */
++		u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
++		u32 win_used = max(tp->snd_cwnd_used, init_win);
++		if (win_used < tp->snd_cwnd) {
++			tp->snd_ssthresh = tcp_current_ssthresh(sk);
++			tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
++		}
++		tp->snd_cwnd_used = 0;
++	}
++	tp->snd_cwnd_stamp = tcp_time_stamp;
++}
++
++bool tcp_should_expand_sndbuf(const struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+-	if (tcp_should_expand_sndbuf(sk)) {
++	if (tp->ops->should_expand_sndbuf(sk)) {
+ 		tcp_sndbuf_expand(sk);
+ 		tp->snd_cwnd_stamp = tcp_time_stamp;
+ 	}
+@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
+ {
+ 	if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
+ 		sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
+-		if (sk->sk_socket &&
+-		    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
++		if (mptcp(tcp_sk(sk)) ||
++		    (sk->sk_socket &&
++			test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
+ 			tcp_new_space(sk);
+ 	}
+ }
+@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+ 	     /* ... and right edge of window advances far enough.
+ 	      * (tcp_recvmsg() will send ACK otherwise). Or...
+ 	      */
+-	     __tcp_select_window(sk) >= tp->rcv_wnd) ||
++	     tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
+ 	    /* We ACK each frame or... */
+ 	    tcp_in_quickack_mode(sk) ||
+ 	    /* We have out of order data. */
+@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
++	/* MPTCP urgent data is not yet supported */
++	if (mptcp(tp))
++		return;
++
+ 	/* Check if we get a new urgent pointer - normally not. */
+ 	if (th->urg)
+ 		tcp_check_urg(sk, th);
+@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
+ }
+ 
+ #ifdef CONFIG_NET_DMA
+-static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
+-				  int hlen)
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	int chunk = skb->len - hlen;
+@@ -5052,9 +5132,15 @@ syn_challenge:
+ 		goto discard;
+ 	}
+ 
++	/* If valid: post process the received MPTCP options. */
++	if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
++		goto discard;
++
+ 	return true;
+ 
+ discard:
++	if (mptcp(tp))
++		mptcp_reset_mopt(tp);
+ 	__kfree_skb(skb);
+ 	return false;
+ }
+@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ 
+ 	tp->rx_opt.saw_tstamp = 0;
+ 
++	/* MPTCP: force slowpath. */
++	if (mptcp(tp))
++		goto slow_path;
++
+ 	/*	pred_flags is 0xS?10 << 16 + snd_wnd
+ 	 *	if header_prediction is to be made
+ 	 *	'S' will always be tp->tcp_header_len >> 2
+@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ 					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
+ 				}
+ 				if (copied_early)
+-					tcp_cleanup_rbuf(sk, skb->len);
++					tp->ops->cleanup_rbuf(sk, skb->len);
+ 			}
+ 			if (!eaten) {
+ 				if (tcp_checksum_complete_user(sk, skb))
+@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
+ 
+ 	tcp_init_metrics(sk);
+ 
+-	tcp_init_congestion_control(sk);
++	tp->ops->init_congestion_control(sk);
+ 
+ 	/* Prevent spurious tcp_cwnd_restart() on first data
+ 	 * packet.
+ 	 */
+ 	tp->lsndtime = tcp_time_stamp;
+ 
+-	tcp_init_buffer_space(sk);
++	tp->ops->init_buffer_space(sk);
+ 
+ 	if (sock_flag(sk, SOCK_KEEPOPEN))
+ 		inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
+@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ 		/* Get original SYNACK MSS value if user MSS sets mss_clamp */
+ 		tcp_clear_options(&opt);
+ 		opt.user_mss = opt.mss_clamp = 0;
+-		tcp_parse_options(synack, &opt, 0, NULL);
++		tcp_parse_options(synack, &opt, NULL, 0, NULL);
+ 		mss = opt.mss_clamp;
+ 	}
+ 
+@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ 
+ 	tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
+ 
+-	if (data) { /* Retransmit unacked data in SYN */
++	/* In mptcp case, we do not rely on "retransmit", but instead on
++	 * "transmit", because if fastopen data is not acked, the retransmission
++	 * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
++	 */
++	if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
+ 		tcp_for_write_queue_from(data, sk) {
+ 			if (data == tcp_send_head(sk) ||
+ 			    __tcp_retransmit_skb(sk, data))
+@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct tcp_fastopen_cookie foc = { .len = -1 };
+ 	int saved_clamp = tp->rx_opt.mss_clamp;
++	struct mptcp_options_received mopt;
++	mptcp_init_mp_opt(&mopt);
+ 
+-	tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
++	tcp_parse_options(skb, &tp->rx_opt,
++			  mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
+ 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+ 
+@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
+ 		tcp_ack(sk, skb, FLAG_SLOWPATH);
+ 
++		if (tp->request_mptcp || mptcp(tp)) {
++			int ret;
++			ret = mptcp_rcv_synsent_state_process(sk, &sk,
++							      skb, &mopt);
++
++			/* May have changed if we support MPTCP */
++			tp = tcp_sk(sk);
++			icsk = inet_csk(sk);
++
++			if (ret == 1)
++				goto reset_and_undo;
++			if (ret == 2)
++				goto discard;
++		}
++
++		if (mptcp(tp) && !is_master_tp(tp)) {
++			/* Timer for repeating the ACK until an answer
++			 * arrives. Used only when establishing an additional
++			 * subflow inside of an MPTCP connection.
++			 */
++			sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++				       jiffies + icsk->icsk_rto);
++		}
++
+ 		/* Ok.. it's good. Set up sequence numbers and
+ 		 * move to established.
+ 		 */
+@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 			tp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
+ 
++		if (mptcp(tp)) {
++			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++		}
++
+ 		if (tcp_is_sack(tp) && sysctl_tcp_fack)
+ 			tcp_enable_fack(tp);
+ 
+@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 		    tcp_rcv_fastopen_synack(sk, skb, &foc))
+ 			return -1;
+ 
+-		if (sk->sk_write_pending ||
++		/* With MPTCP we cannot send data on the third ack due to the
++		 * lack of option-space to combine with an MP_CAPABLE.
++		 */
++		if (!mptcp(tp) && (sk->sk_write_pending ||
+ 		    icsk->icsk_accept_queue.rskq_defer_accept ||
+-		    icsk->icsk_ack.pingpong) {
++		    icsk->icsk_ack.pingpong)) {
+ 			/* Save one ACK. Data will be ready after
+ 			 * several ticks, if write_pending is set.
+ 			 *
+@@ -5536,6 +5665,7 @@ discard:
+ 	    tcp_paws_reject(&tp->rx_opt, 0))
+ 		goto discard_and_undo;
+ 
++	/* TODO - check this here for MPTCP */
+ 	if (th->syn) {
+ 		/* We see SYN without ACK. It is attempt of
+ 		 * simultaneous connect with crossed SYNs.
+@@ -5552,6 +5682,11 @@ discard:
+ 			tp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
+ 
++		if (mptcp(tp)) {
++			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++		}
++
+ 		tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
+ 		tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
+ 
+@@ -5610,6 +5745,7 @@ reset_and_undo:
+ 
+ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			  const struct tcphdr *th, unsigned int len)
++	__releases(&sk->sk_lock.slock)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 	case TCP_SYN_SENT:
+ 		queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
++		if (is_meta_sk(sk)) {
++			sk = tcp_sk(sk)->mpcb->master_sk;
++			tp = tcp_sk(sk);
++
++			/* Need to call it here, because it will announce new
++			 * addresses, which can only be done after the third ack
++			 * of the 3-way handshake.
++			 */
++			mptcp_update_metasocket(sk, tp->meta_sk);
++		}
+ 		if (queued >= 0)
+ 			return queued;
+ 
+@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tcp_urg(sk, skb, th);
+ 		__kfree_skb(skb);
+ 		tcp_data_snd_check(sk);
++		if (mptcp(tp) && is_master_tp(tp))
++			bh_unlock_sock(sk);
+ 		return 0;
+ 	}
+ 
+@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			synack_stamp = tp->lsndtime;
+ 			/* Make sure socket is routed, for correct metrics. */
+ 			icsk->icsk_af_ops->rebuild_header(sk);
+-			tcp_init_congestion_control(sk);
++			tp->ops->init_congestion_control(sk);
+ 
+ 			tcp_mtup_init(sk);
+ 			tp->copied_seq = tp->rcv_nxt;
+-			tcp_init_buffer_space(sk);
++			tp->ops->init_buffer_space(sk);
+ 		}
+ 		smp_mb();
+ 		tcp_set_state(sk, TCP_ESTABLISHED);
+@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 		if (tp->rx_opt.tstamp_ok)
+ 			tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
++		if (mptcp(tp))
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
+ 
+ 		if (req) {
+ 			/* Re-arm the timer because data may have been sent out.
+@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 		tcp_initialize_rcv_mss(sk);
+ 		tcp_fast_path_on(tp);
++		/* Send an ACK when establishing a new
++		 * MPTCP subflow, i.e. using an MP_JOIN
++		 * subtype.
++		 */
++		if (mptcp(tp) && !is_master_tp(tp))
++			tcp_send_ack(sk);
+ 		break;
+ 
+ 	case TCP_FIN_WAIT1: {
+@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tmo = tcp_fin_time(sk);
+ 		if (tmo > TCP_TIMEWAIT_LEN) {
+ 			inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
+-		} else if (th->fin || sock_owned_by_user(sk)) {
++		} else if (th->fin || mptcp_is_data_fin(skb) ||
++			   sock_owned_by_user(sk)) {
+ 			/* Bad case. We could lose such FIN otherwise.
+ 			 * It is not a big problem, but it looks confusing
+ 			 * and not so rare event. We still can lose it now,
+@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			 */
+ 			inet_csk_reset_keepalive_timer(sk, tmo);
+ 		} else {
+-			tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++			tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ 			goto discard;
+ 		}
+ 		break;
+@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 	case TCP_CLOSING:
+ 		if (tp->snd_una == tp->write_seq) {
+-			tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++			tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ 			goto discard;
+ 		}
+ 		break;
+@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			goto discard;
+ 		}
+ 		break;
++	case TCP_CLOSE:
++		if (tp->mp_killed)
++			goto discard;
+ 	}
+ 
+ 	/* step 6: check the URG bit */
+@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		 */
+ 		if (sk->sk_shutdown & RCV_SHUTDOWN) {
+ 			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
+-			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
++			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++			    !mptcp(tp)) {
++				/* In case of mptcp, the reset is handled by
++				 * mptcp_rcv_state_process
++				 */
+ 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
+ 				tcp_reset(sk);
+ 				return 1;
+@@ -5877,3 +6041,154 @@ discard:
+ 	return 0;
+ }
+ EXPORT_SYMBOL(tcp_rcv_state_process);
++
++static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++
++	if (family == AF_INET)
++		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
++			       &ireq->ir_rmt_addr, port);
++#if IS_ENABLED(CONFIG_IPV6)
++	else if (family == AF_INET6)
++		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
++			       &ireq->ir_v6_rmt_addr, port);
++#endif
++}
++
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++		     const struct tcp_request_sock_ops *af_ops,
++		     struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_options_received tmp_opt;
++	struct request_sock *req;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct dst_entry *dst = NULL;
++	__u32 isn = TCP_SKB_CB(skb)->when;
++	bool want_cookie = false, fastopen;
++	struct flowi fl;
++	struct tcp_fastopen_cookie foc = { .len = -1 };
++	int err;
++
++
++	/* TW buckets are converted to open requests without
++	 * limitations, they conserve resources and peer is
++	 * evidently real one.
++	 */
++	if ((sysctl_tcp_syncookies == 2 ||
++	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++		want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
++		if (!want_cookie)
++			goto drop;
++	}
++
++
++	/* Accept backlog is full. If we have already queued enough
++	 * of warm entries in syn queue, drop request. It is better than
++	 * clogging syn queue with openreqs with exponentially increasing
++	 * timeout.
++	 */
++	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
++		goto drop;
++	}
++
++	req = inet_reqsk_alloc(rsk_ops);
++	if (!req)
++		goto drop;
++
++	tcp_rsk(req)->af_specific = af_ops;
++
++	tcp_clear_options(&tmp_opt);
++	tmp_opt.mss_clamp = af_ops->mss_clamp;
++	tmp_opt.user_mss  = tp->rx_opt.user_mss;
++	tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
++
++	if (want_cookie && !tmp_opt.saw_tstamp)
++		tcp_clear_options(&tmp_opt);
++
++	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
++	tcp_openreq_init(req, &tmp_opt, skb);
++
++	if (af_ops->init_req(req, sk, skb))
++		goto drop_and_free;
++
++	if (security_inet_conn_request(sk, skb, req))
++		goto drop_and_free;
++
++	if (!want_cookie || tmp_opt.tstamp_ok)
++		TCP_ECN_create_request(req, skb, sock_net(sk));
++
++	if (want_cookie) {
++		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
++		req->cookie_ts = tmp_opt.tstamp_ok;
++	} else if (!isn) {
++		/* VJ's idea. We save last timestamp seen
++		 * from the destination in peer table, when entering
++		 * state TIME-WAIT, and check against it before
++		 * accepting new connection request.
++		 *
++		 * If "isn" is not zero, this request hit alive
++		 * timewait bucket, so that all the necessary checks
++		 * are made in the function processing timewait state.
++		 */
++		if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
++			bool strict;
++
++			dst = af_ops->route_req(sk, &fl, req, &strict);
++			if (dst && strict &&
++			    !tcp_peer_is_proven(req, dst, true)) {
++				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
++				goto drop_and_release;
++			}
++		}
++		/* Kill the following clause, if you dislike this way. */
++		else if (!sysctl_tcp_syncookies &&
++			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
++			  (sysctl_max_syn_backlog >> 2)) &&
++			 !tcp_peer_is_proven(req, dst, false)) {
++			/* Without syncookies last quarter of
++			 * backlog is filled with destinations,
++			 * proven to be alive.
++			 * It means that we continue to communicate
++			 * to destinations, already remembered
++			 * to the moment of synflood.
++			 */
++			pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
++				    rsk_ops->family);
++			goto drop_and_release;
++		}
++
++		isn = af_ops->init_seq(skb);
++	}
++	if (!dst) {
++		dst = af_ops->route_req(sk, &fl, req, NULL);
++		if (!dst)
++			goto drop_and_free;
++	}
++
++	tcp_rsk(req)->snt_isn = isn;
++	tcp_openreq_init_rwin(req, sk, dst);
++	fastopen = !want_cookie &&
++		   tcp_try_fastopen(sk, skb, req, &foc, dst);
++	err = af_ops->send_synack(sk, dst, &fl, req,
++				  skb_get_queue_mapping(skb), &foc);
++	if (!fastopen) {
++		if (err || want_cookie)
++			goto drop_and_free;
++
++		tcp_rsk(req)->listener = NULL;
++		af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
++	}
++
++	return 0;
++
++drop_and_release:
++	dst_release(dst);
++drop_and_free:
++	reqsk_free(req);
++drop:
++	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++	return 0;
++}
++EXPORT_SYMBOL(tcp_conn_request);
+diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
+index 77cccda1ad0c..c77017f600f1 100644
+--- a/net/ipv4/tcp_ipv4.c
++++ b/net/ipv4/tcp_ipv4.c
+@@ -67,6 +67,8 @@
+ #include <net/icmp.h>
+ #include <net/inet_hashtables.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/transp_v6.h>
+ #include <net/ipv6.h>
+ #include <net/inet_common.h>
+@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
+ struct inet_hashinfo tcp_hashinfo;
+ EXPORT_SYMBOL(tcp_hashinfo);
+ 
+-static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
+ {
+ 	return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
+ 					  ip_hdr(skb)->saddr,
+@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	struct inet_sock *inet;
+ 	const int type = icmp_hdr(icmp_skb)->type;
+ 	const int code = icmp_hdr(icmp_skb)->code;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 	struct sk_buff *skb;
+ 	struct request_sock *fastopen;
+ 	__u32 seq, snd_una;
+@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		return;
+ 	}
+ 
+-	bh_lock_sock(sk);
++	tp = tcp_sk(sk);
++	if (mptcp(tp))
++		meta_sk = mptcp_meta_sk(sk);
++	else
++		meta_sk = sk;
++
++	bh_lock_sock(meta_sk);
+ 	/* If too many ICMPs get dropped on busy
+ 	 * servers this needs to be solved differently.
+ 	 * We do take care of PMTU discovery (RFC1191) special case :
+ 	 * we can receive locally generated ICMP messages while socket is held.
+ 	 */
+-	if (sock_owned_by_user(sk)) {
++	if (sock_owned_by_user(meta_sk)) {
+ 		if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
+ 			NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ 	}
+@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	}
+ 
+ 	icsk = inet_csk(sk);
+-	tp = tcp_sk(sk);
+ 	seq = ntohl(th->seq);
+ 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ 	fastopen = tp->fastopen_rsk;
+@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 				goto out;
+ 
+ 			tp->mtu_info = info;
+-			if (!sock_owned_by_user(sk)) {
++			if (!sock_owned_by_user(meta_sk)) {
+ 				tcp_v4_mtu_reduced(sk);
+ 			} else {
+ 				if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
+ 					sock_hold(sk);
++				if (mptcp(tp))
++					mptcp_tsq_flags(sk);
+ 			}
+ 			goto out;
+ 		}
+@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		    !icsk->icsk_backoff || fastopen)
+ 			break;
+ 
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			break;
+ 
+ 		icsk->icsk_backoff--;
+@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	switch (sk->sk_state) {
+ 		struct request_sock *req, **prev;
+ 	case TCP_LISTEN:
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			goto out;
+ 
+ 		req = inet_csk_search_req(sk, &prev, th->dest,
+@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		if (fastopen && fastopen->sk == NULL)
+ 			break;
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			sk->sk_err = err;
+ 
+ 			sk->sk_error_report(sk);
+@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	 */
+ 
+ 	inet = inet_sk(sk);
+-	if (!sock_owned_by_user(sk) && inet->recverr) {
++	if (!sock_owned_by_user(meta_sk) && inet->recverr) {
+ 		sk->sk_err = err;
+ 		sk->sk_error_report(sk);
+ 	} else	{ /* Only an error on timeout */
+@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	}
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
+  *	Exception: precedence violation. We do not implement it in any case.
+  */
+ 
+-static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct {
+@@ -702,10 +711,10 @@ release_sk1:
+    outside socket context is ugly, certainly. What can I do?
+  */
+ 
+-static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ 			    u32 win, u32 tsval, u32 tsecr, int oif,
+ 			    struct tcp_md5sig_key *key,
+-			    int reply_flags, u8 tos)
++			    int reply_flags, u8 tos, int mptcp)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct {
+@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ #ifdef CONFIG_TCP_MD5SIG
+ 			   + (TCPOLEN_MD5SIG_ALIGNED >> 2)
+ #endif
++#ifdef CONFIG_MPTCP
++			   + ((MPTCP_SUB_LEN_DSS >> 2) +
++			      (MPTCP_SUB_LEN_ACK >> 2))
++#endif
+ 			];
+ 	} rep;
+ 	struct ip_reply_arg arg;
+@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ 				    ip_hdr(skb)->daddr, &rep.th);
+ 	}
+ #endif
++#ifdef CONFIG_MPTCP
++	if (mptcp) {
++		int offset = (tsecr) ? 3 : 0;
++		/* Construction of 32-bit data_ack */
++		rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
++					  ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++					  (0x20 << 8) |
++					  (0x01));
++		rep.opt[offset] = htonl(data_ack);
++
++		arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++		rep.th.doff = arg.iov[0].iov_len / 4;
++	}
++#endif /* CONFIG_MPTCP */
++
+ 	arg.flags = reply_flags;
+ 	arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
+ 				      ip_hdr(skb)->saddr, /* XXX */
+@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct inet_timewait_sock *tw = inet_twsk(sk);
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++	u32 data_ack = 0;
++	int mptcp = 0;
++
++	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++		mptcp = 1;
++	}
+ 
+ 	tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++			data_ack,
+ 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ 			tcp_time_stamp + tcptw->tw_ts_offset,
+ 			tcptw->tw_ts_recent,
+ 			tw->tw_bound_dev_if,
+ 			tcp_twsk_md5_key(tcptw),
+ 			tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
+-			tw->tw_tos
++			tw->tw_tos, mptcp
+ 			);
+ 
+ 	inet_twsk_put(tw);
+ }
+ 
+-static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				  struct request_sock *req)
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req)
+ {
+ 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ 	 */
+ 	tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+-			tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
++			tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
+ 			tcp_time_stamp,
+ 			req->ts_recent,
+ 			0,
+ 			tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
+ 					  AF_INET),
+ 			inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
+-			ip_hdr(skb)->tos);
++			ip_hdr(skb)->tos, 0);
+ }
+ 
+ /*
+@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+  *	This still operates on a request_sock only, not on a big
+  *	socket.
+  */
+-static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+-			      struct request_sock *req,
+-			      u16 queue_mapping,
+-			      struct tcp_fastopen_cookie *foc)
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc)
+ {
+ 	const struct inet_request_sock *ireq = inet_rsk(req);
+ 	struct flowi4 fl4;
+@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+ 	return err;
+ }
+ 
+-static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
+-{
+-	int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
+-
+-	if (!res) {
+-		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+-	}
+-	return res;
+-}
+-
+ /*
+  *	IPv4 request_sock destructor.
+  */
+-static void tcp_v4_reqsk_destructor(struct request_sock *req)
++void tcp_v4_reqsk_destructor(struct request_sock *req)
+ {
+ 	kfree(inet_rsk(req)->opt);
+ }
+@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
+ /*
+  * Save and compile IPv4 options into the request_sock if needed.
+  */
+-static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
+ {
+ 	const struct ip_options *opt = &(IPCB(skb)->opt);
+ 	struct ip_options_rcu *dopt = NULL;
+@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ 
+ #endif
+ 
++static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
++			   struct sk_buff *skb)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++
++	ireq->ir_loc_addr = ip_hdr(skb)->daddr;
++	ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
++	ireq->no_srccheck = inet_sk(sk)->transparent;
++	ireq->opt = tcp_v4_save_options(skb);
++	ireq->ir_mark = inet_request_mark(sk, skb);
++
++	return 0;
++}
++
++static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
++					  const struct request_sock *req,
++					  bool *strict)
++{
++	struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
++
++	if (strict) {
++		if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
++			*strict = true;
++		else
++			*strict = false;
++	}
++
++	return dst;
++}
++
+ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
+ 	.family		=	PF_INET,
+ 	.obj_size	=	sizeof(struct tcp_request_sock),
+-	.rtx_syn_ack	=	tcp_v4_rtx_synack,
++	.rtx_syn_ack	=	tcp_rtx_synack,
+ 	.send_ack	=	tcp_v4_reqsk_send_ack,
+ 	.destructor	=	tcp_v4_reqsk_destructor,
+ 	.send_reset	=	tcp_v4_send_reset,
+ 	.syn_ack_timeout = 	tcp_syn_ack_timeout,
+ };
+ 
++const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
++	.mss_clamp	=	TCP_MSS_DEFAULT,
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+ 	.md5_lookup	=	tcp_v4_reqsk_md5_lookup,
+ 	.calc_md5_hash	=	tcp_v4_md5_hash_skb,
+-};
+ #endif
++	.init_req	=	tcp_v4_init_req,
++#ifdef CONFIG_SYN_COOKIES
++	.cookie_init_seq =	cookie_v4_init_sequence,
++#endif
++	.route_req	=	tcp_v4_route_req,
++	.init_seq	=	tcp_v4_init_sequence,
++	.send_synack	=	tcp_v4_send_synack,
++	.queue_hash_add =	inet_csk_reqsk_queue_hash_add,
++};
+ 
+ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+-	struct tcp_options_received tmp_opt;
+-	struct request_sock *req;
+-	struct inet_request_sock *ireq;
+-	struct tcp_sock *tp = tcp_sk(sk);
+-	struct dst_entry *dst = NULL;
+-	__be32 saddr = ip_hdr(skb)->saddr;
+-	__be32 daddr = ip_hdr(skb)->daddr;
+-	__u32 isn = TCP_SKB_CB(skb)->when;
+-	bool want_cookie = false, fastopen;
+-	struct flowi4 fl4;
+-	struct tcp_fastopen_cookie foc = { .len = -1 };
+-	int err;
+-
+ 	/* Never answer to SYNs send to broadcast or multicast */
+ 	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+ 		goto drop;
+ 
+-	/* TW buckets are converted to open requests without
+-	 * limitations, they conserve resources and peer is
+-	 * evidently real one.
+-	 */
+-	if ((sysctl_tcp_syncookies == 2 ||
+-	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+-		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
+-		if (!want_cookie)
+-			goto drop;
+-	}
+-
+-	/* Accept backlog is full. If we have already queued enough
+-	 * of warm entries in syn queue, drop request. It is better than
+-	 * clogging syn queue with openreqs with exponentially increasing
+-	 * timeout.
+-	 */
+-	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+-		goto drop;
+-	}
+-
+-	req = inet_reqsk_alloc(&tcp_request_sock_ops);
+-	if (!req)
+-		goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+-	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
+-#endif
+-
+-	tcp_clear_options(&tmp_opt);
+-	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
+-	tmp_opt.user_mss  = tp->rx_opt.user_mss;
+-	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+-	if (want_cookie && !tmp_opt.saw_tstamp)
+-		tcp_clear_options(&tmp_opt);
+-
+-	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+-	tcp_openreq_init(req, &tmp_opt, skb);
++	return tcp_conn_request(&tcp_request_sock_ops,
++				&tcp_request_sock_ipv4_ops, sk, skb);
+ 
+-	ireq = inet_rsk(req);
+-	ireq->ir_loc_addr = daddr;
+-	ireq->ir_rmt_addr = saddr;
+-	ireq->no_srccheck = inet_sk(sk)->transparent;
+-	ireq->opt = tcp_v4_save_options(skb);
+-	ireq->ir_mark = inet_request_mark(sk, skb);
+-
+-	if (security_inet_conn_request(sk, skb, req))
+-		goto drop_and_free;
+-
+-	if (!want_cookie || tmp_opt.tstamp_ok)
+-		TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+-	if (want_cookie) {
+-		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+-		req->cookie_ts = tmp_opt.tstamp_ok;
+-	} else if (!isn) {
+-		/* VJ's idea. We save last timestamp seen
+-		 * from the destination in peer table, when entering
+-		 * state TIME-WAIT, and check against it before
+-		 * accepting new connection request.
+-		 *
+-		 * If "isn" is not zero, this request hit alive
+-		 * timewait bucket, so that all the necessary checks
+-		 * are made in the function processing timewait state.
+-		 */
+-		if (tmp_opt.saw_tstamp &&
+-		    tcp_death_row.sysctl_tw_recycle &&
+-		    (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
+-		    fl4.daddr == saddr) {
+-			if (!tcp_peer_is_proven(req, dst, true)) {
+-				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+-				goto drop_and_release;
+-			}
+-		}
+-		/* Kill the following clause, if you dislike this way. */
+-		else if (!sysctl_tcp_syncookies &&
+-			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+-			  (sysctl_max_syn_backlog >> 2)) &&
+-			 !tcp_peer_is_proven(req, dst, false)) {
+-			/* Without syncookies last quarter of
+-			 * backlog is filled with destinations,
+-			 * proven to be alive.
+-			 * It means that we continue to communicate
+-			 * to destinations, already remembered
+-			 * to the moment of synflood.
+-			 */
+-			LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
+-				       &saddr, ntohs(tcp_hdr(skb)->source));
+-			goto drop_and_release;
+-		}
+-
+-		isn = tcp_v4_init_sequence(skb);
+-	}
+-	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
+-		goto drop_and_free;
+-
+-	tcp_rsk(req)->snt_isn = isn;
+-	tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-	tcp_openreq_init_rwin(req, sk, dst);
+-	fastopen = !want_cookie &&
+-		   tcp_try_fastopen(sk, skb, req, &foc, dst);
+-	err = tcp_v4_send_synack(sk, dst, req,
+-				 skb_get_queue_mapping(skb), &foc);
+-	if (!fastopen) {
+-		if (err || want_cookie)
+-			goto drop_and_free;
+-
+-		tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-		tcp_rsk(req)->listener = NULL;
+-		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+-	}
+-
+-	return 0;
+-
+-drop_and_release:
+-	dst_release(dst);
+-drop_and_free:
+-	reqsk_free(req);
+ drop:
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ 	return 0;
+@@ -1497,7 +1433,7 @@ put_and_exit:
+ }
+ EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
+ 
+-static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcphdr *th = tcp_hdr(skb);
+ 	const struct iphdr *iph = ip_hdr(skb);
+@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 
+ 	if (nsk) {
+ 		if (nsk->sk_state != TCP_TIME_WAIT) {
++			/* Don't lock again the meta-sk. It has been locked
++			 * before mptcp_v4_do_rcv.
++			 */
++			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++				bh_lock_sock(mptcp_meta_sk(nsk));
+ 			bh_lock_sock(nsk);
++
+ 			return nsk;
++
+ 		}
+ 		inet_twsk_put(inet_twsk(nsk));
+ 		return NULL;
+@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
+ 		goto discard;
+ #endif
+ 
++	if (is_meta_sk(sk))
++		return mptcp_v4_do_rcv(sk, skb);
++
+ 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
+ 		struct dst_entry *dst = sk->sk_rx_dst;
+ 
+@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
+ 	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
+ 		wake_up_interruptible_sync_poll(sk_sleep(sk),
+ 					   POLLIN | POLLRDNORM | POLLRDBAND);
+-		if (!inet_csk_ack_scheduled(sk))
++		if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
+ 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
+ 						  (3 * tcp_rto_min(sk)) / 4,
+ 						  TCP_RTO_MAX);
+@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ {
+ 	const struct iphdr *iph;
+ 	const struct tcphdr *th;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk = NULL;
+ 	int ret;
+ 	struct net *net = dev_net(skb->dev);
+ 
+@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ 				    skb->len - th->doff * 4);
+ 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++	TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ 	TCP_SKB_CB(skb)->when	 = 0;
+ 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
+ 	TCP_SKB_CB(skb)->sacked	 = 0;
+ 
+ 	sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+-	if (!sk)
+-		goto no_tcp_socket;
+ 
+ process:
+-	if (sk->sk_state == TCP_TIME_WAIT)
++	if (sk && sk->sk_state == TCP_TIME_WAIT)
+ 		goto do_time_wait;
+ 
++#ifdef CONFIG_MPTCP
++	if (!sk && th->syn && !th->ack) {
++		int ret = mptcp_lookup_join(skb, NULL);
++
++		if (ret < 0) {
++			tcp_v4_send_reset(NULL, skb);
++			goto discard_it;
++		} else if (ret > 0) {
++			return 0;
++		}
++	}
++
++	/* Is there a pending request sock for this segment ? */
++	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++		if (sk)
++			sock_put(sk);
++		return 0;
++	}
++#endif
++	if (!sk)
++		goto no_tcp_socket;
++
+ 	if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ 		goto discard_and_relse;
+@@ -1759,11 +1729,21 @@ process:
+ 	sk_mark_napi_id(sk, skb);
+ 	skb->dev = NULL;
+ 
+-	bh_lock_sock_nested(sk);
++	if (mptcp(tcp_sk(sk))) {
++		meta_sk = mptcp_meta_sk(sk);
++
++		bh_lock_sock_nested(meta_sk);
++		if (sock_owned_by_user(meta_sk))
++			skb->sk = sk;
++	} else {
++		meta_sk = sk;
++		bh_lock_sock_nested(sk);
++	}
++
+ 	ret = 0;
+-	if (!sock_owned_by_user(sk)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+-		struct tcp_sock *tp = tcp_sk(sk);
++		struct tcp_sock *tp = tcp_sk(meta_sk);
+ 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ 			tp->ucopy.dma_chan = net_dma_find_channel();
+ 		if (tp->ucopy.dma_chan)
+@@ -1771,16 +1751,16 @@ process:
+ 		else
+ #endif
+ 		{
+-			if (!tcp_prequeue(sk, skb))
++			if (!tcp_prequeue(meta_sk, skb))
+ 				ret = tcp_v4_do_rcv(sk, skb);
+ 		}
+-	} else if (unlikely(sk_add_backlog(sk, skb,
+-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
+-		bh_unlock_sock(sk);
++	} else if (unlikely(sk_add_backlog(meta_sk, skb,
++					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++		bh_unlock_sock(meta_sk);
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ 		goto discard_and_relse;
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 
+ 	sock_put(sk);
+ 
+@@ -1835,6 +1815,18 @@ do_time_wait:
+ 			sk = sk2;
+ 			goto process;
+ 		}
++#ifdef CONFIG_MPTCP
++		if (th->syn && !th->ack) {
++			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++			if (ret < 0) {
++				tcp_v4_send_reset(NULL, skb);
++				goto discard_it;
++			} else if (ret > 0) {
++				return 0;
++			}
++		}
++#endif
+ 		/* Fall through to ACK */
+ 	}
+ 	case TCP_TW_ACK:
+@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
+ 
+ 	tcp_init_sock(sk);
+ 
+-	icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++	if (is_mptcp_enabled(sk))
++		icsk->icsk_af_ops = &mptcp_v4_specific;
++	else
++#endif
++		icsk->icsk_af_ops = &ipv4_specific;
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ 	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
+@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
+ 
+ 	tcp_cleanup_congestion_control(sk);
+ 
++	if (mptcp(tp))
++		mptcp_destroy_sock(sk);
++	if (tp->inside_tk_table)
++		mptcp_hash_remove(tp);
++
+ 	/* Cleanup up the write buffer. */
+ 	tcp_write_queue_purge(sk);
+ 
+@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
+ }
+ #endif /* CONFIG_PROC_FS */
+ 
++#ifdef CONFIG_MPTCP
++static void tcp_v4_clear_sk(struct sock *sk, int size)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* we do not want to clear tk_table field, because of RCU lookups */
++	sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
++
++	size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
++	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
++}
++#endif
++
+ struct proto tcp_prot = {
+ 	.name			= "TCP",
+ 	.owner			= THIS_MODULE,
+@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
+ 	.destroy_cgroup		= tcp_destroy_cgroup,
+ 	.proto_cgroup		= tcp_proto_cgroup,
+ #endif
++#ifdef CONFIG_MPTCP
++	.clear_sk		= tcp_v4_clear_sk,
++#endif
+ };
+ EXPORT_SYMBOL(tcp_prot);
+ 
+diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
+index e68e0d4af6c9..ae6946857dff 100644
+--- a/net/ipv4/tcp_minisocks.c
++++ b/net/ipv4/tcp_minisocks.c
+@@ -18,11 +18,13 @@
+  *		Jorge Cwik, <jorge@laser.satlink.net>
+  */
+ 
++#include <linux/kconfig.h>
+ #include <linux/mm.h>
+ #include <linux/module.h>
+ #include <linux/slab.h>
+ #include <linux/sysctl.h>
+ #include <linux/workqueue.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/inet_common.h>
+ #include <net/xfrm.h>
+@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 	struct tcp_options_received tmp_opt;
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
+ 	bool paws_reject = false;
++	struct mptcp_options_received mopt;
+ 
+ 	tmp_opt.saw_tstamp = 0;
+ 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
+-		tcp_parse_options(skb, &tmp_opt, 0, NULL);
++		mptcp_init_mp_opt(&mopt);
++
++		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+ 
+ 		if (tmp_opt.saw_tstamp) {
+ 			tmp_opt.rcv_tsecr	-= tcptw->tw_ts_offset;
+@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 			tmp_opt.ts_recent_stamp	= tcptw->tw_ts_recent_stamp;
+ 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
+ 		}
++
++		if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
++			if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
++				goto kill_with_rst;
++		}
+ 	}
+ 
+ 	if (tw->tw_substate == TCP_FIN_WAIT2) {
+@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 		if (!th->ack ||
+ 		    !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
+ 		    TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
++			/* If mptcp_is_data_fin() returns true, we are sure that
++			 * mopt has been initialized - otherwise it would not
++			 * be a DATA_FIN.
++			 */
++			if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
++			    mptcp_is_data_fin(skb) &&
++			    TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
++			    mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
++				return TCP_TW_ACK;
++
+ 			inet_twsk_put(tw);
+ 			return TCP_TW_SUCCESS;
+ 		}
+@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ 		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
+ 		tcptw->tw_ts_offset	= tp->tsoffset;
+ 
++		if (mptcp(tp)) {
++			if (mptcp_init_tw_sock(sk, tcptw)) {
++				inet_twsk_free(tw);
++				goto exit;
++			}
++		} else {
++			tcptw->mptcp_tw = NULL;
++		}
++
+ #if IS_ENABLED(CONFIG_IPV6)
+ 		if (tw->tw_family == PF_INET6) {
+ 			struct ipv6_pinfo *np = inet6_sk(sk);
+@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
+ 	}
+ 
++exit:
+ 	tcp_update_metrics(sk);
+ 	tcp_done(sk);
+ }
+ 
+ void tcp_twsk_destructor(struct sock *sk)
+ {
+-#ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_timewait_sock *twsk = tcp_twsk(sk);
+ 
++	if (twsk->mptcp_tw)
++		mptcp_twsk_destructor(twsk);
++#ifdef CONFIG_TCP_MD5SIG
+ 	if (twsk->tw_md5_key)
+ 		kfree_rcu(twsk->tw_md5_key, rcu);
+ #endif
+@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
+ 		req->window_clamp = tcp_full_space(sk);
+ 
+ 	/* tcp_full_space because it is guaranteed to be the first packet */
+-	tcp_select_initial_window(tcp_full_space(sk),
+-		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
++	tp->ops->select_initial_window(tcp_full_space(sk),
++		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
++		(ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
+ 		&req->rcv_wnd,
+ 		&req->window_clamp,
+ 		ireq->wscale_ok,
+ 		&rcv_wscale,
+-		dst_metric(dst, RTAX_INITRWND));
++		dst_metric(dst, RTAX_INITRWND), sk);
+ 	ireq->rcv_wscale = rcv_wscale;
+ }
+ EXPORT_SYMBOL(tcp_openreq_init_rwin);
+@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
+ 			newtp->rx_opt.ts_recent_stamp = 0;
+ 			newtp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
++		if (ireq->saw_mpc)
++			newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
+ 		newtp->tsoffset = 0;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		newtp->md5sig_info = NULL;	/*XXX*/
+@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 			   bool fastopen)
+ {
+ 	struct tcp_options_received tmp_opt;
++	struct mptcp_options_received mopt;
+ 	struct sock *child;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
+ 	bool paws_reject = false;
+ 
+-	BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
++	BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
+ 
+ 	tmp_opt.saw_tstamp = 0;
++
++	mptcp_init_mp_opt(&mopt);
++
+ 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
+-		tcp_parse_options(skb, &tmp_opt, 0, NULL);
++		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+ 
+ 		if (tmp_opt.saw_tstamp) {
+ 			tmp_opt.ts_recent = req->ts_recent;
+@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 		 *
+ 		 * Reset timer after retransmitting SYNACK, similar to
+ 		 * the idea of fast retransmit in recovery.
++		 *
++		 * Fall back to TCP if MP_CAPABLE is not set.
+ 		 */
++
++		if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
++			inet_rsk(req)->saw_mpc = false;
++
++
+ 		if (!inet_rtx_syn_ack(sk, req))
+ 			req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
+ 					   TCP_RTO_MAX) + jiffies;
+@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 	 * socket is created, wait for troubles.
+ 	 */
+ 	child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
++
+ 	if (child == NULL)
+ 		goto listen_overflow;
+ 
++	if (!is_meta_sk(sk)) {
++		int ret = mptcp_check_req_master(sk, child, req, prev);
++		if (ret < 0)
++			goto listen_overflow;
++
++		/* MPTCP-supported */
++		if (!ret)
++			return tcp_sk(child)->mpcb->master_sk;
++	} else {
++		return mptcp_check_req_child(sk, child, req, prev, &mopt);
++	}
+ 	inet_csk_reqsk_queue_unlink(sk, req, prev);
+ 	inet_csk_reqsk_queue_removed(sk, req);
+ 
+@@ -746,7 +804,17 @@ embryonic_reset:
+ 		tcp_reset(sk);
+ 	}
+ 	if (!fastopen) {
+-		inet_csk_reqsk_queue_drop(sk, req, prev);
++		if (is_meta_sk(sk)) {
++			/* We want to avoid stoping the keepalive-timer and so
++			 * avoid ending up in inet_csk_reqsk_queue_removed ...
++			 */
++			inet_csk_reqsk_queue_unlink(sk, req, prev);
++			if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
++				mptcp_delete_synack_timer(sk);
++			reqsk_free(req);
++		} else {
++			inet_csk_reqsk_queue_drop(sk, req, prev);
++		}
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+ 	}
+ 	return NULL;
+@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ {
+ 	int ret = 0;
+ 	int state = child->sk_state;
++	struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
+ 
+-	if (!sock_owned_by_user(child)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ 		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
+ 					    skb->len);
+ 		/* Wakeup parent, send SIGIO */
+@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ 		 * in main socket hash table and lock on listening
+ 		 * socket does not protect us more.
+ 		 */
+-		__sk_add_backlog(child, skb);
++		if (mptcp(tcp_sk(child)))
++			skb->sk = child;
++		__sk_add_backlog(meta_sk, skb);
+ 	}
+ 
+-	bh_unlock_sock(child);
++	if (mptcp(tcp_sk(child)))
++		bh_unlock_sock(child);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(child);
+ 	return ret;
+ }
+diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
+index 179b51e6bda3..efd31b6c5784 100644
+--- a/net/ipv4/tcp_output.c
++++ b/net/ipv4/tcp_output.c
+@@ -36,6 +36,12 @@
+ 
+ #define pr_fmt(fmt) "TCP: " fmt
+ 
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++#include <net/ipv6.h>
+ #include <net/tcp.h>
+ 
+ #include <linux/compiler.h>
+@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
+ unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
+ EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
+ 
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+-			   int push_one, gfp_t gfp);
+-
+ /* Account for new data that has been sent to the network. */
+-static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
+ void tcp_select_initial_window(int __space, __u32 mss,
+ 			       __u32 *rcv_wnd, __u32 *window_clamp,
+ 			       int wscale_ok, __u8 *rcv_wscale,
+-			       __u32 init_rcv_wnd)
++			       __u32 init_rcv_wnd, const struct sock *sk)
+ {
+ 	unsigned int space = (__space < 0 ? 0 : __space);
+ 
+@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
+  * value can be stuffed directly into th->window for an outgoing
+  * frame.
+  */
+-static u16 tcp_select_window(struct sock *sk)
++u16 tcp_select_window(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 old_win = tp->rcv_wnd;
+-	u32 cur_win = tcp_receive_window(tp);
+-	u32 new_win = __tcp_select_window(sk);
++	/* The window must never shrink at the meta-level. At the subflow we
++	 * have to allow this. Otherwise we may announce a window too large
++	 * for the current meta-level sk_rcvbuf.
++	 */
++	u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
++	u32 new_win = tp->ops->__select_window(sk);
+ 
+ 	/* Never shrink the offered window */
+ 	if (new_win < cur_win) {
+@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
+ 				      LINUX_MIB_TCPWANTZEROWINDOWADV);
+ 		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+ 	}
++
+ 	tp->rcv_wnd = new_win;
+ 	tp->rcv_wup = tp->rcv_nxt;
+ 
+@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
+ /* Constructs common control bits of non-data skb. If SYN/FIN is present,
+  * auto increment end seqno.
+  */
+-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ {
+ 	struct skb_shared_info *shinfo = skb_shinfo(skb);
+ 
+@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ 	TCP_SKB_CB(skb)->end_seq = seq;
+ }
+ 
+-static inline bool tcp_urg_mode(const struct tcp_sock *tp)
++bool tcp_urg_mode(const struct tcp_sock *tp)
+ {
+ 	return tp->snd_una != tp->snd_up;
+ }
+@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
+ #define OPTION_MD5		(1 << 2)
+ #define OPTION_WSCALE		(1 << 3)
+ #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
+-
+-struct tcp_out_options {
+-	u16 options;		/* bit field of OPTION_* */
+-	u16 mss;		/* 0 to disable */
+-	u8 ws;			/* window scale, 0 to disable */
+-	u8 num_sack_blocks;	/* number of SACK blocks to include */
+-	u8 hash_size;		/* bytes in hash_location */
+-	__u8 *hash_location;	/* temporary pointer, overloaded */
+-	__u32 tsval, tsecr;	/* need to include OPTION_TS */
+-	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
+-};
++/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
+ 
+ /* Write previously computed TCP options to the packet.
+  *
+@@ -430,7 +428,7 @@ struct tcp_out_options {
+  * (but it may well be that other scenarios fail similarly).
+  */
+ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+-			      struct tcp_out_options *opts)
++			      struct tcp_out_options *opts, struct sk_buff *skb)
+ {
+ 	u16 options = opts->options;	/* mungable copy */
+ 
+@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+ 		}
+ 		ptr += (foc->len + 3) >> 2;
+ 	}
++
++	if (unlikely(OPTION_MPTCP & opts->options))
++		mptcp_options_write(ptr, tp, opts, skb);
+ }
+ 
+ /* Compute TCP options for SYN packets. This is not the final
+@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
+ 		if (unlikely(!(OPTION_TS & opts->options)))
+ 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
+ 	}
++	if (tp->request_mptcp || mptcp(tp))
++		mptcp_syn_options(sk, opts, &remaining);
+ 
+ 	if (fastopen && fastopen->cookie.len >= 0) {
+ 		u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
+@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
+ 		}
+ 	}
+ 
++	if (ireq->saw_mpc)
++		mptcp_synack_options(req, opts, &remaining);
++
+ 	return MAX_TCP_OPTION_SPACE - remaining;
+ }
+ 
+@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
+ 		opts->tsecr = tp->rx_opt.ts_recent;
+ 		size += TCPOLEN_TSTAMP_ALIGNED;
+ 	}
++	if (mptcp(tp))
++		mptcp_established_options(sk, skb, opts, &size);
+ 
+ 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
+ 	if (unlikely(eff_sacks)) {
+-		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+-		opts->num_sack_blocks =
+-			min_t(unsigned int, eff_sacks,
+-			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
+-			      TCPOLEN_SACK_PERBLOCK);
+-		size += TCPOLEN_SACK_BASE_ALIGNED +
+-			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
++		const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
++		if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
++			opts->num_sack_blocks = 0;
++		else
++			opts->num_sack_blocks =
++			    min_t(unsigned int, eff_sacks,
++				  (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
++				  TCPOLEN_SACK_PERBLOCK);
++		if (opts->num_sack_blocks)
++			size += TCPOLEN_SACK_BASE_ALIGNED +
++			    opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+ 	}
+ 
+ 	return size;
+@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
+ 	if ((1 << sk->sk_state) &
+ 	    (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
+ 	     TCPF_CLOSE_WAIT  | TCPF_LAST_ACK))
+-		tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+-			       0, GFP_ATOMIC);
++		tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
++					    tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
+ }
+ /*
+  * One tasklet per cpu tries to send more skbs.
+@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
+ 	unsigned long flags;
+ 	struct list_head *q, *n;
+ 	struct tcp_sock *tp;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 
+ 	local_irq_save(flags);
+ 	list_splice_init(&tsq->head, &list);
+@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
+ 		list_del(&tp->tsq_node);
+ 
+ 		sk = (struct sock *)tp;
+-		bh_lock_sock(sk);
++		meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++		bh_lock_sock(meta_sk);
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			tcp_tsq_handler(sk);
++			if (mptcp(tp))
++				tcp_tsq_handler(meta_sk);
+ 		} else {
++			if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
++				goto exit;
++
+ 			/* defer the work to tcp_release_cb() */
+ 			set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
++
++			if (mptcp(tp))
++				mptcp_tsq_flags(sk);
+ 		}
+-		bh_unlock_sock(sk);
++exit:
++		bh_unlock_sock(meta_sk);
+ 
+ 		clear_bit(TSQ_QUEUED, &tp->tsq_flags);
+ 		sk_free(sk);
+@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
+ #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) |		\
+ 			  (1UL << TCP_WRITE_TIMER_DEFERRED) |	\
+ 			  (1UL << TCP_DELACK_TIMER_DEFERRED) |	\
+-			  (1UL << TCP_MTU_REDUCED_DEFERRED))
++			  (1UL << TCP_MTU_REDUCED_DEFERRED) |   \
++			  (1UL << MPTCP_PATH_MANAGER) |		\
++			  (1UL << MPTCP_SUB_DEFERRED))
++
+ /**
+  * tcp_release_cb - tcp release_sock() callback
+  * @sk: socket
+@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
+ 		sk->sk_prot->mtu_reduced(sk);
+ 		__sock_put(sk);
+ 	}
++	if (flags & (1UL << MPTCP_PATH_MANAGER)) {
++		if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
++			tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
++		__sock_put(sk);
++	}
++	if (flags & (1UL << MPTCP_SUB_DEFERRED))
++		mptcp_tsq_sub_deferred(sk);
+ }
+ EXPORT_SYMBOL(tcp_release_cb);
+ 
+@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
+  * We are working here with either a clone of the original
+  * SKB, or a fresh unique copy made by the retransmit engine.
+  */
+-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+-			    gfp_t gfp_mask)
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++		        gfp_t gfp_mask)
+ {
+ 	const struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct inet_sock *inet;
+@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ 		 */
+ 		th->window	= htons(min(tp->rcv_wnd, 65535U));
+ 	} else {
+-		th->window	= htons(tcp_select_window(sk));
++		th->window	= htons(tp->ops->select_window(sk));
+ 	}
+ 	th->check		= 0;
+ 	th->urg_ptr		= 0;
+@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ 		}
+ 	}
+ 
+-	tcp_options_write((__be32 *)(th + 1), tp, &opts);
++	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ 	if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
+ 		TCP_ECN_send(sk, skb, tcp_header_size);
+ 
+@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+  * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
+  * otherwise socket can stall.
+  */
+-static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ }
+ 
+ /* Initialize TSO segments for a packet. */
+-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+-				 unsigned int mss_now)
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++			  unsigned int mss_now)
+ {
+ 	struct skb_shared_info *shinfo = skb_shinfo(skb);
+ 
+ 	/* Make sure we own this skb before messing gso_size/gso_segs */
+ 	WARN_ON_ONCE(skb_cloned(skb));
+ 
+-	if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
++	if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
++	    (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
+ 		/* Avoid the costly divide in the normal
+ 		 * non-TSO case.
+ 		 */
+@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
+ /* Pcount in the middle of the write queue got changed, we need to do various
+  * tweaks to fix counters
+  */
+-static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
+  * eventually). The difference is that pulled data not copied, but
+  * immediately discarded.
+  */
+-static void __pskb_trim_head(struct sk_buff *skb, int len)
++void __pskb_trim_head(struct sk_buff *skb, int len)
+ {
+ 	struct skb_shared_info *shinfo;
+ 	int i, k, eat;
+@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
+ /* Remove acked data from a packet in the transmit queue. */
+ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ {
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
++		return mptcp_trim_head(sk, skb, len);
++
+ 	if (skb_unclone(skb, GFP_ATOMIC))
+ 		return -ENOMEM;
+ 
+@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ 	if (tcp_skb_pcount(skb) > 1)
+ 		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
+ 
++#ifdef CONFIG_MPTCP
++	/* Some data got acked - we assume that the seq-number reached the dest.
++	 * Anyway, our MPTCP-option has been trimmed above - we lost it here.
++	 * Only remove the SEQ if the call does not come from a meta retransmit.
++	 */
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++		TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
++#endif
++
+ 	return 0;
+ }
+ 
+@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
+ 
+ 	return mss_now;
+ }
++EXPORT_SYMBOL(tcp_current_mss);
+ 
+ /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
+  * As additional protections, we do not touch cwnd in retransmission phases,
+@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
+  * But we can avoid doing the divide again given we already have
+  *  skb_pcount = skb->len / mss_now
+  */
+-static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
+-				const struct sk_buff *skb)
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++			 const struct sk_buff *skb)
+ {
+ 	if (skb->len < tcp_skb_pcount(skb) * mss_now)
+ 		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
+ 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
+ }
+ /* Returns the portion of skb which can be sent right away */
+-static unsigned int tcp_mss_split_point(const struct sock *sk,
+-					const struct sk_buff *skb,
+-					unsigned int mss_now,
+-					unsigned int max_segs,
+-					int nonagle)
++unsigned int tcp_mss_split_point(const struct sock *sk,
++				 const struct sk_buff *skb,
++				 unsigned int mss_now,
++				 unsigned int max_segs,
++				 int nonagle)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 partial, needed, window, max_len;
+@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
+ /* Can at least one segment of SKB be sent right now, according to the
+  * congestion window rules?  If so, return how many segments are allowed.
+  */
+-static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+-					 const struct sk_buff *skb)
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
++			   const struct sk_buff *skb)
+ {
+ 	u32 in_flight, cwnd;
+ 
+ 	/* Don't be strict about the congestion window for the final FIN.  */
+-	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
++	if (skb &&
++	    (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
+ 	    tcp_skb_pcount(skb) == 1)
+ 		return 1;
+ 
+@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+  * This must be invoked the first time we consider transmitting
+  * SKB onto the wire.
+  */
+-static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+-			     unsigned int mss_now)
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++		      unsigned int mss_now)
+ {
+ 	int tso_segs = tcp_skb_pcount(skb);
+ 
+@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+ /* Return true if the Nagle test allows this packet to be
+  * sent now.
+  */
+-static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
+-				  unsigned int cur_mss, int nonagle)
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		    unsigned int cur_mss, int nonagle)
+ {
+ 	/* Nagle rule does not apply to frames, which sit in the middle of the
+ 	 * write_queue (they have no chances to get new data).
+@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ 		return true;
+ 
+ 	/* Don't use the nagle rule for urgent data (or for the final FIN). */
+-	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
++	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
++	    mptcp_is_data_fin(skb))
+ 		return true;
+ 
+ 	if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
+@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ }
+ 
+ /* Does at least the first segment of SKB fit into the send window? */
+-static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
+-			     const struct sk_buff *skb,
+-			     unsigned int cur_mss)
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		      unsigned int cur_mss)
+ {
+ 	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
+ 
+@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
+ 	u32 send_win, cong_win, limit, in_flight;
+ 	int win_divisor;
+ 
+-	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
++	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
+ 		goto send_now;
+ 
+ 	if (icsk->icsk_ca_state != TCP_CA_Open)
+@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
+  * Returns true, if no segments are in flight and we have queued segments,
+  * but cannot send anything now because of SWS or another problem.
+  */
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ 			   int push_one, gfp_t gfp)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ 
+ 	sent_pkts = 0;
+ 
+-	if (!push_one) {
++	/* pmtu not yet supported with MPTCP. Should be possible, by early
++	 * exiting the loop inside tcp_mtu_probe, making sure that only one
++	 * single DSS-mapping gets probed.
++	 */
++	if (!push_one && !mptcp(tp)) {
+ 		/* Do MTU probing. */
+ 		result = tcp_mtu_probe(sk);
+ 		if (!result) {
+@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
+ 	int err = -1;
+ 
+ 	if (tcp_send_head(sk) != NULL) {
+-		err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
++		err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
++					  GFP_ATOMIC);
+ 		goto rearm_timer;
+ 	}
+ 
+@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
+ 	if (unlikely(sk->sk_state == TCP_CLOSE))
+ 		return;
+ 
+-	if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
+-			   sk_gfp_atomic(sk, GFP_ATOMIC)))
++	if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
++					sk_gfp_atomic(sk, GFP_ATOMIC)))
+ 		tcp_check_probe_timer(sk);
+ }
+ 
+@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
+ 
+ 	BUG_ON(!skb || skb->len < mss_now);
+ 
+-	tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
++	tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
++				    sk->sk_allocation);
+ }
+ 
+ /* This function returns the amount that we can raise the
+@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+ 	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
+ 		return;
+ 
++	/* Currently not supported for MPTCP - but it should be possible */
++	if (mptcp(tp))
++		return;
++
+ 	tcp_for_write_queue_from_safe(skb, tmp, sk) {
+ 		if (!tcp_can_collapse(sk, skb))
+ 			break;
+@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+ 
+ 	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
+ 	th->window = htons(min(req->rcv_wnd, 65535U));
+-	tcp_options_write((__be32 *)(th + 1), tp, &opts);
++	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ 	th->doff = (tcp_header_size >> 2);
+ 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
+ 
+@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
+ 	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
+ 		tp->window_clamp = tcp_full_space(sk);
+ 
+-	tcp_select_initial_window(tcp_full_space(sk),
+-				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
+-				  &tp->rcv_wnd,
+-				  &tp->window_clamp,
+-				  sysctl_tcp_window_scaling,
+-				  &rcv_wscale,
+-				  dst_metric(dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk),
++				       tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
++				       &tp->rcv_wnd,
++				       &tp->window_clamp,
++				       sysctl_tcp_window_scaling,
++				       &rcv_wscale,
++				       dst_metric(dst, RTAX_INITRWND), sk);
+ 
+ 	tp->rx_opt.rcv_wscale = rcv_wscale;
+ 	tp->rcv_ssthresh = tp->rcv_wnd;
+@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
+ 	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ 	inet_csk(sk)->icsk_retransmits = 0;
+ 	tcp_clear_retrans(tp);
++
++#ifdef CONFIG_MPTCP
++	if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
++		if (is_master_tp(tp)) {
++			tp->request_mptcp = 1;
++			mptcp_connect_init(sk);
++		} else if (tp->mptcp) {
++			struct inet_sock *inet	= inet_sk(sk);
++
++			tp->mptcp->snt_isn	= tp->write_seq;
++			tp->mptcp->init_rcv_wnd	= tp->rcv_wnd;
++
++			/* Set nonce for new subflows */
++			if (sk->sk_family == AF_INET)
++				tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
++							inet->inet_saddr,
++							inet->inet_daddr,
++							inet->inet_sport,
++							inet->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++			else
++				tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
++						inet6_sk(sk)->saddr.s6_addr32,
++						sk->sk_v6_daddr.s6_addr32,
++						inet->inet_sport,
++						inet->inet_dport);
++#endif
++		}
++	}
++#endif
+ }
+ 
+ static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
+@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
+ 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ 	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
+ }
++EXPORT_SYMBOL(tcp_send_ack);
+ 
+ /* This routine sends a packet with an out of date sequence
+  * number. It assumes the other end will try to ack it.
+@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
+  * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
+  * out-of-date with SND.UNA-1 to probe window.
+  */
+-static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
++int tcp_xmit_probe_skb(struct sock *sk, int urgent)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct sk_buff *skb;
+@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	int err;
+ 
+-	err = tcp_write_wakeup(sk);
++	err = tp->ops->write_wakeup(sk);
+ 
+ 	if (tp->packets_out || !tcp_send_head(sk)) {
+ 		/* Cancel probe timer, if it is not required. */
+@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
+ 					  TCP_RTO_MAX);
+ 	}
+ }
++
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
++{
++	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++	struct flowi fl;
++	int res;
++
++	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
++	if (!res) {
++		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
++	}
++	return res;
++}
++EXPORT_SYMBOL(tcp_rtx_synack);
+diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
+index 286227abed10..966b873cbf3e 100644
+--- a/net/ipv4/tcp_timer.c
++++ b/net/ipv4/tcp_timer.c
+@@ -20,6 +20,7 @@
+ 
+ #include <linux/module.h>
+ #include <linux/gfp.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ 
+ int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
+@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+ int sysctl_tcp_orphan_retries __read_mostly;
+ int sysctl_tcp_thin_linear_timeouts __read_mostly;
+ 
+-static void tcp_write_err(struct sock *sk)
++void tcp_write_err(struct sock *sk)
+ {
+ 	sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
+ 	sk->sk_error_report(sk);
+@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
+ 		    (!tp->snd_wnd && !tp->packets_out))
+ 			do_reset = 1;
+ 		if (do_reset)
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 		tcp_done(sk);
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
+ 		return 1;
+@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
+  * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
+  * syn_set flag is set.
+  */
+-static bool retransmits_timed_out(struct sock *sk,
+-				  unsigned int boundary,
+-				  unsigned int timeout,
+-				  bool syn_set)
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++			   unsigned int timeout, bool syn_set)
+ {
+ 	unsigned int linear_backoff_thresh, start_ts;
+ 	unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
+ }
+ 
+ /* A write timeout has occurred. Process the after effects. */
+-static int tcp_write_timeout(struct sock *sk)
++int tcp_write_timeout(struct sock *sk)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
+ 		}
+ 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
+ 		syn_set = true;
++		/* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
++		if (tcp_sk(sk)->request_mptcp &&
++		    icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
++			tcp_sk(sk)->request_mptcp = 0;
+ 	} else {
+ 		if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
+ 			/* Black hole detection */
+@@ -251,18 +254,22 @@ out:
+ static void tcp_delack_timer(unsigned long data)
+ {
+ 	struct sock *sk = (struct sock *)data;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ 
+-	bh_lock_sock(sk);
+-	if (!sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (!sock_owned_by_user(meta_sk)) {
+ 		tcp_delack_timer_handler(sk);
+ 	} else {
+ 		inet_csk(sk)->icsk_ack.blocked = 1;
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
++		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
+ 		/* deleguate our work to tcp_release_cb() */
+ 		if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ 			sock_hold(sk);
++		if (mptcp(tp))
++			mptcp_tsq_flags(sk);
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -479,6 +486,10 @@ out_reset_timer:
+ 		__sk_dst_reset(sk);
+ 
+ out:;
++	if (mptcp(tp)) {
++		mptcp_reinject_data(sk, 1);
++		mptcp_set_rto(sk);
++	}
+ }
+ 
+ void tcp_write_timer_handler(struct sock *sk)
+@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
+ 		break;
+ 	case ICSK_TIME_RETRANS:
+ 		icsk->icsk_pending = 0;
+-		tcp_retransmit_timer(sk);
++		tcp_sk(sk)->ops->retransmit_timer(sk);
+ 		break;
+ 	case ICSK_TIME_PROBE0:
+ 		icsk->icsk_pending = 0;
+@@ -520,16 +531,19 @@ out:
+ static void tcp_write_timer(unsigned long data)
+ {
+ 	struct sock *sk = (struct sock *)data;
++	struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
+ 
+-	bh_lock_sock(sk);
+-	if (!sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (!sock_owned_by_user(meta_sk)) {
+ 		tcp_write_timer_handler(sk);
+ 	} else {
+ 		/* deleguate our work to tcp_release_cb() */
+ 		if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ 			sock_hold(sk);
++		if (mptcp(tcp_sk(sk)))
++			mptcp_tsq_flags(sk);
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
+ 	struct sock *sk = (struct sock *) data;
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ 	u32 elapsed;
+ 
+ 	/* Only process if socket is not in use. */
+-	bh_lock_sock(sk);
+-	if (sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
+ 		/* Try again later. */
+ 		inet_csk_reset_keepalive_timer (sk, HZ/20);
+ 		goto out;
+@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
+ 		goto out;
+ 	}
+ 
++	if (tp->send_mp_fclose) {
++		/* MUST do this before tcp_write_timeout, because retrans_stamp
++		 * may have been set to 0 in another part while we are
++		 * retransmitting MP_FASTCLOSE. Then, we would crash, because
++		 * retransmits_timed_out accesses the meta-write-queue.
++		 *
++		 * We make sure that the timestamp is != 0.
++		 */
++		if (!tp->retrans_stamp)
++			tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++		if (tcp_write_timeout(sk))
++			goto out;
++
++		tcp_send_ack(sk);
++		icsk->icsk_retransmits++;
++
++		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++		elapsed = icsk->icsk_rto;
++		goto resched;
++	}
++
+ 	if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
+ 		if (tp->linger2 >= 0) {
+ 			const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
+ 
+ 			if (tmo > 0) {
+-				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++				tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ 				goto out;
+ 			}
+ 		}
+-		tcp_send_active_reset(sk, GFP_ATOMIC);
++		tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 		goto death;
+ 	}
+ 
+@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
+ 		    icsk->icsk_probes_out > 0) ||
+ 		    (icsk->icsk_user_timeout == 0 &&
+ 		    icsk->icsk_probes_out >= keepalive_probes(tp))) {
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			tcp_write_err(sk);
+ 			goto out;
+ 		}
+-		if (tcp_write_wakeup(sk) <= 0) {
++		if (tp->ops->write_wakeup(sk) <= 0) {
+ 			icsk->icsk_probes_out++;
+ 			elapsed = keepalive_intvl_when(tp);
+ 		} else {
+@@ -642,7 +679,7 @@ death:
+ 	tcp_done(sk);
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
+index 5667b3003af9..7139c2973fd2 100644
+--- a/net/ipv6/addrconf.c
++++ b/net/ipv6/addrconf.c
+@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
+ 
+ 	kfree_rcu(ifp, rcu);
+ }
++EXPORT_SYMBOL(inet6_ifa_finish_destroy);
+ 
+ static void
+ ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
+diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
+index 7cb4392690dd..7057afbca4df 100644
+--- a/net/ipv6/af_inet6.c
++++ b/net/ipv6/af_inet6.c
+@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
+ 	return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
+ }
+ 
+-static int inet6_create(struct net *net, struct socket *sock, int protocol,
+-			int kern)
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ 	struct inet_sock *inet;
+ 	struct ipv6_pinfo *np;
+diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
+index a245e5ddffbd..99c892b8992d 100644
+--- a/net/ipv6/inet6_connection_sock.c
++++ b/net/ipv6/inet6_connection_sock.c
+@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
+ /*
+  * request_sock (formerly open request) hash tables.
+  */
+-static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
+-			   const u32 rnd, const u32 synq_hsize)
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++		    const u32 rnd, const u32 synq_hsize)
+ {
+ 	u32 c;
+ 
+diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
+index edb58aff4ae7..ea4d9fda0927 100644
+--- a/net/ipv6/ipv6_sockglue.c
++++ b/net/ipv6/ipv6_sockglue.c
+@@ -48,6 +48,8 @@
+ #include <net/addrconf.h>
+ #include <net/inet_common.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/xfrm.h>
+@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+ 				sock_prot_inuse_add(net, &tcp_prot, 1);
+ 				local_bh_enable();
+ 				sk->sk_prot = &tcp_prot;
+-				icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++				if (is_mptcp_enabled(sk))
++					icsk->icsk_af_ops = &mptcp_v4_specific;
++				else
++#endif
++					icsk->icsk_af_ops = &ipv4_specific;
+ 				sk->sk_socket->ops = &inet_stream_ops;
+ 				sk->sk_family = PF_INET;
+ 				tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
+diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
+index a822b880689b..b2b38869d795 100644
+--- a/net/ipv6/syncookies.c
++++ b/net/ipv6/syncookies.c
+@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ 
+ 	/* check for timestamp cookie support */
+ 	memset(&tcp_opt, 0, sizeof(tcp_opt));
+-	tcp_parse_options(skb, &tcp_opt, 0, NULL);
++	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+ 
+ 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ 		goto out;
+ 
+ 	ret = NULL;
+-	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
++	req = inet_reqsk_alloc(&tcp6_request_sock_ops);
+ 	if (!req)
+ 		goto out;
+ 
+@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ 	}
+ 
+ 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
+-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+-				  &req->rcv_wnd, &req->window_clamp,
+-				  ireq->wscale_ok, &rcv_wscale,
+-				  dst_metric(dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++				       &req->rcv_wnd, &req->window_clamp,
++				       ireq->wscale_ok, &rcv_wscale,
++				       dst_metric(dst, RTAX_INITRWND), sk);
+ 
+ 	ireq->rcv_wscale = rcv_wscale;
+ 
+diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
+index 229239ad96b1..fda94d71666e 100644
+--- a/net/ipv6/tcp_ipv6.c
++++ b/net/ipv6/tcp_ipv6.c
+@@ -63,6 +63,8 @@
+ #include <net/inet_common.h>
+ #include <net/secure_seq.h>
+ #include <net/tcp_memcontrol.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
+ #include <net/busy_poll.h>
+ 
+ #include <linux/proc_fs.h>
+@@ -71,12 +73,6 @@
+ #include <linux/crypto.h>
+ #include <linux/scatterlist.h>
+ 
+-static void	tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
+-static void	tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				      struct request_sock *req);
+-
+-static int	tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
+-
+ static const struct inet_connection_sock_af_ops ipv6_mapped;
+ static const struct inet_connection_sock_af_ops ipv6_specific;
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
+ }
+ #endif
+ 
+-static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct dst_entry *dst = skb_dst(skb);
+ 	const struct rt6_info *rt = (const struct rt6_info *)dst;
+@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ 		inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+ }
+ 
+-static void tcp_v6_hash(struct sock *sk)
++void tcp_v6_hash(struct sock *sk)
+ {
+ 	if (sk->sk_state != TCP_CLOSE) {
+-		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
++		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
++		    inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
+ 			tcp_prot.hash(sk);
+ 			return;
+ 		}
+@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
+ 	}
+ }
+ 
+-static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ {
+ 	return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
+ 					    ipv6_hdr(skb)->saddr.s6_addr32,
+@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ 					    tcp_hdr(skb)->source);
+ }
+ 
+-static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 			  int addr_len)
+ {
+ 	struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
+@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 		sin.sin_port = usin->sin6_port;
+ 		sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
+ 
+-		icsk->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++		if (is_mptcp_enabled(sk))
++			icsk->icsk_af_ops = &mptcp_v6_mapped;
++		else
++#endif
++			icsk->icsk_af_ops = &ipv6_mapped;
+ 		sk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		tp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 
+ 		if (err) {
+ 			icsk->icsk_ext_hdr_len = exthdrlen;
+-			icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++			if (is_mptcp_enabled(sk))
++				icsk->icsk_af_ops = &mptcp_v6_specific;
++			else
++#endif
++				icsk->icsk_af_ops = &ipv6_specific;
+ 			sk->sk_backlog_rcv = tcp_v6_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 			tp->af_specific = &tcp_sock_ipv6_specific;
+@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 	const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
+ 	const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
+ 	struct ipv6_pinfo *np;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 	int err;
+ 	struct tcp_sock *tp;
+ 	struct request_sock *fastopen;
+@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		return;
+ 	}
+ 
+-	bh_lock_sock(sk);
+-	if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
++	tp = tcp_sk(sk);
++	if (mptcp(tp))
++		meta_sk = mptcp_meta_sk(sk);
++	else
++		meta_sk = sk;
++
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
+ 		NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ 
+ 	if (sk->sk_state == TCP_CLOSE)
+@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		goto out;
+ 	}
+ 
+-	tp = tcp_sk(sk);
+ 	seq = ntohl(th->seq);
+ 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ 	fastopen = tp->fastopen_rsk;
+@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 			goto out;
+ 
+ 		tp->mtu_info = ntohl(info);
+-		if (!sock_owned_by_user(sk))
++		if (!sock_owned_by_user(meta_sk))
+ 			tcp_v6_mtu_reduced(sk);
+-		else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
++		else {
++			if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
+ 					   &tp->tsq_flags))
+-			sock_hold(sk);
++				sock_hold(sk);
++			if (mptcp(tp))
++				mptcp_tsq_flags(sk);
++		}
+ 		goto out;
+ 	}
+ 
+@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 	switch (sk->sk_state) {
+ 		struct request_sock *req, **prev;
+ 	case TCP_LISTEN:
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			goto out;
+ 
+ 		req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
+@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		if (fastopen && fastopen->sk == NULL)
+ 			break;
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			sk->sk_err = err;
+ 			sk->sk_error_report(sk);		/* Wake people up to see the error (see connect in sock.c) */
+ 
+@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		goto out;
+ 	}
+ 
+-	if (!sock_owned_by_user(sk) && np->recverr) {
++	if (!sock_owned_by_user(meta_sk) && np->recverr) {
+ 		sk->sk_err = err;
+ 		sk->sk_error_report(sk);
+ 	} else
+ 		sk->sk_err_soft = err;
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+ 
+-static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+-			      struct flowi6 *fl6,
+-			      struct request_sock *req,
+-			      u16 queue_mapping,
+-			      struct tcp_fastopen_cookie *foc)
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc)
+ {
+ 	struct inet_request_sock *ireq = inet_rsk(req);
+ 	struct ipv6_pinfo *np = inet6_sk(sk);
++	struct flowi6 *fl6 = &fl->u.ip6;
+ 	struct sk_buff *skb;
+ 	int err = -ENOMEM;
+ 
+@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+ 		skb_set_queue_mapping(skb, queue_mapping);
+ 		err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
+ 		err = net_xmit_eval(err);
++		if (!tcp_rsk(req)->snt_synack && !err)
++			tcp_rsk(req)->snt_synack = tcp_time_stamp;
+ 	}
+ 
+ done:
+ 	return err;
+ }
+ 
+-static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ {
+-	struct flowi6 fl6;
++	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++	struct flowi fl;
+ 	int res;
+ 
+-	res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
++	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
+ 	if (!res) {
+ 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ 	return res;
+ }
+ 
+-static void tcp_v6_reqsk_destructor(struct request_sock *req)
++void tcp_v6_reqsk_destructor(struct request_sock *req)
+ {
+ 	kfree_skb(inet_rsk(req)->pktopts);
+ }
+@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ }
+ #endif
+ 
++static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
++			   struct sk_buff *skb)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++	struct ipv6_pinfo *np = inet6_sk(sk);
++
++	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
++	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
++
++	ireq->ir_iif = sk->sk_bound_dev_if;
++	ireq->ir_mark = inet_request_mark(sk, skb);
++
++	/* So that link locals have meaning */
++	if (!sk->sk_bound_dev_if &&
++	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
++		ireq->ir_iif = inet6_iif(skb);
++
++	if (!TCP_SKB_CB(skb)->when &&
++	    (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
++	     np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
++	     np->rxopt.bits.rxohlim || np->repflow)) {
++		atomic_inc(&skb->users);
++		ireq->pktopts = skb;
++	}
++
++	return 0;
++}
++
++static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
++					  const struct request_sock *req,
++					  bool *strict)
++{
++	if (strict)
++		*strict = true;
++	return inet6_csk_route_req(sk, &fl->u.ip6, req);
++}
++
+ struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
+ 	.family		=	AF_INET6,
+ 	.obj_size	=	sizeof(struct tcp6_request_sock),
+-	.rtx_syn_ack	=	tcp_v6_rtx_synack,
++	.rtx_syn_ack	=	tcp_rtx_synack,
+ 	.send_ack	=	tcp_v6_reqsk_send_ack,
+ 	.destructor	=	tcp_v6_reqsk_destructor,
+ 	.send_reset	=	tcp_v6_send_reset,
+ 	.syn_ack_timeout =	tcp_syn_ack_timeout,
+ };
+ 
++const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
++	.mss_clamp	=	IPV6_MIN_MTU - sizeof(struct tcphdr) -
++				sizeof(struct ipv6hdr),
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
+ 	.md5_lookup	=	tcp_v6_reqsk_md5_lookup,
+ 	.calc_md5_hash	=	tcp_v6_md5_hash_skb,
+-};
+ #endif
++	.init_req	=	tcp_v6_init_req,
++#ifdef CONFIG_SYN_COOKIES
++	.cookie_init_seq =	cookie_v6_init_sequence,
++#endif
++	.route_req	=	tcp_v6_route_req,
++	.init_seq	=	tcp_v6_init_sequence,
++	.send_synack	=	tcp_v6_send_synack,
++	.queue_hash_add =	inet6_csk_reqsk_queue_hash_add,
++};
+ 
+-static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+-				 u32 tsval, u32 tsecr, int oif,
+-				 struct tcp_md5sig_key *key, int rst, u8 tclass,
+-				 u32 label)
++static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
++				 u32 data_ack, u32 win, u32 tsval, u32 tsecr,
++				 int oif, struct tcp_md5sig_key *key, int rst,
++				 u8 tclass, u32 label, int mptcp)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct tcphdr *t1;
+@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 	if (key)
+ 		tot_len += TCPOLEN_MD5SIG_ALIGNED;
+ #endif
+-
++#ifdef CONFIG_MPTCP
++	if (mptcp)
++		tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++#endif
+ 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
+ 			 GFP_ATOMIC);
+ 	if (buff == NULL)
+@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 		tcp_v6_md5_hash_hdr((__u8 *)topt, key,
+ 				    &ipv6_hdr(skb)->saddr,
+ 				    &ipv6_hdr(skb)->daddr, t1);
++		topt += 4;
++	}
++#endif
++#ifdef CONFIG_MPTCP
++	if (mptcp) {
++		/* Construction of 32-bit data_ack */
++		*topt++ = htonl((TCPOPT_MPTCP << 24) |
++				((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++				(0x20 << 8) |
++				(0x01));
++		*topt++ = htonl(data_ack);
+ 	}
+ #endif
+ 
+@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 	kfree_skb(buff);
+ }
+ 
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	u32 seq = 0, ack_seq = 0;
+@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ 			  (th->doff << 2);
+ 
+ 	oif = sk ? sk->sk_bound_dev_if : 0;
+-	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
++	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ release_sk1:
+@@ -902,45 +983,52 @@ release_sk1:
+ #endif
+ }
+ 
+-static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ 			    u32 win, u32 tsval, u32 tsecr, int oif,
+ 			    struct tcp_md5sig_key *key, u8 tclass,
+-			    u32 label)
++			    u32 label, int mptcp)
+ {
+-	tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
+-			     label);
++	tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
++			     key, 0, tclass, label, mptcp);
+ }
+ 
+ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct inet_timewait_sock *tw = inet_twsk(sk);
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++	u32 data_ack = 0;
++	int mptcp = 0;
+ 
++	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++		mptcp = 1;
++	}
+ 	tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++			data_ack,
+ 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ 			tcp_time_stamp + tcptw->tw_ts_offset,
+ 			tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
+-			tw->tw_tclass, (tw->tw_flowlabel << 12));
++			tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
+ 
+ 	inet_twsk_put(tw);
+ }
+ 
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				  struct request_sock *req)
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req)
+ {
+ 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ 	 */
+ 	tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+-			tcp_rsk(req)->rcv_nxt,
++			tcp_rsk(req)->rcv_nxt, 0,
+ 			req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
+ 			tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
+-			0, 0);
++			0, 0, 0);
+ }
+ 
+ 
+-static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct request_sock *req, **prev;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 
+ 	if (nsk) {
+ 		if (nsk->sk_state != TCP_TIME_WAIT) {
++			/* Don't lock again the meta-sk. It has been locked
++			 * before mptcp_v6_do_rcv.
++			 */
++			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++				bh_lock_sock(mptcp_meta_sk(nsk));
+ 			bh_lock_sock(nsk);
++
+ 			return nsk;
+ 		}
+ 		inet_twsk_put(inet_twsk(nsk));
+@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 	return sk;
+ }
+ 
+-/* FIXME: this is substantially similar to the ipv4 code.
+- * Can some kind of merge be done? -- erics
+- */
+-static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+-	struct tcp_options_received tmp_opt;
+-	struct request_sock *req;
+-	struct inet_request_sock *ireq;
+-	struct ipv6_pinfo *np = inet6_sk(sk);
+-	struct tcp_sock *tp = tcp_sk(sk);
+-	__u32 isn = TCP_SKB_CB(skb)->when;
+-	struct dst_entry *dst = NULL;
+-	struct tcp_fastopen_cookie foc = { .len = -1 };
+-	bool want_cookie = false, fastopen;
+-	struct flowi6 fl6;
+-	int err;
+-
+ 	if (skb->protocol == htons(ETH_P_IP))
+ 		return tcp_v4_conn_request(sk, skb);
+ 
+ 	if (!ipv6_unicast_destination(skb))
+ 		goto drop;
+ 
+-	if ((sysctl_tcp_syncookies == 2 ||
+-	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+-		want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
+-		if (!want_cookie)
+-			goto drop;
+-	}
+-
+-	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+-		goto drop;
+-	}
+-
+-	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
+-	if (req == NULL)
+-		goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+-	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
+-#endif
+-
+-	tcp_clear_options(&tmp_opt);
+-	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
+-	tmp_opt.user_mss = tp->rx_opt.user_mss;
+-	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+-	if (want_cookie && !tmp_opt.saw_tstamp)
+-		tcp_clear_options(&tmp_opt);
++	return tcp_conn_request(&tcp6_request_sock_ops,
++				&tcp_request_sock_ipv6_ops, sk, skb);
+ 
+-	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+-	tcp_openreq_init(req, &tmp_opt, skb);
+-
+-	ireq = inet_rsk(req);
+-	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
+-	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
+-	if (!want_cookie || tmp_opt.tstamp_ok)
+-		TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+-	ireq->ir_iif = sk->sk_bound_dev_if;
+-	ireq->ir_mark = inet_request_mark(sk, skb);
+-
+-	/* So that link locals have meaning */
+-	if (!sk->sk_bound_dev_if &&
+-	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
+-		ireq->ir_iif = inet6_iif(skb);
+-
+-	if (!isn) {
+-		if (ipv6_opt_accepted(sk, skb) ||
+-		    np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
+-		    np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
+-		    np->repflow) {
+-			atomic_inc(&skb->users);
+-			ireq->pktopts = skb;
+-		}
+-
+-		if (want_cookie) {
+-			isn = cookie_v6_init_sequence(sk, skb, &req->mss);
+-			req->cookie_ts = tmp_opt.tstamp_ok;
+-			goto have_isn;
+-		}
+-
+-		/* VJ's idea. We save last timestamp seen
+-		 * from the destination in peer table, when entering
+-		 * state TIME-WAIT, and check against it before
+-		 * accepting new connection request.
+-		 *
+-		 * If "isn" is not zero, this request hit alive
+-		 * timewait bucket, so that all the necessary checks
+-		 * are made in the function processing timewait state.
+-		 */
+-		if (tmp_opt.saw_tstamp &&
+-		    tcp_death_row.sysctl_tw_recycle &&
+-		    (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
+-			if (!tcp_peer_is_proven(req, dst, true)) {
+-				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+-				goto drop_and_release;
+-			}
+-		}
+-		/* Kill the following clause, if you dislike this way. */
+-		else if (!sysctl_tcp_syncookies &&
+-			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+-			  (sysctl_max_syn_backlog >> 2)) &&
+-			 !tcp_peer_is_proven(req, dst, false)) {
+-			/* Without syncookies last quarter of
+-			 * backlog is filled with destinations,
+-			 * proven to be alive.
+-			 * It means that we continue to communicate
+-			 * to destinations, already remembered
+-			 * to the moment of synflood.
+-			 */
+-			LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
+-				       &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
+-			goto drop_and_release;
+-		}
+-
+-		isn = tcp_v6_init_sequence(skb);
+-	}
+-have_isn:
+-
+-	if (security_inet_conn_request(sk, skb, req))
+-		goto drop_and_release;
+-
+-	if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
+-		goto drop_and_free;
+-
+-	tcp_rsk(req)->snt_isn = isn;
+-	tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-	tcp_openreq_init_rwin(req, sk, dst);
+-	fastopen = !want_cookie &&
+-		   tcp_try_fastopen(sk, skb, req, &foc, dst);
+-	err = tcp_v6_send_synack(sk, dst, &fl6, req,
+-				 skb_get_queue_mapping(skb), &foc);
+-	if (!fastopen) {
+-		if (err || want_cookie)
+-			goto drop_and_free;
+-
+-		tcp_rsk(req)->listener = NULL;
+-		inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+-	}
+-	return 0;
+-
+-drop_and_release:
+-	dst_release(dst);
+-drop_and_free:
+-	reqsk_free(req);
+ drop:
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ 	return 0; /* don't send reset */
+ }
+ 
+-static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+-					 struct request_sock *req,
+-					 struct dst_entry *dst)
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++				  struct request_sock *req,
++				  struct dst_entry *dst)
+ {
+ 	struct inet_request_sock *ireq;
+ 	struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
+@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+ 
+ 		newsk->sk_v6_rcv_saddr = newnp->saddr;
+ 
+-		inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++		if (is_mptcp_enabled(newsk))
++			inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
++		else
++#endif
++			inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
+ 		newsk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -1329,7 +1292,7 @@ out:
+  * This is because we cannot sleep with the original spinlock
+  * held.
+  */
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct ipv6_pinfo *np = inet6_sk(sk);
+ 	struct tcp_sock *tp;
+@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ 		goto discard;
+ #endif
+ 
++	if (is_meta_sk(sk))
++		return mptcp_v6_do_rcv(sk, skb);
++
+ 	if (sk_filter(sk, skb))
+ 		goto discard;
+ 
+@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th;
+ 	const struct ipv6hdr *hdr;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk = NULL;
+ 	int ret;
+ 	struct net *net = dev_net(skb->dev);
+ 
+@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ 				    skb->len - th->doff*4);
+ 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++	TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ 	TCP_SKB_CB(skb)->when = 0;
+ 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
+ 	TCP_SKB_CB(skb)->sacked = 0;
+ 
+ 	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+-	if (!sk)
+-		goto no_tcp_socket;
+ 
+ process:
+-	if (sk->sk_state == TCP_TIME_WAIT)
++	if (sk && sk->sk_state == TCP_TIME_WAIT)
+ 		goto do_time_wait;
+ 
++#ifdef CONFIG_MPTCP
++	if (!sk && th->syn && !th->ack) {
++		int ret = mptcp_lookup_join(skb, NULL);
++
++		if (ret < 0) {
++			tcp_v6_send_reset(NULL, skb);
++			goto discard_it;
++		} else if (ret > 0) {
++			return 0;
++		}
++	}
++
++	/* Is there a pending request sock for this segment ? */
++	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++		if (sk)
++			sock_put(sk);
++		return 0;
++	}
++#endif
++
++	if (!sk)
++		goto no_tcp_socket;
++
+ 	if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ 		goto discard_and_relse;
+@@ -1529,11 +1520,21 @@ process:
+ 	sk_mark_napi_id(sk, skb);
+ 	skb->dev = NULL;
+ 
+-	bh_lock_sock_nested(sk);
++	if (mptcp(tcp_sk(sk))) {
++		meta_sk = mptcp_meta_sk(sk);
++
++		bh_lock_sock_nested(meta_sk);
++		if (sock_owned_by_user(meta_sk))
++			skb->sk = sk;
++	} else {
++		meta_sk = sk;
++		bh_lock_sock_nested(sk);
++	}
++
+ 	ret = 0;
+-	if (!sock_owned_by_user(sk)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+-		struct tcp_sock *tp = tcp_sk(sk);
++		struct tcp_sock *tp = tcp_sk(meta_sk);
+ 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ 			tp->ucopy.dma_chan = net_dma_find_channel();
+ 		if (tp->ucopy.dma_chan)
+@@ -1541,16 +1542,17 @@ process:
+ 		else
+ #endif
+ 		{
+-			if (!tcp_prequeue(sk, skb))
++			if (!tcp_prequeue(meta_sk, skb))
+ 				ret = tcp_v6_do_rcv(sk, skb);
+ 		}
+-	} else if (unlikely(sk_add_backlog(sk, skb,
+-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
+-		bh_unlock_sock(sk);
++	} else if (unlikely(sk_add_backlog(meta_sk, skb,
++					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++		bh_unlock_sock(meta_sk);
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ 		goto discard_and_relse;
+ 	}
+-	bh_unlock_sock(sk);
++
++	bh_unlock_sock(meta_sk);
+ 
+ 	sock_put(sk);
+ 	return ret ? -1 : 0;
+@@ -1607,6 +1609,18 @@ do_time_wait:
+ 			sk = sk2;
+ 			goto process;
+ 		}
++#ifdef CONFIG_MPTCP
++		if (th->syn && !th->ack) {
++			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++			if (ret < 0) {
++				tcp_v6_send_reset(NULL, skb);
++				goto discard_it;
++			} else if (ret > 0) {
++				return 0;
++			}
++		}
++#endif
+ 		/* Fall through to ACK */
+ 	}
+ 	case TCP_TW_ACK:
+@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
+ 	}
+ }
+ 
+-static struct timewait_sock_ops tcp6_timewait_sock_ops = {
++struct timewait_sock_ops tcp6_timewait_sock_ops = {
+ 	.twsk_obj_size	= sizeof(struct tcp6_timewait_sock),
+ 	.twsk_unique	= tcp_twsk_unique,
+ 	.twsk_destructor = tcp_twsk_destructor,
+@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
+ 
+ 	tcp_init_sock(sk);
+ 
+-	icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++	if (is_mptcp_enabled(sk))
++		icsk->icsk_af_ops = &mptcp_v6_specific;
++	else
++#endif
++		icsk->icsk_af_ops = &ipv6_specific;
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ 	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
+@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
+ 	return 0;
+ }
+ 
+-static void tcp_v6_destroy_sock(struct sock *sk)
++void tcp_v6_destroy_sock(struct sock *sk)
+ {
+ 	tcp_v4_destroy_sock(sk);
+ 	inet6_destroy_sock(sk);
+@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
+ static void tcp_v6_clear_sk(struct sock *sk, int size)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
++#ifdef CONFIG_MPTCP
++	struct tcp_sock *tp = tcp_sk(sk);
++	/* size_tk_table goes from the end of tk_table to the end of sk */
++	int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
++			    sizeof(tp->tk_table);
++#endif
+ 
+ 	/* we do not want to clear pinet6 field, because of RCU lookups */
+ 	sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
+ 
+ 	size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
++
++#ifdef CONFIG_MPTCP
++	/* We zero out only from pinet6 to tk_table */
++	size -= size_tk_table + sizeof(tp->tk_table);
++#endif
+ 	memset(&inet->pinet6 + 1, 0, size);
++
++#ifdef CONFIG_MPTCP
++	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
++#endif
++
+ }
+ 
+ struct proto tcpv6_prot = {
+diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
+new file mode 100644
+index 000000000000..cdfc03adabf8
+--- /dev/null
++++ b/net/mptcp/Kconfig
+@@ -0,0 +1,115 @@
++#
++# MPTCP configuration
++#
++config MPTCP
++        bool "MPTCP protocol"
++        depends on (IPV6=y || IPV6=n)
++        ---help---
++          This replaces the normal TCP stack with a Multipath TCP stack,
++          able to use several paths at once.
++
++menuconfig MPTCP_PM_ADVANCED
++	bool "MPTCP: advanced path-manager control"
++	depends on MPTCP=y
++	---help---
++	  Support for selection of different path-managers. You should choose 'Y' here,
++	  because otherwise you will not actively create new MPTCP-subflows.
++
++if MPTCP_PM_ADVANCED
++
++config MPTCP_FULLMESH
++	tristate "MPTCP Full-Mesh Path-Manager"
++	depends on MPTCP=y
++	---help---
++	  This path-management module will create a full-mesh among all IP-addresses.
++
++config MPTCP_NDIFFPORTS
++	tristate "MPTCP ndiff-ports"
++	depends on MPTCP=y
++	---help---
++	  This path-management module will create multiple subflows between the same
++	  pair of IP-addresses, modifying the source-port. You can set the number
++	  of subflows via the mptcp_ndiffports-sysctl.
++
++config MPTCP_BINDER
++	tristate "MPTCP Binder"
++	depends on (MPTCP=y)
++	---help---
++	  This path-management module works like ndiffports, and adds the sysctl
++	  option to set the gateway (and/or path to) per each additional subflow
++	  via Loose Source Routing (IPv4 only).
++
++choice
++	prompt "Default MPTCP Path-Manager"
++	default DEFAULT
++	help
++	  Select the Path-Manager of your choice
++
++	config DEFAULT_FULLMESH
++		bool "Full mesh" if MPTCP_FULLMESH=y
++
++	config DEFAULT_NDIFFPORTS
++		bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
++
++	config DEFAULT_BINDER
++		bool "binder" if MPTCP_BINDER=y
++
++	config DEFAULT_DUMMY
++		bool "Default"
++
++endchoice
++
++endif
++
++config DEFAULT_MPTCP_PM
++	string
++	default "default" if DEFAULT_DUMMY
++	default "fullmesh" if DEFAULT_FULLMESH 
++	default "ndiffports" if DEFAULT_NDIFFPORTS
++	default "binder" if DEFAULT_BINDER
++	default "default"
++
++menuconfig MPTCP_SCHED_ADVANCED
++	bool "MPTCP: advanced scheduler control"
++	depends on MPTCP=y
++	---help---
++	  Support for selection of different schedulers. You should choose 'Y' here,
++	  if you want to choose a different scheduler than the default one.
++
++if MPTCP_SCHED_ADVANCED
++
++config MPTCP_ROUNDROBIN
++	tristate "MPTCP Round-Robin"
++	depends on (MPTCP=y)
++	---help---
++	  This is a very simple round-robin scheduler. Probably has bad performance
++	  but might be interesting for researchers.
++
++choice
++	prompt "Default MPTCP Scheduler"
++	default DEFAULT
++	help
++	  Select the Scheduler of your choice
++
++	config DEFAULT_SCHEDULER
++		bool "Default"
++		---help---
++		  This is the default scheduler, sending first on the subflow
++		  with the lowest RTT.
++
++	config DEFAULT_ROUNDROBIN
++		bool "Round-Robin" if MPTCP_ROUNDROBIN=y
++		---help---
++		  This is the round-rob scheduler, sending in a round-robin
++		  fashion..
++
++endchoice
++endif
++
++config DEFAULT_MPTCP_SCHED
++	string
++	depends on (MPTCP=y)
++	default "default" if DEFAULT_SCHEDULER
++	default "roundrobin" if DEFAULT_ROUNDROBIN
++	default "default"
++
+diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
+new file mode 100644
+index 000000000000..35561a7012e3
+--- /dev/null
++++ b/net/mptcp/Makefile
+@@ -0,0 +1,20 @@
++#
++## Makefile for MultiPath TCP support code.
++#
++#
++
++obj-$(CONFIG_MPTCP) += mptcp.o
++
++mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
++	   mptcp_output.o mptcp_input.o mptcp_sched.o
++
++obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
++obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
++obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
++obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
++obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
++obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
++obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
++
++mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
++
+diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
+new file mode 100644
+index 000000000000..95d8da560715
+--- /dev/null
++++ b/net/mptcp/mptcp_binder.c
+@@ -0,0 +1,487 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#include <linux/route.h>
++#include <linux/inet.h>
++#include <linux/mroute.h>
++#include <linux/spinlock_types.h>
++#include <net/inet_ecn.h>
++#include <net/route.h>
++#include <net/xfrm.h>
++#include <net/compat.h>
++#include <linux/slab.h>
++
++#define MPTCP_GW_MAX_LISTS	10
++#define MPTCP_GW_LIST_MAX_LEN	6
++#define MPTCP_GW_SYSCTL_MAX_LEN	(15 * MPTCP_GW_LIST_MAX_LEN *	\
++							MPTCP_GW_MAX_LISTS)
++
++struct mptcp_gw_list {
++	struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
++	u8 len[MPTCP_GW_MAX_LISTS];
++};
++
++struct binder_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++
++	struct mptcp_cb *mpcb;
++
++	/* Prevent multiple sub-sockets concurrently iterating over sockets */
++	spinlock_t *flow_lock;
++};
++
++static struct mptcp_gw_list *mptcp_gws;
++static rwlock_t mptcp_gws_lock;
++
++static int mptcp_binder_ndiffports __read_mostly = 1;
++
++static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
++
++static int mptcp_get_avail_list_ipv4(struct sock *sk)
++{
++	int i, j, list_taken, opt_ret, opt_len;
++	unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
++
++	for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
++		if (mptcp_gws->len[i] == 0)
++			goto error;
++
++		mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
++		list_taken = 0;
++
++		/* Loop through all sub-sockets in this connection */
++		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
++			mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
++
++			/* Reset length and options buffer, then retrieve
++			 * from socket
++			 */
++			opt_len = MAX_IPOPTLEN;
++			memset(opt, 0, MAX_IPOPTLEN);
++			opt_ret = ip_getsockopt(sk, IPPROTO_IP,
++				IP_OPTIONS, opt, &opt_len);
++			if (opt_ret < 0) {
++				mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
++					    __func__, opt_ret);
++				goto error;
++			}
++
++			/* If socket has no options, it has no stake in this list */
++			if (opt_len <= 0)
++				continue;
++
++			/* Iterate options buffer */
++			for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
++				if (*opt_ptr == IPOPT_LSRR) {
++					mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
++					goto sock_lsrr;
++				}
++			}
++			continue;
++
++sock_lsrr:
++			/* Pointer to the 2nd to last address */
++			opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
++
++			/* Addresses start 3 bytes after type offset */
++			opt_ptr += 3;
++			j = 0;
++
++			/* Different length lists cannot be the same */
++			if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
++				continue;
++
++			/* Iterate if we are still inside options list
++			 * and sysctl list
++			 */
++			while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
++				/* If there is a different address, this list must
++				 * not be set on this socket
++				 */
++				if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
++					break;
++
++				/* Jump 4 bytes to next address */
++				opt_ptr += 4;
++				j++;
++			}
++
++			/* Reached the end without a differing address, lists
++			 * are therefore identical.
++			 */
++			if (j == mptcp_gws->len[i]) {
++				mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
++				list_taken = 1;
++				break;
++			}
++		}
++
++		/* Free list found if not taken by a socket */
++		if (!list_taken) {
++			mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
++			break;
++		}
++	}
++
++	if (i >= MPTCP_GW_MAX_LISTS)
++		goto error;
++
++	return i;
++error:
++	return -1;
++}
++
++/* The list of addresses is parsed each time a new connection is opened,
++ *  to make sure it's up to date. In case of error, all the lists are
++ *  marked as unavailable and the subflow's fingerprint is set to 0.
++ */
++static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
++{
++	int i, j, ret;
++	unsigned char opt[MAX_IPOPTLEN] = {0};
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
++
++	/* Read lock: multiple sockets can read LSRR addresses at the same
++	 * time, but writes are done in mutual exclusion.
++	 * Spin lock: must search for free list for one socket at a time, or
++	 * multiple sockets could take the same list.
++	 */
++	read_lock(&mptcp_gws_lock);
++	spin_lock(fmp->flow_lock);
++
++	i = mptcp_get_avail_list_ipv4(sk);
++
++	/* Execution enters here only if a free path is found.
++	 */
++	if (i >= 0) {
++		opt[0] = IPOPT_NOP;
++		opt[1] = IPOPT_LSRR;
++		opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
++				(mptcp_gws->len[i] + 1) + 3;
++		opt[3] = IPOPT_MINOFF;
++		for (j = 0; j < mptcp_gws->len[i]; ++j)
++			memcpy(opt + 4 +
++				(j * sizeof(mptcp_gws->list[i][0].s_addr)),
++				&mptcp_gws->list[i][j].s_addr,
++				sizeof(mptcp_gws->list[i][0].s_addr));
++		/* Final destination must be part of IP_OPTIONS parameter. */
++		memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
++		       sizeof(addr.s_addr));
++
++		/* setsockopt must be inside the lock, otherwise another
++		 * subflow could fail to see that we have taken a list.
++		 */
++		ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
++				4 + sizeof(mptcp_gws->list[i][0].s_addr)
++				* (mptcp_gws->len[i] + 1));
++
++		if (ret < 0) {
++			mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
++				    __func__, ret);
++		}
++	}
++
++	spin_unlock(fmp->flow_lock);
++	read_unlock(&mptcp_gws_lock);
++
++	return;
++}
++
++/* Parses gateways string for a list of paths to different
++ * gateways, and stores them for use with the Loose Source Routing (LSRR)
++ * socket option. Each list must have "," separated addresses, and the lists
++ * themselves must be separated by "-". Returns -1 in case one or more of the
++ * addresses is not a valid ipv4/6 address.
++ */
++static int mptcp_parse_gateway_ipv4(char *gateways)
++{
++	int i, j, k, ret;
++	char *tmp_string = NULL;
++	struct in_addr tmp_addr;
++
++	tmp_string = kzalloc(16, GFP_KERNEL);
++	if (tmp_string == NULL)
++		return -ENOMEM;
++
++	write_lock(&mptcp_gws_lock);
++
++	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++
++	/* A TMP string is used since inet_pton needs a null terminated string
++	 * but we do not want to modify the sysctl for obvious reasons.
++	 * i will iterate over the SYSCTL string, j will iterate over the
++	 * temporary string where each IP is copied into, k will iterate over
++	 * the IPs in each list.
++	 */
++	for (i = j = k = 0;
++			i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
++			++i) {
++		if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
++			/* If the temp IP is empty and the current list is
++			 *  empty, we are done.
++			 */
++			if (j == 0 && mptcp_gws->len[k] == 0)
++				break;
++
++			/* Terminate the temp IP string, then if it is
++			 * non-empty parse the IP and copy it.
++			 */
++			tmp_string[j] = '\0';
++			if (j > 0) {
++				mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
++
++				ret = in4_pton(tmp_string, strlen(tmp_string),
++						(u8 *)&tmp_addr.s_addr, '\0',
++						NULL);
++
++				if (ret) {
++					mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
++						    ret,
++						    &tmp_addr.s_addr);
++					memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
++					       &tmp_addr.s_addr,
++					       sizeof(tmp_addr.s_addr));
++					mptcp_gws->len[k]++;
++					j = 0;
++					tmp_string[j] = '\0';
++					/* Since we can't impose a limit to
++					 * what the user can input, make sure
++					 * there are not too many IPs in the
++					 * SYSCTL string.
++					 */
++					if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
++						mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
++							    k,
++							    MPTCP_GW_LIST_MAX_LEN);
++						goto error;
++					}
++				} else {
++					goto error;
++				}
++			}
++
++			if (gateways[i] == '-' || gateways[i] == '\0')
++				++k;
++		} else {
++			tmp_string[j] = gateways[i];
++			++j;
++		}
++	}
++
++	/* Number of flows is number of gateway lists plus master flow */
++	mptcp_binder_ndiffports = k+1;
++
++	write_unlock(&mptcp_gws_lock);
++	kfree(tmp_string);
++
++	return 0;
++
++error:
++	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++	memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
++	write_unlock(&mptcp_gws_lock);
++	kfree(tmp_string);
++	return -1;
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	const struct binder_priv *pm_priv = container_of(work,
++						     struct binder_priv,
++						     subflow_work);
++	struct mptcp_cb *mpcb = pm_priv->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	int iter = 0;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	if (mptcp_binder_ndiffports > iter &&
++	    mptcp_binder_ndiffports > mpcb->cnt_subflows) {
++		struct mptcp_loc4 loc;
++		struct mptcp_rem4 rem;
++
++		loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++		loc.loc4_id = 0;
++		loc.low_prio = 0;
++
++		rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++		rem.port = inet_sk(meta_sk)->inet_dport;
++		rem.rem4_id = 0; /* Default 0 */
++
++		mptcp_init4_subsockets(meta_sk, &loc, &rem);
++
++		goto next_subflow;
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void binder_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
++	static DEFINE_SPINLOCK(flow_lock);
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (meta_sk->sk_family == AF_INET6 &&
++	    !mptcp_v6_is_v4_mapped(meta_sk)) {
++			mptcp_fallback_default(mpcb);
++			return;
++	}
++#endif
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	fmp->mpcb = mpcb;
++
++	fmp->flow_lock = &flow_lock;
++}
++
++static void binder_create_subflows(struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (!work_pending(&pm_priv->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &pm_priv->subflow_work);
++	}
++}
++
++static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
++				  struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
++ * Inspired from proc_tcp_congestion_control().
++ */
++static int proc_mptcp_gateways(ctl_table *ctl, int write,
++				       void __user *buffer, size_t *lenp,
++				       loff_t *ppos)
++{
++	int ret;
++	ctl_table tbl = {
++		.maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
++	};
++
++	if (write) {
++		tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
++		if (tbl.data == NULL)
++			return -1;
++		ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++		if (ret == 0) {
++			ret = mptcp_parse_gateway_ipv4(tbl.data);
++			memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
++		}
++		kfree(tbl.data);
++	} else {
++		ret = proc_dostring(ctl, write, buffer, lenp, ppos);
++	}
++
++
++	return ret;
++}
++
++static struct mptcp_pm_ops binder __read_mostly = {
++	.new_session = binder_new_session,
++	.fully_established = binder_create_subflows,
++	.get_local_id = binder_get_local_id,
++	.init_subsocket_v4 = mptcp_v4_add_lsrr,
++	.name = "binder",
++	.owner = THIS_MODULE,
++};
++
++static struct ctl_table binder_table[] = {
++	{
++		.procname = "mptcp_binder_gateways",
++		.data = &sysctl_mptcp_binder_gateways,
++		.maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
++		.mode = 0644,
++		.proc_handler = &proc_mptcp_gateways
++	},
++	{ }
++};
++
++struct ctl_table_header *mptcp_sysctl_binder;
++
++/* General initialization of MPTCP_PM */
++static int __init binder_register(void)
++{
++	mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
++	if (!mptcp_gws)
++		return -ENOMEM;
++
++	rwlock_init(&mptcp_gws_lock);
++
++	BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
++
++	mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
++			binder_table);
++	if (!mptcp_sysctl_binder)
++		goto sysctl_fail;
++
++	if (mptcp_register_path_manager(&binder))
++		goto pm_failed;
++
++	return 0;
++
++pm_failed:
++	unregister_net_sysctl_table(mptcp_sysctl_binder);
++sysctl_fail:
++	kfree(mptcp_gws);
++
++	return -1;
++}
++
++static void binder_unregister(void)
++{
++	mptcp_unregister_path_manager(&binder);
++	unregister_net_sysctl_table(mptcp_sysctl_binder);
++	kfree(mptcp_gws);
++}
++
++module_init(binder_register);
++module_exit(binder_unregister);
++
++MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("BINDER MPTCP");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
+new file mode 100644
+index 000000000000..5d761164eb85
+--- /dev/null
++++ b/net/mptcp/mptcp_coupled.c
+@@ -0,0 +1,270 @@
++/*
++ *	MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++/* Scaling is done in the numerator with alpha_scale_num and in the denominator
++ * with alpha_scale_den.
++ *
++ * To downscale, we just need to use alpha_scale.
++ *
++ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
++ */
++static int alpha_scale_den = 10;
++static int alpha_scale_num = 32;
++static int alpha_scale = 12;
++
++struct mptcp_ccc {
++	u64	alpha;
++	bool	forced_update;
++};
++
++static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
++{
++	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
++{
++	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
++}
++
++static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
++{
++	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
++}
++
++static inline u64 mptcp_ccc_scale(u32 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++static inline bool mptcp_get_forced(const struct sock *meta_sk)
++{
++	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
++}
++
++static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
++{
++	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
++}
++
++static void mptcp_ccc_recalc_alpha(const struct sock *sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	const struct sock *sub_sk;
++	int best_cwnd = 0, best_rtt = 0, can_send = 0;
++	u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
++
++	if (!mpcb)
++		return;
++
++	/* Only one subflow left - fall back to normal reno-behavior
++	 * (set alpha to 1)
++	 */
++	if (mpcb->cnt_established <= 1)
++		goto exit;
++
++	/* Do regular alpha-calculation for multiple subflows */
++
++	/* Find the max numerator of the alpha-calculation */
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++		u64 tmp;
++
++		if (!mptcp_ccc_sk_can_send(sub_sk))
++			continue;
++
++		can_send++;
++
++		/* We need to look for the path, that provides the max-value.
++		 * Integer-overflow is not possible here, because
++		 * tmp will be in u64.
++		 */
++		tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
++				alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
++
++		if (tmp >= max_numerator) {
++			max_numerator = tmp;
++			best_cwnd = sub_tp->snd_cwnd;
++			best_rtt = sub_tp->srtt_us;
++		}
++	}
++
++	/* No subflow is able to send - we don't care anymore */
++	if (unlikely(!can_send))
++		goto exit;
++
++	/* Calculate the denominator */
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++
++		if (!mptcp_ccc_sk_can_send(sub_sk))
++			continue;
++
++		sum_denominator += div_u64(
++				mptcp_ccc_scale(sub_tp->snd_cwnd,
++						alpha_scale_den) * best_rtt,
++						sub_tp->srtt_us);
++	}
++	sum_denominator *= sum_denominator;
++	if (unlikely(!sum_denominator)) {
++		pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
++		       __func__, mpcb->cnt_established);
++		mptcp_for_each_sk(mpcb, sub_sk) {
++			struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++			pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
++			       __func__, sub_tp->mptcp->path_index,
++			       sub_sk->sk_state, sub_tp->srtt_us,
++			       sub_tp->snd_cwnd);
++		}
++	}
++
++	alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
++
++	if (unlikely(!alpha))
++		alpha = 1;
++
++exit:
++	mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
++}
++
++static void mptcp_ccc_init(struct sock *sk)
++{
++	if (mptcp(tcp_sk(sk))) {
++		mptcp_set_forced(mptcp_meta_sk(sk), 0);
++		mptcp_set_alpha(mptcp_meta_sk(sk), 1);
++	}
++	/* If we do not mptcp, behave like reno: return */
++}
++
++static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++	if (event == CA_EVENT_LOSS)
++		mptcp_ccc_recalc_alpha(sk);
++}
++
++static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
++{
++	if (!mptcp(tcp_sk(sk)))
++		return;
++
++	mptcp_set_forced(mptcp_meta_sk(sk), 1);
++}
++
++static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++	int snd_cwnd;
++
++	if (!mptcp(tp)) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	if (!tcp_is_cwnd_limited(sk))
++		return;
++
++	if (tp->snd_cwnd <= tp->snd_ssthresh) {
++		/* In "safe" area, increase. */
++		tcp_slow_start(tp, acked);
++		mptcp_ccc_recalc_alpha(sk);
++		return;
++	}
++
++	if (mptcp_get_forced(mptcp_meta_sk(sk))) {
++		mptcp_ccc_recalc_alpha(sk);
++		mptcp_set_forced(mptcp_meta_sk(sk), 0);
++	}
++
++	if (mpcb->cnt_established > 1) {
++		u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
++
++		/* This may happen, if at the initialization, the mpcb
++		 * was not yet attached to the sock, and thus
++		 * initializing alpha failed.
++		 */
++		if (unlikely(!alpha))
++			alpha = 1;
++
++		snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
++						alpha);
++
++		/* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
++		 * Thus, we select here the max value.
++		 */
++		if (snd_cwnd < tp->snd_cwnd)
++			snd_cwnd = tp->snd_cwnd;
++	} else {
++		snd_cwnd = tp->snd_cwnd;
++	}
++
++	if (tp->snd_cwnd_cnt >= snd_cwnd) {
++		if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
++			tp->snd_cwnd++;
++			mptcp_ccc_recalc_alpha(sk);
++		}
++
++		tp->snd_cwnd_cnt = 0;
++	} else {
++		tp->snd_cwnd_cnt++;
++	}
++}
++
++static struct tcp_congestion_ops mptcp_ccc = {
++	.init		= mptcp_ccc_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_ccc_cong_avoid,
++	.cwnd_event	= mptcp_ccc_cwnd_event,
++	.set_state	= mptcp_ccc_set_state,
++	.owner		= THIS_MODULE,
++	.name		= "lia",
++};
++
++static int __init mptcp_ccc_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
++	return tcp_register_congestion_control(&mptcp_ccc);
++}
++
++static void __exit mptcp_ccc_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_ccc);
++}
++
++module_init(mptcp_ccc_register);
++module_exit(mptcp_ccc_unregister);
++
++MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
+new file mode 100644
+index 000000000000..28dfa0479f5e
+--- /dev/null
++++ b/net/mptcp/mptcp_ctrl.c
+@@ -0,0 +1,2401 @@
++/*
++ *	MPTCP implementation - MPTCP-control
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <net/inet_common.h>
++#include <net/inet6_hashtables.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/ip6_route.h>
++#include <net/mptcp_v6.h>
++#endif
++#include <net/sock.h>
++#include <net/tcp.h>
++#include <net/tcp_states.h>
++#include <net/transp_v6.h>
++#include <net/xfrm.h>
++
++#include <linux/cryptohash.h>
++#include <linux/kconfig.h>
++#include <linux/module.h>
++#include <linux/netpoll.h>
++#include <linux/list.h>
++#include <linux/jhash.h>
++#include <linux/tcp.h>
++#include <linux/net.h>
++#include <linux/in.h>
++#include <linux/random.h>
++#include <linux/inetdevice.h>
++#include <linux/workqueue.h>
++#include <linux/atomic.h>
++#include <linux/sysctl.h>
++
++static struct kmem_cache *mptcp_sock_cache __read_mostly;
++static struct kmem_cache *mptcp_cb_cache __read_mostly;
++static struct kmem_cache *mptcp_tw_cache __read_mostly;
++
++int sysctl_mptcp_enabled __read_mostly = 1;
++int sysctl_mptcp_checksum __read_mostly = 1;
++int sysctl_mptcp_debug __read_mostly;
++EXPORT_SYMBOL(sysctl_mptcp_debug);
++int sysctl_mptcp_syn_retries __read_mostly = 3;
++
++bool mptcp_init_failed __read_mostly;
++
++struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
++EXPORT_SYMBOL(mptcp_static_key);
++
++static int proc_mptcp_path_manager(ctl_table *ctl, int write,
++				   void __user *buffer, size_t *lenp,
++				   loff_t *ppos)
++{
++	char val[MPTCP_PM_NAME_MAX];
++	ctl_table tbl = {
++		.data = val,
++		.maxlen = MPTCP_PM_NAME_MAX,
++	};
++	int ret;
++
++	mptcp_get_default_path_manager(val);
++
++	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++	if (write && ret == 0)
++		ret = mptcp_set_default_path_manager(val);
++	return ret;
++}
++
++static int proc_mptcp_scheduler(ctl_table *ctl, int write,
++				void __user *buffer, size_t *lenp,
++				loff_t *ppos)
++{
++	char val[MPTCP_SCHED_NAME_MAX];
++	ctl_table tbl = {
++		.data = val,
++		.maxlen = MPTCP_SCHED_NAME_MAX,
++	};
++	int ret;
++
++	mptcp_get_default_scheduler(val);
++
++	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++	if (write && ret == 0)
++		ret = mptcp_set_default_scheduler(val);
++	return ret;
++}
++
++static struct ctl_table mptcp_table[] = {
++	{
++		.procname = "mptcp_enabled",
++		.data = &sysctl_mptcp_enabled,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_checksum",
++		.data = &sysctl_mptcp_checksum,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_debug",
++		.data = &sysctl_mptcp_debug,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_syn_retries",
++		.data = &sysctl_mptcp_syn_retries,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname	= "mptcp_path_manager",
++		.mode		= 0644,
++		.maxlen		= MPTCP_PM_NAME_MAX,
++		.proc_handler	= proc_mptcp_path_manager,
++	},
++	{
++		.procname	= "mptcp_scheduler",
++		.mode		= 0644,
++		.maxlen		= MPTCP_SCHED_NAME_MAX,
++		.proc_handler	= proc_mptcp_scheduler,
++	},
++	{ }
++};
++
++static inline u32 mptcp_hash_tk(u32 token)
++{
++	return token % MPTCP_HASH_SIZE;
++}
++
++struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++EXPORT_SYMBOL(tk_hashtable);
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
++
++/* The following hash table is used to avoid collision of token */
++static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
++
++static bool mptcp_reqsk_find_tk(const u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct mptcp_request_sock *mtreqsk;
++	const struct hlist_nulls_node *node;
++
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreqsk, node,
++				       &mptcp_reqsk_tk_htb[hash], hash_entry) {
++		if (token == mtreqsk->mptcp_loc_token)
++			return true;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++	return false;
++}
++
++static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
++{
++	u32 hash = mptcp_hash_tk(token);
++
++	hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
++				 &mptcp_reqsk_tk_htb[hash]);
++}
++
++static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
++{
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++}
++
++void mptcp_reqsk_destructor(struct request_sock *req)
++{
++	if (!mptcp_rsk(req)->is_sub) {
++		if (in_softirq()) {
++			mptcp_reqsk_remove_tk(req);
++		} else {
++			rcu_read_lock_bh();
++			spin_lock(&mptcp_tk_hashlock);
++			hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++			spin_unlock(&mptcp_tk_hashlock);
++			rcu_read_unlock_bh();
++		}
++	} else {
++		mptcp_hash_request_remove(req);
++	}
++}
++
++static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
++{
++	u32 hash = mptcp_hash_tk(token);
++	hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
++	meta_tp->inside_tk_table = 1;
++}
++
++static bool mptcp_find_token(u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct tcp_sock *meta_tp;
++	const struct hlist_nulls_node *node;
++
++begin:
++	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
++		if (token == meta_tp->mptcp_loc_token)
++			return true;
++	}
++	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++	return false;
++}
++
++static void mptcp_set_key_reqsk(struct request_sock *req,
++				const struct sk_buff *skb)
++{
++	const struct inet_request_sock *ireq = inet_rsk(req);
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++	if (skb->protocol == htons(ETH_P_IP)) {
++		mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
++							ip_hdr(skb)->daddr,
++							htons(ireq->ir_num),
++							ireq->ir_rmt_port);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
++							ipv6_hdr(skb)->daddr.s6_addr32,
++							htons(ireq->ir_num),
++							ireq->ir_rmt_port);
++#endif
++	}
++
++	mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
++}
++
++/* New MPTCP-connection request, prepare a new token for the meta-socket that
++ * will be created in mptcp_check_req_master(), and store the received token.
++ */
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++			   const struct mptcp_options_received *mopt,
++			   const struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++	inet_rsk(req)->saw_mpc = 1;
++
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	do {
++		mptcp_set_key_reqsk(req, skb);
++	} while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
++		 mptcp_find_token(mtreq->mptcp_loc_token));
++
++	mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++	mtreq->mptcp_rem_key = mopt->mptcp_key;
++}
++
++static void mptcp_set_key_sk(const struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct inet_sock *isk = inet_sk(sk);
++
++	if (sk->sk_family == AF_INET)
++		tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
++						     isk->inet_daddr,
++						     isk->inet_sport,
++						     isk->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
++						     sk->sk_v6_daddr.s6_addr32,
++						     isk->inet_sport,
++						     isk->inet_dport);
++#endif
++
++	mptcp_key_sha1(tp->mptcp_loc_key,
++		       &tp->mptcp_loc_token, NULL);
++}
++
++void mptcp_connect_init(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	rcu_read_lock_bh();
++	spin_lock(&mptcp_tk_hashlock);
++	do {
++		mptcp_set_key_sk(sk);
++	} while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
++		 mptcp_find_token(tp->mptcp_loc_token));
++
++	__mptcp_hash_insert(tp, tp->mptcp_loc_token);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock_bh();
++}
++
++/**
++ * This function increments the refcount of the mpcb struct.
++ * It is the responsibility of the caller to decrement when releasing
++ * the structure.
++ */
++struct sock *mptcp_hash_find(const struct net *net, const u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct tcp_sock *meta_tp;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
++				       tk_table) {
++		meta_sk = (struct sock *)meta_tp;
++		if (token == meta_tp->mptcp_loc_token &&
++		    net_eq(net, sock_net(meta_sk))) {
++			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++				goto out;
++			if (unlikely(token != meta_tp->mptcp_loc_token ||
++				     !net_eq(net, sock_net(meta_sk)))) {
++				sock_gen_put(meta_sk);
++				goto begin;
++			}
++			goto found;
++		}
++	}
++	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++out:
++	meta_sk = NULL;
++found:
++	rcu_read_unlock();
++	return meta_sk;
++}
++
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
++{
++	/* remove from the token hashtable */
++	rcu_read_lock_bh();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++	meta_tp->inside_tk_table = 0;
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock_bh();
++}
++
++void mptcp_hash_remove(struct tcp_sock *meta_tp)
++{
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++	meta_tp->inside_tk_table = 0;
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++}
++
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk, *rttsk = NULL, *lastsk = NULL;
++	u32 min_time = 0, last_active = 0;
++
++	mptcp_for_each_sk(meta_tp->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		u32 elapsed;
++
++		if (!mptcp_sk_can_send_ack(sk) || tp->pf)
++			continue;
++
++		elapsed = keepalive_time_elapsed(tp);
++
++		/* We take the one with the lowest RTT within a reasonable
++		 * (meta-RTO)-timeframe
++		 */
++		if (elapsed < inet_csk(meta_sk)->icsk_rto) {
++			if (!min_time || tp->srtt_us < min_time) {
++				min_time = tp->srtt_us;
++				rttsk = sk;
++			}
++			continue;
++		}
++
++		/* Otherwise, we just take the most recent active */
++		if (!rttsk && (!last_active || elapsed < last_active)) {
++			last_active = elapsed;
++			lastsk = sk;
++		}
++	}
++
++	if (rttsk)
++		return rttsk;
++
++	return lastsk;
++}
++EXPORT_SYMBOL(mptcp_select_ack_sock);
++
++static void mptcp_sock_def_error_report(struct sock *sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	if (!sock_flag(sk, SOCK_DEAD))
++		mptcp_sub_close(sk, 0);
++
++	if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
++	    mpcb->send_infinite_mapping) {
++		struct sock *meta_sk = mptcp_meta_sk(sk);
++
++		meta_sk->sk_err = sk->sk_err;
++		meta_sk->sk_err_soft = sk->sk_err_soft;
++
++		if (!sock_flag(meta_sk, SOCK_DEAD))
++			meta_sk->sk_error_report(meta_sk);
++
++		tcp_done(meta_sk);
++	}
++
++	sk->sk_err = 0;
++	return;
++}
++
++static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
++{
++	if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
++		mptcp_cleanup_path_manager(mpcb);
++		mptcp_cleanup_scheduler(mpcb);
++		kmem_cache_free(mptcp_cb_cache, mpcb);
++	}
++}
++
++static void mptcp_sock_destruct(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	inet_sock_destruct(sk);
++
++	if (!is_meta_sk(sk) && !tp->was_meta_sk) {
++		BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
++
++		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++		tp->mptcp = NULL;
++
++		/* Taken when mpcb pointer was set */
++		sock_put(mptcp_meta_sk(sk));
++		mptcp_mpcb_put(tp->mpcb);
++	} else {
++		struct mptcp_cb *mpcb = tp->mpcb;
++		struct mptcp_tw *mptw;
++
++		/* The mpcb is disappearing - we can make the final
++		 * update to the rcv_nxt of the time-wait-sock and remove
++		 * its reference to the mpcb.
++		 */
++		spin_lock_bh(&mpcb->tw_lock);
++		list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
++			list_del_rcu(&mptw->list);
++			mptw->in_list = 0;
++			mptcp_mpcb_put(mpcb);
++			rcu_assign_pointer(mptw->mpcb, NULL);
++		}
++		spin_unlock_bh(&mpcb->tw_lock);
++
++		mptcp_mpcb_put(mpcb);
++
++		mptcp_debug("%s destroying meta-sk\n", __func__);
++	}
++
++	WARN_ON(!static_key_false(&mptcp_static_key));
++	/* Must be the last call, because is_meta_sk() above still needs the
++	 * static key
++	 */
++	static_key_slow_dec(&mptcp_static_key);
++}
++
++void mptcp_destroy_sock(struct sock *sk)
++{
++	if (is_meta_sk(sk)) {
++		struct sock *sk_it, *tmpsk;
++
++		__skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
++		mptcp_purge_ofo_queue(tcp_sk(sk));
++
++		/* We have to close all remaining subflows. Normally, they
++		 * should all be about to get closed. But, if the kernel is
++		 * forcing a closure (e.g., tcp_write_err), the subflows might
++		 * not have been closed properly (as we are waiting for the
++		 * DATA_ACK of the DATA_FIN).
++		 */
++		mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
++			/* Already did call tcp_close - waiting for graceful
++			 * closure, or if we are retransmitting fast-close on
++			 * the subflow. The reset (or timeout) will kill the
++			 * subflow..
++			 */
++			if (tcp_sk(sk_it)->closing ||
++			    tcp_sk(sk_it)->send_mp_fclose)
++				continue;
++
++			/* Allow the delayed work first to prevent time-wait state */
++			if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
++				continue;
++
++			mptcp_sub_close(sk_it, 0);
++		}
++
++		mptcp_delete_synack_timer(sk);
++	} else {
++		mptcp_del_sock(sk);
++	}
++}
++
++static void mptcp_set_state(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	/* Meta is not yet established - wake up the application */
++	if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
++	    sk->sk_state == TCP_ESTABLISHED) {
++		tcp_set_state(meta_sk, TCP_ESTABLISHED);
++
++		if (!sock_flag(meta_sk, SOCK_DEAD)) {
++			meta_sk->sk_state_change(meta_sk);
++			sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
++		}
++	}
++
++	if (sk->sk_state == TCP_ESTABLISHED) {
++		tcp_sk(sk)->mptcp->establish_increased = 1;
++		tcp_sk(sk)->mpcb->cnt_established++;
++	}
++}
++
++void mptcp_init_congestion_control(struct sock *sk)
++{
++	struct inet_connection_sock *icsk = inet_csk(sk);
++	struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
++	const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
++
++	/* The application didn't set the congestion control to use
++	 * fallback to the default one.
++	 */
++	if (ca == &tcp_init_congestion_ops)
++		goto use_default;
++
++	/* Use the same congestion control as set by the user. If the
++	 * module is not available fallback to the default one.
++	 */
++	if (!try_module_get(ca->owner)) {
++		pr_warn("%s: fallback to the system default CC\n", __func__);
++		goto use_default;
++	}
++
++	icsk->icsk_ca_ops = ca;
++	if (icsk->icsk_ca_ops->init)
++		icsk->icsk_ca_ops->init(sk);
++
++	return;
++
++use_default:
++	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
++	tcp_init_congestion_control(sk);
++}
++
++u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
++u32 mptcp_seed = 0;
++
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
++{
++	u32 workspace[SHA_WORKSPACE_WORDS];
++	u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
++	u8 input[64];
++	int i;
++
++	memset(workspace, 0, sizeof(workspace));
++
++	/* Initialize input with appropriate padding */
++	memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
++						   * is explicitly set too
++						   */
++	memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
++	input[8] = 0x80; /* Padding: First bit after message = 1 */
++	input[63] = 0x40; /* Padding: Length of the message = 64 bits */
++
++	sha_init(mptcp_hashed_key);
++	sha_transform(mptcp_hashed_key, input, workspace);
++
++	for (i = 0; i < 5; i++)
++		mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
++
++	if (token)
++		*token = mptcp_hashed_key[0];
++	if (idsn)
++		*idsn = *((u64 *)&mptcp_hashed_key[3]);
++}
++
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++		       u32 *hash_out)
++{
++	u32 workspace[SHA_WORKSPACE_WORDS];
++	u8 input[128]; /* 2 512-bit blocks */
++	int i;
++
++	memset(workspace, 0, sizeof(workspace));
++
++	/* Generate key xored with ipad */
++	memset(input, 0x36, 64);
++	for (i = 0; i < 8; i++)
++		input[i] ^= key_1[i];
++	for (i = 0; i < 8; i++)
++		input[i + 8] ^= key_2[i];
++
++	memcpy(&input[64], rand_1, 4);
++	memcpy(&input[68], rand_2, 4);
++	input[72] = 0x80; /* Padding: First bit after message = 1 */
++	memset(&input[73], 0, 53);
++
++	/* Padding: Length of the message = 512 + 64 bits */
++	input[126] = 0x02;
++	input[127] = 0x40;
++
++	sha_init(hash_out);
++	sha_transform(hash_out, input, workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	sha_transform(hash_out, &input[64], workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	for (i = 0; i < 5; i++)
++		hash_out[i] = cpu_to_be32(hash_out[i]);
++
++	/* Prepare second part of hmac */
++	memset(input, 0x5C, 64);
++	for (i = 0; i < 8; i++)
++		input[i] ^= key_1[i];
++	for (i = 0; i < 8; i++)
++		input[i + 8] ^= key_2[i];
++
++	memcpy(&input[64], hash_out, 20);
++	input[84] = 0x80;
++	memset(&input[85], 0, 41);
++
++	/* Padding: Length of the message = 512 + 160 bits */
++	input[126] = 0x02;
++	input[127] = 0xA0;
++
++	sha_init(hash_out);
++	sha_transform(hash_out, input, workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	sha_transform(hash_out, &input[64], workspace);
++
++	for (i = 0; i < 5; i++)
++		hash_out[i] = cpu_to_be32(hash_out[i]);
++}
++
++static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
++{
++	/* Socket-options handled by sk_clone_lock while creating the meta-sk.
++	 * ======
++	 * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
++	 * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
++	 * TCP_NODELAY, TCP_CORK
++	 *
++	 * Socket-options handled in this function here
++	 * ======
++	 * TCP_DEFER_ACCEPT
++	 * SO_KEEPALIVE
++	 *
++	 * Socket-options on the todo-list
++	 * ======
++	 * SO_BINDTODEVICE - should probably prevent creation of new subsocks
++	 *		     across other devices. - what about the api-draft?
++	 * SO_DEBUG
++	 * SO_REUSEADDR - probably we don't care about this
++	 * SO_DONTROUTE, SO_BROADCAST
++	 * SO_OOBINLINE
++	 * SO_LINGER
++	 * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
++	 * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
++	 * SO_RXQ_OVFL
++	 * TCP_COOKIE_TRANSACTIONS
++	 * TCP_MAXSEG
++	 * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
++	 *		in mptcp_retransmit_timer. AND we need to check what is
++	 *		about the subsockets.
++	 * TCP_LINGER2
++	 * TCP_WINDOW_CLAMP
++	 * TCP_USER_TIMEOUT
++	 * TCP_MD5SIG
++	 *
++	 * Socket-options of no concern for the meta-socket (but for the subsocket)
++	 * ======
++	 * SO_PRIORITY
++	 * SO_MARK
++	 * TCP_CONGESTION
++	 * TCP_SYNCNT
++	 * TCP_QUICKACK
++	 */
++
++	/* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
++	inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
++
++	/* Keepalives are handled entirely at the MPTCP-layer */
++	if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
++		inet_csk_reset_keepalive_timer(meta_sk,
++					       keepalive_time_when(tcp_sk(meta_sk)));
++		sock_reset_flag(master_sk, SOCK_KEEPOPEN);
++		inet_csk_delete_keepalive_timer(master_sk);
++	}
++
++	/* Do not propagate subflow-errors up to the MPTCP-layer */
++	inet_sk(master_sk)->recverr = 0;
++}
++
++static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
++{
++	/* IP_TOS also goes to the subflow. */
++	if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
++		inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
++		sub_sk->sk_priority = meta_sk->sk_priority;
++		sk_dst_reset(sub_sk);
++	}
++
++	/* Inherit SO_REUSEADDR */
++	sub_sk->sk_reuse = meta_sk->sk_reuse;
++
++	/* Inherit snd/rcv-buffer locks */
++	sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
++
++	/* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
++	tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
++
++	/* Keepalives are handled entirely at the MPTCP-layer */
++	if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
++		sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
++		inet_csk_delete_keepalive_timer(sub_sk);
++	}
++
++	/* Do not propagate subflow-errors up to the MPTCP-layer */
++	inet_sk(sub_sk)->recverr = 0;
++}
++
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	/* skb-sk may be NULL if we receive a packet immediatly after the
++	 * SYN/ACK + MP_CAPABLE.
++	 */
++	struct sock *sk = skb->sk ? skb->sk : meta_sk;
++	int ret = 0;
++
++	skb->sk = NULL;
++
++	if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
++		kfree_skb(skb);
++		return 0;
++	}
++
++	if (sk->sk_family == AF_INET)
++		ret = tcp_v4_do_rcv(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		ret = tcp_v6_do_rcv(sk, skb);
++#endif
++
++	sock_put(sk);
++	return ret;
++}
++
++struct lock_class_key meta_key;
++struct lock_class_key meta_slock_key;
++
++static void mptcp_synack_timer_handler(unsigned long data)
++{
++	struct sock *meta_sk = (struct sock *) data;
++	struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
++
++	/* Only process if socket is not in use. */
++	bh_lock_sock(meta_sk);
++
++	if (sock_owned_by_user(meta_sk)) {
++		/* Try again later. */
++		mptcp_reset_synack_timer(meta_sk, HZ/20);
++		goto out;
++	}
++
++	/* May happen if the queue got destructed in mptcp_close */
++	if (!lopt)
++		goto out;
++
++	inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
++				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
++
++	if (lopt->qlen)
++		mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
++
++out:
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk);
++}
++
++static const struct tcp_sock_ops mptcp_meta_specific = {
++	.__select_window		= __mptcp_select_window,
++	.select_window			= mptcp_select_window,
++	.select_initial_window		= mptcp_select_initial_window,
++	.init_buffer_space		= mptcp_init_buffer_space,
++	.set_rto			= mptcp_tcp_set_rto,
++	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
++	.init_congestion_control	= mptcp_init_congestion_control,
++	.send_fin			= mptcp_send_fin,
++	.write_xmit			= mptcp_write_xmit,
++	.send_active_reset		= mptcp_send_active_reset,
++	.write_wakeup			= mptcp_write_wakeup,
++	.prune_ofo_queue		= mptcp_prune_ofo_queue,
++	.retransmit_timer		= mptcp_retransmit_timer,
++	.time_wait			= mptcp_time_wait,
++	.cleanup_rbuf			= mptcp_cleanup_rbuf,
++};
++
++static const struct tcp_sock_ops mptcp_sub_specific = {
++	.__select_window		= __mptcp_select_window,
++	.select_window			= mptcp_select_window,
++	.select_initial_window		= mptcp_select_initial_window,
++	.init_buffer_space		= mptcp_init_buffer_space,
++	.set_rto			= mptcp_tcp_set_rto,
++	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
++	.init_congestion_control	= mptcp_init_congestion_control,
++	.send_fin			= tcp_send_fin,
++	.write_xmit			= tcp_write_xmit,
++	.send_active_reset		= tcp_send_active_reset,
++	.write_wakeup			= tcp_write_wakeup,
++	.prune_ofo_queue		= tcp_prune_ofo_queue,
++	.retransmit_timer		= tcp_retransmit_timer,
++	.time_wait			= tcp_time_wait,
++	.cleanup_rbuf			= tcp_cleanup_rbuf,
++};
++
++static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++	struct mptcp_cb *mpcb;
++	struct sock *master_sk;
++	struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
++	struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
++	u64 idsn;
++
++	dst_release(meta_sk->sk_rx_dst);
++	meta_sk->sk_rx_dst = NULL;
++	/* This flag is set to announce sock_lock_init to
++	 * reclassify the lock-class of the master socket.
++	 */
++	meta_tp->is_master_sk = 1;
++	master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
++	meta_tp->is_master_sk = 0;
++	if (!master_sk)
++		return -ENOBUFS;
++
++	master_tp = tcp_sk(master_sk);
++	master_icsk = inet_csk(master_sk);
++
++	mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
++	if (!mpcb) {
++		/* sk_free (and __sk_free) requirese wmem_alloc to be 1.
++		 * All the rest is set to 0 thanks to __GFP_ZERO above.
++		 */
++		atomic_set(&master_sk->sk_wmem_alloc, 1);
++		sk_free(master_sk);
++		return -ENOBUFS;
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
++		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++		newnp = inet6_sk(master_sk);
++		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++		newnp->ipv6_mc_list = NULL;
++		newnp->ipv6_ac_list = NULL;
++		newnp->ipv6_fl_list = NULL;
++		newnp->opt = NULL;
++		newnp->pktoptions = NULL;
++		(void)xchg(&newnp->rxpmtu, NULL);
++	} else if (meta_sk->sk_family == AF_INET6) {
++		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++		newnp = inet6_sk(master_sk);
++		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++		newnp->hop_limit	= -1;
++		newnp->mcast_hops	= IPV6_DEFAULT_MCASTHOPS;
++		newnp->mc_loop	= 1;
++		newnp->pmtudisc	= IPV6_PMTUDISC_WANT;
++		newnp->ipv6only	= sock_net(master_sk)->ipv6.sysctl.bindv6only;
++	}
++#endif
++
++	meta_tp->mptcp = NULL;
++
++	/* Store the keys and generate the peer's token */
++	mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
++	mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
++
++	/* Generate Initial data-sequence-numbers */
++	mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
++	idsn = ntohll(idsn) + 1;
++	mpcb->snd_high_order[0] = idsn >> 32;
++	mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
++
++	meta_tp->write_seq = (u32)idsn;
++	meta_tp->snd_sml = meta_tp->write_seq;
++	meta_tp->snd_una = meta_tp->write_seq;
++	meta_tp->snd_nxt = meta_tp->write_seq;
++	meta_tp->pushed_seq = meta_tp->write_seq;
++	meta_tp->snd_up = meta_tp->write_seq;
++
++	mpcb->mptcp_rem_key = remote_key;
++	mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
++	idsn = ntohll(idsn) + 1;
++	mpcb->rcv_high_order[0] = idsn >> 32;
++	mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
++	meta_tp->copied_seq = (u32) idsn;
++	meta_tp->rcv_nxt = (u32) idsn;
++	meta_tp->rcv_wup = (u32) idsn;
++
++	meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
++	meta_tp->snd_wnd = window;
++	meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
++
++	meta_tp->packets_out = 0;
++	meta_icsk->icsk_probes_out = 0;
++
++	/* Set mptcp-pointers */
++	master_tp->mpcb = mpcb;
++	master_tp->meta_sk = meta_sk;
++	meta_tp->mpcb = mpcb;
++	meta_tp->meta_sk = meta_sk;
++	mpcb->meta_sk = meta_sk;
++	mpcb->master_sk = master_sk;
++
++	meta_tp->was_meta_sk = 0;
++
++	/* Initialize the queues */
++	skb_queue_head_init(&mpcb->reinject_queue);
++	skb_queue_head_init(&master_tp->out_of_order_queue);
++	tcp_prequeue_init(master_tp);
++	INIT_LIST_HEAD(&master_tp->tsq_node);
++
++	master_tp->tsq_flags = 0;
++
++	mutex_init(&mpcb->mpcb_mutex);
++
++	/* Init the accept_queue structure, we support a queue of 32 pending
++	 * connections, it does not need to be huge, since we only store  here
++	 * pending subflow creations.
++	 */
++	if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
++		inet_put_port(master_sk);
++		kmem_cache_free(mptcp_cb_cache, mpcb);
++		sk_free(master_sk);
++		return -ENOMEM;
++	}
++
++	/* Redefine function-pointers as the meta-sk is now fully ready */
++	static_key_slow_inc(&mptcp_static_key);
++	meta_tp->mpc = 1;
++	meta_tp->ops = &mptcp_meta_specific;
++
++	meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
++	meta_sk->sk_destruct = mptcp_sock_destruct;
++
++	/* Meta-level retransmit timer */
++	meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
++
++	tcp_init_xmit_timers(master_sk);
++	/* Has been set for sending out the SYN */
++	inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
++
++	if (!meta_tp->inside_tk_table) {
++		/* Adding the meta_tp in the token hashtable - coming from server-side */
++		rcu_read_lock();
++		spin_lock(&mptcp_tk_hashlock);
++
++		__mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
++
++		spin_unlock(&mptcp_tk_hashlock);
++		rcu_read_unlock();
++	}
++	master_tp->inside_tk_table = 0;
++
++	/* Init time-wait stuff */
++	INIT_LIST_HEAD(&mpcb->tw_list);
++	spin_lock_init(&mpcb->tw_lock);
++
++	INIT_HLIST_HEAD(&mpcb->callback_list);
++
++	mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
++
++	mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
++	mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
++	mpcb->orig_window_clamp = meta_tp->window_clamp;
++
++	/* The meta is directly linked - set refcnt to 1 */
++	atomic_set(&mpcb->mpcb_refcnt, 1);
++
++	mptcp_init_path_manager(mpcb);
++	mptcp_init_scheduler(mpcb);
++
++	setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
++		    (unsigned long)meta_sk);
++
++	mptcp_debug("%s: created mpcb with token %#x\n",
++		    __func__, mpcb->mptcp_loc_token);
++
++	return 0;
++}
++
++void mptcp_fallback_meta_sk(struct sock *meta_sk)
++{
++	kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
++	kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
++}
++
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++		   gfp_t flags)
++{
++	struct mptcp_cb *mpcb	= tcp_sk(meta_sk)->mpcb;
++	struct tcp_sock *tp	= tcp_sk(sk);
++
++	tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
++	if (!tp->mptcp)
++		return -ENOMEM;
++
++	tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
++	/* No more space for more subflows? */
++	if (!tp->mptcp->path_index) {
++		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++		return -EPERM;
++	}
++
++	INIT_HLIST_NODE(&tp->mptcp->cb_list);
++
++	tp->mptcp->tp = tp;
++	tp->mpcb = mpcb;
++	tp->meta_sk = meta_sk;
++
++	static_key_slow_inc(&mptcp_static_key);
++	tp->mpc = 1;
++	tp->ops = &mptcp_sub_specific;
++
++	tp->mptcp->loc_id = loc_id;
++	tp->mptcp->rem_id = rem_id;
++	if (mpcb->sched_ops->init)
++		mpcb->sched_ops->init(sk);
++
++	/* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
++	 * included in mptcp_del_sock(), because the mpcb must remain alive
++	 * until the last subsocket is completely destroyed.
++	 */
++	sock_hold(meta_sk);
++	atomic_inc(&mpcb->mpcb_refcnt);
++
++	tp->mptcp->next = mpcb->connection_list;
++	mpcb->connection_list = tp;
++	tp->mptcp->attached = 1;
++
++	mpcb->cnt_subflows++;
++	atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
++		   &meta_sk->sk_rmem_alloc);
++
++	mptcp_sub_inherit_sockopts(meta_sk, sk);
++	INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
++
++	/* As we successfully allocated the mptcp_tcp_sock, we have to
++	 * change the function-pointers here (for sk_destruct to work correctly)
++	 */
++	sk->sk_error_report = mptcp_sock_def_error_report;
++	sk->sk_data_ready = mptcp_data_ready;
++	sk->sk_write_space = mptcp_write_space;
++	sk->sk_state_change = mptcp_set_state;
++	sk->sk_destruct = mptcp_sock_destruct;
++
++	if (sk->sk_family == AF_INET)
++		mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
++			    __func__ , mpcb->mptcp_loc_token,
++			    tp->mptcp->path_index,
++			    &((struct inet_sock *)tp)->inet_saddr,
++			    ntohs(((struct inet_sock *)tp)->inet_sport),
++			    &((struct inet_sock *)tp)->inet_daddr,
++			    ntohs(((struct inet_sock *)tp)->inet_dport),
++			    mpcb->cnt_subflows);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
++			    __func__ , mpcb->mptcp_loc_token,
++			    tp->mptcp->path_index, &inet6_sk(sk)->saddr,
++			    ntohs(((struct inet_sock *)tp)->inet_sport),
++			    &sk->sk_v6_daddr,
++			    ntohs(((struct inet_sock *)tp)->inet_dport),
++			    mpcb->cnt_subflows);
++#endif
++
++	return 0;
++}
++
++void mptcp_del_sock(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
++	struct mptcp_cb *mpcb;
++
++	if (!tp->mptcp || !tp->mptcp->attached)
++		return;
++
++	mpcb = tp->mpcb;
++	tp_prev = mpcb->connection_list;
++
++	mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
++		    __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
++		    sk->sk_state, is_meta_sk(sk));
++
++	if (tp_prev == tp) {
++		mpcb->connection_list = tp->mptcp->next;
++	} else {
++		for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
++			if (tp_prev->mptcp->next == tp) {
++				tp_prev->mptcp->next = tp->mptcp->next;
++				break;
++			}
++		}
++	}
++	mpcb->cnt_subflows--;
++	if (tp->mptcp->establish_increased)
++		mpcb->cnt_established--;
++
++	tp->mptcp->next = NULL;
++	tp->mptcp->attached = 0;
++	mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
++
++	if (!skb_queue_empty(&sk->sk_write_queue))
++		mptcp_reinject_data(sk, 0);
++
++	if (is_master_tp(tp))
++		mpcb->master_sk = NULL;
++	else if (tp->mptcp->pre_established)
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++
++	rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
++}
++
++/* Updates the metasocket ULID/port data, based on the given sock.
++ * The argument sock must be the sock accessible to the application.
++ * In this function, we update the meta socket info, based on the changes
++ * in the application socket (bind, address allocation, ...)
++ */
++void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
++{
++	if (tcp_sk(sk)->mpcb->pm_ops->new_session)
++		tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
++
++	tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
++}
++
++/* Clean up the receive buffer for full frames taken by the user,
++ * then send an ACK if necessary.  COPIED is the number of bytes
++ * tcp_recvmsg has given to the user so far, it speeds up the
++ * calculation of whether or not we must ACK for the sake of
++ * a window update.
++ */
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk;
++	__u32 rcv_window_now = 0;
++
++	if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
++		rcv_window_now = tcp_receive_window(meta_tp);
++
++		if (2 * rcv_window_now > meta_tp->window_clamp)
++			rcv_window_now = 0;
++	}
++
++	mptcp_for_each_sk(meta_tp->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		const struct inet_connection_sock *icsk = inet_csk(sk);
++
++		if (!mptcp_sk_can_send_ack(sk))
++			continue;
++
++		if (!inet_csk_ack_scheduled(sk))
++			goto second_part;
++		/* Delayed ACKs frequently hit locked sockets during bulk
++		 * receive.
++		 */
++		if (icsk->icsk_ack.blocked ||
++		    /* Once-per-two-segments ACK was not sent by tcp_input.c */
++		    tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
++		    /* If this read emptied read buffer, we send ACK, if
++		     * connection is not bidirectional, user drained
++		     * receive buffer and there was a small segment
++		     * in queue.
++		     */
++		    (copied > 0 &&
++		     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
++		      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
++		       !icsk->icsk_ack.pingpong)) &&
++		     !atomic_read(&meta_sk->sk_rmem_alloc))) {
++			tcp_send_ack(sk);
++			continue;
++		}
++
++second_part:
++		/* This here is the second part of tcp_cleanup_rbuf */
++		if (rcv_window_now) {
++			__u32 new_window = tp->ops->__select_window(sk);
++
++			/* Send ACK now, if this read freed lots of space
++			 * in our buffer. Certainly, new_window is new window.
++			 * We can advertise it now, if it is not less than
++			 * current one.
++			 * "Lots" means "at least twice" here.
++			 */
++			if (new_window && new_window >= 2 * rcv_window_now)
++				tcp_send_ack(sk);
++		}
++	}
++}
++
++static int mptcp_sub_send_fin(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *skb = tcp_write_queue_tail(sk);
++	int mss_now;
++
++	/* Optimization, tack on the FIN if we have a queue of
++	 * unsent frames.  But be careful about outgoing SACKS
++	 * and IP options.
++	 */
++	mss_now = tcp_current_mss(sk);
++
++	if (tcp_send_head(sk) != NULL) {
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++		TCP_SKB_CB(skb)->end_seq++;
++		tp->write_seq++;
++	} else {
++		skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
++		if (!skb)
++			return 1;
++
++		/* Reserve space for headers and prepare control bits. */
++		skb_reserve(skb, MAX_TCP_HEADER);
++		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
++		tcp_init_nondata_skb(skb, tp->write_seq,
++				     TCPHDR_ACK | TCPHDR_FIN);
++		tcp_queue_skb(sk, skb);
++	}
++	__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
++
++	return 0;
++}
++
++void mptcp_sub_close_wq(struct work_struct *work)
++{
++	struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
++	struct sock *sk = (struct sock *)tp;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	mutex_lock(&tp->mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	if (sock_flag(sk, SOCK_DEAD))
++		goto exit;
++
++	/* We come from tcp_disconnect. We are sure that meta_sk is set */
++	if (!mptcp(tp)) {
++		tp->closing = 1;
++		sock_rps_reset_flow(sk);
++		tcp_close(sk, 0);
++		goto exit;
++	}
++
++	if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
++		tp->closing = 1;
++		sock_rps_reset_flow(sk);
++		tcp_close(sk, 0);
++	} else if (tcp_close_state(sk)) {
++		sk->sk_shutdown |= SEND_SHUTDOWN;
++		tcp_send_fin(sk);
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&tp->mpcb->mpcb_mutex);
++	sock_put(sk);
++}
++
++void mptcp_sub_close(struct sock *sk, unsigned long delay)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
++
++	/* We are already closing - e.g., call from sock_def_error_report upon
++	 * tcp_disconnect in tcp_close.
++	 */
++	if (tp->closing)
++		return;
++
++	/* Work already scheduled ? */
++	if (work_pending(&work->work)) {
++		/* Work present - who will be first ? */
++		if (jiffies + delay > work->timer.expires)
++			return;
++
++		/* Try canceling - if it fails, work will be executed soon */
++		if (!cancel_delayed_work(work))
++			return;
++		sock_put(sk);
++	}
++
++	if (!delay) {
++		unsigned char old_state = sk->sk_state;
++
++		/* If we are in user-context we can directly do the closing
++		 * procedure. No need to schedule a work-queue.
++		 */
++		if (!in_softirq()) {
++			if (sock_flag(sk, SOCK_DEAD))
++				return;
++
++			if (!mptcp(tp)) {
++				tp->closing = 1;
++				sock_rps_reset_flow(sk);
++				tcp_close(sk, 0);
++				return;
++			}
++
++			if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
++			    sk->sk_state == TCP_CLOSE) {
++				tp->closing = 1;
++				sock_rps_reset_flow(sk);
++				tcp_close(sk, 0);
++			} else if (tcp_close_state(sk)) {
++				sk->sk_shutdown |= SEND_SHUTDOWN;
++				tcp_send_fin(sk);
++			}
++
++			return;
++		}
++
++		/* We directly send the FIN. Because it may take so a long time,
++		 * untile the work-queue will get scheduled...
++		 *
++		 * If mptcp_sub_send_fin returns 1, it failed and thus we reset
++		 * the old state so that tcp_close will finally send the fin
++		 * in user-context.
++		 */
++		if (!sk->sk_err && old_state != TCP_CLOSE &&
++		    tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
++			if (old_state == TCP_ESTABLISHED)
++				TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
++			sk->sk_state = old_state;
++		}
++	}
++
++	sock_hold(sk);
++	queue_delayed_work(mptcp_wq, work, delay);
++}
++
++void mptcp_sub_force_close(struct sock *sk)
++{
++	/* The below tcp_done may have freed the socket, if he is already dead.
++	 * Thus, we are not allowed to access it afterwards. That's why
++	 * we have to store the dead-state in this local variable.
++	 */
++	int sock_is_dead = sock_flag(sk, SOCK_DEAD);
++
++	tcp_sk(sk)->mp_killed = 1;
++
++	if (sk->sk_state != TCP_CLOSE)
++		tcp_done(sk);
++
++	if (!sock_is_dead)
++		mptcp_sub_close(sk, 0);
++}
++EXPORT_SYMBOL(mptcp_sub_force_close);
++
++/* Update the mpcb send window, based on the contributions
++ * of each subflow
++ */
++void mptcp_update_sndbuf(const struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk, *sk;
++	int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
++
++	mptcp_for_each_sk(tp->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		new_sndbuf += sk->sk_sndbuf;
++
++		if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
++			new_sndbuf = sysctl_tcp_wmem[2];
++			break;
++		}
++	}
++	meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
++
++	/* The subflow's call to sk_write_space in tcp_new_space ends up in
++	 * mptcp_write_space.
++	 * It has nothing to do with waking up the application.
++	 * So, we do it here.
++	 */
++	if (old_sndbuf != meta_sk->sk_sndbuf)
++		meta_sk->sk_write_space(meta_sk);
++}
++
++void mptcp_close(struct sock *meta_sk, long timeout)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk_it, *tmpsk;
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb;
++	int data_was_unread = 0;
++	int state;
++
++	mptcp_debug("%s: Close of meta_sk with tok %#x\n",
++		    __func__, mpcb->mptcp_loc_token);
++
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock(meta_sk);
++
++	if (meta_tp->inside_tk_table) {
++		/* Detach the mpcb from the token hashtable */
++		mptcp_hash_remove_bh(meta_tp);
++		reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
++	}
++
++	meta_sk->sk_shutdown = SHUTDOWN_MASK;
++	/* We need to flush the recv. buffs.  We do this only on the
++	 * descriptor close, not protocol-sourced closes, because the
++	 * reader process may not have drained the data yet!
++	 */
++	while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
++		u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
++			  tcp_hdr(skb)->fin;
++		data_was_unread += len;
++		__kfree_skb(skb);
++	}
++
++	sk_mem_reclaim(meta_sk);
++
++	/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
++	if (meta_sk->sk_state == TCP_CLOSE) {
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++			if (tcp_sk(sk_it)->send_mp_fclose)
++				continue;
++			mptcp_sub_close(sk_it, 0);
++		}
++		goto adjudge_to_death;
++	}
++
++	if (data_was_unread) {
++		/* Unread data was tossed, zap the connection. */
++		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
++		tcp_set_state(meta_sk, TCP_CLOSE);
++		tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
++							meta_sk->sk_allocation);
++	} else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
++		/* Check zero linger _after_ checking for unread data. */
++		meta_sk->sk_prot->disconnect(meta_sk, 0);
++		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++	} else if (tcp_close_state(meta_sk)) {
++		mptcp_send_fin(meta_sk);
++	} else if (meta_tp->snd_una == meta_tp->write_seq) {
++		/* The DATA_FIN has been sent and acknowledged
++		 * (e.g., by sk_shutdown). Close all the other subflows
++		 */
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++			unsigned long delay = 0;
++			/* If we are the passive closer, don't trigger
++			 * subflow-fin until the subflow has been finned
++			 * by the peer. - thus we add a delay
++			 */
++			if (mpcb->passive_close &&
++			    sk_it->sk_state == TCP_ESTABLISHED)
++				delay = inet_csk(sk_it)->icsk_rto << 3;
++
++			mptcp_sub_close(sk_it, delay);
++		}
++	}
++
++	sk_stream_wait_close(meta_sk, timeout);
++
++adjudge_to_death:
++	state = meta_sk->sk_state;
++	sock_hold(meta_sk);
++	sock_orphan(meta_sk);
++
++	/* socket will be freed after mptcp_close - we have to prevent
++	 * access from the subflows.
++	 */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		/* Similar to sock_orphan, but we don't set it DEAD, because
++		 * the callbacks are still set and must be called.
++		 */
++		write_lock_bh(&sk_it->sk_callback_lock);
++		sk_set_socket(sk_it, NULL);
++		sk_it->sk_wq  = NULL;
++		write_unlock_bh(&sk_it->sk_callback_lock);
++	}
++
++	/* It is the last release_sock in its life. It will remove backlog. */
++	release_sock(meta_sk);
++
++	/* Now socket is owned by kernel and we acquire BH lock
++	 * to finish close. No need to check for user refs.
++	 */
++	local_bh_disable();
++	bh_lock_sock(meta_sk);
++	WARN_ON(sock_owned_by_user(meta_sk));
++
++	percpu_counter_inc(meta_sk->sk_prot->orphan_count);
++
++	/* Have we already been destroyed by a softirq or backlog? */
++	if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
++		goto out;
++
++	/*	This is a (useful) BSD violating of the RFC. There is a
++	 *	problem with TCP as specified in that the other end could
++	 *	keep a socket open forever with no application left this end.
++	 *	We use a 3 minute timeout (about the same as BSD) then kill
++	 *	our end. If they send after that then tough - BUT: long enough
++	 *	that we won't make the old 4*rto = almost no time - whoops
++	 *	reset mistake.
++	 *
++	 *	Nope, it was not mistake. It is really desired behaviour
++	 *	f.e. on http servers, when such sockets are useless, but
++	 *	consume significant resources. Let's do it with special
++	 *	linger2	option.					--ANK
++	 */
++
++	if (meta_sk->sk_state == TCP_FIN_WAIT2) {
++		if (meta_tp->linger2 < 0) {
++			tcp_set_state(meta_sk, TCP_CLOSE);
++			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPABORTONLINGER);
++		} else {
++			const int tmo = tcp_fin_time(meta_sk);
++
++			if (tmo > TCP_TIMEWAIT_LEN) {
++				inet_csk_reset_keepalive_timer(meta_sk,
++							       tmo - TCP_TIMEWAIT_LEN);
++			} else {
++				meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
++							tmo);
++				goto out;
++			}
++		}
++	}
++	if (meta_sk->sk_state != TCP_CLOSE) {
++		sk_mem_reclaim(meta_sk);
++		if (tcp_too_many_orphans(meta_sk, 0)) {
++			if (net_ratelimit())
++				pr_info("MPTCP: too many of orphaned sockets\n");
++			tcp_set_state(meta_sk, TCP_CLOSE);
++			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPABORTONMEMORY);
++		}
++	}
++
++
++	if (meta_sk->sk_state == TCP_CLOSE)
++		inet_csk_destroy_sock(meta_sk);
++	/* Otherwise, socket is reprieved until protocol close. */
++
++out:
++	bh_unlock_sock(meta_sk);
++	local_bh_enable();
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk); /* Taken by sock_hold */
++}
++
++void mptcp_disconnect(struct sock *sk)
++{
++	struct sock *subsk, *tmpsk;
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	mptcp_delete_synack_timer(sk);
++
++	__skb_queue_purge(&tp->mpcb->reinject_queue);
++
++	if (tp->inside_tk_table) {
++		mptcp_hash_remove_bh(tp);
++		reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
++	}
++
++	local_bh_disable();
++	mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
++		/* The socket will get removed from the subsocket-list
++		 * and made non-mptcp by setting mpc to 0.
++		 *
++		 * This is necessary, because tcp_disconnect assumes
++		 * that the connection is completly dead afterwards.
++		 * Thus we need to do a mptcp_del_sock. Due to this call
++		 * we have to make it non-mptcp.
++		 *
++		 * We have to lock the socket, because we set mpc to 0.
++		 * An incoming packet would take the subsocket's lock
++		 * and go on into the receive-path.
++		 * This would be a race.
++		 */
++
++		bh_lock_sock(subsk);
++		mptcp_del_sock(subsk);
++		tcp_sk(subsk)->mpc = 0;
++		tcp_sk(subsk)->ops = &tcp_specific;
++		mptcp_sub_force_close(subsk);
++		bh_unlock_sock(subsk);
++	}
++	local_bh_enable();
++
++	tp->was_meta_sk = 1;
++	tp->mpc = 0;
++	tp->ops = &tcp_specific;
++}
++
++
++/* Returns 1 if we should enable MPTCP for that socket. */
++int mptcp_doit(struct sock *sk)
++{
++	/* Do not allow MPTCP enabling if the MPTCP initialization failed */
++	if (mptcp_init_failed)
++		return 0;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++		return 0;
++
++	/* Socket may already be established (e.g., called from tcp_recvmsg) */
++	if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
++		return 1;
++
++	/* Don't do mptcp over loopback */
++	if (sk->sk_family == AF_INET &&
++	    (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
++	     ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
++		return 0;
++#if IS_ENABLED(CONFIG_IPV6)
++	if (sk->sk_family == AF_INET6 &&
++	    (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
++	     ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
++		return 0;
++#endif
++	if (mptcp_v6_is_v4_mapped(sk) &&
++	    ipv4_is_loopback(inet_sk(sk)->inet_saddr))
++		return 0;
++
++#ifdef CONFIG_TCP_MD5SIG
++	/* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
++	if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
++		return 0;
++#endif
++
++	return 1;
++}
++
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++	struct tcp_sock *master_tp;
++	struct sock *master_sk;
++
++	if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
++		goto err_alloc_mpcb;
++
++	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++	master_tp = tcp_sk(master_sk);
++
++	if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
++		goto err_add_sock;
++
++	if (__inet_inherit_port(meta_sk, master_sk) < 0)
++		goto err_add_sock;
++
++	meta_sk->sk_prot->unhash(meta_sk);
++
++	if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
++		__inet_hash_nolisten(master_sk, NULL);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		__inet6_hash(master_sk, NULL);
++#endif
++
++	master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
++
++	return 0;
++
++err_add_sock:
++	mptcp_fallback_meta_sk(meta_sk);
++
++	inet_csk_prepare_forced_close(master_sk);
++	tcp_done(master_sk);
++	inet_csk_prepare_forced_close(meta_sk);
++	tcp_done(meta_sk);
++
++err_alloc_mpcb:
++	return -ENOBUFS;
++}
++
++static int __mptcp_check_req_master(struct sock *child,
++				    struct request_sock *req)
++{
++	struct tcp_sock *child_tp = tcp_sk(child);
++	struct sock *meta_sk = child;
++	struct mptcp_cb *mpcb;
++	struct mptcp_request_sock *mtreq;
++
++	/* Never contained an MP_CAPABLE */
++	if (!inet_rsk(req)->mptcp_rqsk)
++		return 1;
++
++	if (!inet_rsk(req)->saw_mpc) {
++		/* Fallback to regular TCP, because we saw one SYN without
++		 * MP_CAPABLE. In tcp_check_req we continue the regular path.
++		 * But, the socket has been added to the reqsk_tk_htb, so we
++		 * must still remove it.
++		 */
++		mptcp_reqsk_remove_tk(req);
++		return 1;
++	}
++
++	/* Just set this values to pass them to mptcp_alloc_mpcb */
++	mtreq = mptcp_rsk(req);
++	child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
++	child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
++
++	if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
++				   child_tp->snd_wnd))
++		return -ENOBUFS;
++
++	child = tcp_sk(child)->mpcb->master_sk;
++	child_tp = tcp_sk(child);
++	mpcb = child_tp->mpcb;
++
++	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++
++	mpcb->dss_csum = mtreq->dss_csum;
++	mpcb->server_side = 1;
++
++	/* Will be moved to ESTABLISHED by  tcp_rcv_state_process() */
++	mptcp_update_metasocket(child, meta_sk);
++
++	/* Needs to be done here additionally, because when accepting a
++	 * new connection we pass by __reqsk_free and not reqsk_free.
++	 */
++	mptcp_reqsk_remove_tk(req);
++
++	/* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
++	sock_put(meta_sk);
++
++	return 0;
++}
++
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
++{
++	struct sock *meta_sk = child, *master_sk;
++	struct sk_buff *skb;
++	u32 new_mapping;
++	int ret;
++
++	ret = __mptcp_check_req_master(child, req);
++	if (ret)
++		return ret;
++
++	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++
++	/* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
++	 * pre-MPTCP data in the receive queue.
++	 */
++	tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
++				       tcp_rsk(req)->rcv_isn - 1;
++
++	/* Map subflow sequence number to data sequence numbers. We need to map
++	 * these data to [IDSN - len - 1, IDSN[.
++	 */
++	new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
++
++	/* There should be only one skb: the SYN + data. */
++	skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
++		TCP_SKB_CB(skb)->seq += new_mapping;
++		TCP_SKB_CB(skb)->end_seq += new_mapping;
++	}
++
++	/* With fastopen we change the semantics of the relative subflow
++	 * sequence numbers to deal with middleboxes that could add/remove
++	 * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
++	 * instead of the regular TCP ISN.
++	 */
++	tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
++
++	/* We need to update copied_seq of the master_sk to account for the
++	 * already moved data to the meta receive queue.
++	 */
++	tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
++
++	/* Handled by the master_sk */
++	tcp_sk(meta_sk)->fastopen_rsk = NULL;
++
++	return 0;
++}
++
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++			   struct request_sock *req,
++			   struct request_sock **prev)
++{
++	struct sock *meta_sk = child;
++	int ret;
++
++	ret = __mptcp_check_req_master(child, req);
++	if (ret)
++		return ret;
++
++	inet_csk_reqsk_queue_unlink(sk, req, prev);
++	inet_csk_reqsk_queue_removed(sk, req);
++	inet_csk_reqsk_queue_add(sk, req, meta_sk);
++
++	return 0;
++}
++
++struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
++				   struct request_sock *req,
++				   struct request_sock **prev,
++				   const struct mptcp_options_received *mopt)
++{
++	struct tcp_sock *child_tp = tcp_sk(child);
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	u8 hash_mac_check[20];
++
++	child_tp->inside_tk_table = 0;
++
++	if (!mopt->join_ack)
++		goto teardown;
++
++	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++			(u8 *)&mpcb->mptcp_loc_key,
++			(u8 *)&mtreq->mptcp_rem_nonce,
++			(u8 *)&mtreq->mptcp_loc_nonce,
++			(u32 *)hash_mac_check);
++
++	if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
++		goto teardown;
++
++	/* Point it to the same struct socket and wq as the meta_sk */
++	sk_set_socket(child, meta_sk->sk_socket);
++	child->sk_wq = meta_sk->sk_wq;
++
++	if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
++		/* Has been inherited, but now child_tp->mptcp is NULL */
++		child_tp->mpc = 0;
++		child_tp->ops = &tcp_specific;
++
++		/* TODO when we support acking the third ack for new subflows,
++		 * we should silently discard this third ack, by returning NULL.
++		 *
++		 * Maybe, at the retransmission we will have enough memory to
++		 * fully add the socket to the meta-sk.
++		 */
++		goto teardown;
++	}
++
++	/* The child is a clone of the meta socket, we must now reset
++	 * some of the fields
++	 */
++	child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
++
++	/* We should allow proper increase of the snd/rcv-buffers. Thus, we
++	 * use the original values instead of the bloated up ones from the
++	 * clone.
++	 */
++	child->sk_sndbuf = mpcb->orig_sk_sndbuf;
++	child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
++
++	child_tp->mptcp->slave_sk = 1;
++	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++	child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
++
++	child_tp->tsq_flags = 0;
++
++	/* Subflows do not use the accept queue, as they
++	 * are attached immediately to the mpcb.
++	 */
++	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++	reqsk_free(req);
++	return child;
++
++teardown:
++	/* Drop this request - sock creation failed. */
++	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++	reqsk_free(req);
++	inet_csk_prepare_forced_close(child);
++	tcp_done(child);
++	return meta_sk;
++}
++
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
++{
++	struct mptcp_tw *mptw;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++
++	/* A subsocket in tw can only receive data. So, if we are in
++	 * infinite-receive, then we should not reply with a data-ack or act
++	 * upon general MPTCP-signaling. We prevent this by simply not creating
++	 * the mptcp_tw_sock.
++	 */
++	if (mpcb->infinite_mapping_rcv) {
++		tw->mptcp_tw = NULL;
++		return 0;
++	}
++
++	/* Alloc MPTCP-tw-sock */
++	mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
++	if (!mptw)
++		return -ENOBUFS;
++
++	atomic_inc(&mpcb->mpcb_refcnt);
++
++	tw->mptcp_tw = mptw;
++	mptw->loc_key = mpcb->mptcp_loc_key;
++	mptw->meta_tw = mpcb->in_time_wait;
++	if (mptw->meta_tw) {
++		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
++		if (mpcb->mptw_state != TCP_TIME_WAIT)
++			mptw->rcv_nxt++;
++	}
++	rcu_assign_pointer(mptw->mpcb, mpcb);
++
++	spin_lock(&mpcb->tw_lock);
++	list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
++	mptw->in_list = 1;
++	spin_unlock(&mpcb->tw_lock);
++
++	return 0;
++}
++
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
++{
++	struct mptcp_cb *mpcb;
++
++	rcu_read_lock();
++	mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
++
++	/* If we are still holding a ref to the mpcb, we have to remove ourself
++	 * from the list and drop the ref properly.
++	 */
++	if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
++		spin_lock(&mpcb->tw_lock);
++		if (tw->mptcp_tw->in_list) {
++			list_del_rcu(&tw->mptcp_tw->list);
++			tw->mptcp_tw->in_list = 0;
++		}
++		spin_unlock(&mpcb->tw_lock);
++
++		/* Twice, because we increased it above */
++		mptcp_mpcb_put(mpcb);
++		mptcp_mpcb_put(mpcb);
++	}
++
++	rcu_read_unlock();
++
++	kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
++}
++
++/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
++ * data-fin.
++ */
++void mptcp_time_wait(struct sock *sk, int state, int timeo)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_tw *mptw;
++
++	/* Used for sockets that go into tw after the meta
++	 * (see mptcp_init_tw_sock())
++	 */
++	tp->mpcb->in_time_wait = 1;
++	tp->mpcb->mptw_state = state;
++
++	/* Update the time-wait-sock's information */
++	rcu_read_lock_bh();
++	list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
++		mptw->meta_tw = 1;
++		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
++
++		/* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
++		 * pretend as if the DATA_FIN has already reached us, that way
++		 * the checks in tcp_timewait_state_process will be good as the
++		 * DATA_FIN comes in.
++		 */
++		if (state != TCP_TIME_WAIT)
++			mptw->rcv_nxt++;
++	}
++	rcu_read_unlock_bh();
++
++	tcp_done(sk);
++}
++
++void mptcp_tsq_flags(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	/* It will be handled as a regular deferred-call */
++	if (is_meta_sk(sk))
++		return;
++
++	if (hlist_unhashed(&tp->mptcp->cb_list)) {
++		hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
++		/* We need to hold it here, as the sock_hold is not assured
++		 * by the release_sock as it is done in regular TCP.
++		 *
++		 * The subsocket may get inet_csk_destroy'd while it is inside
++		 * the callback_list.
++		 */
++		sock_hold(sk);
++	}
++
++	if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
++		sock_hold(meta_sk);
++}
++
++void mptcp_tsq_sub_deferred(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_tcp_sock *mptcp;
++	struct hlist_node *tmp;
++
++	BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
++
++	__sock_put(meta_sk);
++	hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
++		struct tcp_sock *tp = mptcp->tp;
++		struct sock *sk = (struct sock *)tp;
++
++		hlist_del_init(&mptcp->cb_list);
++		sk->sk_prot->release_cb(sk);
++		/* Final sock_put (cfr. mptcp_tsq_flags */
++		sock_put(sk);
++	}
++}
++
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++			   struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_options_received mopt;
++	u8 mptcp_hash_mac[20];
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	mtreq = mptcp_rsk(req);
++	mtreq->mptcp_mpcb = mpcb;
++	mtreq->is_sub = 1;
++	inet_rsk(req)->mptcp_rqsk = 1;
++
++	mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
++
++	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++			(u8 *)&mpcb->mptcp_rem_key,
++			(u8 *)&mtreq->mptcp_loc_nonce,
++			(u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
++	mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
++
++	mtreq->rem_id = mopt.rem_id;
++	mtreq->rcv_low_prio = mopt.low_prio;
++	inet_rsk(req)->saw_mpc = 1;
++}
++
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
++{
++	struct mptcp_options_received mopt;
++	struct mptcp_request_sock *mreq = mptcp_rsk(req);
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	mreq->is_sub = 0;
++	inet_rsk(req)->mptcp_rqsk = 1;
++	mreq->dss_csum = mopt.dss_csum;
++	mreq->hash_entry.pprev = NULL;
++
++	mptcp_reqsk_new_mptcp(req, &mopt, skb);
++}
++
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
++{
++	struct mptcp_options_received mopt;
++	const struct tcp_sock *tp = tcp_sk(sk);
++	__u32 isn = TCP_SKB_CB(skb)->when;
++	bool want_cookie = false;
++
++	if ((sysctl_tcp_syncookies == 2 ||
++	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++		want_cookie = tcp_syn_flood_action(sk, skb,
++						   mptcp_request_sock_ops.slab_name);
++		if (!want_cookie)
++			goto drop;
++	}
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	if (mopt.is_mp_join)
++		return mptcp_do_join_short(skb, &mopt, sock_net(sk));
++	if (mopt.drop_me)
++		goto drop;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
++		mopt.saw_mpc = 0;
++
++	if (skb->protocol == htons(ETH_P_IP)) {
++		if (mopt.saw_mpc && !want_cookie) {
++			if (skb_rtable(skb)->rt_flags &
++			    (RTCF_BROADCAST | RTCF_MULTICAST))
++				goto drop;
++
++			return tcp_conn_request(&mptcp_request_sock_ops,
++						&mptcp_request_sock_ipv4_ops,
++						sk, skb);
++		}
++
++		return tcp_v4_conn_request(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		if (mopt.saw_mpc && !want_cookie) {
++			if (!ipv6_unicast_destination(skb))
++				goto drop;
++
++			return tcp_conn_request(&mptcp6_request_sock_ops,
++						&mptcp_request_sock_ipv6_ops,
++						sk, skb);
++		}
++
++		return tcp_v6_conn_request(sk, skb);
++#endif
++	}
++drop:
++	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++	return 0;
++}
++
++struct workqueue_struct *mptcp_wq;
++EXPORT_SYMBOL(mptcp_wq);
++
++/* Output /proc/net/mptcp */
++static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
++{
++	struct tcp_sock *meta_tp;
++	const struct net *net = seq->private;
++	int i, n = 0;
++
++	seq_printf(seq, "  sl  loc_tok  rem_tok  v6 local_address                         remote_address                        st ns tx_queue rx_queue inode");
++	seq_putc(seq, '\n');
++
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		struct hlist_nulls_node *node;
++		rcu_read_lock_bh();
++		hlist_nulls_for_each_entry_rcu(meta_tp, node,
++					       &tk_hashtable[i], tk_table) {
++			struct mptcp_cb *mpcb = meta_tp->mpcb;
++			struct sock *meta_sk = (struct sock *)meta_tp;
++			struct inet_sock *isk = inet_sk(meta_sk);
++
++			if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
++				continue;
++
++			if (capable(CAP_NET_ADMIN)) {
++				seq_printf(seq, "%4d: %04X %04X ", n++,
++						mpcb->mptcp_loc_token,
++						mpcb->mptcp_rem_token);
++			} else {
++				seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
++			}
++			if (meta_sk->sk_family == AF_INET ||
++			    mptcp_v6_is_v4_mapped(meta_sk)) {
++				seq_printf(seq, " 0 %08X:%04X                         %08X:%04X                        ",
++					   isk->inet_rcv_saddr,
++					   ntohs(isk->inet_sport),
++					   isk->inet_daddr,
++					   ntohs(isk->inet_dport));
++#if IS_ENABLED(CONFIG_IPV6)
++			} else if (meta_sk->sk_family == AF_INET6) {
++				struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
++				struct in6_addr *dst = &meta_sk->sk_v6_daddr;
++				seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
++					   src->s6_addr32[0], src->s6_addr32[1],
++					   src->s6_addr32[2], src->s6_addr32[3],
++					   ntohs(isk->inet_sport),
++					   dst->s6_addr32[0], dst->s6_addr32[1],
++					   dst->s6_addr32[2], dst->s6_addr32[3],
++					   ntohs(isk->inet_dport));
++#endif
++			}
++			seq_printf(seq, " %02X %02X %08X:%08X %lu",
++				   meta_sk->sk_state, mpcb->cnt_subflows,
++				   meta_tp->write_seq - meta_tp->snd_una,
++				   max_t(int, meta_tp->rcv_nxt -
++					 meta_tp->copied_seq, 0),
++				   sock_i_ino(meta_sk));
++			seq_putc(seq, '\n');
++		}
++
++		rcu_read_unlock_bh();
++	}
++
++	return 0;
++}
++
++static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
++{
++	return single_open_net(inode, file, mptcp_pm_seq_show);
++}
++
++static const struct file_operations mptcp_pm_seq_fops = {
++	.owner = THIS_MODULE,
++	.open = mptcp_pm_seq_open,
++	.read = seq_read,
++	.llseek = seq_lseek,
++	.release = single_release_net,
++};
++
++static int mptcp_pm_init_net(struct net *net)
++{
++	if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
++		return -ENOMEM;
++
++	return 0;
++}
++
++static void mptcp_pm_exit_net(struct net *net)
++{
++	remove_proc_entry("mptcp", net->proc_net);
++}
++
++static struct pernet_operations mptcp_pm_proc_ops = {
++	.init = mptcp_pm_init_net,
++	.exit = mptcp_pm_exit_net,
++};
++
++/* General initialization of mptcp */
++void __init mptcp_init(void)
++{
++	int i;
++	struct ctl_table_header *mptcp_sysctl;
++
++	mptcp_sock_cache = kmem_cache_create("mptcp_sock",
++					     sizeof(struct mptcp_tcp_sock),
++					     0, SLAB_HWCACHE_ALIGN,
++					     NULL);
++	if (!mptcp_sock_cache)
++		goto mptcp_sock_cache_failed;
++
++	mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
++					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++					   NULL);
++	if (!mptcp_cb_cache)
++		goto mptcp_cb_cache_failed;
++
++	mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
++					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++					   NULL);
++	if (!mptcp_tw_cache)
++		goto mptcp_tw_cache_failed;
++
++	get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
++
++	mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
++	if (!mptcp_wq)
++		goto alloc_workqueue_failed;
++
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
++		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
++				      i + MPTCP_REQSK_NULLS_BASE);
++		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
++	}
++
++	spin_lock_init(&mptcp_reqsk_hlock);
++	spin_lock_init(&mptcp_tk_hashlock);
++
++	if (register_pernet_subsys(&mptcp_pm_proc_ops))
++		goto pernet_failed;
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (mptcp_pm_v6_init())
++		goto mptcp_pm_v6_failed;
++#endif
++	if (mptcp_pm_v4_init())
++		goto mptcp_pm_v4_failed;
++
++	mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
++	if (!mptcp_sysctl)
++		goto register_sysctl_failed;
++
++	if (mptcp_register_path_manager(&mptcp_pm_default))
++		goto register_pm_failed;
++
++	if (mptcp_register_scheduler(&mptcp_sched_default))
++		goto register_sched_failed;
++
++	pr_info("MPTCP: Stable release v0.89.0-rc");
++
++	mptcp_init_failed = false;
++
++	return;
++
++register_sched_failed:
++	mptcp_unregister_path_manager(&mptcp_pm_default);
++register_pm_failed:
++	unregister_net_sysctl_table(mptcp_sysctl);
++register_sysctl_failed:
++	mptcp_pm_v4_undo();
++mptcp_pm_v4_failed:
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_pm_v6_undo();
++mptcp_pm_v6_failed:
++#endif
++	unregister_pernet_subsys(&mptcp_pm_proc_ops);
++pernet_failed:
++	destroy_workqueue(mptcp_wq);
++alloc_workqueue_failed:
++	kmem_cache_destroy(mptcp_tw_cache);
++mptcp_tw_cache_failed:
++	kmem_cache_destroy(mptcp_cb_cache);
++mptcp_cb_cache_failed:
++	kmem_cache_destroy(mptcp_sock_cache);
++mptcp_sock_cache_failed:
++	mptcp_init_failed = true;
++}
+diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
+new file mode 100644
+index 000000000000..3a54413ce25b
+--- /dev/null
++++ b/net/mptcp/mptcp_fullmesh.c
+@@ -0,0 +1,1722 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#include <net/addrconf.h>
++#endif
++
++enum {
++	MPTCP_EVENT_ADD = 1,
++	MPTCP_EVENT_DEL,
++	MPTCP_EVENT_MOD,
++};
++
++#define MPTCP_SUBFLOW_RETRY_DELAY	1000
++
++/* Max number of local or remote addresses we can store.
++ * When changing, see the bitfield below in fullmesh_rem4/6.
++ */
++#define MPTCP_MAX_ADDR	8
++
++struct fullmesh_rem4 {
++	u8		rem4_id;
++	u8		bitfield;
++	u8		retry_bitfield;
++	__be16		port;
++	struct in_addr	addr;
++};
++
++struct fullmesh_rem6 {
++	u8		rem6_id;
++	u8		bitfield;
++	u8		retry_bitfield;
++	__be16		port;
++	struct in6_addr	addr;
++};
++
++struct mptcp_loc_addr {
++	struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
++	u8 loc4_bits;
++	u8 next_v4_index;
++
++	struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
++	u8 loc6_bits;
++	u8 next_v6_index;
++};
++
++struct mptcp_addr_event {
++	struct list_head list;
++	unsigned short	family;
++	u8	code:7,
++		low_prio:1;
++	union inet_addr addr;
++};
++
++struct fullmesh_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++	/* Delayed worker, when the routing-tables are not yet ready. */
++	struct delayed_work subflow_retry_work;
++
++	/* Remote addresses */
++	struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
++	struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
++
++	struct mptcp_cb *mpcb;
++
++	u16 remove_addrs; /* Addresses to remove */
++	u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
++	u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
++
++	u8	add_addr; /* Are we sending an add_addr? */
++
++	u8 rem4_bits;
++	u8 rem6_bits;
++};
++
++struct mptcp_fm_ns {
++	struct mptcp_loc_addr __rcu *local;
++	spinlock_t local_lock; /* Protecting the above pointer */
++	struct list_head events;
++	struct delayed_work address_worker;
++
++	struct net *net;
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly;
++
++static void full_mesh_create_subflows(struct sock *meta_sk);
++
++static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
++{
++	return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
++}
++
++static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
++{
++	return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
++}
++
++/* Find the first free index in the bitfield */
++static int __mptcp_find_free_index(u8 bitfield, u8 base)
++{
++	int i;
++
++	/* There are anyways no free bits... */
++	if (bitfield == 0xff)
++		goto exit;
++
++	i = ffs(~(bitfield >> base)) - 1;
++	if (i < 0)
++		goto exit;
++
++	/* No free bits when starting at base, try from 0 on */
++	if (i + base >= sizeof(bitfield) * 8)
++		return __mptcp_find_free_index(bitfield, 0);
++
++	return i + base;
++exit:
++	return -1;
++}
++
++static int mptcp_find_free_index(u8 bitfield)
++{
++	return __mptcp_find_free_index(bitfield, 0);
++}
++
++static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
++			      const struct in_addr *addr,
++			      __be16 port, u8 id)
++{
++	int i;
++	struct fullmesh_rem4 *rem4;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		rem4 = &fmp->remaddr4[i];
++
++		/* Address is already in the list --- continue */
++		if (rem4->rem4_id == id &&
++		    rem4->addr.s_addr == addr->s_addr && rem4->port == port)
++			return;
++
++		/* This may be the case, when the peer is behind a NAT. He is
++		 * trying to JOIN, thus sending the JOIN with a certain ID.
++		 * However the src_addr of the IP-packet has been changed. We
++		 * update the addr in the list, because this is the address as
++		 * OUR BOX sees it.
++		 */
++		if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
++			/* update the address */
++			mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
++				    __func__, &rem4->addr.s_addr,
++				    &addr->s_addr, id);
++			rem4->addr.s_addr = addr->s_addr;
++			rem4->port = port;
++			mpcb->list_rcvd = 1;
++			return;
++		}
++	}
++
++	i = mptcp_find_free_index(fmp->rem4_bits);
++	/* Do we have already the maximum number of local/remote addresses? */
++	if (i < 0) {
++		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
++			    __func__, MPTCP_MAX_ADDR, &addr->s_addr);
++		return;
++	}
++
++	rem4 = &fmp->remaddr4[i];
++
++	/* Address is not known yet, store it */
++	rem4->addr.s_addr = addr->s_addr;
++	rem4->port = port;
++	rem4->bitfield = 0;
++	rem4->retry_bitfield = 0;
++	rem4->rem4_id = id;
++	mpcb->list_rcvd = 1;
++	fmp->rem4_bits |= (1 << i);
++
++	return;
++}
++
++static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
++			      const struct in6_addr *addr,
++			      __be16 port, u8 id)
++{
++	int i;
++	struct fullmesh_rem6 *rem6;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		rem6 = &fmp->remaddr6[i];
++
++		/* Address is already in the list --- continue */
++		if (rem6->rem6_id == id &&
++		    ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
++			return;
++
++		/* This may be the case, when the peer is behind a NAT. He is
++		 * trying to JOIN, thus sending the JOIN with a certain ID.
++		 * However the src_addr of the IP-packet has been changed. We
++		 * update the addr in the list, because this is the address as
++		 * OUR BOX sees it.
++		 */
++		if (rem6->rem6_id == id) {
++			/* update the address */
++			mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
++				    __func__, &rem6->addr, addr, id);
++			rem6->addr = *addr;
++			rem6->port = port;
++			mpcb->list_rcvd = 1;
++			return;
++		}
++	}
++
++	i = mptcp_find_free_index(fmp->rem6_bits);
++	/* Do we have already the maximum number of local/remote addresses? */
++	if (i < 0) {
++		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
++			    __func__, MPTCP_MAX_ADDR, addr);
++		return;
++	}
++
++	rem6 = &fmp->remaddr6[i];
++
++	/* Address is not known yet, store it */
++	rem6->addr = *addr;
++	rem6->port = port;
++	rem6->bitfield = 0;
++	rem6->retry_bitfield = 0;
++	rem6->rem6_id = id;
++	mpcb->list_rcvd = 1;
++	fmp->rem6_bits |= (1 << i);
++
++	return;
++}
++
++static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		if (fmp->remaddr4[i].rem4_id == id) {
++			/* remove address from bitfield */
++			fmp->rem4_bits &= ~(1 << i);
++
++			break;
++		}
++	}
++}
++
++static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		if (fmp->remaddr6[i].rem6_id == id) {
++			/* remove address from bitfield */
++			fmp->rem6_bits &= ~(1 << i);
++
++			break;
++		}
++	}
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
++				       const struct in_addr *addr, u8 index)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
++			fmp->remaddr4[i].bitfield |= (1 << index);
++			return;
++		}
++	}
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
++				       const struct in6_addr *addr, u8 index)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
++			fmp->remaddr6[i].bitfield |= (1 << index);
++			return;
++		}
++	}
++}
++
++static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
++				    const union inet_addr *addr,
++				    sa_family_t family, u8 id)
++{
++	if (family == AF_INET)
++		mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
++	else
++		mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
++}
++
++static void retry_subflow_worker(struct work_struct *work)
++{
++	struct delayed_work *delayed_work = container_of(work,
++							 struct delayed_work,
++							 work);
++	struct fullmesh_priv *fmp = container_of(delayed_work,
++						 struct fullmesh_priv,
++						 subflow_retry_work);
++	struct mptcp_cb *mpcb = fmp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int iter = 0, i;
++
++	/* We need a local (stable) copy of the address-list. Really, it is not
++	 * such a big deal, if the address-list is not 100% up-to-date.
++	 */
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++	rcu_read_unlock_bh();
++
++	if (!mptcp_local)
++		return;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
++		/* Do we need to retry establishing a subflow ? */
++		if (rem->retry_bitfield) {
++			int i = mptcp_find_free_index(~rem->retry_bitfield);
++			struct mptcp_rem4 rem4;
++
++			rem->bitfield |= (1 << i);
++			rem->retry_bitfield &= ~(1 << i);
++
++			rem4.addr = rem->addr;
++			rem4.port = rem->port;
++			rem4.rem4_id = rem->rem4_id;
++
++			mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
++			goto next_subflow;
++		}
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
++
++		/* Do we need to retry establishing a subflow ? */
++		if (rem->retry_bitfield) {
++			int i = mptcp_find_free_index(~rem->retry_bitfield);
++			struct mptcp_rem6 rem6;
++
++			rem->bitfield |= (1 << i);
++			rem->retry_bitfield &= ~(1 << i);
++
++			rem6.addr = rem->addr;
++			rem6.port = rem->port;
++			rem6.rem6_id = rem->rem6_id;
++
++			mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
++			goto next_subflow;
++		}
++	}
++#endif
++
++exit:
++	kfree(mptcp_local);
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
++						 subflow_work);
++	struct mptcp_cb *mpcb = fmp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int iter = 0, retry = 0;
++	int i;
++
++	/* We need a local (stable) copy of the address-list. Really, it is not
++	 * such a big deal, if the address-list is not 100% up-to-date.
++	 */
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++	rcu_read_unlock_bh();
++
++	if (!mptcp_local)
++		return;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		struct fullmesh_rem4 *rem;
++		u8 remaining_bits;
++
++		rem = &fmp->remaddr4[i];
++		remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
++
++		/* Are there still combinations to handle? */
++		if (remaining_bits) {
++			int i = mptcp_find_free_index(~remaining_bits);
++			struct mptcp_rem4 rem4;
++
++			rem->bitfield |= (1 << i);
++
++			rem4.addr = rem->addr;
++			rem4.port = rem->port;
++			rem4.rem4_id = rem->rem4_id;
++
++			/* If a route is not yet available then retry once */
++			if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
++						   &rem4) == -ENETUNREACH)
++				retry = rem->retry_bitfield |= (1 << i);
++			goto next_subflow;
++		}
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		struct fullmesh_rem6 *rem;
++		u8 remaining_bits;
++
++		rem = &fmp->remaddr6[i];
++		remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
++
++		/* Are there still combinations to handle? */
++		if (remaining_bits) {
++			int i = mptcp_find_free_index(~remaining_bits);
++			struct mptcp_rem6 rem6;
++
++			rem->bitfield |= (1 << i);
++
++			rem6.addr = rem->addr;
++			rem6.port = rem->port;
++			rem6.rem6_id = rem->rem6_id;
++
++			/* If a route is not yet available then retry once */
++			if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
++						   &rem6) == -ENETUNREACH)
++				retry = rem->retry_bitfield |= (1 << i);
++			goto next_subflow;
++		}
++	}
++#endif
++
++	if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
++		sock_hold(meta_sk);
++		queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
++				   msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
++	}
++
++exit:
++	kfree(mptcp_local);
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	struct sock *sk = mptcp_select_ack_sock(meta_sk);
++
++	fmp->remove_addrs |= (1 << addr_id);
++	mpcb->addr_signal = 1;
++
++	if (sk)
++		tcp_send_ack(sk);
++}
++
++static void update_addr_bitfields(struct sock *meta_sk,
++				  const struct mptcp_loc_addr *mptcp_local)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	int i;
++
++	/* The bits in announced_addrs_* always match with loc*_bits. So, a
++	 * simply & operation unsets the correct bits, because these go from
++	 * announced to non-announced
++	 */
++	fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
++		fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
++	}
++
++	fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
++		fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
++	}
++}
++
++static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
++			      sa_family_t family, const union inet_addr *addr)
++{
++	int i;
++	u8 loc_bits;
++	bool found = false;
++
++	if (family == AF_INET)
++		loc_bits = mptcp_local->loc4_bits;
++	else
++		loc_bits = mptcp_local->loc6_bits;
++
++	mptcp_for_each_bit_set(loc_bits, i) {
++		if (family == AF_INET &&
++		    mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
++			found = true;
++			break;
++		}
++		if (family == AF_INET6 &&
++		    ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
++				    &addr->in6)) {
++			found = true;
++			break;
++		}
++	}
++
++	if (!found)
++		return -1;
++
++	return i;
++}
++
++static void mptcp_address_worker(struct work_struct *work)
++{
++	const struct delayed_work *delayed_work = container_of(work,
++							 struct delayed_work,
++							 work);
++	struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
++						 struct mptcp_fm_ns,
++						 address_worker);
++	struct net *net = fm_ns->net;
++	struct mptcp_addr_event *event = NULL;
++	struct mptcp_loc_addr *mptcp_local, *old;
++	int i, id = -1; /* id is used in the socket-code on a delete-event */
++	bool success; /* Used to indicate if we succeeded handling the event */
++
++next_event:
++	success = false;
++	kfree(event);
++
++	/* First, let's dequeue an event from our event-list */
++	rcu_read_lock_bh();
++	spin_lock(&fm_ns->local_lock);
++
++	event = list_first_entry_or_null(&fm_ns->events,
++					 struct mptcp_addr_event, list);
++	if (!event) {
++		spin_unlock(&fm_ns->local_lock);
++		rcu_read_unlock_bh();
++		return;
++	}
++
++	list_del(&event->list);
++
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++	if (event->code == MPTCP_EVENT_DEL) {
++		id = mptcp_find_address(mptcp_local, event->family, &event->addr);
++
++		/* Not in the list - so we don't care */
++		if (id < 0) {
++			mptcp_debug("%s could not find id\n", __func__);
++			goto duno;
++		}
++
++		old = mptcp_local;
++		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++				      GFP_ATOMIC);
++		if (!mptcp_local)
++			goto duno;
++
++		if (event->family == AF_INET)
++			mptcp_local->loc4_bits &= ~(1 << id);
++		else
++			mptcp_local->loc6_bits &= ~(1 << id);
++
++		rcu_assign_pointer(fm_ns->local, mptcp_local);
++		kfree(old);
++	} else {
++		int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
++		int j = i;
++
++		if (j < 0) {
++			/* Not in the list, so we have to find an empty slot */
++			if (event->family == AF_INET)
++				i = __mptcp_find_free_index(mptcp_local->loc4_bits,
++							    mptcp_local->next_v4_index);
++			if (event->family == AF_INET6)
++				i = __mptcp_find_free_index(mptcp_local->loc6_bits,
++							    mptcp_local->next_v6_index);
++
++			if (i < 0) {
++				mptcp_debug("%s no more space\n", __func__);
++				goto duno;
++			}
++
++			/* It might have been a MOD-event. */
++			event->code = MPTCP_EVENT_ADD;
++		} else {
++			/* Let's check if anything changes */
++			if (event->family == AF_INET &&
++			    event->low_prio == mptcp_local->locaddr4[i].low_prio)
++				goto duno;
++
++			if (event->family == AF_INET6 &&
++			    event->low_prio == mptcp_local->locaddr6[i].low_prio)
++				goto duno;
++		}
++
++		old = mptcp_local;
++		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++				      GFP_ATOMIC);
++		if (!mptcp_local)
++			goto duno;
++
++		if (event->family == AF_INET) {
++			mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
++			mptcp_local->locaddr4[i].loc4_id = i + 1;
++			mptcp_local->locaddr4[i].low_prio = event->low_prio;
++		} else {
++			mptcp_local->locaddr6[i].addr = event->addr.in6;
++			mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
++			mptcp_local->locaddr6[i].low_prio = event->low_prio;
++		}
++
++		if (j < 0) {
++			if (event->family == AF_INET) {
++				mptcp_local->loc4_bits |= (1 << i);
++				mptcp_local->next_v4_index = i + 1;
++			} else {
++				mptcp_local->loc6_bits |= (1 << i);
++				mptcp_local->next_v6_index = i + 1;
++			}
++		}
++
++		rcu_assign_pointer(fm_ns->local, mptcp_local);
++		kfree(old);
++	}
++	success = true;
++
++duno:
++	spin_unlock(&fm_ns->local_lock);
++	rcu_read_unlock_bh();
++
++	if (!success)
++		goto next_event;
++
++	/* Now we iterate over the MPTCP-sockets and apply the event. */
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		const struct hlist_nulls_node *node;
++		struct tcp_sock *meta_tp;
++
++		rcu_read_lock_bh();
++		hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
++					       tk_table) {
++			struct mptcp_cb *mpcb = meta_tp->mpcb;
++			struct sock *meta_sk = (struct sock *)meta_tp, *sk;
++			struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++			bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++			if (sock_net(meta_sk) != net)
++				continue;
++
++			if (meta_v4) {
++				/* skip IPv6 events if meta is IPv4 */
++				if (event->family == AF_INET6)
++					continue;
++			}
++			/* skip IPv4 events if IPV6_V6ONLY is set */
++			else if (event->family == AF_INET &&
++				 inet6_sk(meta_sk)->ipv6only)
++				continue;
++
++			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++				continue;
++
++			bh_lock_sock(meta_sk);
++
++			if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
++			    mpcb->infinite_mapping_snd ||
++			    mpcb->infinite_mapping_rcv ||
++			    mpcb->send_infinite_mapping)
++				goto next;
++
++			/* May be that the pm has changed in-between */
++			if (mpcb->pm_ops != &full_mesh)
++				goto next;
++
++			if (sock_owned_by_user(meta_sk)) {
++				if (!test_and_set_bit(MPTCP_PATH_MANAGER,
++						      &meta_tp->tsq_flags))
++					sock_hold(meta_sk);
++
++				goto next;
++			}
++
++			if (event->code == MPTCP_EVENT_ADD) {
++				fmp->add_addr++;
++				mpcb->addr_signal = 1;
++
++				sk = mptcp_select_ack_sock(meta_sk);
++				if (sk)
++					tcp_send_ack(sk);
++
++				full_mesh_create_subflows(meta_sk);
++			}
++
++			if (event->code == MPTCP_EVENT_DEL) {
++				struct sock *sk, *tmpsk;
++				struct mptcp_loc_addr *mptcp_local;
++				bool found = false;
++
++				mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++				/* In any case, we need to update our bitfields */
++				if (id >= 0)
++					update_addr_bitfields(meta_sk, mptcp_local);
++
++				/* Look for the socket and remove him */
++				mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++					if ((event->family == AF_INET6 &&
++					     (sk->sk_family == AF_INET ||
++					      mptcp_v6_is_v4_mapped(sk))) ||
++					    (event->family == AF_INET &&
++					     (sk->sk_family == AF_INET6 &&
++					      !mptcp_v6_is_v4_mapped(sk))))
++						continue;
++
++					if (event->family == AF_INET &&
++					    (sk->sk_family == AF_INET ||
++					     mptcp_v6_is_v4_mapped(sk)) &&
++					     inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
++						continue;
++
++					if (event->family == AF_INET6 &&
++					    sk->sk_family == AF_INET6 &&
++					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
++						continue;
++
++					/* Reinject, so that pf = 1 and so we
++					 * won't select this one as the
++					 * ack-sock.
++					 */
++					mptcp_reinject_data(sk, 0);
++
++					/* We announce the removal of this id */
++					announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
++
++					mptcp_sub_force_close(sk);
++					found = true;
++				}
++
++				if (found)
++					goto next;
++
++				/* The id may have been given by the event,
++				 * matching on a local address. And it may not
++				 * have matched on one of the above sockets,
++				 * because the client never created a subflow.
++				 * So, we have to finally remove it here.
++				 */
++				if (id > 0)
++					announce_remove_addr(id, meta_sk);
++			}
++
++			if (event->code == MPTCP_EVENT_MOD) {
++				struct sock *sk;
++
++				mptcp_for_each_sk(mpcb, sk) {
++					struct tcp_sock *tp = tcp_sk(sk);
++					if (event->family == AF_INET &&
++					    (sk->sk_family == AF_INET ||
++					     mptcp_v6_is_v4_mapped(sk)) &&
++					     inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
++						if (event->low_prio != tp->mptcp->low_prio) {
++							tp->mptcp->send_mp_prio = 1;
++							tp->mptcp->low_prio = event->low_prio;
++
++							tcp_send_ack(sk);
++						}
++					}
++
++					if (event->family == AF_INET6 &&
++					    sk->sk_family == AF_INET6 &&
++					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
++						if (event->low_prio != tp->mptcp->low_prio) {
++							tp->mptcp->send_mp_prio = 1;
++							tp->mptcp->low_prio = event->low_prio;
++
++							tcp_send_ack(sk);
++						}
++					}
++				}
++			}
++next:
++			bh_unlock_sock(meta_sk);
++			sock_put(meta_sk);
++		}
++		rcu_read_unlock_bh();
++	}
++	goto next_event;
++}
++
++static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
++						     const struct mptcp_addr_event *event)
++{
++	struct mptcp_addr_event *eventq;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++	list_for_each_entry(eventq, &fm_ns->events, list) {
++		if (eventq->family != event->family)
++			continue;
++		if (event->family == AF_INET) {
++			if (eventq->addr.in.s_addr == event->addr.in.s_addr)
++				return eventq;
++		} else {
++			if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
++				return eventq;
++		}
++	}
++	return NULL;
++}
++
++/* We already hold the net-namespace MPTCP-lock */
++static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
++{
++	struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++	if (eventq) {
++		switch (event->code) {
++		case MPTCP_EVENT_DEL:
++			mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
++			list_del(&eventq->list);
++			kfree(eventq);
++			break;
++		case MPTCP_EVENT_ADD:
++			mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
++			eventq->low_prio = event->low_prio;
++			eventq->code = MPTCP_EVENT_ADD;
++			return;
++		case MPTCP_EVENT_MOD:
++			mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
++			eventq->low_prio = event->low_prio;
++			eventq->code = MPTCP_EVENT_MOD;
++			return;
++		}
++	}
++
++	/* OK, we have to add the new address to the wait queue */
++	eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
++	if (!eventq)
++		return;
++
++	list_add_tail(&eventq->list, &fm_ns->events);
++
++	/* Create work-queue */
++	if (!delayed_work_pending(&fm_ns->address_worker))
++		queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
++				   msecs_to_jiffies(500));
++}
++
++static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
++				struct net *net)
++{
++	const struct net_device *netdev = ifa->ifa_dev->dev;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	struct mptcp_addr_event mpevent;
++
++	if (ifa->ifa_scope > RT_SCOPE_LINK ||
++	    ipv4_is_loopback(ifa->ifa_local))
++		return;
++
++	spin_lock_bh(&fm_ns->local_lock);
++
++	mpevent.family = AF_INET;
++	mpevent.addr.in.s_addr = ifa->ifa_local;
++	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++	if (event == NETDEV_DOWN || !netif_running(netdev) ||
++	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++		mpevent.code = MPTCP_EVENT_DEL;
++	else if (event == NETDEV_UP)
++		mpevent.code = MPTCP_EVENT_ADD;
++	else if (event == NETDEV_CHANGE)
++		mpevent.code = MPTCP_EVENT_MOD;
++
++	mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
++		    &ifa->ifa_local, mpevent.code, mpevent.low_prio);
++	add_pm_event(net, &mpevent);
++
++	spin_unlock_bh(&fm_ns->local_lock);
++	return;
++}
++
++/* React on IPv4-addr add/rem-events */
++static int mptcp_pm_inetaddr_event(struct notifier_block *this,
++				   unsigned long event, void *ptr)
++{
++	const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
++	struct net *net = dev_net(ifa->ifa_dev->dev);
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	addr4_event_handler(ifa, event, net);
++
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_inetaddr_notifier = {
++		.notifier_call = mptcp_pm_inetaddr_event,
++};
++
++#if IS_ENABLED(CONFIG_IPV6)
++
++/* IPV6-related address/interface watchers */
++struct mptcp_dad_data {
++	struct timer_list timer;
++	struct inet6_ifaddr *ifa;
++};
++
++static void dad_callback(unsigned long arg);
++static int inet6_addr_event(struct notifier_block *this,
++				     unsigned long event, void *ptr);
++
++static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
++{
++	return (ifa->flags & IFA_F_TENTATIVE) &&
++	       ifa->state == INET6_IFADDR_STATE_DAD;
++}
++
++static void dad_init_timer(struct mptcp_dad_data *data,
++				 struct inet6_ifaddr *ifa)
++{
++	data->ifa = ifa;
++	data->timer.data = (unsigned long)data;
++	data->timer.function = dad_callback;
++	if (ifa->idev->cnf.rtr_solicit_delay)
++		data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
++	else
++		data->timer.expires = jiffies + (HZ/10);
++}
++
++static void dad_callback(unsigned long arg)
++{
++	struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
++
++	if (ipv6_is_in_dad_state(data->ifa)) {
++		dad_init_timer(data, data->ifa);
++		add_timer(&data->timer);
++	} else {
++		inet6_addr_event(NULL, NETDEV_UP, data->ifa);
++		in6_ifa_put(data->ifa);
++		kfree(data);
++	}
++}
++
++static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
++{
++	struct mptcp_dad_data *data;
++
++	data = kmalloc(sizeof(*data), GFP_ATOMIC);
++
++	if (!data)
++		return;
++
++	init_timer(&data->timer);
++	dad_init_timer(data, ifa);
++	add_timer(&data->timer);
++	in6_ifa_hold(ifa);
++}
++
++static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
++				struct net *net)
++{
++	const struct net_device *netdev = ifa->idev->dev;
++	int addr_type = ipv6_addr_type(&ifa->addr);
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	struct mptcp_addr_event mpevent;
++
++	if (ifa->scope > RT_SCOPE_LINK ||
++	    addr_type == IPV6_ADDR_ANY ||
++	    (addr_type & IPV6_ADDR_LOOPBACK) ||
++	    (addr_type & IPV6_ADDR_LINKLOCAL))
++		return;
++
++	spin_lock_bh(&fm_ns->local_lock);
++
++	mpevent.family = AF_INET6;
++	mpevent.addr.in6 = ifa->addr;
++	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++	if (event == NETDEV_DOWN || !netif_running(netdev) ||
++	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++		mpevent.code = MPTCP_EVENT_DEL;
++	else if (event == NETDEV_UP)
++		mpevent.code = MPTCP_EVENT_ADD;
++	else if (event == NETDEV_CHANGE)
++		mpevent.code = MPTCP_EVENT_MOD;
++
++	mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
++		    &ifa->addr, mpevent.code, mpevent.low_prio);
++	add_pm_event(net, &mpevent);
++
++	spin_unlock_bh(&fm_ns->local_lock);
++	return;
++}
++
++/* React on IPv6-addr add/rem-events */
++static int inet6_addr_event(struct notifier_block *this, unsigned long event,
++			    void *ptr)
++{
++	struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
++	struct net *net = dev_net(ifa6->idev->dev);
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	if (ipv6_is_in_dad_state(ifa6))
++		dad_setup_timer(ifa6);
++	else
++		addr6_event_handler(ifa6, event, net);
++
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block inet6_addr_notifier = {
++		.notifier_call = inet6_addr_event,
++};
++
++#endif
++
++/* React on ifup/down-events */
++static int netdev_event(struct notifier_block *this, unsigned long event,
++			void *ptr)
++{
++	const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
++	struct in_device *in_dev;
++#if IS_ENABLED(CONFIG_IPV6)
++	struct inet6_dev *in6_dev;
++#endif
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	rcu_read_lock();
++	in_dev = __in_dev_get_rtnl(dev);
++
++	if (in_dev) {
++		for_ifa(in_dev) {
++			mptcp_pm_inetaddr_event(NULL, event, ifa);
++		} endfor_ifa(in_dev);
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	in6_dev = __in6_dev_get(dev);
++
++	if (in6_dev) {
++		struct inet6_ifaddr *ifa6;
++		list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
++			inet6_addr_event(NULL, event, ifa6);
++	}
++#endif
++
++	rcu_read_unlock();
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_netdev_notifier = {
++		.notifier_call = netdev_event,
++};
++
++static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
++				const union inet_addr *addr,
++				sa_family_t family, __be16 port, u8 id)
++{
++	if (family == AF_INET)
++		mptcp_addv4_raddr(mpcb, &addr->in, port, id);
++	else
++		mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
++}
++
++static void full_mesh_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int i, index;
++	union inet_addr saddr, daddr;
++	sa_family_t family;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++	/* Init local variables necessary for the rest */
++	if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
++		saddr.ip = inet_sk(meta_sk)->inet_saddr;
++		daddr.ip = inet_sk(meta_sk)->inet_daddr;
++		family = AF_INET;
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		saddr.in6 = inet6_sk(meta_sk)->saddr;
++		daddr.in6 = meta_sk->sk_v6_daddr;
++		family = AF_INET6;
++#endif
++	}
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	index = mptcp_find_address(mptcp_local, family, &saddr);
++	if (index < 0)
++		goto fallback;
++
++	full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
++	mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
++	fmp->mpcb = mpcb;
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* Look for the address among the local addresses */
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		__be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
++
++		/* We do not need to announce the initial subflow's address again */
++		if (family == AF_INET && saddr.ip == ifa_address)
++			continue;
++
++		fmp->add_addr++;
++		mpcb->addr_signal = 1;
++	}
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++	/* skip IPv6 addresses if meta-socket is IPv4 */
++	if (meta_v4)
++		goto skip_ipv6;
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
++
++		/* We do not need to announce the initial subflow's address again */
++		if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
++			continue;
++
++		fmp->add_addr++;
++		mpcb->addr_signal = 1;
++	}
++
++skip_ipv6:
++#endif
++
++	rcu_read_unlock();
++
++	if (family == AF_INET)
++		fmp->announced_addrs_v4 |= (1 << index);
++	else
++		fmp->announced_addrs_v6 |= (1 << index);
++
++	for (i = fmp->add_addr; i && fmp->add_addr; i--)
++		tcp_send_ack(mpcb->master_sk);
++
++	return;
++
++fallback:
++	rcu_read_unlock();
++	mptcp_fallback_default(mpcb);
++	return;
++}
++
++static void full_mesh_create_subflows(struct sock *meta_sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		return;
++
++	if (!work_pending(&fmp->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &fmp->subflow_work);
++	}
++}
++
++/* Called upon release_sock, if the socket was owned by the user during
++ * a path-management event.
++ */
++static void full_mesh_release_sock(struct sock *meta_sk)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	struct sock *sk, *tmpsk;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++	int i;
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* First, detect modifications or additions */
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		struct in_addr ifa = mptcp_local->locaddr4[i].addr;
++		bool found = false;
++
++		mptcp_for_each_sk(mpcb, sk) {
++			struct tcp_sock *tp = tcp_sk(sk);
++
++			if (sk->sk_family == AF_INET6 &&
++			    !mptcp_v6_is_v4_mapped(sk))
++				continue;
++
++			if (inet_sk(sk)->inet_saddr != ifa.s_addr)
++				continue;
++
++			found = true;
++
++			if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
++				tp->mptcp->send_mp_prio = 1;
++				tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
++
++				tcp_send_ack(sk);
++			}
++		}
++
++		if (!found) {
++			fmp->add_addr++;
++			mpcb->addr_signal = 1;
++
++			sk = mptcp_select_ack_sock(meta_sk);
++			if (sk)
++				tcp_send_ack(sk);
++			full_mesh_create_subflows(meta_sk);
++		}
++	}
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++	/* skip IPv6 addresses if meta-socket is IPv4 */
++	if (meta_v4)
++		goto removal;
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
++		bool found = false;
++
++		mptcp_for_each_sk(mpcb, sk) {
++			struct tcp_sock *tp = tcp_sk(sk);
++
++			if (sk->sk_family == AF_INET ||
++			    mptcp_v6_is_v4_mapped(sk))
++				continue;
++
++			if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
++				continue;
++
++			found = true;
++
++			if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
++				tp->mptcp->send_mp_prio = 1;
++				tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
++
++				tcp_send_ack(sk);
++			}
++		}
++
++		if (!found) {
++			fmp->add_addr++;
++			mpcb->addr_signal = 1;
++
++			sk = mptcp_select_ack_sock(meta_sk);
++			if (sk)
++				tcp_send_ack(sk);
++			full_mesh_create_subflows(meta_sk);
++		}
++	}
++
++removal:
++#endif
++
++	/* Now, detect address-removals */
++	mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++		bool shall_remove = true;
++
++		if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
++			mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++				if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
++					shall_remove = false;
++					break;
++				}
++			}
++		} else {
++			mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++				if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
++					shall_remove = false;
++					break;
++				}
++			}
++		}
++
++		if (shall_remove) {
++			/* Reinject, so that pf = 1 and so we
++			 * won't select this one as the
++			 * ack-sock.
++			 */
++			mptcp_reinject_data(sk, 0);
++
++			announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
++					     meta_sk);
++
++			mptcp_sub_force_close(sk);
++		}
++	}
++
++	/* Just call it optimistically. It actually cannot do any harm */
++	update_addr_bitfields(meta_sk, mptcp_local);
++
++	rcu_read_unlock();
++}
++
++static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
++				  struct net *net, bool *low_prio)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	int index, id = -1;
++
++	/* Handle the backup-flows */
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	index = mptcp_find_address(mptcp_local, family, addr);
++
++	if (index != -1) {
++		if (family == AF_INET) {
++			id = mptcp_local->locaddr4[index].loc4_id;
++			*low_prio = mptcp_local->locaddr4[index].low_prio;
++		} else {
++			id = mptcp_local->locaddr6[index].loc6_id;
++			*low_prio = mptcp_local->locaddr6[index].low_prio;
++		}
++	}
++
++
++	rcu_read_unlock();
++
++	return id;
++}
++
++static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
++				  struct tcp_out_options *opts,
++				  struct sk_buff *skb)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
++	int remove_addr_len;
++	u8 unannouncedv4 = 0, unannouncedv6 = 0;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++	mpcb->addr_signal = 0;
++
++	if (likely(!fmp->add_addr))
++		goto remove_addr;
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* IPv4 */
++	unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
++	if (unannouncedv4 &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
++		int ind = mptcp_find_free_index(~unannouncedv4);
++
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_ADD_ADDR;
++		opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
++		opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
++		opts->add_addr_v4 = 1;
++
++		if (skb) {
++			fmp->announced_addrs_v4 |= (1 << ind);
++			fmp->add_addr--;
++		}
++		*size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
++	}
++
++	if (meta_v4)
++		goto skip_ipv6;
++
++skip_ipv4:
++	/* IPv6 */
++	unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
++	if (unannouncedv6 &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
++		int ind = mptcp_find_free_index(~unannouncedv6);
++
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_ADD_ADDR;
++		opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
++		opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
++		opts->add_addr_v6 = 1;
++
++		if (skb) {
++			fmp->announced_addrs_v6 |= (1 << ind);
++			fmp->add_addr--;
++		}
++		*size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
++	}
++
++skip_ipv6:
++	rcu_read_unlock();
++
++	if (!unannouncedv4 && !unannouncedv6 && skb)
++		fmp->add_addr--;
++
++remove_addr:
++	if (likely(!fmp->remove_addrs))
++		goto exit;
++
++	remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
++	if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
++		goto exit;
++
++	opts->options |= OPTION_MPTCP;
++	opts->mptcp_options |= OPTION_REMOVE_ADDR;
++	opts->remove_addrs = fmp->remove_addrs;
++	*size += remove_addr_len;
++	if (skb)
++		fmp->remove_addrs = 0;
++
++exit:
++	mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
++}
++
++static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
++{
++	mptcp_v4_rem_raddress(mpcb, rem_id);
++	mptcp_v6_rem_raddress(mpcb, rem_id);
++}
++
++/* Output /proc/net/mptcp_fullmesh */
++static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
++{
++	const struct net *net = seq->private;
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	int i;
++
++	seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
++
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
++
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
++
++		seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
++			   loc4->low_prio, &loc4->addr);
++	}
++
++	seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
++
++		seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
++			   loc6->low_prio, &loc6->addr);
++	}
++	rcu_read_unlock_bh();
++
++	return 0;
++}
++
++static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
++{
++	return single_open_net(inode, file, mptcp_fm_seq_show);
++}
++
++static const struct file_operations mptcp_fm_seq_fops = {
++	.owner = THIS_MODULE,
++	.open = mptcp_fm_seq_open,
++	.read = seq_read,
++	.llseek = seq_lseek,
++	.release = single_release_net,
++};
++
++static int mptcp_fm_init_net(struct net *net)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns;
++	int err = 0;
++
++	fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
++	if (!fm_ns)
++		return -ENOBUFS;
++
++	mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
++	if (!mptcp_local) {
++		err = -ENOBUFS;
++		goto err_mptcp_local;
++	}
++
++	if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
++			 &mptcp_fm_seq_fops)) {
++		err = -ENOMEM;
++		goto err_seq_fops;
++	}
++
++	mptcp_local->next_v4_index = 1;
++
++	rcu_assign_pointer(fm_ns->local, mptcp_local);
++	INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
++	INIT_LIST_HEAD(&fm_ns->events);
++	spin_lock_init(&fm_ns->local_lock);
++	fm_ns->net = net;
++	net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
++
++	return 0;
++err_seq_fops:
++	kfree(mptcp_local);
++err_mptcp_local:
++	kfree(fm_ns);
++	return err;
++}
++
++static void mptcp_fm_exit_net(struct net *net)
++{
++	struct mptcp_addr_event *eventq, *tmp;
++	struct mptcp_fm_ns *fm_ns;
++	struct mptcp_loc_addr *mptcp_local;
++
++	fm_ns = fm_get_ns(net);
++	cancel_delayed_work_sync(&fm_ns->address_worker);
++
++	rcu_read_lock_bh();
++
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	kfree(mptcp_local);
++
++	spin_lock(&fm_ns->local_lock);
++	list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
++		list_del(&eventq->list);
++		kfree(eventq);
++	}
++	spin_unlock(&fm_ns->local_lock);
++
++	rcu_read_unlock_bh();
++
++	remove_proc_entry("mptcp_fullmesh", net->proc_net);
++
++	kfree(fm_ns);
++}
++
++static struct pernet_operations full_mesh_net_ops = {
++	.init = mptcp_fm_init_net,
++	.exit = mptcp_fm_exit_net,
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly = {
++	.new_session = full_mesh_new_session,
++	.release_sock = full_mesh_release_sock,
++	.fully_established = full_mesh_create_subflows,
++	.new_remote_address = full_mesh_create_subflows,
++	.get_local_id = full_mesh_get_local_id,
++	.addr_signal = full_mesh_addr_signal,
++	.add_raddr = full_mesh_add_raddr,
++	.rem_raddr = full_mesh_rem_raddr,
++	.name = "fullmesh",
++	.owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init full_mesh_register(void)
++{
++	int ret;
++
++	BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
++
++	ret = register_pernet_subsys(&full_mesh_net_ops);
++	if (ret)
++		goto out;
++
++	ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++	if (ret)
++		goto err_reg_inetaddr;
++	ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
++	if (ret)
++		goto err_reg_netdev;
++
++#if IS_ENABLED(CONFIG_IPV6)
++	ret = register_inet6addr_notifier(&inet6_addr_notifier);
++	if (ret)
++		goto err_reg_inet6addr;
++#endif
++
++	ret = mptcp_register_path_manager(&full_mesh);
++	if (ret)
++		goto err_reg_pm;
++
++out:
++	return ret;
++
++
++err_reg_pm:
++#if IS_ENABLED(CONFIG_IPV6)
++	unregister_inet6addr_notifier(&inet6_addr_notifier);
++err_reg_inet6addr:
++#endif
++	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++err_reg_netdev:
++	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++err_reg_inetaddr:
++	unregister_pernet_subsys(&full_mesh_net_ops);
++	goto out;
++}
++
++static void full_mesh_unregister(void)
++{
++#if IS_ENABLED(CONFIG_IPV6)
++	unregister_inet6addr_notifier(&inet6_addr_notifier);
++#endif
++	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++	unregister_pernet_subsys(&full_mesh_net_ops);
++	mptcp_unregister_path_manager(&full_mesh);
++}
++
++module_init(full_mesh_register);
++module_exit(full_mesh_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("Full-Mesh MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
+new file mode 100644
+index 000000000000..43704ccb639e
+--- /dev/null
++++ b/net/mptcp/mptcp_input.c
+@@ -0,0 +1,2405 @@
++/*
++ *	MPTCP implementation - Sending side
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <asm/unaligned.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++
++#include <linux/kconfig.h>
++
++/* is seq1 < seq2 ? */
++static inline bool before64(const u64 seq1, const u64 seq2)
++{
++	return (s64)(seq1 - seq2) < 0;
++}
++
++/* is seq1 > seq2 ? */
++#define after64(seq1, seq2)	before64(seq2, seq1)
++
++static inline void mptcp_become_fully_estab(struct sock *sk)
++{
++	tcp_sk(sk)->mptcp->fully_established = 1;
++
++	if (is_master_tp(tcp_sk(sk)) &&
++	    tcp_sk(sk)->mpcb->pm_ops->fully_established)
++		tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
++}
++
++/* Similar to tcp_tso_acked without any memory accounting */
++static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
++					   struct sk_buff *skb)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	u32 packets_acked, len;
++
++	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
++
++	packets_acked = tcp_skb_pcount(skb);
++
++	if (skb_unclone(skb, GFP_ATOMIC))
++		return 0;
++
++	len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
++	__pskb_trim_head(skb, len);
++
++	TCP_SKB_CB(skb)->seq += len;
++	skb->ip_summed = CHECKSUM_PARTIAL;
++	skb->truesize	     -= len;
++
++	/* Any change of skb->len requires recalculation of tso factor. */
++	if (tcp_skb_pcount(skb) > 1)
++		tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
++	packets_acked -= tcp_skb_pcount(skb);
++
++	if (packets_acked) {
++		BUG_ON(tcp_skb_pcount(skb) == 0);
++		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
++	}
++
++	return packets_acked;
++}
++
++/**
++ * Cleans the meta-socket retransmission queue and the reinject-queue.
++ * @sk must be the metasocket.
++ */
++static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
++{
++	struct sk_buff *skb, *tmp;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	bool acked = false;
++	u32 acked_pcount;
++
++	while ((skb = tcp_write_queue_head(meta_sk)) &&
++	       skb != tcp_send_head(meta_sk)) {
++		bool fully_acked = true;
++
++		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++			if (tcp_skb_pcount(skb) == 1 ||
++			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++				break;
++
++			acked_pcount = tcp_tso_acked(meta_sk, skb);
++			if (!acked_pcount)
++				break;
++
++			fully_acked = false;
++		} else {
++			acked_pcount = tcp_skb_pcount(skb);
++		}
++
++		acked = true;
++		meta_tp->packets_out -= acked_pcount;
++		meta_tp->retrans_stamp = 0;
++
++		if (!fully_acked)
++			break;
++
++		tcp_unlink_write_queue(skb, meta_sk);
++
++		if (mptcp_is_data_fin(skb)) {
++			struct sock *sk_it;
++
++			/* DATA_FIN has been acknowledged - now we can close
++			 * the subflows
++			 */
++			mptcp_for_each_sk(mpcb, sk_it) {
++				unsigned long delay = 0;
++
++				/* If we are the passive closer, don't trigger
++				 * subflow-fin until the subflow has been finned
++				 * by the peer - thus we add a delay.
++				 */
++				if (mpcb->passive_close &&
++				    sk_it->sk_state == TCP_ESTABLISHED)
++					delay = inet_csk(sk_it)->icsk_rto << 3;
++
++				mptcp_sub_close(sk_it, delay);
++			}
++		}
++		sk_wmem_free_skb(meta_sk, skb);
++	}
++	/* Remove acknowledged data from the reinject queue */
++	skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
++		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++			if (tcp_skb_pcount(skb) == 1 ||
++			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++				break;
++
++			mptcp_tso_acked_reinject(meta_sk, skb);
++			break;
++		}
++
++		__skb_unlink(skb, &mpcb->reinject_queue);
++		__kfree_skb(skb);
++	}
++
++	if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
++		meta_tp->snd_up = meta_tp->snd_una;
++
++	if (acked) {
++		tcp_rearm_rto(meta_sk);
++		/* Normally this is done in tcp_try_undo_loss - but MPTCP
++		 * does not call this function.
++		 */
++		inet_csk(meta_sk)->icsk_retransmits = 0;
++	}
++}
++
++/* Inspired by tcp_rcv_state_process */
++static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
++				   const struct sk_buff *skb, u32 data_seq,
++				   u16 data_len)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++	const struct tcphdr *th = tcp_hdr(skb);
++
++	/* State-machine handling if FIN has been enqueued and he has
++	 * been acked (snd_una == write_seq) - it's important that this
++	 * here is after sk_wmem_free_skb because otherwise
++	 * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
++	 */
++	switch (meta_sk->sk_state) {
++	case TCP_FIN_WAIT1: {
++		struct dst_entry *dst;
++		int tmo;
++
++		if (meta_tp->snd_una != meta_tp->write_seq)
++			break;
++
++		tcp_set_state(meta_sk, TCP_FIN_WAIT2);
++		meta_sk->sk_shutdown |= SEND_SHUTDOWN;
++
++		dst = __sk_dst_get(sk);
++		if (dst)
++			dst_confirm(dst);
++
++		if (!sock_flag(meta_sk, SOCK_DEAD)) {
++			/* Wake up lingering close() */
++			meta_sk->sk_state_change(meta_sk);
++			break;
++		}
++
++		if (meta_tp->linger2 < 0 ||
++		    (data_len &&
++		     after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
++			   meta_tp->rcv_nxt))) {
++			mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++			tcp_done(meta_sk);
++			NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++			return 1;
++		}
++
++		tmo = tcp_fin_time(meta_sk);
++		if (tmo > TCP_TIMEWAIT_LEN) {
++			inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
++		} else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
++			/* Bad case. We could lose such FIN otherwise.
++			 * It is not a big problem, but it looks confusing
++			 * and not so rare event. We still can lose it now,
++			 * if it spins in bh_lock_sock(), but it is really
++			 * marginal case.
++			 */
++			inet_csk_reset_keepalive_timer(meta_sk, tmo);
++		} else {
++			meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
++		}
++		break;
++	}
++	case TCP_CLOSING:
++	case TCP_LAST_ACK:
++		if (meta_tp->snd_una == meta_tp->write_seq) {
++			tcp_done(meta_sk);
++			return 1;
++		}
++		break;
++	}
++
++	/* step 7: process the segment text */
++	switch (meta_sk->sk_state) {
++	case TCP_FIN_WAIT1:
++	case TCP_FIN_WAIT2:
++		/* RFC 793 says to queue data in these states,
++		 * RFC 1122 says we MUST send a reset.
++		 * BSD 4.4 also does reset.
++		 */
++		if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
++			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++			    !mptcp_is_data_fin2(skb, tp)) {
++				NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++				mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++				tcp_reset(meta_sk);
++				return 1;
++			}
++		}
++		break;
++	}
++
++	return 0;
++}
++
++/**
++ * @return:
++ *  i) 1: Everything's fine.
++ *  ii) -1: A reset has been sent on the subflow - csum-failure
++ *  iii) 0: csum-failure but no reset sent, because it's the last subflow.
++ *	 Last packet should not be destroyed by the caller because it has
++ *	 been done here.
++ */
++static int mptcp_verif_dss_csum(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *tmp, *tmp1, *last = NULL;
++	__wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
++	int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
++	int iter = 0;
++
++	skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
++		unsigned int csum_len;
++
++		if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
++			/* Mapping ends in the middle of the packet -
++			 * csum only these bytes
++			 */
++			csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
++		else
++			csum_len = tmp->len;
++
++		offset = 0;
++		if (overflowed) {
++			char first_word[4];
++			first_word[0] = 0;
++			first_word[1] = 0;
++			first_word[2] = 0;
++			first_word[3] = *(tmp->data);
++			csum_tcp = csum_partial(first_word, 4, csum_tcp);
++			offset = 1;
++			csum_len--;
++			overflowed = 0;
++		}
++
++		csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
++
++		/* Was it on an odd-length? Then we have to merge the next byte
++		 * correctly (see above)
++		 */
++		if (csum_len != (csum_len & (~1)))
++			overflowed = 1;
++
++		if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
++			__be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
++
++			/* If a 64-bit dss is present, we increase the offset
++			 * by 4 bytes, as the high-order 64-bits will be added
++			 * in the final csum_partial-call.
++			 */
++			u32 offset = skb_transport_offset(tmp) +
++				     TCP_SKB_CB(tmp)->dss_off;
++			if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
++				offset += 4;
++
++			csum_tcp = skb_checksum(tmp, offset,
++						MPTCP_SUB_LEN_SEQ_CSUM,
++						csum_tcp);
++
++			csum_tcp = csum_partial(&data_seq,
++						sizeof(data_seq), csum_tcp);
++
++			dss_csum_added = 1; /* Just do it once */
++		}
++		last = tmp;
++		iter++;
++
++		if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
++		    !before(TCP_SKB_CB(tmp1)->seq,
++			    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++			break;
++	}
++
++	/* Now, checksum must be 0 */
++	if (unlikely(csum_fold(csum_tcp))) {
++		pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
++		       __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
++		       dss_csum_added, overflowed, iter);
++
++		tp->mptcp->send_mp_fail = 1;
++
++		/* map_data_seq is the data-seq number of the
++		 * mapping we are currently checking
++		 */
++		tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
++
++		if (tp->mpcb->cnt_subflows > 1) {
++			mptcp_send_reset(sk);
++			ans = -1;
++		} else {
++			tp->mpcb->send_infinite_mapping = 1;
++
++			/* Need to purge the rcv-queue as it's no more valid */
++			while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
++				tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
++				kfree_skb(tmp);
++			}
++
++			ans = 0;
++		}
++	}
++
++	return ans;
++}
++
++static inline void mptcp_prepare_skb(struct sk_buff *skb,
++				     const struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 inc = 0;
++
++	/* If skb is the end of this mapping (end is always at mapping-boundary
++	 * thanks to the splitting/trimming), then we need to increase
++	 * data-end-seq by 1 if this here is a data-fin.
++	 *
++	 * We need to do -1 because end_seq includes the subflow-FIN.
++	 */
++	if (tp->mptcp->map_data_fin &&
++	    (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
++	    (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++		inc = 1;
++
++		/* We manually set the fin-flag if it is a data-fin. For easy
++		 * processing in tcp_recvmsg.
++		 */
++		tcp_hdr(skb)->fin = 1;
++	} else {
++		/* We may have a subflow-fin with data but without data-fin */
++		tcp_hdr(skb)->fin = 0;
++	}
++
++	/* Adapt data-seq's to the packet itself. We kinda transform the
++	 * dss-mapping to a per-packet granularity. This is necessary to
++	 * correctly handle overlapping mappings coming from different
++	 * subflows. Otherwise it would be a complete mess.
++	 */
++	tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
++	tcb->end_seq = tcb->seq + skb->len + inc;
++}
++
++/**
++ * @return: 1 if the segment has been eaten and can be suppressed,
++ *          otherwise 0.
++ */
++static inline int mptcp_direct_copy(const struct sk_buff *skb,
++				    struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
++	int eaten = 0;
++
++	__set_current_state(TASK_RUNNING);
++
++	local_bh_enable();
++	if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
++		meta_tp->ucopy.len -= chunk;
++		meta_tp->copied_seq += chunk;
++		eaten = (chunk == skb->len);
++		tcp_rcv_space_adjust(meta_sk);
++	}
++	local_bh_disable();
++	return eaten;
++}
++
++static inline void mptcp_reset_mapping(struct tcp_sock *tp)
++{
++	tp->mptcp->map_data_len = 0;
++	tp->mptcp->map_data_seq = 0;
++	tp->mptcp->map_subseq = 0;
++	tp->mptcp->map_data_fin = 0;
++	tp->mptcp->mapping_present = 0;
++}
++
++/* The DSS-mapping received on the sk only covers the second half of the skb
++ * (cut at seq). We trim the head from the skb.
++ * Data will be freed upon kfree().
++ *
++ * Inspired by tcp_trim_head().
++ */
++static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++	int len = seq - TCP_SKB_CB(skb)->seq;
++	u32 new_seq = TCP_SKB_CB(skb)->seq + len;
++
++	if (len < skb_headlen(skb))
++		__skb_pull(skb, len);
++	else
++		__pskb_trim_head(skb, len - skb_headlen(skb));
++
++	TCP_SKB_CB(skb)->seq = new_seq;
++
++	skb->truesize -= len;
++	atomic_sub(len, &sk->sk_rmem_alloc);
++	sk_mem_uncharge(sk, len);
++}
++
++/* The DSS-mapping received on the sk only covers the first half of the skb
++ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
++ * as further packets may resolve the mapping of the second half of data.
++ *
++ * Inspired by tcp_fragment().
++ */
++static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++	struct sk_buff *buff;
++	int nsize;
++	int nlen, len;
++
++	len = seq - TCP_SKB_CB(skb)->seq;
++	nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
++	if (nsize < 0)
++		nsize = 0;
++
++	/* Get a new skb... force flag on. */
++	buff = alloc_skb(nsize, GFP_ATOMIC);
++	if (buff == NULL)
++		return -ENOMEM;
++
++	skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
++	skb_reset_transport_header(buff);
++
++	tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
++	tcp_hdr(skb)->fin = 0;
++
++	/* We absolutly need to call skb_set_owner_r before refreshing the
++	 * truesize of buff, otherwise the moved data will account twice.
++	 */
++	skb_set_owner_r(buff, sk);
++	nlen = skb->len - len - nsize;
++	buff->truesize += nlen;
++	skb->truesize -= nlen;
++
++	/* Correct the sequence numbers. */
++	TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
++	TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
++	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
++
++	skb_split(skb, buff, len);
++
++	__skb_queue_after(&sk->sk_receive_queue, skb, buff);
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
++	if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
++	    !tp->mpcb->infinite_mapping_rcv) {
++		/* Remove a pure subflow-fin from the queue and increase
++		 * copied_seq.
++		 */
++		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++		__skb_unlink(skb, &sk->sk_receive_queue);
++		__kfree_skb(skb);
++		return -1;
++	}
++
++	/* If we are not yet fully established and do not know the mapping for
++	 * this segment, this path has to fallback to infinite or be torn down.
++	 */
++	if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
++	    !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
++		pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
++		       __func__, tp->mpcb->mptcp_loc_token,
++		       tp->mptcp->path_index, __builtin_return_address(0),
++		       TCP_SKB_CB(skb)->seq);
++
++		if (!is_master_tp(tp)) {
++			mptcp_send_reset(sk);
++			return 1;
++		}
++
++		tp->mpcb->infinite_mapping_snd = 1;
++		tp->mpcb->infinite_mapping_rcv = 1;
++		/* We do a seamless fallback and should not send a inf.mapping. */
++		tp->mpcb->send_infinite_mapping = 0;
++		tp->mptcp->fully_established = 1;
++	}
++
++	/* Receiver-side becomes fully established when a whole rcv-window has
++	 * been received without the need to fallback due to the previous
++	 * condition.
++	 */
++	if (!tp->mptcp->fully_established) {
++		tp->mptcp->init_rcv_wnd -= skb->len;
++		if (tp->mptcp->init_rcv_wnd < 0)
++			mptcp_become_fully_estab(sk);
++	}
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 *ptr;
++	u32 data_seq, sub_seq, data_len, tcp_end_seq;
++
++	/* If we are in infinite-mapping-mode, the subflow is guaranteed to be
++	 * in-order at the data-level. Thus data-seq-numbers can be inferred
++	 * from what is expected at the data-level.
++	 */
++	if (mpcb->infinite_mapping_rcv) {
++		tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
++		tp->mptcp->map_subseq = tcb->seq;
++		tp->mptcp->map_data_len = skb->len;
++		tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
++		tp->mptcp->mapping_present = 1;
++		return 0;
++	}
++
++	/* No mapping here? Exit - it is either already set or still on its way */
++	if (!mptcp_is_data_seq(skb)) {
++		/* Too many packets without a mapping - this subflow is broken */
++		if (!tp->mptcp->mapping_present &&
++		    tp->rcv_nxt - tp->copied_seq > 65536) {
++			mptcp_send_reset(sk);
++			return 1;
++		}
++
++		return 0;
++	}
++
++	ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
++	ptr++;
++	sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
++	ptr++;
++	data_len = get_unaligned_be16(ptr);
++
++	/* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
++	 * The draft sets it to 0, but we really would like to have the
++	 * real value, to have an easy handling afterwards here in this
++	 * function.
++	 */
++	if (mptcp_is_data_fin(skb) && skb->len == 0)
++		sub_seq = TCP_SKB_CB(skb)->seq;
++
++	/* If there is already a mapping - we check if it maps with the current
++	 * one. If not - we reset.
++	 */
++	if (tp->mptcp->mapping_present &&
++	    (data_seq != (u32)tp->mptcp->map_data_seq ||
++	     sub_seq != tp->mptcp->map_subseq ||
++	     data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
++	     mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
++		/* Mapping in packet is different from what we want */
++		pr_err("%s Mappings do not match!\n", __func__);
++		pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
++		       __func__, data_seq, (u32)tp->mptcp->map_data_seq,
++		       sub_seq, tp->mptcp->map_subseq, data_len,
++		       tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
++		       tp->mptcp->map_data_fin);
++		mptcp_send_reset(sk);
++		return 1;
++	}
++
++	/* If the previous check was good, the current mapping is valid and we exit. */
++	if (tp->mptcp->mapping_present)
++		return 0;
++
++	/* Mapping not yet set on this subflow - we set it here! */
++
++	if (!data_len) {
++		mpcb->infinite_mapping_rcv = 1;
++		tp->mptcp->fully_established = 1;
++		/* We need to repeat mp_fail's until the sender felt
++		 * back to infinite-mapping - here we stop repeating it.
++		 */
++		tp->mptcp->send_mp_fail = 0;
++
++		/* We have to fixup data_len - it must be the same as skb->len */
++		data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
++		sub_seq = tcb->seq;
++
++		/* TODO kill all other subflows than this one */
++		/* data_seq and so on are set correctly */
++
++		/* At this point, the meta-ofo-queue has to be emptied,
++		 * as the following data is guaranteed to be in-order at
++		 * the data and subflow-level
++		 */
++		mptcp_purge_ofo_queue(meta_tp);
++	}
++
++	/* We are sending mp-fail's and thus are in fallback mode.
++	 * Ignore packets which do not announce the fallback and still
++	 * want to provide a mapping.
++	 */
++	if (tp->mptcp->send_mp_fail) {
++		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++		__skb_unlink(skb, &sk->sk_receive_queue);
++		__kfree_skb(skb);
++		return -1;
++	}
++
++	/* FIN increased the mapping-length by 1 */
++	if (mptcp_is_data_fin(skb))
++		data_len--;
++
++	/* Subflow-sequences of packet must be
++	 * (at least partially) be part of the DSS-mapping's
++	 * subflow-sequence-space.
++	 *
++	 * Basically the mapping is not valid, if either of the
++	 * following conditions is true:
++	 *
++	 * 1. It's not a data_fin and
++	 *    MPTCP-sub_seq >= TCP-end_seq
++	 *
++	 * 2. It's a data_fin and TCP-end_seq > TCP-seq and
++	 *    MPTCP-sub_seq >= TCP-end_seq
++	 *
++	 * The previous two can be merged into:
++	 *    TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
++	 *    Because if it's not a data-fin, TCP-end_seq > TCP-seq
++	 *
++	 * 3. It's a data_fin and skb->len == 0 and
++	 *    MPTCP-sub_seq > TCP-end_seq
++	 *
++	 * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
++	 *    MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
++	 *
++	 * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
++	 */
++
++	/* subflow-fin is not part of the mapping - ignore it here ! */
++	tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
++	if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
++	    (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
++	    (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
++	    before(sub_seq, tp->copied_seq)) {
++		/* Subflow-sequences of packet is different from what is in the
++		 * packet's dss-mapping. The peer is misbehaving - reset
++		 */
++		pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
++		       "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
++		       "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
++		       skb->len, data_len, tp->copied_seq);
++		mptcp_send_reset(sk);
++		return 1;
++	}
++
++	/* Does the DSS had 64-bit seqnum's ? */
++	if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
++		/* Wrapped around? */
++		if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
++			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
++		} else {
++			/* Else, access the default high-order bits */
++			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
++		}
++	} else {
++		tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
++
++		if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
++			/* We make sure that the data_seq is invalid.
++			 * It will be dropped later.
++			 */
++			tp->mptcp->map_data_seq += 0xFFFFFFFF;
++			tp->mptcp->map_data_seq += 0xFFFFFFFF;
++		}
++	}
++
++	tp->mptcp->map_data_len = data_len;
++	tp->mptcp->map_subseq = sub_seq;
++	tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
++	tp->mptcp->mapping_present = 1;
++
++	return 0;
++}
++
++/* Similar to tcp_sequence(...) */
++static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
++				 u64 data_seq, u64 end_data_seq)
++{
++	const struct mptcp_cb *mpcb = meta_tp->mpcb;
++	u64 rcv_wup64;
++
++	/* Wrap-around? */
++	if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
++		rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
++				meta_tp->rcv_wup;
++	} else {
++		rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++						  meta_tp->rcv_wup);
++	}
++
++	return	!before64(end_data_seq, rcv_wup64) &&
++		!after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *tmp, *tmp1;
++	u32 tcp_end_seq;
++
++	if (!tp->mptcp->mapping_present)
++		return 0;
++
++	/* either, the new skb gave us the mapping and the first segment
++	 * in the sub-rcv-queue has to be trimmed ...
++	 */
++	tmp = skb_peek(&sk->sk_receive_queue);
++	if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
++	    after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
++		mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
++
++	/* ... or the new skb (tail) has to be split at the end. */
++	tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
++	if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++		u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
++		if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
++			/* TODO : maybe handle this here better.
++			 * We now just force meta-retransmission.
++			 */
++			tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++			__skb_unlink(skb, &sk->sk_receive_queue);
++			__kfree_skb(skb);
++			return -1;
++		}
++	}
++
++	/* Now, remove old sk_buff's from the receive-queue.
++	 * This may happen if the mapping has been lost for these segments and
++	 * the next mapping has already been received.
++	 */
++	if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
++				break;
++
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++
++			/* Impossible that we could free skb here, because his
++			 * mapping is known to be valid from previous checks
++			 */
++			__kfree_skb(tmp1);
++		}
++	}
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this mapping has been put in the meta-receive-queue
++ *	    -2 this mapping has been eaten by the application
++ */
++static int mptcp_queue_skb(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sk_buff *tmp, *tmp1;
++	u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
++	bool data_queued = false;
++
++	/* Have we not yet received the full mapping? */
++	if (!tp->mptcp->mapping_present ||
++	    before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++		return 0;
++
++	/* Is this an overlapping mapping? rcv_nxt >= end_data_seq
++	 * OR
++	 * This mapping is out of window
++	 */
++	if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
++	    !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
++			    tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			__kfree_skb(tmp1);
++
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++
++		mptcp_reset_mapping(tp);
++
++		return -1;
++	}
++
++	/* Record it, because we want to send our data_fin on the same path */
++	if (tp->mptcp->map_data_fin) {
++		mpcb->dfin_path_index = tp->mptcp->path_index;
++		mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
++	}
++
++	/* Verify the checksum */
++	if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
++		int ret = mptcp_verif_dss_csum(sk);
++
++		if (ret <= 0) {
++			mptcp_reset_mapping(tp);
++			return 1;
++		}
++	}
++
++	if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
++		/* Seg's have to go to the meta-ofo-queue */
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_prepare_skb(tmp1, sk);
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			/* MUST be done here, because fragstolen may be true later.
++			 * Then, kfree_skb_partial will not account the memory.
++			 */
++			skb_orphan(tmp1);
++
++			if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
++				mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
++			else
++				__kfree_skb(tmp1);
++
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++		tcp_enter_quickack_mode(sk);
++	} else {
++		/* Ready for the meta-rcv-queue */
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			int eaten = 0;
++			const bool copied_early = false;
++			bool fragstolen = false;
++			u32 old_rcv_nxt = meta_tp->rcv_nxt;
++
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_prepare_skb(tmp1, sk);
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			/* MUST be done here, because fragstolen may be true.
++			 * Then, kfree_skb_partial will not account the memory.
++			 */
++			skb_orphan(tmp1);
++
++			/* This segment has already been received */
++			if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
++				__kfree_skb(tmp1);
++				goto next;
++			}
++
++#ifdef CONFIG_NET_DMA
++			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt  &&
++			    meta_tp->ucopy.task == current &&
++			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
++			    tmp1->len <= meta_tp->ucopy.len &&
++			    sock_owned_by_user(meta_sk) &&
++			    tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
++				copied_early = true;
++				eaten = 1;
++			}
++#endif
++
++			/* Is direct copy possible ? */
++			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++			    meta_tp->ucopy.task == current &&
++			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
++			    meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
++			    !copied_early)
++				eaten = mptcp_direct_copy(tmp1, meta_sk);
++
++			if (mpcb->in_time_wait) /* In time-wait, do not receive data */
++				eaten = 1;
++
++			if (!eaten)
++				eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
++
++			meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++#ifdef CONFIG_NET_DMA
++			if (copied_early)
++				meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
++#endif
++
++			if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
++				mptcp_fin(meta_sk);
++
++			/* Check if this fills a gap in the ofo queue */
++			if (!skb_queue_empty(&meta_tp->out_of_order_queue))
++				mptcp_ofo_queue(meta_sk);
++
++#ifdef CONFIG_NET_DMA
++			if (copied_early)
++				__skb_queue_tail(&meta_sk->sk_async_wait_queue,
++						 tmp1);
++			else
++#endif
++			if (eaten)
++				kfree_skb_partial(tmp1, fragstolen);
++
++			data_queued = true;
++next:
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++	}
++
++	inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
++	mptcp_reset_mapping(tp);
++
++	return data_queued ? -1 : -2;
++}
++
++void mptcp_data_ready(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct sk_buff *skb, *tmp;
++	int queued = 0;
++
++	/* restart before the check, because mptcp_fin might have changed the
++	 * state.
++	 */
++restart:
++	/* If the meta cannot receive data, there is no point in pushing data.
++	 * If we are in time-wait, we may still be waiting for the final FIN.
++	 * So, we should proceed with the processing.
++	 */
++	if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
++		skb_queue_purge(&sk->sk_receive_queue);
++		tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
++		goto exit;
++	}
++
++	/* Iterate over all segments, detect their mapping (if we don't have
++	 * one yet), validate them and push everything one level higher.
++	 */
++	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
++		int ret;
++		/* Pre-validation - e.g., early fallback */
++		ret = mptcp_prevalidate_skb(sk, skb);
++		if (ret < 0)
++			goto restart;
++		else if (ret > 0)
++			break;
++
++		/* Set the current mapping */
++		ret = mptcp_detect_mapping(sk, skb);
++		if (ret < 0)
++			goto restart;
++		else if (ret > 0)
++			break;
++
++		/* Validation */
++		if (mptcp_validate_mapping(sk, skb) < 0)
++			goto restart;
++
++		/* Push a level higher */
++		ret = mptcp_queue_skb(sk);
++		if (ret < 0) {
++			if (ret == -1)
++				queued = ret;
++			goto restart;
++		} else if (ret == 0) {
++			continue;
++		} else { /* ret == 1 */
++			break;
++		}
++	}
++
++exit:
++	if (tcp_sk(sk)->close_it) {
++		tcp_send_ack(sk);
++		tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
++	}
++
++	if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
++		meta_sk->sk_data_ready(meta_sk);
++}
++
++
++int mptcp_check_req(struct sk_buff *skb, struct net *net)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	struct sock *meta_sk = NULL;
++
++	/* MPTCP structures not initialized */
++	if (mptcp_init_failed)
++		return 0;
++
++	if (skb->protocol == htons(ETH_P_IP))
++		meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
++					      ip_hdr(skb)->daddr, net);
++#if IS_ENABLED(CONFIG_IPV6)
++	else /* IPv6 */
++		meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
++					      &ipv6_hdr(skb)->daddr, net);
++#endif /* CONFIG_IPV6 */
++
++	if (!meta_sk)
++		return 0;
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++	bh_lock_sock_nested(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++			bh_unlock_sock(meta_sk);
++			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++			sock_put(meta_sk); /* Taken by mptcp_search_req */
++			kfree_skb(skb);
++			return 1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP)) {
++		tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else { /* IPv6 */
++		tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++	}
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
++	return 1;
++}
++
++struct mp_join *mptcp_find_join(const struct sk_buff *skb)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	unsigned char *ptr;
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++	/* Jump through the options to check whether JOIN is there */
++	ptr = (unsigned char *)(th + 1);
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return NULL;
++		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)	/* "silly options" */
++				return NULL;
++			if (opsize > length)
++				return NULL;  /* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
++				return (struct mp_join *)(ptr - 2);
++			}
++			ptr += opsize - 2;
++			length -= opsize;
++		}
++	}
++	return NULL;
++}
++
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
++{
++	const struct mptcp_cb *mpcb;
++	struct sock *meta_sk;
++	u32 token;
++	bool meta_v4;
++	struct mp_join *join_opt = mptcp_find_join(skb);
++	if (!join_opt)
++		return 0;
++
++	/* MPTCP structures were not initialized, so return error */
++	if (mptcp_init_failed)
++		return -1;
++
++	token = join_opt->u.syn.token;
++	meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
++	if (!meta_sk) {
++		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++		return -1;
++	}
++
++	meta_v4 = meta_sk->sk_family == AF_INET;
++	if (meta_v4) {
++		if (skb->protocol == htons(ETH_P_IPV6)) {
++			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			return -1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP) &&
++		   inet6_sk(meta_sk)->ipv6only) {
++		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	mpcb = tcp_sk(meta_sk)->mpcb;
++	if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
++		/* We are in fallback-mode on the reception-side -
++		 * no new subflows!
++		 */
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	/* Coming from time-wait-sock processing in tcp_v4_rcv.
++	 * We have to deschedule it before continuing, because otherwise
++	 * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
++	 */
++	if (tw) {
++		inet_twsk_deschedule(tw, &tcp_death_row);
++		inet_twsk_put(tw);
++	}
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++	/* OK, this is a new syn/join, let's create a new open request and
++	 * send syn+ack
++	 */
++	bh_lock_sock_nested(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++			bh_unlock_sock(meta_sk);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPBACKLOGDROP);
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			kfree_skb(skb);
++			return 1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP)) {
++		tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++	}
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_hash_find */
++	return 1;
++}
++
++int mptcp_do_join_short(struct sk_buff *skb,
++			const struct mptcp_options_received *mopt,
++			struct net *net)
++{
++	struct sock *meta_sk;
++	u32 token;
++	bool meta_v4;
++
++	token = mopt->mptcp_rem_token;
++	meta_sk = mptcp_hash_find(net, token);
++	if (!meta_sk) {
++		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++		return -1;
++	}
++
++	meta_v4 = meta_sk->sk_family == AF_INET;
++	if (meta_v4) {
++		if (skb->protocol == htons(ETH_P_IPV6)) {
++			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			return -1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP) &&
++		   inet6_sk(meta_sk)->ipv6only) {
++		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++	/* OK, this is a new syn/join, let's create a new open request and
++	 * send syn+ack
++	 */
++	bh_lock_sock(meta_sk);
++
++	/* This check is also done in mptcp_vX_do_rcv. But, there we cannot
++	 * call tcp_vX_send_reset, because we hold already two socket-locks.
++	 * (the listener and the meta from above)
++	 *
++	 * And the send-reset will try to take yet another one (ip_send_reply).
++	 * Thus, we propagate the reset up to tcp_rcv_state_process.
++	 */
++	if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
++	    tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
++	    meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
++		bh_unlock_sock(meta_sk);
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
++			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++		else
++			/* Must make sure that upper layers won't free the
++			 * skb if it is added to the backlog-queue.
++			 */
++			skb_get(skb);
++	} else {
++		/* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
++		 * the skb will finally be freed by tcp_v4_do_rcv (where we are
++		 * coming from)
++		 */
++		skb_get(skb);
++		if (skb->protocol == htons(ETH_P_IP)) {
++			tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++		} else { /* IPv6 */
++			tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++		}
++	}
++
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_hash_find */
++	return 0;
++}
++
++/**
++ * Equivalent of tcp_fin() for MPTCP
++ * Can be called only when the FIN is validly part
++ * of the data seqnum space. Not before when we get holes.
++ */
++void mptcp_fin(struct sock *meta_sk)
++{
++	struct sock *sk = NULL, *sk_it;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++
++	mptcp_for_each_sk(mpcb, sk_it) {
++		if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
++			sk = sk_it;
++			break;
++		}
++	}
++
++	if (!sk || sk->sk_state == TCP_CLOSE)
++		sk = mptcp_select_ack_sock(meta_sk);
++
++	inet_csk_schedule_ack(sk);
++
++	meta_sk->sk_shutdown |= RCV_SHUTDOWN;
++	sock_set_flag(meta_sk, SOCK_DONE);
++
++	switch (meta_sk->sk_state) {
++	case TCP_SYN_RECV:
++	case TCP_ESTABLISHED:
++		/* Move to CLOSE_WAIT */
++		tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
++		inet_csk(sk)->icsk_ack.pingpong = 1;
++		break;
++
++	case TCP_CLOSE_WAIT:
++	case TCP_CLOSING:
++		/* Received a retransmission of the FIN, do
++		 * nothing.
++		 */
++		break;
++	case TCP_LAST_ACK:
++		/* RFC793: Remain in the LAST-ACK state. */
++		break;
++
++	case TCP_FIN_WAIT1:
++		/* This case occurs when a simultaneous close
++		 * happens, we must ack the received FIN and
++		 * enter the CLOSING state.
++		 */
++		tcp_send_ack(sk);
++		tcp_set_state(meta_sk, TCP_CLOSING);
++		break;
++	case TCP_FIN_WAIT2:
++		/* Received a FIN -- send ACK and enter TIME_WAIT. */
++		tcp_send_ack(sk);
++		meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
++		break;
++	default:
++		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
++		 * cases we should never reach this piece of code.
++		 */
++		pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
++		       meta_sk->sk_state);
++		break;
++	}
++
++	/* It _is_ possible, that we have something out-of-order _after_ FIN.
++	 * Probably, we should reset in this case. For now drop them.
++	 */
++	mptcp_purge_ofo_queue(meta_tp);
++	sk_mem_reclaim(meta_sk);
++
++	if (!sock_flag(meta_sk, SOCK_DEAD)) {
++		meta_sk->sk_state_change(meta_sk);
++
++		/* Do not send POLL_HUP for half duplex close. */
++		if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
++		    meta_sk->sk_state == TCP_CLOSE)
++			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
++		else
++			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
++	}
++
++	return;
++}
++
++static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++
++	if (!meta_tp->packets_out)
++		return;
++
++	tcp_for_write_queue(skb, meta_sk) {
++		if (skb == tcp_send_head(meta_sk))
++			break;
++
++		if (mptcp_retransmit_skb(meta_sk, skb))
++			return;
++
++		if (skb == tcp_write_queue_head(meta_sk))
++			inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++						  inet_csk(meta_sk)->icsk_rto,
++						  TCP_RTO_MAX);
++	}
++}
++
++/* Handle the DATA_ACK */
++static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 prior_snd_una = meta_tp->snd_una;
++	int prior_packets;
++	u32 nwin, data_ack, data_seq;
++	u16 data_len = 0;
++
++	/* A valid packet came in - subflow is operational again */
++	tp->pf = 0;
++
++	/* Even if there is no data-ack, we stop retransmitting.
++	 * Except if this is a SYN/ACK. Then it is just a retransmission
++	 */
++	if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
++		tp->mptcp->pre_established = 0;
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++	}
++
++	/* If we are in infinite mapping mode, rx_opt.data_ack has been
++	 * set by mptcp_clean_rtx_infinite.
++	 */
++	if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
++		goto exit;
++
++	data_ack = tp->mptcp->rx_opt.data_ack;
++
++	if (unlikely(!tp->mptcp->fully_established) &&
++	    tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
++		/* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
++		 * includes a data-ack, we are fully established
++		 */
++		mptcp_become_fully_estab(sk);
++
++	/* Get the data_seq */
++	if (mptcp_is_data_seq(skb)) {
++		data_seq = tp->mptcp->rx_opt.data_seq;
++		data_len = tp->mptcp->rx_opt.data_len;
++	} else {
++		data_seq = meta_tp->snd_wl1;
++	}
++
++	/* If the ack is older than previous acks
++	 * then we can probably ignore it.
++	 */
++	if (before(data_ack, prior_snd_una))
++		goto exit;
++
++	/* If the ack includes data we haven't sent yet, discard
++	 * this segment (RFC793 Section 3.9).
++	 */
++	if (after(data_ack, meta_tp->snd_nxt))
++		goto exit;
++
++	/*** Now, update the window  - inspired by tcp_ack_update_window ***/
++	nwin = ntohs(tcp_hdr(skb)->window);
++
++	if (likely(!tcp_hdr(skb)->syn))
++		nwin <<= tp->rx_opt.snd_wscale;
++
++	if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
++		tcp_update_wl(meta_tp, data_seq);
++
++		/* Draft v09, Section 3.3.5:
++		 * [...] It should only update its local receive window values
++		 * when the largest sequence number allowed (i.e.  DATA_ACK +
++		 * receive window) increases. [...]
++		 */
++		if (meta_tp->snd_wnd != nwin &&
++		    !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
++			meta_tp->snd_wnd = nwin;
++
++			if (nwin > meta_tp->max_window)
++				meta_tp->max_window = nwin;
++		}
++	}
++	/*** Done, update the window ***/
++
++	/* We passed data and got it acked, remove any soft error
++	 * log. Something worked...
++	 */
++	sk->sk_err_soft = 0;
++	inet_csk(meta_sk)->icsk_probes_out = 0;
++	meta_tp->rcv_tstamp = tcp_time_stamp;
++	prior_packets = meta_tp->packets_out;
++	if (!prior_packets)
++		goto no_queue;
++
++	meta_tp->snd_una = data_ack;
++
++	mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
++
++	/* We are in loss-state, and something got acked, retransmit the whole
++	 * queue now!
++	 */
++	if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
++	    after(data_ack, prior_snd_una)) {
++		mptcp_xmit_retransmit_queue(meta_sk);
++		inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
++	}
++
++	/* Simplified version of tcp_new_space, because the snd-buffer
++	 * is handled by all the subflows.
++	 */
++	if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
++		sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
++		if (meta_sk->sk_socket &&
++		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++			meta_sk->sk_write_space(meta_sk);
++	}
++
++	if (meta_sk->sk_state != TCP_ESTABLISHED &&
++	    mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
++		return;
++
++exit:
++	mptcp_push_pending_frames(meta_sk);
++
++	return;
++
++no_queue:
++	if (tcp_send_head(meta_sk))
++		tcp_ack_probe(meta_sk);
++
++	mptcp_push_pending_frames(meta_sk);
++
++	return;
++}
++
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
++
++	if (!tp->mpcb->infinite_mapping_snd)
++		return;
++
++	/* The difference between both write_seq's represents the offset between
++	 * data-sequence and subflow-sequence. As we are infinite, this must
++	 * match.
++	 *
++	 * Thus, from this difference we can infer the meta snd_una.
++	 */
++	tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
++				     tp->snd_una;
++
++	mptcp_data_ack(sk, skb);
++}
++
++/**** static functions used by mptcp_parse_options */
++
++static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
++{
++	struct sock *sk_it, *tmpsk;
++
++	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++		if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
++			mptcp_reinject_data(sk_it, 0);
++			sk_it->sk_err = ECONNRESET;
++			if (tcp_need_reset(sk_it->sk_state))
++				tcp_sk(sk_it)->ops->send_active_reset(sk_it,
++								      GFP_ATOMIC);
++			mptcp_sub_force_close(sk_it);
++		}
++	}
++}
++
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++			 struct mptcp_options_received *mopt,
++			 const struct sk_buff *skb)
++{
++	const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
++
++	/* If the socket is mp-capable we would have a mopt. */
++	if (!mopt)
++		return;
++
++	switch (mp_opt->sub) {
++	case MPTCP_SUB_CAPABLE:
++	{
++		const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
++		    opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
++			mptcp_debug("%s: mp_capable: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		if (!sysctl_mptcp_enabled)
++			break;
++
++		/* We only support MPTCP version 0 */
++		if (mpcapable->ver != 0)
++			break;
++
++		/* MPTCP-RFC 6824:
++		 * "If receiving a message with the 'B' flag set to 1, and this
++		 * is not understood, then this SYN MUST be silently ignored;
++		 */
++		if (mpcapable->b) {
++			mopt->drop_me = 1;
++			break;
++		}
++
++		/* MPTCP-RFC 6824:
++		 * "An implementation that only supports this method MUST set
++		 *  bit "H" to 1, and bits "C" through "G" to 0."
++		 */
++		if (!mpcapable->h)
++			break;
++
++		mopt->saw_mpc = 1;
++		mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
++
++		if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
++			mopt->mptcp_key = mpcapable->sender_key;
++
++		break;
++	}
++	case MPTCP_SUB_JOIN:
++	{
++		const struct mp_join *mpjoin = (struct mp_join *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
++		    opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
++		    opsize != MPTCP_SUB_LEN_JOIN_ACK) {
++			mptcp_debug("%s: mp_join: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		/* saw_mpc must be set, because in tcp_check_req we assume that
++		 * it is set to support falling back to reg. TCP if a rexmitted
++		 * SYN has no MP_CAPABLE or MP_JOIN
++		 */
++		switch (opsize) {
++		case MPTCP_SUB_LEN_JOIN_SYN:
++			mopt->is_mp_join = 1;
++			mopt->saw_mpc = 1;
++			mopt->low_prio = mpjoin->b;
++			mopt->rem_id = mpjoin->addr_id;
++			mopt->mptcp_rem_token = mpjoin->u.syn.token;
++			mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
++			break;
++		case MPTCP_SUB_LEN_JOIN_SYNACK:
++			mopt->saw_mpc = 1;
++			mopt->low_prio = mpjoin->b;
++			mopt->rem_id = mpjoin->addr_id;
++			mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
++			mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
++			break;
++		case MPTCP_SUB_LEN_JOIN_ACK:
++			mopt->saw_mpc = 1;
++			mopt->join_ack = 1;
++			memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
++			break;
++		}
++		break;
++	}
++	case MPTCP_SUB_DSS:
++	{
++		const struct mp_dss *mdss = (struct mp_dss *)ptr;
++		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++
++		/* We check opsize for the csum and non-csum case. We do this,
++		 * because the draft says that the csum SHOULD be ignored if
++		 * it has not been negotiated in the MP_CAPABLE but still is
++		 * present in the data.
++		 *
++		 * It will get ignored later in mptcp_queue_skb.
++		 */
++		if (opsize != mptcp_sub_len_dss(mdss, 0) &&
++		    opsize != mptcp_sub_len_dss(mdss, 1)) {
++			mptcp_debug("%s: mp_dss: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		ptr += 4;
++
++		if (mdss->A) {
++			tcb->mptcp_flags |= MPTCPHDR_ACK;
++
++			if (mdss->a) {
++				mopt->data_ack = (u32) get_unaligned_be64(ptr);
++				ptr += MPTCP_SUB_LEN_ACK_64;
++			} else {
++				mopt->data_ack = get_unaligned_be32(ptr);
++				ptr += MPTCP_SUB_LEN_ACK;
++			}
++		}
++
++		tcb->dss_off = (ptr - skb_transport_header(skb));
++
++		if (mdss->M) {
++			if (mdss->m) {
++				u64 data_seq64 = get_unaligned_be64(ptr);
++
++				tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
++				mopt->data_seq = (u32) data_seq64;
++
++				ptr += 12; /* 64-bit dseq + subseq */
++			} else {
++				mopt->data_seq = get_unaligned_be32(ptr);
++				ptr += 8; /* 32-bit dseq + subseq */
++			}
++			mopt->data_len = get_unaligned_be16(ptr);
++
++			tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++			/* Is a check-sum present? */
++			if (opsize == mptcp_sub_len_dss(mdss, 1))
++				tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
++
++			/* DATA_FIN only possible with DSS-mapping */
++			if (mdss->F)
++				tcb->mptcp_flags |= MPTCPHDR_FIN;
++		}
++
++		break;
++	}
++	case MPTCP_SUB_ADD_ADDR:
++	{
++#if IS_ENABLED(CONFIG_IPV6)
++		const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++		if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++		     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++		    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++		     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
++#else
++		if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++		    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
++#endif /* CONFIG_IPV6 */
++			mptcp_debug("%s: mp_add_addr: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		/* We have to manually parse the options if we got two of them. */
++		if (mopt->saw_add_addr) {
++			mopt->more_add_addr = 1;
++			break;
++		}
++		mopt->saw_add_addr = 1;
++		mopt->add_addr_ptr = ptr;
++		break;
++	}
++	case MPTCP_SUB_REMOVE_ADDR:
++		if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
++			mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		if (mopt->saw_rem_addr) {
++			mopt->more_rem_addr = 1;
++			break;
++		}
++		mopt->saw_rem_addr = 1;
++		mopt->rem_addr_ptr = ptr;
++		break;
++	case MPTCP_SUB_PRIO:
++	{
++		const struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_PRIO &&
++		    opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
++			mptcp_debug("%s: mp_prio: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		mopt->saw_low_prio = 1;
++		mopt->low_prio = mpprio->b;
++
++		if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
++			mopt->saw_low_prio = 2;
++			mopt->prio_addr_id = mpprio->addr_id;
++		}
++		break;
++	}
++	case MPTCP_SUB_FAIL:
++		if (opsize != MPTCP_SUB_LEN_FAIL) {
++			mptcp_debug("%s: mp_fail: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++		mopt->mp_fail = 1;
++		break;
++	case MPTCP_SUB_FCLOSE:
++		if (opsize != MPTCP_SUB_LEN_FCLOSE) {
++			mptcp_debug("%s: mp_fclose: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		mopt->mp_fclose = 1;
++		mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
++
++		break;
++	default:
++		mptcp_debug("%s: Received unkown subtype: %d\n",
++			    __func__, mp_opt->sub);
++		break;
++	}
++}
++
++/** Parse only MPTCP options */
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++			     struct mptcp_options_received *mopt)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++	const unsigned char *ptr = (const unsigned char *)(th + 1);
++
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return;
++		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)	/* "silly options" */
++				return;
++			if (opsize > length)
++				return;	/* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP)
++				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++		}
++		ptr += opsize - 2;
++		length -= opsize;
++	}
++}
++
++int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sock *sk;
++	u32 rtt_max = 0;
++
++	/* In MPTCP, we take the max delay across all flows,
++	 * in order to take into account meta-reordering buffers.
++	 */
++	mptcp_for_each_sk(mpcb, sk) {
++		if (!mptcp_sk_can_recv(sk))
++			continue;
++
++		if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
++			rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
++	}
++	if (time < (rtt_max >> 3) || !rtt_max)
++		return 1;
++
++	return 0;
++}
++
++static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
++{
++	struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	__be16 port = 0;
++	union inet_addr addr;
++	sa_family_t family;
++
++	if (mpadd->ipver == 4) {
++		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++			port  = mpadd->u.v4.port;
++		family = AF_INET;
++		addr.in = mpadd->u.v4.addr;
++#if IS_ENABLED(CONFIG_IPV6)
++	} else if (mpadd->ipver == 6) {
++		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
++			port  = mpadd->u.v6.port;
++		family = AF_INET6;
++		addr.in6 = mpadd->u.v6.addr;
++#endif /* CONFIG_IPV6 */
++	} else {
++		return;
++	}
++
++	if (mpcb->pm_ops->add_raddr)
++		mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
++}
++
++static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
++{
++	struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++	int i;
++	u8 rem_id;
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
++		rem_id = (&mprem->addrs_id)[i];
++
++		if (mpcb->pm_ops->rem_raddr)
++			mpcb->pm_ops->rem_raddr(mpcb, rem_id);
++		mptcp_send_reset_rem_id(mpcb, rem_id);
++	}
++}
++
++static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
++{
++	struct tcphdr *th = tcp_hdr(skb);
++	unsigned char *ptr;
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++	/* Jump through the options to check whether ADD_ADDR is there */
++	ptr = (unsigned char *)(th + 1);
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return;
++		case TCPOPT_NOP:
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)
++				return;
++			if (opsize > length)
++				return;  /* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
++#if IS_ENABLED(CONFIG_IPV6)
++				struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++				if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++				     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++				    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++				     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
++#else
++				if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++				    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++#endif /* CONFIG_IPV6 */
++					goto cont;
++
++				mptcp_handle_add_addr(ptr, sk);
++			}
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
++				if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
++					goto cont;
++
++				mptcp_handle_rem_addr(ptr, sk);
++			}
++cont:
++			ptr += opsize - 2;
++			length -= opsize;
++		}
++	}
++	return;
++}
++
++static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
++{
++	struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	if (unlikely(mptcp->rx_opt.mp_fail)) {
++		mptcp->rx_opt.mp_fail = 0;
++
++		if (!th->rst && !mpcb->infinite_mapping_snd) {
++			struct sock *sk_it;
++
++			mpcb->send_infinite_mapping = 1;
++			/* We resend everything that has not been acknowledged */
++			meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++
++			/* We artificially restart the whole send-queue. Thus,
++			 * it is as if no packets are in flight
++			 */
++			tcp_sk(meta_sk)->packets_out = 0;
++
++			/* If the snd_nxt already wrapped around, we have to
++			 * undo the wrapping, as we are restarting from snd_una
++			 * on.
++			 */
++			if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
++				mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
++				mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++			}
++			tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
++
++			/* Trigger a sending on the meta. */
++			mptcp_push_pending_frames(meta_sk);
++
++			mptcp_for_each_sk(mpcb, sk_it) {
++				if (sk != sk_it)
++					mptcp_sub_force_close(sk_it);
++			}
++		}
++
++		return 0;
++	}
++
++	if (unlikely(mptcp->rx_opt.mp_fclose)) {
++		struct sock *sk_it, *tmpsk;
++
++		mptcp->rx_opt.mp_fclose = 0;
++		if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
++			return 0;
++
++		if (tcp_need_reset(sk->sk_state))
++			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
++			mptcp_sub_force_close(sk_it);
++
++		tcp_reset(meta_sk);
++
++		return 1;
++	}
++
++	return 0;
++}
++
++static inline void mptcp_path_array_check(struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++
++	if (unlikely(mpcb->list_rcvd)) {
++		mpcb->list_rcvd = 0;
++		if (mpcb->pm_ops->new_remote_address)
++			mpcb->pm_ops->new_remote_address(meta_sk);
++	}
++}
++
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++			 const struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++	if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
++		return 0;
++
++	if (mptcp_mp_fail_rcvd(sk, th))
++		return 1;
++
++	/* RFC 6824, Section 3.3:
++	 * If a checksum is not present when its use has been negotiated, the
++	 * receiver MUST close the subflow with a RST as it is considered broken.
++	 */
++	if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
++	    !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
++		if (tcp_need_reset(sk->sk_state))
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
++
++		mptcp_sub_force_close(sk);
++		return 1;
++	}
++
++	/* We have to acknowledge retransmissions of the third
++	 * ack.
++	 */
++	if (mopt->join_ack) {
++		tcp_send_delayed_ack(sk);
++		mopt->join_ack = 0;
++	}
++
++	if (mopt->saw_add_addr || mopt->saw_rem_addr) {
++		if (mopt->more_add_addr || mopt->more_rem_addr) {
++			mptcp_parse_addropt(skb, sk);
++		} else {
++			if (mopt->saw_add_addr)
++				mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
++			if (mopt->saw_rem_addr)
++				mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
++		}
++
++		mopt->more_add_addr = 0;
++		mopt->saw_add_addr = 0;
++		mopt->more_rem_addr = 0;
++		mopt->saw_rem_addr = 0;
++	}
++	if (mopt->saw_low_prio) {
++		if (mopt->saw_low_prio == 1) {
++			tp->mptcp->rcv_low_prio = mopt->low_prio;
++		} else {
++			struct sock *sk_it;
++			mptcp_for_each_sk(tp->mpcb, sk_it) {
++				struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
++				if (mptcp->rem_id == mopt->prio_addr_id)
++					mptcp->rcv_low_prio = mopt->low_prio;
++			}
++		}
++		mopt->saw_low_prio = 0;
++	}
++
++	mptcp_data_ack(sk, skb);
++
++	mptcp_path_array_check(mptcp_meta_sk(sk));
++	/* Socket may have been mp_killed by a REMOVE_ADDR */
++	if (tp->mp_killed)
++		return 1;
++
++	return 0;
++}
++
++/* In case of fastopen, some data can already be in the write queue.
++ * We need to update the sequence number of the segments as they
++ * were initially TCP sequence numbers.
++ */
++static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
++	struct sk_buff *skb;
++	u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
++
++	/* There should only be one skb in write queue: the data not
++	 * acknowledged in the SYN+ACK. In this case, we need to map
++	 * this data to data sequence numbers.
++	 */
++	skb_queue_walk(&meta_sk->sk_write_queue, skb) {
++		/* If the server only acknowledges partially the data sent in
++		 * the SYN, we need to trim the acknowledged part because
++		 * we don't want to retransmit this already received data.
++		 * When we reach this point, tcp_ack() has already cleaned up
++		 * fully acked segments. However, tcp trims partially acked
++		 * segments only when retransmitting. Since MPTCP comes into
++		 * play only now, we will fake an initial transmit, and
++		 * retransmit_skb() will not be called. The following fragment
++		 * comes from __tcp_retransmit_skb().
++		 */
++		if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
++			BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
++				      master_tp->snd_una));
++			/* tcp_trim_head can only returns ENOMEM if skb is
++			 * cloned. It is not the case here (see
++			 * tcp_send_syn_data).
++			 */
++			BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
++					     TCP_SKB_CB(skb)->seq));
++		}
++
++		TCP_SKB_CB(skb)->seq += new_mapping;
++		TCP_SKB_CB(skb)->end_seq += new_mapping;
++	}
++
++	/* We can advance write_seq by the number of bytes unacknowledged
++	 * and that were mapped in the previous loop.
++	 */
++	meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
++
++	/* The packets from the master_sk will be entailed to it later
++	 * Until that time, its write queue is empty, and
++	 * write_seq must align with snd_una
++	 */
++	master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
++	master_tp->packets_out = 0;
++
++	/* Although these data have been sent already over the subsk,
++	 * They have never been sent over the meta_sk, so we rewind
++	 * the send_head so that tcp considers it as an initial send
++	 * (instead of retransmit).
++	 */
++	meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++}
++
++/* The skptr is needed, because if we become MPTCP-capable, we have to switch
++ * from meta-socket to master-socket.
++ *
++ * @return: 1 - we want to reset this connection
++ *	    2 - we want to discard the received syn/ack
++ *	    0 - everything is fine - continue
++ */
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++				    const struct sk_buff *skb,
++				    const struct mptcp_options_received *mopt)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	if (mptcp(tp)) {
++		u8 hash_mac_check[20];
++		struct mptcp_cb *mpcb = tp->mpcb;
++
++		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++				(u8 *)&mpcb->mptcp_loc_key,
++				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++				(u8 *)&tp->mptcp->mptcp_loc_nonce,
++				(u32 *)hash_mac_check);
++		if (memcmp(hash_mac_check,
++			   (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
++			mptcp_sub_force_close(sk);
++			return 1;
++		}
++
++		/* Set this flag in order to postpone data sending
++		 * until the 4th ack arrives.
++		 */
++		tp->mptcp->pre_established = 1;
++		tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
++
++		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++				(u8 *)&mpcb->mptcp_rem_key,
++				(u8 *)&tp->mptcp->mptcp_loc_nonce,
++				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++				(u32 *)&tp->mptcp->sender_mac[0]);
++
++	} else if (mopt->saw_mpc) {
++		struct sock *meta_sk = sk;
++
++		if (mptcp_create_master_sk(sk, mopt->mptcp_key,
++					   ntohs(tcp_hdr(skb)->window)))
++			return 2;
++
++		sk = tcp_sk(sk)->mpcb->master_sk;
++		*skptr = sk;
++		tp = tcp_sk(sk);
++
++		/* If fastopen was used data might be in the send queue. We
++		 * need to update their sequence number to MPTCP-level seqno.
++		 * Note that it can happen in rare cases that fastopen_req is
++		 * NULL and syn_data is 0 but fastopen indeed occurred and
++		 * data has been queued in the write queue (but not sent).
++		 * Example of such rare cases: connect is non-blocking and
++		 * TFO is configured to work without cookies.
++		 */
++		if (!skb_queue_empty(&meta_sk->sk_write_queue))
++			mptcp_rcv_synsent_fastopen(meta_sk);
++
++		/* -1, because the SYN consumed 1 byte. In case of TFO, we
++		 * start the subflow-sequence number as if the data of the SYN
++		 * is not part of any mapping.
++		 */
++		tp->mptcp->snt_isn = tp->snd_una - 1;
++		tp->mpcb->dss_csum = mopt->dss_csum;
++		tp->mptcp->include_mpc = 1;
++
++		/* Ensure that fastopen is handled at the meta-level. */
++		tp->fastopen_req = NULL;
++
++		sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
++		sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
++
++		 /* hold in sk_clone_lock due to initialization to 2 */
++		sock_put(sk);
++	} else {
++		tp->request_mptcp = 0;
++
++		if (tp->inside_tk_table)
++			mptcp_hash_remove(tp);
++	}
++
++	if (mptcp(tp))
++		tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
++
++	return 0;
++}
++
++bool mptcp_should_expand_sndbuf(const struct sock *sk)
++{
++	const struct sock *sk_it;
++	const struct sock *meta_sk = mptcp_meta_sk(sk);
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int cnt_backups = 0;
++	int backup_available = 0;
++
++	/* We circumvent this check in tcp_check_space, because we want to
++	 * always call sk_write_space. So, we reproduce the check here.
++	 */
++	if (!meta_sk->sk_socket ||
++	    !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++		return false;
++
++	/* If the user specified a specific send buffer setting, do
++	 * not modify it.
++	 */
++	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++		return false;
++
++	/* If we are under global TCP memory pressure, do not expand.  */
++	if (sk_under_memory_pressure(meta_sk))
++		return false;
++
++	/* If we are under soft global TCP memory pressure, do not expand.  */
++	if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
++		return false;
++
++
++	/* For MPTCP we look for a subsocket that could send data.
++	 * If we found one, then we update the send-buffer.
++	 */
++	mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++		struct tcp_sock *tp_it = tcp_sk(sk_it);
++
++		if (!mptcp_sk_can_send(sk_it))
++			continue;
++
++		/* Backup-flows have to be counted - if there is no other
++		 * subflow we take the backup-flow into account.
++		 */
++		if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
++			cnt_backups++;
++
++		if (tp_it->packets_out < tp_it->snd_cwnd) {
++			if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
++				backup_available = 1;
++				continue;
++			}
++			return true;
++		}
++	}
++
++	/* Backup-flow is available for sending - update send-buffer */
++	if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
++		return true;
++	return false;
++}
++
++void mptcp_init_buffer_space(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int space;
++
++	tcp_init_buffer_space(sk);
++
++	if (is_master_tp(tp)) {
++		meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
++		meta_tp->rcvq_space.time = tcp_time_stamp;
++		meta_tp->rcvq_space.seq = meta_tp->copied_seq;
++
++		/* If there is only one subflow, we just use regular TCP
++		 * autotuning. User-locks are handled already by
++		 * tcp_init_buffer_space
++		 */
++		meta_tp->window_clamp = tp->window_clamp;
++		meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
++		meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
++		meta_sk->sk_sndbuf = sk->sk_sndbuf;
++
++		return;
++	}
++
++	if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
++		goto snd_buf;
++
++	/* Adding a new subflow to the rcv-buffer space. We make a simple
++	 * addition, to give some space to allow traffic on the new subflow.
++	 * Autotuning will increase it further later on.
++	 */
++	space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
++	if (space > meta_sk->sk_rcvbuf) {
++		meta_tp->window_clamp += tp->window_clamp;
++		meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
++		meta_sk->sk_rcvbuf = space;
++	}
++
++snd_buf:
++	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++		return;
++
++	/* Adding a new subflow to the send-buffer space. We make a simple
++	 * addition, to give some space to allow traffic on the new subflow.
++	 * Autotuning will increase it further later on.
++	 */
++	space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
++	if (space > meta_sk->sk_sndbuf) {
++		meta_sk->sk_sndbuf = space;
++		meta_sk->sk_write_space(meta_sk);
++	}
++}
++
++void mptcp_tcp_set_rto(struct sock *sk)
++{
++	tcp_set_rto(sk);
++	mptcp_set_rto(sk);
++}
+diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
+new file mode 100644
+index 000000000000..1183d1305d35
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv4.c
+@@ -0,0 +1,483 @@
++/*
++ *	MPTCP implementation - IPv4-specific functions
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/ip.h>
++#include <linux/list.h>
++#include <linux/skbuff.h>
++#include <linux/spinlock.h>
++#include <linux/tcp.h>
++
++#include <net/inet_common.h>
++#include <net/inet_connection_sock.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/request_sock.h>
++#include <net/tcp.h>
++
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++	u32 hash[MD5_DIGEST_WORDS];
++
++	hash[0] = (__force u32)saddr;
++	hash[1] = (__force u32)daddr;
++	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++	hash[3] = mptcp_seed++;
++
++	md5_transform(hash, mptcp_secret);
++
++	return hash[0];
++}
++
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++	u32 hash[MD5_DIGEST_WORDS];
++
++	hash[0] = (__force u32)saddr;
++	hash[1] = (__force u32)daddr;
++	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++	hash[3] = mptcp_seed++;
++
++	md5_transform(hash, mptcp_secret);
++
++	return *((u64 *)hash);
++}
++
++
++static void mptcp_v4_reqsk_destructor(struct request_sock *req)
++{
++	mptcp_reqsk_destructor(req);
++
++	tcp_v4_reqsk_destructor(req);
++}
++
++static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
++			     struct sk_buff *skb)
++{
++	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++	mptcp_reqsk_init(req, skb);
++
++	return 0;
++}
++
++static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
++				  struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	union inet_addr addr;
++	int loc_id;
++	bool low_prio = false;
++
++	/* We need to do this as early as possible. Because, if we fail later
++	 * (e.g., get_local_id), then reqsk_free tries to remove the
++	 * request-socket from the htb in mptcp_hash_request_remove as pprev
++	 * may be different from NULL.
++	 */
++	mtreq->hash_entry.pprev = NULL;
++
++	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++
++	mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
++						    ip_hdr(skb)->daddr,
++						    tcp_hdr(skb)->source,
++						    tcp_hdr(skb)->dest);
++	addr.ip = inet_rsk(req)->ir_loc_addr;
++	loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
++	if (loc_id == -1)
++		return -1;
++	mtreq->loc_id = loc_id;
++	mtreq->low_prio = low_prio;
++
++	mptcp_join_reqsk_init(mpcb, req, skb);
++
++	return 0;
++}
++
++/* Similar to tcp_request_sock_ops */
++struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
++	.family		=	PF_INET,
++	.obj_size	=	sizeof(struct mptcp_request_sock),
++	.rtx_syn_ack	=	tcp_rtx_synack,
++	.send_ack	=	tcp_v4_reqsk_send_ack,
++	.destructor	=	mptcp_v4_reqsk_destructor,
++	.send_reset	=	tcp_v4_send_reset,
++	.syn_ack_timeout =	tcp_syn_ack_timeout,
++};
++
++static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
++					  struct request_sock *req,
++					  const unsigned long timeout)
++{
++	const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++				     inet_rsk(req)->ir_rmt_port,
++				     0, MPTCP_HASH_SIZE);
++	/* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
++	 * want to reset the keepalive-timer (responsible for retransmitting
++	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++	 * overload the keepalive timer. Also, it's not a big deal, because the
++	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++	 * if the third ACK gets lost, the client will handle the retransmission
++	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
++	 * SYN.
++	 */
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++	const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++				     inet_rsk(req)->ir_rmt_port,
++				     lopt->hash_rnd, lopt->nr_table_entries);
++
++	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++		mptcp_reset_synack_timer(meta_sk, timeout);
++
++	rcu_read_lock();
++	spin_lock(&mptcp_reqsk_hlock);
++	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++	spin_unlock(&mptcp_reqsk_hlock);
++	rcu_read_unlock();
++}
++
++/* Similar to tcp_v4_conn_request */
++static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return tcp_conn_request(&mptcp_request_sock_ops,
++				&mptcp_join_request_sock_ipv4_ops,
++				meta_sk, skb);
++}
++
++/* We only process join requests here. (either the SYN or the final ACK) */
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *child, *rsk = NULL;
++	int ret;
++
++	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++		struct tcphdr *th = tcp_hdr(skb);
++		const struct iphdr *iph = ip_hdr(skb);
++		struct sock *sk;
++
++		sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
++					     iph->saddr, th->source, iph->daddr,
++					     th->dest, inet_iif(skb));
++
++		if (!sk) {
++			kfree_skb(skb);
++			return 0;
++		}
++		if (is_meta_sk(sk)) {
++			WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
++			kfree_skb(skb);
++			sock_put(sk);
++			return 0;
++		}
++
++		if (sk->sk_state == TCP_TIME_WAIT) {
++			inet_twsk_put(inet_twsk(sk));
++			kfree_skb(skb);
++			return 0;
++		}
++
++		ret = tcp_v4_do_rcv(sk, skb);
++		sock_put(sk);
++
++		return ret;
++	}
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++	/* Has been removed from the tk-table. Thus, no new subflows.
++	 *
++	 * Check for close-state is necessary, because we may have been closed
++	 * without passing by mptcp_close().
++	 *
++	 * When falling back, no new subflows are allowed either.
++	 */
++	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++		goto reset_and_discard;
++
++	child = tcp_v4_hnd_req(meta_sk, skb);
++
++	if (!child)
++		goto discard;
++
++	if (child != meta_sk) {
++		sock_rps_save_rxhash(child, skb);
++		/* We don't call tcp_child_process here, because we hold
++		 * already the meta-sk-lock and are sure that it is not owned
++		 * by the user.
++		 */
++		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++		bh_unlock_sock(child);
++		sock_put(child);
++		if (ret) {
++			rsk = child;
++			goto reset_and_discard;
++		}
++	} else {
++		if (tcp_hdr(skb)->syn) {
++			mptcp_v4_join_request(meta_sk, skb);
++			goto discard;
++		}
++		goto reset_and_discard;
++	}
++	return 0;
++
++reset_and_discard:
++	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++		const struct tcphdr *th = tcp_hdr(skb);
++		const struct iphdr *iph = ip_hdr(skb);
++		struct request_sock **prev, *req;
++		/* If we end up here, it means we should not have matched on the
++		 * request-socket. But, because the request-sock queue is only
++		 * destroyed in mptcp_close, the socket may actually already be
++		 * in close-state (e.g., through shutdown()) while still having
++		 * pending request sockets.
++		 */
++		req = inet_csk_search_req(meta_sk, &prev, th->source,
++					  iph->saddr, iph->daddr);
++		if (req) {
++			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++					    req);
++			reqsk_free(req);
++		}
++	}
++
++	tcp_v4_send_reset(rsk, skb);
++discard:
++	kfree_skb(skb);
++	return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++				 const __be32 laddr, const struct net *net)
++{
++	const struct mptcp_request_sock *mtreq;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++	const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++				       hash_entry) {
++		struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
++		meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++		if (ireq->ir_rmt_port == rport &&
++		    ireq->ir_rmt_addr == raddr &&
++		    ireq->ir_loc_addr == laddr &&
++		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
++		    net_eq(net, sock_net(meta_sk)))
++			goto found;
++		meta_sk = NULL;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++		goto begin;
++
++found:
++	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++		meta_sk = NULL;
++	rcu_read_unlock();
++
++	return meta_sk;
++}
++
++/* Create a new IPv4 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++			   struct mptcp_rem4 *rem)
++{
++	struct tcp_sock *tp;
++	struct sock *sk;
++	struct sockaddr_in loc_in, rem_in;
++	struct socket sock;
++	int ret;
++
++	/** First, create and prepare the new socket */
++
++	sock.type = meta_sk->sk_socket->type;
++	sock.state = SS_UNCONNECTED;
++	sock.wq = meta_sk->sk_socket->wq;
++	sock.file = meta_sk->sk_socket->file;
++	sock.ops = NULL;
++
++	ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++	if (unlikely(ret < 0)) {
++		mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
++		return ret;
++	}
++
++	sk = sock.sk;
++	tp = tcp_sk(sk);
++
++	/* All subsockets need the MPTCP-lock-class */
++	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++	if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
++		goto error;
++
++	tp->mptcp->slave_sk = 1;
++	tp->mptcp->low_prio = loc->low_prio;
++
++	/* Initializing the timer for an MPTCP subflow */
++	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++	/** Then, connect the socket to the peer */
++	loc_in.sin_family = AF_INET;
++	rem_in.sin_family = AF_INET;
++	loc_in.sin_port = 0;
++	if (rem->port)
++		rem_in.sin_port = rem->port;
++	else
++		rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
++	loc_in.sin_addr = loc->addr;
++	rem_in.sin_addr = rem->addr;
++
++	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
++	if (ret < 0) {
++		mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
++		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++		    tp->mptcp->path_index, &loc_in.sin_addr,
++		    ntohs(loc_in.sin_port), &rem_in.sin_addr,
++		    ntohs(rem_in.sin_port));
++
++	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
++		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
++
++	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++				sizeof(struct sockaddr_in), O_NONBLOCK);
++	if (ret < 0 && ret != -EINPROGRESS) {
++		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	sk_set_socket(sk, meta_sk->sk_socket);
++	sk->sk_wq = meta_sk->sk_wq;
++
++	return 0;
++
++error:
++	/* May happen if mptcp_add_sock fails first */
++	if (!mptcp(tp)) {
++		tcp_close(sk, 0);
++	} else {
++		local_bh_disable();
++		mptcp_sub_force_close(sk);
++		local_bh_enable();
++	}
++	return ret;
++}
++EXPORT_SYMBOL(mptcp_init4_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v4_specific = {
++	.queue_xmit	   = ip_queue_xmit,
++	.send_check	   = tcp_v4_send_check,
++	.rebuild_header	   = inet_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v4_syn_recv_sock,
++	.net_header_len	   = sizeof(struct iphdr),
++	.setsockopt	   = ip_setsockopt,
++	.getsockopt	   = ip_getsockopt,
++	.addr2sockaddr	   = inet_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in),
++	.bind_conflict	   = inet_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ip_setsockopt,
++	.compat_getsockopt = compat_ip_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++/* General initialization of IPv4 for MPTCP */
++int mptcp_pm_v4_init(void)
++{
++	int ret = 0;
++	struct request_sock_ops *ops = &mptcp_request_sock_ops;
++
++	mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++	mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
++
++	mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++	mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
++	mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
++
++	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
++	if (ops->slab_name == NULL) {
++		ret = -ENOMEM;
++		goto out;
++	}
++
++	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++				      NULL);
++
++	if (ops->slab == NULL) {
++		ret =  -ENOMEM;
++		goto err_reqsk_create;
++	}
++
++out:
++	return ret;
++
++err_reqsk_create:
++	kfree(ops->slab_name);
++	ops->slab_name = NULL;
++	goto out;
++}
++
++void mptcp_pm_v4_undo(void)
++{
++	kmem_cache_destroy(mptcp_request_sock_ops.slab);
++	kfree(mptcp_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
+new file mode 100644
+index 000000000000..1036973aa855
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv6.c
+@@ -0,0 +1,518 @@
++/*
++ *	MPTCP implementation - IPv6-specific functions
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/in6.h>
++#include <linux/kernel.h>
++
++#include <net/addrconf.h>
++#include <net/flow.h>
++#include <net/inet6_connection_sock.h>
++#include <net/inet6_hashtables.h>
++#include <net/inet_common.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/ip6_route.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
++#include <net/tcp.h>
++#include <net/transp_v6.h>
++
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++			 __be16 sport, __be16 dport)
++{
++	u32 secret[MD5_MESSAGE_BYTES / 4];
++	u32 hash[MD5_DIGEST_WORDS];
++	u32 i;
++
++	memcpy(hash, saddr, 16);
++	for (i = 0; i < 4; i++)
++		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++	secret[4] = mptcp_secret[4] +
++		    (((__force u16)sport << 16) + (__force u16)dport);
++	secret[5] = mptcp_seed++;
++	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++		secret[i] = mptcp_secret[i];
++
++	md5_transform(hash, secret);
++
++	return hash[0];
++}
++
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++		     __be16 sport, __be16 dport)
++{
++	u32 secret[MD5_MESSAGE_BYTES / 4];
++	u32 hash[MD5_DIGEST_WORDS];
++	u32 i;
++
++	memcpy(hash, saddr, 16);
++	for (i = 0; i < 4; i++)
++		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++	secret[4] = mptcp_secret[4] +
++		    (((__force u16)sport << 16) + (__force u16)dport);
++	secret[5] = mptcp_seed++;
++	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++		secret[i] = mptcp_secret[i];
++
++	md5_transform(hash, secret);
++
++	return *((u64 *)hash);
++}
++
++static void mptcp_v6_reqsk_destructor(struct request_sock *req)
++{
++	mptcp_reqsk_destructor(req);
++
++	tcp_v6_reqsk_destructor(req);
++}
++
++static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
++			     struct sk_buff *skb)
++{
++	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++	mptcp_reqsk_init(req, skb);
++
++	return 0;
++}
++
++static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
++				  struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	union inet_addr addr;
++	int loc_id;
++	bool low_prio = false;
++
++	/* We need to do this as early as possible. Because, if we fail later
++	 * (e.g., get_local_id), then reqsk_free tries to remove the
++	 * request-socket from the htb in mptcp_hash_request_remove as pprev
++	 * may be different from NULL.
++	 */
++	mtreq->hash_entry.pprev = NULL;
++
++	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++
++	mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
++						    ipv6_hdr(skb)->daddr.s6_addr32,
++						    tcp_hdr(skb)->source,
++						    tcp_hdr(skb)->dest);
++	addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
++	loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
++	if (loc_id == -1)
++		return -1;
++	mtreq->loc_id = loc_id;
++	mtreq->low_prio = low_prio;
++
++	mptcp_join_reqsk_init(mpcb, req, skb);
++
++	return 0;
++}
++
++/* Similar to tcp6_request_sock_ops */
++struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
++	.family		=	AF_INET6,
++	.obj_size	=	sizeof(struct mptcp_request_sock),
++	.rtx_syn_ack	=	tcp_v6_rtx_synack,
++	.send_ack	=	tcp_v6_reqsk_send_ack,
++	.destructor	=	mptcp_v6_reqsk_destructor,
++	.send_reset	=	tcp_v6_send_reset,
++	.syn_ack_timeout =	tcp_syn_ack_timeout,
++};
++
++static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
++					  struct request_sock *req,
++					  const unsigned long timeout)
++{
++	const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++				      inet_rsk(req)->ir_rmt_port,
++				      0, MPTCP_HASH_SIZE);
++	/* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
++	 * want to reset the keepalive-timer (responsible for retransmitting
++	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++	 * overload the keepalive timer. Also, it's not a big deal, because the
++	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++	 * if the third ACK gets lost, the client will handle the retransmission
++	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
++	 * SYN.
++	 */
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++	const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++				      inet_rsk(req)->ir_rmt_port,
++				      lopt->hash_rnd, lopt->nr_table_entries);
++
++	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++		mptcp_reset_synack_timer(meta_sk, timeout);
++
++	rcu_read_lock();
++	spin_lock(&mptcp_reqsk_hlock);
++	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++	spin_unlock(&mptcp_reqsk_hlock);
++	rcu_read_unlock();
++}
++
++static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return tcp_conn_request(&mptcp6_request_sock_ops,
++				&mptcp_join_request_sock_ipv6_ops,
++				meta_sk, skb);
++}
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *child, *rsk = NULL;
++	int ret;
++
++	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++		struct tcphdr *th = tcp_hdr(skb);
++		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
++		struct sock *sk;
++
++		sk = __inet6_lookup_established(sock_net(meta_sk),
++						&tcp_hashinfo,
++						&ip6h->saddr, th->source,
++						&ip6h->daddr, ntohs(th->dest),
++						inet6_iif(skb));
++
++		if (!sk) {
++			kfree_skb(skb);
++			return 0;
++		}
++		if (is_meta_sk(sk)) {
++			WARN("%s Did not find a sub-sk!\n", __func__);
++			kfree_skb(skb);
++			sock_put(sk);
++			return 0;
++		}
++
++		if (sk->sk_state == TCP_TIME_WAIT) {
++			inet_twsk_put(inet_twsk(sk));
++			kfree_skb(skb);
++			return 0;
++		}
++
++		ret = tcp_v6_do_rcv(sk, skb);
++		sock_put(sk);
++
++		return ret;
++	}
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++	/* Has been removed from the tk-table. Thus, no new subflows.
++	 *
++	 * Check for close-state is necessary, because we may have been closed
++	 * without passing by mptcp_close().
++	 *
++	 * When falling back, no new subflows are allowed either.
++	 */
++	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++		goto reset_and_discard;
++
++	child = tcp_v6_hnd_req(meta_sk, skb);
++
++	if (!child)
++		goto discard;
++
++	if (child != meta_sk) {
++		sock_rps_save_rxhash(child, skb);
++		/* We don't call tcp_child_process here, because we hold
++		 * already the meta-sk-lock and are sure that it is not owned
++		 * by the user.
++		 */
++		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++		bh_unlock_sock(child);
++		sock_put(child);
++		if (ret) {
++			rsk = child;
++			goto reset_and_discard;
++		}
++	} else {
++		if (tcp_hdr(skb)->syn) {
++			mptcp_v6_join_request(meta_sk, skb);
++			goto discard;
++		}
++		goto reset_and_discard;
++	}
++	return 0;
++
++reset_and_discard:
++	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++		const struct tcphdr *th = tcp_hdr(skb);
++		struct request_sock **prev, *req;
++		/* If we end up here, it means we should not have matched on the
++		 * request-socket. But, because the request-sock queue is only
++		 * destroyed in mptcp_close, the socket may actually already be
++		 * in close-state (e.g., through shutdown()) while still having
++		 * pending request sockets.
++		 */
++		req = inet6_csk_search_req(meta_sk, &prev, th->source,
++					   &ipv6_hdr(skb)->saddr,
++					   &ipv6_hdr(skb)->daddr, inet6_iif(skb));
++		if (req) {
++			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++					    req);
++			reqsk_free(req);
++		}
++	}
++
++	tcp_v6_send_reset(rsk, skb);
++discard:
++	kfree_skb(skb);
++	return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++				 const struct in6_addr *laddr, const struct net *net)
++{
++	const struct mptcp_request_sock *mtreq;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++	const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++				       hash_entry) {
++		struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
++		meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++		if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
++		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
++		    ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
++		    ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
++		    net_eq(net, sock_net(meta_sk)))
++			goto found;
++		meta_sk = NULL;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++		goto begin;
++
++found:
++	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++		meta_sk = NULL;
++	rcu_read_unlock();
++
++	return meta_sk;
++}
++
++/* Create a new IPv6 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++			   struct mptcp_rem6 *rem)
++{
++	struct tcp_sock *tp;
++	struct sock *sk;
++	struct sockaddr_in6 loc_in, rem_in;
++	struct socket sock;
++	int ret;
++
++	/** First, create and prepare the new socket */
++
++	sock.type = meta_sk->sk_socket->type;
++	sock.state = SS_UNCONNECTED;
++	sock.wq = meta_sk->sk_socket->wq;
++	sock.file = meta_sk->sk_socket->file;
++	sock.ops = NULL;
++
++	ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++	if (unlikely(ret < 0)) {
++		mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
++		return ret;
++	}
++
++	sk = sock.sk;
++	tp = tcp_sk(sk);
++
++	/* All subsockets need the MPTCP-lock-class */
++	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++	if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
++		goto error;
++
++	tp->mptcp->slave_sk = 1;
++	tp->mptcp->low_prio = loc->low_prio;
++
++	/* Initializing the timer for an MPTCP subflow */
++	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++	/** Then, connect the socket to the peer */
++	loc_in.sin6_family = AF_INET6;
++	rem_in.sin6_family = AF_INET6;
++	loc_in.sin6_port = 0;
++	if (rem->port)
++		rem_in.sin6_port = rem->port;
++	else
++		rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
++	loc_in.sin6_addr = loc->addr;
++	rem_in.sin6_addr = rem->addr;
++
++	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
++	if (ret < 0) {
++		mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
++		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++		    tp->mptcp->path_index, &loc_in.sin6_addr,
++		    ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
++		    ntohs(rem_in.sin6_port));
++
++	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
++		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
++
++	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++				sizeof(struct sockaddr_in6), O_NONBLOCK);
++	if (ret < 0 && ret != -EINPROGRESS) {
++		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	sk_set_socket(sk, meta_sk->sk_socket);
++	sk->sk_wq = meta_sk->sk_wq;
++
++	return 0;
++
++error:
++	/* May happen if mptcp_add_sock fails first */
++	if (!mptcp(tp)) {
++		tcp_close(sk, 0);
++	} else {
++		local_bh_disable();
++		mptcp_sub_force_close(sk);
++		local_bh_enable();
++	}
++	return ret;
++}
++EXPORT_SYMBOL(mptcp_init6_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v6_specific = {
++	.queue_xmit	   = inet6_csk_xmit,
++	.send_check	   = tcp_v6_send_check,
++	.rebuild_header	   = inet6_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet6_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
++	.net_header_len	   = sizeof(struct ipv6hdr),
++	.net_frag_header_len = sizeof(struct frag_hdr),
++	.setsockopt	   = ipv6_setsockopt,
++	.getsockopt	   = ipv6_getsockopt,
++	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in6),
++	.bind_conflict	   = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ipv6_setsockopt,
++	.compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
++	.queue_xmit	   = ip_queue_xmit,
++	.send_check	   = tcp_v4_send_check,
++	.rebuild_header	   = inet_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
++	.net_header_len	   = sizeof(struct iphdr),
++	.setsockopt	   = ipv6_setsockopt,
++	.getsockopt	   = ipv6_getsockopt,
++	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in6),
++	.bind_conflict	   = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ipv6_setsockopt,
++	.compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_pm_v6_init(void)
++{
++	int ret = 0;
++	struct request_sock_ops *ops = &mptcp6_request_sock_ops;
++
++	mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++	mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
++
++	mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++	mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
++	mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
++
++	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
++	if (ops->slab_name == NULL) {
++		ret = -ENOMEM;
++		goto out;
++	}
++
++	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++				      NULL);
++
++	if (ops->slab == NULL) {
++		ret =  -ENOMEM;
++		goto err_reqsk_create;
++	}
++
++out:
++	return ret;
++
++err_reqsk_create:
++	kfree(ops->slab_name);
++	ops->slab_name = NULL;
++	goto out;
++}
++
++void mptcp_pm_v6_undo(void)
++{
++	kmem_cache_destroy(mptcp6_request_sock_ops.slab);
++	kfree(mptcp6_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
+new file mode 100644
+index 000000000000..6f5087983175
+--- /dev/null
++++ b/net/mptcp/mptcp_ndiffports.c
+@@ -0,0 +1,161 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++
++struct ndiffports_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++
++	struct mptcp_cb *mpcb;
++};
++
++static int num_subflows __read_mostly = 2;
++module_param(num_subflows, int, 0644);
++MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	const struct ndiffports_priv *pm_priv = container_of(work,
++						     struct ndiffports_priv,
++						     subflow_work);
++	struct mptcp_cb *mpcb = pm_priv->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	int iter = 0;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
++		if (meta_sk->sk_family == AF_INET ||
++		    mptcp_v6_is_v4_mapped(meta_sk)) {
++			struct mptcp_loc4 loc;
++			struct mptcp_rem4 rem;
++
++			loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++			loc.loc4_id = 0;
++			loc.low_prio = 0;
++
++			rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++			rem.port = inet_sk(meta_sk)->inet_dport;
++			rem.rem4_id = 0; /* Default 0 */
++
++			mptcp_init4_subsockets(meta_sk, &loc, &rem);
++		} else {
++#if IS_ENABLED(CONFIG_IPV6)
++			struct mptcp_loc6 loc;
++			struct mptcp_rem6 rem;
++
++			loc.addr = inet6_sk(meta_sk)->saddr;
++			loc.loc6_id = 0;
++			loc.low_prio = 0;
++
++			rem.addr = meta_sk->sk_v6_daddr;
++			rem.port = inet_sk(meta_sk)->inet_dport;
++			rem.rem6_id = 0; /* Default 0 */
++
++			mptcp_init6_subsockets(meta_sk, &loc, &rem);
++#endif
++		}
++		goto next_subflow;
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void ndiffports_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	fmp->mpcb = mpcb;
++}
++
++static void ndiffports_create_subflows(struct sock *meta_sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (!work_pending(&pm_priv->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &pm_priv->subflow_work);
++	}
++}
++
++static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
++				   struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++static struct mptcp_pm_ops ndiffports __read_mostly = {
++	.new_session = ndiffports_new_session,
++	.fully_established = ndiffports_create_subflows,
++	.get_local_id = ndiffports_get_local_id,
++	.name = "ndiffports",
++	.owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init ndiffports_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
++
++	if (mptcp_register_path_manager(&ndiffports))
++		goto exit;
++
++	return 0;
++
++exit:
++	return -1;
++}
++
++static void ndiffports_unregister(void)
++{
++	mptcp_unregister_path_manager(&ndiffports);
++}
++
++module_init(ndiffports_register);
++module_exit(ndiffports_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
+new file mode 100644
+index 000000000000..ec4e98622637
+--- /dev/null
++++ b/net/mptcp/mptcp_ofo_queue.c
+@@ -0,0 +1,295 @@
++/*
++ *	MPTCP implementation - Fast algorithm for MPTCP meta-reordering
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <linux/slab.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++			    const struct sk_buff *skb)
++{
++	struct tcp_sock *tp;
++
++	mptcp_for_each_tp(mpcb, tp) {
++		if (tp->mptcp->shortcut_ofoqueue == skb) {
++			tp->mptcp->shortcut_ofoqueue = NULL;
++			return;
++		}
++	}
++}
++
++/* Does 'skb' fits after 'here' in the queue 'head' ?
++ * If yes, we queue it and return 1
++ */
++static int mptcp_ofo_queue_after(struct sk_buff_head *head,
++				 struct sk_buff *skb, struct sk_buff *here,
++				 const struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	u32 seq = TCP_SKB_CB(skb)->seq;
++	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++	/* We want to queue skb after here, thus seq >= end_seq */
++	if (before(seq, TCP_SKB_CB(here)->end_seq))
++		return 0;
++
++	if (seq == TCP_SKB_CB(here)->end_seq) {
++		bool fragstolen = false;
++
++		if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
++			__skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
++			return 1;
++		} else {
++			kfree_skb_partial(skb, fragstolen);
++			return -1;
++		}
++	}
++
++	/* If here is the last one, we can always queue it */
++	if (skb_queue_is_last(head, here)) {
++		__skb_queue_after(head, here, skb);
++		return 1;
++	} else {
++		struct sk_buff *skb1 = skb_queue_next(head, here);
++		/* It's not the last one, but does it fits between 'here' and
++		 * the one after 'here' ? Thus, does end_seq <= after_here->seq
++		 */
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
++			__skb_queue_after(head, here, skb);
++			return 1;
++		}
++	}
++
++	return 0;
++}
++
++static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
++			 struct sk_buff_head *head, struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk;
++	struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb1, *best_shortcut = NULL;
++	u32 seq = TCP_SKB_CB(skb)->seq;
++	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++	u32 distance = 0xffffffff;
++
++	/* First, check the tp's shortcut */
++	if (!shortcut) {
++		if (skb_queue_empty(head)) {
++			__skb_queue_head(head, skb);
++			goto end;
++		}
++	} else {
++		int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++		/* Does the tp's shortcut is a hit? If yes, we insert. */
++
++		if (ret) {
++			skb = (ret > 0) ? skb : NULL;
++			goto end;
++		}
++	}
++
++	/* Check the shortcuts of the other subsockets. */
++	mptcp_for_each_tp(mpcb, tp_it) {
++		shortcut = tp_it->mptcp->shortcut_ofoqueue;
++		/* Can we queue it here? If yes, do so! */
++		if (shortcut) {
++			int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++
++			if (ret) {
++				skb = (ret > 0) ? skb : NULL;
++				goto end;
++			}
++		}
++
++		/* Could not queue it, check if we are close.
++		 * We are looking for a shortcut, close enough to seq to
++		 * set skb1 prematurely and thus improve the subsequent lookup,
++		 * which tries to find a skb1 so that skb1->seq <= seq.
++		 *
++		 * So, here we only take shortcuts, whose shortcut->seq > seq,
++		 * and minimize the distance between shortcut->seq and seq and
++		 * set best_shortcut to this one with the minimal distance.
++		 *
++		 * That way, the subsequent while-loop is shortest.
++		 */
++		if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
++			/* Are we closer than the current best shortcut? */
++			if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
++				distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
++				best_shortcut = shortcut;
++			}
++		}
++	}
++
++	if (best_shortcut)
++		skb1 = best_shortcut;
++	else
++		skb1 = skb_peek_tail(head);
++
++	if (seq == TCP_SKB_CB(skb1)->end_seq) {
++		bool fragstolen = false;
++
++		if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
++			__skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
++		} else {
++			kfree_skb_partial(skb, fragstolen);
++			skb = NULL;
++		}
++
++		goto end;
++	}
++
++	/* Find the insertion point, starting from best_shortcut if available.
++	 *
++	 * Inspired from tcp_data_queue_ofo.
++	 */
++	while (1) {
++		/* skb1->seq <= seq */
++		if (!after(TCP_SKB_CB(skb1)->seq, seq))
++			break;
++		if (skb_queue_is_first(head, skb1)) {
++			skb1 = NULL;
++			break;
++		}
++		skb1 = skb_queue_prev(head, skb1);
++	}
++
++	/* Do skb overlap to previous one? */
++	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++			/* All the bits are present. */
++			__kfree_skb(skb);
++			skb = NULL;
++			goto end;
++		}
++		if (seq == TCP_SKB_CB(skb1)->seq) {
++			if (skb_queue_is_first(head, skb1))
++				skb1 = NULL;
++			else
++				skb1 = skb_queue_prev(head, skb1);
++		}
++	}
++	if (!skb1)
++		__skb_queue_head(head, skb);
++	else
++		__skb_queue_after(head, skb1, skb);
++
++	/* And clean segments covered by new one as whole. */
++	while (!skb_queue_is_last(head, skb)) {
++		skb1 = skb_queue_next(head, skb);
++
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++			break;
++
++		__skb_unlink(skb1, head);
++		mptcp_remove_shortcuts(mpcb, skb1);
++		__kfree_skb(skb1);
++	}
++
++end:
++	if (skb) {
++		skb_set_owner_r(skb, meta_sk);
++		tp->mptcp->shortcut_ofoqueue = skb;
++	}
++
++	return;
++}
++
++/**
++ * @sk: the subflow that received this skb.
++ */
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++			      struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
++		     &tcp_sk(meta_sk)->out_of_order_queue, tp);
++}
++
++bool mptcp_prune_ofo_queue(struct sock *sk)
++{
++	struct tcp_sock *tp	= tcp_sk(sk);
++	bool res		= false;
++
++	if (!skb_queue_empty(&tp->out_of_order_queue)) {
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
++		mptcp_purge_ofo_queue(tp);
++
++		/* No sack at the mptcp-level */
++		sk_mem_reclaim(sk);
++		res = true;
++	}
++
++	return res;
++}
++
++void mptcp_ofo_queue(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++
++	while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
++		u32 old_rcv_nxt = meta_tp->rcv_nxt;
++		if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
++			break;
++
++		if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
++			__skb_unlink(skb, &meta_tp->out_of_order_queue);
++			mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++			__kfree_skb(skb);
++			continue;
++		}
++
++		__skb_unlink(skb, &meta_tp->out_of_order_queue);
++		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++
++		__skb_queue_tail(&meta_sk->sk_receive_queue, skb);
++		meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++		mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++		if (tcp_hdr(skb)->fin)
++			mptcp_fin(meta_sk);
++	}
++}
++
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
++{
++	struct sk_buff_head *head = &meta_tp->out_of_order_queue;
++	struct sk_buff *skb, *tmp;
++
++	skb_queue_walk_safe(head, skb, tmp) {
++		__skb_unlink(skb, head);
++		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++		kfree_skb(skb);
++	}
++}
+diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
+new file mode 100644
+index 000000000000..53f5c43bb488
+--- /dev/null
++++ b/net/mptcp/mptcp_olia.c
+@@ -0,0 +1,311 @@
++/*
++ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
++ *
++ * Algorithm design:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ * Nicolas Gast <nicolas.gast@epfl.ch>
++ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
++ *
++ * Implementation:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++static int scale = 10;
++
++struct mptcp_olia {
++	u32	mptcp_loss1;
++	u32	mptcp_loss2;
++	u32	mptcp_loss3;
++	int	epsilon_num;
++	u32	epsilon_den;
++	int	mptcp_snd_cwnd_cnt;
++};
++
++static inline int mptcp_olia_sk_can_send(const struct sock *sk)
++{
++	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_olia_scale(u64 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++/* take care of artificially inflate (see RFC5681)
++ * of cwnd during fast-retransmit phase
++ */
++static u32 mptcp_get_crt_cwnd(struct sock *sk)
++{
++	const struct inet_connection_sock *icsk = inet_csk(sk);
++
++	if (icsk->icsk_ca_state == TCP_CA_Recovery)
++		return tcp_sk(sk)->snd_ssthresh;
++	else
++		return tcp_sk(sk)->snd_cwnd;
++}
++
++/* return the dominator of the first term of  the increasing term */
++static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
++{
++	struct sock *sk;
++	u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
++
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		u64 scaled_num;
++		u32 tmp_cwnd;
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
++		rate += div_u64(scaled_num , tp->srtt_us);
++	}
++	rate *= rate;
++	return rate;
++}
++
++/* find the maximum cwnd, used to find set M */
++static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
++{
++	struct sock *sk;
++	u32 best_cwnd = 0;
++
++	mptcp_for_each_sk(mpcb, sk) {
++		u32 tmp_cwnd;
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if (tmp_cwnd > best_cwnd)
++			best_cwnd = tmp_cwnd;
++	}
++	return best_cwnd;
++}
++
++static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
++{
++	struct mptcp_olia *ca;
++	struct tcp_sock *tp;
++	struct sock *sk;
++	u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
++	u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
++	u8 M = 0, B_not_M = 0;
++
++	/* TODO - integrate this in the following loop - we just want to iterate once */
++
++	max_cwnd = mptcp_get_max_cwnd(mpcb);
++
++	/* find the best path */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++		/* TODO - check here and rename variables */
++		tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++			      ca->mptcp_loss2 - ca->mptcp_loss1);
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
++			best_rtt = tmp_rtt;
++			best_int = tmp_int;
++			best_cwnd = tmp_cwnd;
++		}
++	}
++
++	/* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
++	/* find the size of M and B_not_M */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if (tmp_cwnd == max_cwnd) {
++			M++;
++		} else {
++			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++				      ca->mptcp_loss2 - ca->mptcp_loss1);
++
++			if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
++				B_not_M++;
++		}
++	}
++
++	/* check if the path is in M or B_not_M and set the value of epsilon accordingly */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		if (B_not_M == 0) {
++			ca->epsilon_num = 0;
++			ca->epsilon_den = 1;
++		} else {
++			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++				      ca->mptcp_loss2 - ca->mptcp_loss1);
++			tmp_cwnd = mptcp_get_crt_cwnd(sk);
++
++			if (tmp_cwnd < max_cwnd &&
++			    (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
++				ca->epsilon_num = 1;
++				ca->epsilon_den = mpcb->cnt_established * B_not_M;
++			} else if (tmp_cwnd == max_cwnd) {
++				ca->epsilon_num = -1;
++				ca->epsilon_den = mpcb->cnt_established  * M;
++			} else {
++				ca->epsilon_num = 0;
++				ca->epsilon_den = 1;
++			}
++		}
++	}
++}
++
++/* setting the initial values */
++static void mptcp_olia_init(struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_olia *ca = inet_csk_ca(sk);
++
++	if (mptcp(tp)) {
++		ca->mptcp_loss1 = tp->snd_una;
++		ca->mptcp_loss2 = tp->snd_una;
++		ca->mptcp_loss3 = tp->snd_una;
++		ca->mptcp_snd_cwnd_cnt = 0;
++		ca->epsilon_num = 0;
++		ca->epsilon_den = 1;
++	}
++}
++
++/* updating inter-loss distance and ssthresh */
++static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
++{
++	if (!mptcp(tcp_sk(sk)))
++		return;
++
++	if (new_state == TCP_CA_Loss ||
++	    new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
++		struct mptcp_olia *ca = inet_csk_ca(sk);
++
++		if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
++		    !inet_csk(sk)->icsk_retransmits) {
++			ca->mptcp_loss1 = ca->mptcp_loss2;
++			ca->mptcp_loss2 = ca->mptcp_loss3;
++		}
++	}
++}
++
++/* main algorithm */
++static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_olia *ca = inet_csk_ca(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++
++	u64 inc_num, inc_den, rate, cwnd_scaled;
++
++	if (!mptcp(tp)) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	ca->mptcp_loss3 = tp->snd_una;
++
++	if (!tcp_is_cwnd_limited(sk))
++		return;
++
++	/* slow start if it is in the safe area */
++	if (tp->snd_cwnd <= tp->snd_ssthresh) {
++		tcp_slow_start(tp, acked);
++		return;
++	}
++
++	mptcp_get_epsilon(mpcb);
++	rate = mptcp_get_rate(mpcb, tp->srtt_us);
++	cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
++	inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
++
++	/* calculate the increasing term, scaling is used to reduce the rounding effect */
++	if (ca->epsilon_num == -1) {
++		if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
++			inc_num = rate - ca->epsilon_den *
++				cwnd_scaled * cwnd_scaled;
++			ca->mptcp_snd_cwnd_cnt -= div64_u64(
++			    mptcp_olia_scale(inc_num , scale) , inc_den);
++		} else {
++			inc_num = ca->epsilon_den *
++			    cwnd_scaled * cwnd_scaled - rate;
++			ca->mptcp_snd_cwnd_cnt += div64_u64(
++			    mptcp_olia_scale(inc_num , scale) , inc_den);
++		}
++	} else {
++		inc_num = ca->epsilon_num * rate +
++		    ca->epsilon_den * cwnd_scaled * cwnd_scaled;
++		ca->mptcp_snd_cwnd_cnt += div64_u64(
++		    mptcp_olia_scale(inc_num , scale) , inc_den);
++	}
++
++
++	if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
++		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
++			tp->snd_cwnd++;
++		ca->mptcp_snd_cwnd_cnt = 0;
++	} else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
++		tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
++		ca->mptcp_snd_cwnd_cnt = 0;
++	}
++}
++
++static struct tcp_congestion_ops mptcp_olia = {
++	.init		= mptcp_olia_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_olia_cong_avoid,
++	.set_state	= mptcp_olia_set_state,
++	.owner		= THIS_MODULE,
++	.name		= "olia",
++};
++
++static int __init mptcp_olia_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
++	return tcp_register_congestion_control(&mptcp_olia);
++}
++
++static void __exit mptcp_olia_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_olia);
++}
++
++module_init(mptcp_olia_register);
++module_exit(mptcp_olia_unregister);
++
++MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
+new file mode 100644
+index 000000000000..400ea254c078
+--- /dev/null
++++ b/net/mptcp/mptcp_output.c
+@@ -0,0 +1,1743 @@
++/*
++ *	MPTCP implementation - Sending side
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/kconfig.h>
++#include <linux/skbuff.h>
++#include <linux/tcp.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++#include <net/sock.h>
++
++static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
++				 MPTCP_SUB_LEN_ACK_ALIGN +
++				 MPTCP_SUB_LEN_SEQ_ALIGN;
++
++static inline int mptcp_sub_len_remove_addr(u16 bitfield)
++{
++	unsigned int c;
++	for (c = 0; bitfield; c++)
++		bitfield &= bitfield - 1;
++	return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
++}
++
++int mptcp_sub_len_remove_addr_align(u16 bitfield)
++{
++	return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
++}
++EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
++
++/* get the data-seq and end-data-seq and store them again in the
++ * tcp_skb_cb
++ */
++static int mptcp_reconstruct_mapping(struct sk_buff *skb)
++{
++	const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
++	u32 *p32;
++	u16 *p16;
++
++	if (!mpdss->M)
++		return 1;
++
++	/* Move the pointer to the data-seq */
++	p32 = (u32 *)mpdss;
++	p32++;
++	if (mpdss->A) {
++		p32++;
++		if (mpdss->a)
++			p32++;
++	}
++
++	TCP_SKB_CB(skb)->seq = ntohl(*p32);
++
++	/* Get the data_len to calculate the end_data_seq */
++	p32++;
++	p32++;
++	p16 = (u16 *)p32;
++	TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
++
++	return 0;
++}
++
++static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
++{
++	struct sk_buff *skb_it;
++
++	skb_it = tcp_write_queue_head(meta_sk);
++
++	tcp_for_write_queue_from(skb_it, meta_sk) {
++		if (skb_it == tcp_send_head(meta_sk))
++			break;
++
++		if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
++			TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
++			break;
++		}
++	}
++}
++
++/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
++ * coming from the meta-retransmit-timer
++ */
++static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
++				  struct sock *sk, int clone_it)
++{
++	struct sk_buff *skb, *skb1;
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	u32 seq, end_seq;
++
++	if (clone_it) {
++		/* pskb_copy is necessary here, because the TCP/IP-headers
++		 * will be changed when it's going to be reinjected on another
++		 * subflow.
++		 */
++		skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
++	} else {
++		__skb_unlink(orig_skb, &sk->sk_write_queue);
++		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++		sk->sk_wmem_queued -= orig_skb->truesize;
++		sk_mem_uncharge(sk, orig_skb->truesize);
++		skb = orig_skb;
++	}
++	if (unlikely(!skb))
++		return;
++
++	if (sk && mptcp_reconstruct_mapping(skb)) {
++		__kfree_skb(skb);
++		return;
++	}
++
++	skb->sk = meta_sk;
++
++	/* If it reached already the destination, we don't have to reinject it */
++	if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++		__kfree_skb(skb);
++		return;
++	}
++
++	/* Only reinject segments that are fully covered by the mapping */
++	if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
++	    TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
++		u32 seq = TCP_SKB_CB(skb)->seq;
++		u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++		__kfree_skb(skb);
++
++		/* Ok, now we have to look for the full mapping in the meta
++		 * send-queue :S
++		 */
++		tcp_for_write_queue(skb, meta_sk) {
++			/* Not yet at the mapping? */
++			if (before(TCP_SKB_CB(skb)->seq, seq))
++				continue;
++			/* We have passed by the mapping */
++			if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
++				return;
++
++			__mptcp_reinject_data(skb, meta_sk, NULL, 1);
++		}
++		return;
++	}
++
++	/* Segment goes back to the MPTCP-layer. So, we need to zero the
++	 * path_mask/dss.
++	 */
++	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++	/* We need to find out the path-mask from the meta-write-queue
++	 * to properly select a subflow.
++	 */
++	mptcp_find_and_set_pathmask(meta_sk, skb);
++
++	/* If it's empty, just add */
++	if (skb_queue_empty(&mpcb->reinject_queue)) {
++		skb_queue_head(&mpcb->reinject_queue, skb);
++		return;
++	}
++
++	/* Find place to insert skb - or even we can 'drop' it, as the
++	 * data is already covered by other skb's in the reinject-queue.
++	 *
++	 * This is inspired by code from tcp_data_queue.
++	 */
++
++	skb1 = skb_peek_tail(&mpcb->reinject_queue);
++	seq = TCP_SKB_CB(skb)->seq;
++	while (1) {
++		if (!after(TCP_SKB_CB(skb1)->seq, seq))
++			break;
++		if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
++			skb1 = NULL;
++			break;
++		}
++		skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++	}
++
++	/* Do skb overlap to previous one? */
++	end_seq = TCP_SKB_CB(skb)->end_seq;
++	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++			/* All the bits are present. Don't reinject */
++			__kfree_skb(skb);
++			return;
++		}
++		if (seq == TCP_SKB_CB(skb1)->seq) {
++			if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
++				skb1 = NULL;
++			else
++				skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++		}
++	}
++	if (!skb1)
++		__skb_queue_head(&mpcb->reinject_queue, skb);
++	else
++		__skb_queue_after(&mpcb->reinject_queue, skb1, skb);
++
++	/* And clean segments covered by new one as whole. */
++	while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
++		skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
++
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++			break;
++
++		__skb_unlink(skb1, &mpcb->reinject_queue);
++		__kfree_skb(skb1);
++	}
++	return;
++}
++
++/* Inserts data into the reinject queue */
++void mptcp_reinject_data(struct sock *sk, int clone_it)
++{
++	struct sk_buff *skb_it, *tmp;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = tp->meta_sk;
++
++	/* It has already been closed - there is really no point in reinjecting */
++	if (meta_sk->sk_state == TCP_CLOSE)
++		return;
++
++	skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
++		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
++		/* Subflow syn's and fin's are not reinjected.
++		 *
++		 * As well as empty subflow-fins with a data-fin.
++		 * They are reinjected below (without the subflow-fin-flag)
++		 */
++		if (tcb->tcp_flags & TCPHDR_SYN ||
++		    (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
++		    (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
++			continue;
++
++		__mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
++	}
++
++	skb_it = tcp_write_queue_tail(meta_sk);
++	/* If sk has sent the empty data-fin, we have to reinject it too. */
++	if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
++	    TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
++		__mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
++	}
++
++	mptcp_push_pending_frames(meta_sk);
++
++	tp->pf = 1;
++}
++EXPORT_SYMBOL(mptcp_reinject_data);
++
++static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
++			       struct sock *subsk)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sock *sk_it;
++	int all_empty = 1, all_acked;
++
++	/* In infinite mapping we always try to combine */
++	if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
++		subsk->sk_shutdown |= SEND_SHUTDOWN;
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++		return;
++	}
++
++	/* Don't combine, if they didn't combine - otherwise we end up in
++	 * TIME_WAIT, even if our app is smart enough to avoid it
++	 */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++		if (!mpcb->dfin_combined)
++			return;
++	}
++
++	/* If no other subflow has data to send, we can combine */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		if (!mptcp_sk_can_send(sk_it))
++			continue;
++
++		if (!tcp_write_queue_empty(sk_it))
++			all_empty = 0;
++	}
++
++	/* If all data has been DATA_ACKed, we can combine.
++	 * -1, because the data_fin consumed one byte
++	 */
++	all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
++
++	if ((all_empty || all_acked) && tcp_close_state(subsk)) {
++		subsk->sk_shutdown |= SEND_SHUTDOWN;
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++	}
++}
++
++static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
++				   __be32 *ptr)
++{
++	const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	__be32 *start = ptr;
++	__u16 data_len;
++
++	*ptr++ = htonl(tcb->seq); /* data_seq */
++
++	/* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
++	if (mptcp_is_data_fin(skb) && skb->len == 0)
++		*ptr++ = 0; /* subseq */
++	else
++		*ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
++
++	if (tcb->mptcp_flags & MPTCPHDR_INF)
++		data_len = 0;
++	else
++		data_len = tcb->end_seq - tcb->seq;
++
++	if (tp->mpcb->dss_csum && data_len) {
++		__be16 *p16 = (__be16 *)ptr;
++		__be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
++		__wsum csum;
++
++		*ptr = htonl(((data_len) << 16) |
++			     (TCPOPT_EOL << 8) |
++			     (TCPOPT_EOL));
++		csum = csum_partial(ptr - 2, 12, skb->csum);
++		p16++;
++		*p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
++	} else {
++		*ptr++ = htonl(((data_len) << 16) |
++			       (TCPOPT_NOP << 8) |
++			       (TCPOPT_NOP));
++	}
++
++	return ptr - start;
++}
++
++static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
++				    __be32 *ptr)
++{
++	struct mp_dss *mdss = (struct mp_dss *)ptr;
++	__be32 *start = ptr;
++
++	mdss->kind = TCPOPT_MPTCP;
++	mdss->sub = MPTCP_SUB_DSS;
++	mdss->rsv1 = 0;
++	mdss->rsv2 = 0;
++	mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
++	mdss->m = 0;
++	mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
++	mdss->a = 0;
++	mdss->A = 1;
++	mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
++	ptr++;
++
++	*ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++	return ptr - start;
++}
++
++/* RFC6824 states that once a particular subflow mapping has been sent
++ * out it must never be changed. However, packets may be split while
++ * they are in the retransmission queue (due to SACK or ACKs) and that
++ * arguably means that we would change the mapping (e.g. it splits it,
++ * our sends out a subset of the initial mapping).
++ *
++ * Furthermore, the skb checksum is not always preserved across splits
++ * (e.g. mptcp_fragment) which would mean that we need to recompute
++ * the DSS checksum in this case.
++ *
++ * To avoid this we save the initial DSS mapping which allows us to
++ * send the same DSS mapping even for fragmented retransmits.
++ */
++static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
++{
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	__be32 *ptr = (__be32 *)tcb->dss;
++
++	tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++	ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++	ptr += mptcp_write_dss_mapping(tp, skb, ptr);
++}
++
++/* Write the saved DSS mapping to the header */
++static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
++				    __be32 *ptr)
++{
++	__be32 *start = ptr;
++
++	memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
++
++	/* update the data_ack */
++	start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++	/* dss is in a union with inet_skb_parm and
++	 * the IP layer expects zeroed IPCB fields.
++	 */
++	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++	return mptcp_dss_len/sizeof(*ptr);
++}
++
++static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct sock *meta_sk = mptcp_meta_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++	struct tcp_skb_cb *tcb;
++	struct sk_buff *subskb = NULL;
++
++	if (!reinject)
++		TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
++						  MPTCPHDR_SEQ64_INDEX : 0);
++
++	subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
++	if (!subskb)
++		return false;
++
++	/* At the subflow-level we need to call again tcp_init_tso_segs. We
++	 * force this, by setting gso_segs to 0. It has been set to 1 prior to
++	 * the call to mptcp_skb_entail.
++	 */
++	skb_shinfo(subskb)->gso_segs = 0;
++
++	TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
++
++	if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
++	    skb->ip_summed == CHECKSUM_PARTIAL) {
++		subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
++		subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
++	}
++
++	tcb = TCP_SKB_CB(subskb);
++
++	if (tp->mpcb->send_infinite_mapping &&
++	    !tp->mpcb->infinite_mapping_snd &&
++	    !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
++		tp->mptcp->fully_established = 1;
++		tp->mpcb->infinite_mapping_snd = 1;
++		tp->mptcp->infinite_cutoff_seq = tp->write_seq;
++		tcb->mptcp_flags |= MPTCPHDR_INF;
++	}
++
++	if (mptcp_is_data_fin(subskb))
++		mptcp_combine_dfin(subskb, meta_sk, sk);
++
++	mptcp_save_dss_data_seq(tp, subskb);
++
++	tcb->seq = tp->write_seq;
++	tcb->sacked = 0; /* reset the sacked field: from the point of view
++			  * of this subflow, we are sending a brand new
++			  * segment
++			  */
++	/* Take into account seg len */
++	tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
++	tcb->end_seq = tp->write_seq;
++
++	/* If it's a non-payload DATA_FIN (also no subflow-fin), the
++	 * segment is not part of the subflow but on a meta-only-level.
++	 */
++	if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
++		tcp_add_write_queue_tail(sk, subskb);
++		sk->sk_wmem_queued += subskb->truesize;
++		sk_mem_charge(sk, subskb->truesize);
++	} else {
++		int err;
++
++		/* Necessary to initialize for tcp_transmit_skb. mss of 1, as
++		 * skb->len = 0 will force tso_segs to 1.
++		 */
++		tcp_init_tso_segs(sk, subskb, 1);
++		/* Empty data-fins are sent immediatly on the subflow */
++		TCP_SKB_CB(subskb)->when = tcp_time_stamp;
++		err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
++
++		/* It has not been queued, we can free it now. */
++		kfree_skb(subskb);
++
++		if (err)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		tp->mptcp->second_packet = 1;
++		tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
++	}
++
++	return true;
++}
++
++/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
++ * might need to undo some operations done by tcp_fragment.
++ */
++static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
++			  gfp_t gfp, int reinject)
++{
++	int ret, diff, old_factor;
++	struct sk_buff *buff;
++	u8 flags;
++
++	if (skb_headlen(skb) < len)
++		diff = skb->len - len;
++	else
++		diff = skb->data_len;
++	old_factor = tcp_skb_pcount(skb);
++
++	/* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
++	 * At the MPTCP-level we do not care about the absolute value. All we
++	 * care about is that it is set to 1 for accurate packets_out
++	 * accounting.
++	 */
++	ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
++	if (ret)
++		return ret;
++
++	buff = skb->next;
++
++	flags = TCP_SKB_CB(skb)->mptcp_flags;
++	TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
++	TCP_SKB_CB(buff)->mptcp_flags = flags;
++	TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
++
++	/* If reinject == 1, the buff will be added to the reinject
++	 * queue, which is currently not part of memory accounting. So
++	 * undo the changes done by tcp_fragment and update the
++	 * reinject queue. Also, undo changes to the packet counters.
++	 */
++	if (reinject == 1) {
++		int undo = buff->truesize - diff;
++		meta_sk->sk_wmem_queued -= undo;
++		sk_mem_uncharge(meta_sk, undo);
++
++		tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
++		meta_sk->sk_write_queue.qlen--;
++
++		if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
++			undo = old_factor - tcp_skb_pcount(skb) -
++				tcp_skb_pcount(buff);
++			if (undo)
++				tcp_adjust_pcount(meta_sk, skb, -undo);
++		}
++	}
++
++	return 0;
++}
++
++/* Inspired by tcp_write_wakeup */
++int mptcp_write_wakeup(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++	struct sock *sk_it;
++	int ans = 0;
++
++	if (meta_sk->sk_state == TCP_CLOSE)
++		return -1;
++
++	skb = tcp_send_head(meta_sk);
++	if (skb &&
++	    before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
++		unsigned int mss;
++		unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
++		struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
++		struct tcp_sock *subtp;
++		if (!subsk)
++			goto window_probe;
++		subtp = tcp_sk(subsk);
++		mss = tcp_current_mss(subsk);
++
++		seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
++			       tcp_wnd_end(subtp) - subtp->write_seq);
++
++		if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
++			meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
++
++		/* We are probing the opening of a window
++		 * but the window size is != 0
++		 * must have been a result SWS avoidance ( sender )
++		 */
++		if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
++		    skb->len > mss) {
++			seg_size = min(seg_size, mss);
++			TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++			if (mptcp_fragment(meta_sk, skb, seg_size,
++					   GFP_ATOMIC, 0))
++				return -1;
++		} else if (!tcp_skb_pcount(skb)) {
++			/* see mptcp_write_xmit on why we use UINT_MAX */
++			tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++		}
++
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++		if (!mptcp_skb_entail(subsk, skb, 0))
++			return -1;
++		TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++		mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
++						 TCP_SKB_CB(skb)->seq);
++		tcp_event_new_data_sent(meta_sk, skb);
++
++		__tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
++
++		return 0;
++	} else {
++window_probe:
++		if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
++			    meta_tp->snd_una + 0xFFFF)) {
++			mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++				if (mptcp_sk_can_send_ack(sk_it))
++					tcp_xmit_probe_skb(sk_it, 1);
++			}
++		}
++
++		/* At least one of the tcp_xmit_probe_skb's has to succeed */
++		mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++			int ret;
++
++			if (!mptcp_sk_can_send_ack(sk_it))
++				continue;
++
++			ret = tcp_xmit_probe_skb(sk_it, 0);
++			if (unlikely(ret > 0))
++				ans = ret;
++		}
++		return ans;
++	}
++}
++
++bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
++		     int push_one, gfp_t gfp)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
++	struct sock *subsk = NULL;
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb;
++	unsigned int sent_pkts;
++	int reinject = 0;
++	unsigned int sublimit;
++
++	sent_pkts = 0;
++
++	while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
++						    &sublimit))) {
++		unsigned int limit;
++
++		subtp = tcp_sk(subsk);
++		mss_now = tcp_current_mss(subsk);
++
++		if (reinject == 1) {
++			if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++				/* Segment already reached the peer, take the next one */
++				__skb_unlink(skb, &mpcb->reinject_queue);
++				__kfree_skb(skb);
++				continue;
++			}
++		}
++
++		/* If the segment was cloned (e.g. a meta retransmission),
++		 * the header must be expanded/copied so that there is no
++		 * corruption of TSO information.
++		 */
++		if (skb_unclone(skb, GFP_ATOMIC))
++			break;
++
++		if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
++			break;
++
++		/* Force tso_segs to 1 by using UINT_MAX.
++		 * We actually don't care about the exact number of segments
++		 * emitted on the subflow. We need just to set tso_segs, because
++		 * we still need an accurate packets_out count in
++		 * tcp_event_new_data_sent.
++		 */
++		tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++
++		/* Check for nagle, irregardless of tso_segs. If the segment is
++		 * actually larger than mss_now (TSO segment), then
++		 * tcp_nagle_check will have partial == false and always trigger
++		 * the transmission.
++		 * tcp_write_xmit has a TSO-level nagle check which is not
++		 * subject to the MPTCP-level. It is based on the properties of
++		 * the subflow, not the MPTCP-level.
++		 */
++		if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
++					     (tcp_skb_is_last(meta_sk, skb) ?
++					      nonagle : TCP_NAGLE_PUSH))))
++			break;
++
++		limit = mss_now;
++		/* skb->len > mss_now is the equivalent of tso_segs > 1 in
++		 * tcp_write_xmit. Otherwise split-point would return 0.
++		 */
++		if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++			/* We limit the size of the skb so that it fits into the
++			 * window. Call tcp_mss_split_point to avoid duplicating
++			 * code.
++			 * We really only care about fitting the skb into the
++			 * window. That's why we use UINT_MAX. If the skb does
++			 * not fit into the cwnd_quota or the NIC's max-segs
++			 * limitation, it will be split by the subflow's
++			 * tcp_write_xmit which does the appropriate call to
++			 * tcp_mss_split_point.
++			 */
++			limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++						    UINT_MAX / mss_now,
++						    nonagle);
++
++		if (sublimit)
++			limit = min(limit, sublimit);
++
++		if (skb->len > limit &&
++		    unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
++			break;
++
++		if (!mptcp_skb_entail(subsk, skb, reinject))
++			break;
++		/* Nagle is handled at the MPTCP-layer, so
++		 * always push on the subflow
++		 */
++		__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++		TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++		if (!reinject) {
++			mptcp_check_sndseq_wrap(meta_tp,
++						TCP_SKB_CB(skb)->end_seq -
++						TCP_SKB_CB(skb)->seq);
++			tcp_event_new_data_sent(meta_sk, skb);
++		}
++
++		tcp_minshall_update(meta_tp, mss_now, skb);
++		sent_pkts += tcp_skb_pcount(skb);
++
++		if (reinject > 0) {
++			__skb_unlink(skb, &mpcb->reinject_queue);
++			kfree_skb(skb);
++		}
++
++		if (push_one)
++			break;
++	}
++
++	return !meta_tp->packets_out && tcp_send_head(meta_sk);
++}
++
++void mptcp_write_space(struct sock *sk)
++{
++	mptcp_push_pending_frames(mptcp_meta_sk(sk));
++}
++
++u32 __mptcp_select_window(struct sock *sk)
++{
++	struct inet_connection_sock *icsk = inet_csk(sk);
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	int mss, free_space, full_space, window;
++
++	/* MSS for the peer's data.  Previous versions used mss_clamp
++	 * here.  I don't know if the value based on our guesses
++	 * of peer's MSS is better for the performance.  It's more correct
++	 * but may be worse for the performance because of rcv_mss
++	 * fluctuations.  --SAW  1998/11/1
++	 */
++	mss = icsk->icsk_ack.rcv_mss;
++	free_space = tcp_space(sk);
++	full_space = min_t(int, meta_tp->window_clamp,
++			tcp_full_space(sk));
++
++	if (mss > full_space)
++		mss = full_space;
++
++	if (free_space < (full_space >> 1)) {
++		icsk->icsk_ack.quick = 0;
++
++		if (tcp_memory_pressure)
++			/* TODO this has to be adapted when we support different
++			 * MSS's among the subflows.
++			 */
++			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
++						    4U * meta_tp->advmss);
++
++		if (free_space < mss)
++			return 0;
++	}
++
++	if (free_space > meta_tp->rcv_ssthresh)
++		free_space = meta_tp->rcv_ssthresh;
++
++	/* Don't do rounding if we are using window scaling, since the
++	 * scaled window will not line up with the MSS boundary anyway.
++	 */
++	window = meta_tp->rcv_wnd;
++	if (tp->rx_opt.rcv_wscale) {
++		window = free_space;
++
++		/* Advertise enough space so that it won't get scaled away.
++		 * Import case: prevent zero window announcement if
++		 * 1<<rcv_wscale > mss.
++		 */
++		if (((window >> tp->rx_opt.rcv_wscale) << tp->
++		     rx_opt.rcv_wscale) != window)
++			window = (((window >> tp->rx_opt.rcv_wscale) + 1)
++				  << tp->rx_opt.rcv_wscale);
++	} else {
++		/* Get the largest window that is a nice multiple of mss.
++		 * Window clamp already applied above.
++		 * If our current window offering is within 1 mss of the
++		 * free space we just keep it. This prevents the divide
++		 * and multiply from happening most of the time.
++		 * We also don't do any window rounding when the free space
++		 * is too small.
++		 */
++		if (window <= free_space - mss || window > free_space)
++			window = (free_space / mss) * mss;
++		else if (mss == full_space &&
++			 free_space > window + (full_space >> 1))
++			window = free_space;
++	}
++
++	return window;
++}
++
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++		       unsigned *remaining)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++
++	opts->options |= OPTION_MPTCP;
++	if (is_master_tp(tp)) {
++		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
++		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++		opts->mp_capable.sender_key = tp->mptcp_loc_key;
++		opts->dss_csum = !!sysctl_mptcp_checksum;
++	} else {
++		const struct mptcp_cb *mpcb = tp->mpcb;
++
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
++		*remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
++		opts->mp_join_syns.token = mpcb->mptcp_rem_token;
++		opts->mp_join_syns.low_prio  = tp->mptcp->low_prio;
++		opts->addr_id = tp->mptcp->loc_id;
++		opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
++	}
++}
++
++void mptcp_synack_options(struct request_sock *req,
++			  struct tcp_out_options *opts, unsigned *remaining)
++{
++	struct mptcp_request_sock *mtreq;
++	mtreq = mptcp_rsk(req);
++
++	opts->options |= OPTION_MPTCP;
++	/* MPCB not yet set - thus it's a new MPTCP-session */
++	if (!mtreq->is_sub) {
++		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
++		opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
++		opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
++		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++	} else {
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
++		opts->mp_join_syns.sender_truncated_mac =
++				mtreq->mptcp_hash_tmac;
++		opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
++		opts->mp_join_syns.low_prio = mtreq->low_prio;
++		opts->addr_id = mtreq->loc_id;
++		*remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
++	}
++}
++
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++			       struct tcp_out_options *opts, unsigned *size)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
++
++	/* We are coming from tcp_current_mss with the meta_sk as an argument.
++	 * It does not make sense to check for the options, because when the
++	 * segment gets sent, another subflow will be chosen.
++	 */
++	if (!skb && is_meta_sk(sk))
++		return;
++
++	/* In fallback mp_fail-mode, we have to repeat it until the fallback
++	 * has been done by the sender
++	 */
++	if (unlikely(tp->mptcp->send_mp_fail)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_FAIL;
++		*size += MPTCP_SUB_LEN_FAIL;
++		return;
++	}
++
++	if (unlikely(tp->send_mp_fclose)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_FCLOSE;
++		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++		*size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
++		return;
++	}
++
++	/* 1. If we are the sender of the infinite-mapping, we need the
++	 *    MPTCPHDR_INF-flag, because a retransmission of the
++	 *    infinite-announcment still needs the mptcp-option.
++	 *
++	 *    We need infinite_cutoff_seq, because retransmissions from before
++	 *    the infinite-cutoff-moment still need the MPTCP-signalling to stay
++	 *    consistent.
++	 *
++	 * 2. If we are the receiver of the infinite-mapping, we always skip
++	 *    mptcp-options, because acknowledgments from before the
++	 *    infinite-mapping point have already been sent out.
++	 *
++	 * I know, the whole infinite-mapping stuff is ugly...
++	 *
++	 * TODO: Handle wrapped data-sequence numbers
++	 *       (even if it's very unlikely)
++	 */
++	if (unlikely(mpcb->infinite_mapping_snd) &&
++	    ((mpcb->send_infinite_mapping && tcb &&
++	      mptcp_is_data_seq(skb) &&
++	      !(tcb->mptcp_flags & MPTCPHDR_INF) &&
++	      !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
++	     !mpcb->send_infinite_mapping))
++		return;
++
++	if (unlikely(tp->mptcp->include_mpc)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_CAPABLE |
++				       OPTION_TYPE_ACK;
++		*size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
++		opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
++		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++		opts->dss_csum = mpcb->dss_csum;
++
++		if (skb)
++			tp->mptcp->include_mpc = 0;
++	}
++	if (unlikely(tp->mptcp->pre_established)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
++		*size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
++	}
++
++	if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_DATA_ACK;
++		/* If !skb, we come from tcp_current_mss and thus we always
++		 * assume that the DSS-option will be set for the data-packet.
++		 */
++		if (skb && !mptcp_is_data_seq(skb)) {
++			*size += MPTCP_SUB_LEN_ACK_ALIGN;
++		} else {
++			/* Doesn't matter, if csum included or not. It will be
++			 * either 10 or 12, and thus aligned = 12
++			 */
++			*size += MPTCP_SUB_LEN_ACK_ALIGN +
++				 MPTCP_SUB_LEN_SEQ_ALIGN;
++		}
++
++		*size += MPTCP_SUB_LEN_DSS_ALIGN;
++	}
++
++	if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
++		mpcb->pm_ops->addr_signal(sk, size, opts, skb);
++
++	if (unlikely(tp->mptcp->send_mp_prio) &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_PRIO;
++		if (skb)
++			tp->mptcp->send_mp_prio = 0;
++		*size += MPTCP_SUB_LEN_PRIO_ALIGN;
++	}
++
++	return;
++}
++
++u16 mptcp_select_window(struct sock *sk)
++{
++	u16 new_win		= tcp_select_window(sk);
++	struct tcp_sock *tp	= tcp_sk(sk);
++	struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
++
++	meta_tp->rcv_wnd	= tp->rcv_wnd;
++	meta_tp->rcv_wup	= meta_tp->rcv_nxt;
++
++	return new_win;
++}
++
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++			 const struct tcp_out_options *opts,
++			 struct sk_buff *skb)
++{
++	if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
++		struct mp_capable *mpc = (struct mp_capable *)ptr;
++
++		mpc->kind = TCPOPT_MPTCP;
++
++		if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
++		    (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
++			mpc->sender_key = opts->mp_capable.sender_key;
++			mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
++			ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
++		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++			mpc->sender_key = opts->mp_capable.sender_key;
++			mpc->receiver_key = opts->mp_capable.receiver_key;
++			mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
++			ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
++		}
++
++		mpc->sub = MPTCP_SUB_CAPABLE;
++		mpc->ver = 0;
++		mpc->a = opts->dss_csum;
++		mpc->b = 0;
++		mpc->rsv = 0;
++		mpc->h = 1;
++	}
++
++	if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
++		struct mp_join *mpj = (struct mp_join *)ptr;
++
++		mpj->kind = TCPOPT_MPTCP;
++		mpj->sub = MPTCP_SUB_JOIN;
++		mpj->rsv = 0;
++
++		if (OPTION_TYPE_SYN & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
++			mpj->u.syn.token = opts->mp_join_syns.token;
++			mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
++			mpj->b = opts->mp_join_syns.low_prio;
++			mpj->addr_id = opts->addr_id;
++			ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
++		} else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
++			mpj->u.synack.mac =
++				opts->mp_join_syns.sender_truncated_mac;
++			mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
++			mpj->b = opts->mp_join_syns.low_prio;
++			mpj->addr_id = opts->addr_id;
++			ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
++		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
++			mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
++			memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
++			ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
++		}
++	}
++	if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
++		struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++		mpadd->kind = TCPOPT_MPTCP;
++		if (opts->add_addr_v4) {
++			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
++			mpadd->sub = MPTCP_SUB_ADD_ADDR;
++			mpadd->ipver = 4;
++			mpadd->addr_id = opts->add_addr4.addr_id;
++			mpadd->u.v4.addr = opts->add_addr4.addr;
++			ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
++		} else if (opts->add_addr_v6) {
++			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
++			mpadd->sub = MPTCP_SUB_ADD_ADDR;
++			mpadd->ipver = 6;
++			mpadd->addr_id = opts->add_addr6.addr_id;
++			memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
++			       sizeof(mpadd->u.v6.addr));
++			ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
++		}
++	}
++	if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
++		struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++		u8 *addrs_id;
++		int id, len, len_align;
++
++		len = mptcp_sub_len_remove_addr(opts->remove_addrs);
++		len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
++
++		mprem->kind = TCPOPT_MPTCP;
++		mprem->len = len;
++		mprem->sub = MPTCP_SUB_REMOVE_ADDR;
++		mprem->rsv = 0;
++		addrs_id = &mprem->addrs_id;
++
++		mptcp_for_each_bit_set(opts->remove_addrs, id)
++			*(addrs_id++) = id;
++
++		/* Fill the rest with NOP's */
++		if (len_align > len) {
++			int i;
++			for (i = 0; i < len_align - len; i++)
++				*(addrs_id++) = TCPOPT_NOP;
++		}
++
++		ptr += len_align >> 2;
++	}
++	if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
++		struct mp_fail *mpfail = (struct mp_fail *)ptr;
++
++		mpfail->kind = TCPOPT_MPTCP;
++		mpfail->len = MPTCP_SUB_LEN_FAIL;
++		mpfail->sub = MPTCP_SUB_FAIL;
++		mpfail->rsv1 = 0;
++		mpfail->rsv2 = 0;
++		mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
++
++		ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
++	}
++	if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
++		struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
++
++		mpfclose->kind = TCPOPT_MPTCP;
++		mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
++		mpfclose->sub = MPTCP_SUB_FCLOSE;
++		mpfclose->rsv1 = 0;
++		mpfclose->rsv2 = 0;
++		mpfclose->key = opts->mp_capable.receiver_key;
++
++		ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
++	}
++
++	if (OPTION_DATA_ACK & opts->mptcp_options) {
++		if (!mptcp_is_data_seq(skb))
++			ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++		else
++			ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
++	}
++	if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
++		struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++		mpprio->kind = TCPOPT_MPTCP;
++		mpprio->len = MPTCP_SUB_LEN_PRIO;
++		mpprio->sub = MPTCP_SUB_PRIO;
++		mpprio->rsv = 0;
++		mpprio->b = tp->mptcp->low_prio;
++		mpprio->addr_id = TCPOPT_NOP;
++
++		ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
++	}
++}
++
++/* Sends the datafin */
++void mptcp_send_fin(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
++	int mss_now;
++
++	if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
++		meta_tp->mpcb->passive_close = 1;
++
++	/* Optimization, tack on the FIN if we have a queue of
++	 * unsent frames.  But be careful about outgoing SACKS
++	 * and IP options.
++	 */
++	mss_now = mptcp_current_mss(meta_sk);
++
++	if (tcp_send_head(meta_sk) != NULL) {
++		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++		TCP_SKB_CB(skb)->end_seq++;
++		meta_tp->write_seq++;
++	} else {
++		/* Socket is locked, keep trying until memory is available. */
++		for (;;) {
++			skb = alloc_skb_fclone(MAX_TCP_HEADER,
++					       meta_sk->sk_allocation);
++			if (skb)
++				break;
++			yield();
++		}
++		/* Reserve space for headers and prepare control bits. */
++		skb_reserve(skb, MAX_TCP_HEADER);
++
++		tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
++		TCP_SKB_CB(skb)->end_seq++;
++		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++		tcp_queue_skb(meta_sk, skb);
++	}
++	__tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
++}
++
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
++
++	if (!mpcb->cnt_subflows)
++		return;
++
++	WARN_ON(meta_tp->send_mp_fclose);
++
++	/* First - select a socket */
++	sk = mptcp_select_ack_sock(meta_sk);
++
++	/* May happen if no subflow is in an appropriate state */
++	if (!sk)
++		return;
++
++	/* We are in infinite mode - just send a reset */
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
++		sk->sk_err = ECONNRESET;
++		if (tcp_need_reset(sk->sk_state))
++			tcp_send_active_reset(sk, priority);
++		mptcp_sub_force_close(sk);
++		return;
++	}
++
++
++	tcp_sk(sk)->send_mp_fclose = 1;
++	/** Reset all other subflows */
++
++	/* tcp_done must be handled with bh disabled */
++	if (!in_serving_softirq())
++		local_bh_disable();
++
++	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++		if (tcp_sk(sk_it)->send_mp_fclose)
++			continue;
++
++		sk_it->sk_err = ECONNRESET;
++		if (tcp_need_reset(sk_it->sk_state))
++			tcp_send_active_reset(sk_it, GFP_ATOMIC);
++		mptcp_sub_force_close(sk_it);
++	}
++
++	if (!in_serving_softirq())
++		local_bh_enable();
++
++	tcp_send_ack(sk);
++	inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
++
++	meta_tp->send_mp_fclose = 1;
++}
++
++static void mptcp_ack_retransmit_timer(struct sock *sk)
++{
++	struct sk_buff *skb;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct inet_connection_sock *icsk = inet_csk(sk);
++
++	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
++		goto out; /* Routing failure or similar */
++
++	if (!tp->retrans_stamp)
++		tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++	if (tcp_write_timeout(sk)) {
++		tp->mptcp->pre_established = 0;
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++		tp->ops->send_active_reset(sk, GFP_ATOMIC);
++		goto out;
++	}
++
++	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
++	if (skb == NULL) {
++		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++			       jiffies + icsk->icsk_rto);
++		return;
++	}
++
++	/* Reserve space for headers and prepare control bits */
++	skb_reserve(skb, MAX_TCP_HEADER);
++	tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
++
++	TCP_SKB_CB(skb)->when = tcp_time_stamp;
++	if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
++		/* Retransmission failed because of local congestion,
++		 * do not backoff.
++		 */
++		if (!icsk->icsk_retransmits)
++			icsk->icsk_retransmits = 1;
++		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++			       jiffies + icsk->icsk_rto);
++		return;
++	}
++
++
++	icsk->icsk_retransmits++;
++	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++	sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++		       jiffies + icsk->icsk_rto);
++	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
++		__sk_dst_reset(sk);
++
++out:;
++}
++
++void mptcp_ack_handler(unsigned long data)
++{
++	struct sock *sk = (struct sock *)data;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		/* Try again later */
++		sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
++			       jiffies + (HZ / 20));
++		goto out_unlock;
++	}
++
++	if (sk->sk_state == TCP_CLOSE)
++		goto out_unlock;
++	if (!tcp_sk(sk)->mptcp->pre_established)
++		goto out_unlock;
++
++	mptcp_ack_retransmit_timer(sk);
++
++	sk_mem_reclaim(sk);
++
++out_unlock:
++	bh_unlock_sock(meta_sk);
++	sock_put(sk);
++}
++
++/* Similar to tcp_retransmit_skb
++ *
++ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
++ * meta-level.
++ */
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *subsk;
++	unsigned int limit, mss_now;
++	int err = -1;
++
++	/* Do not sent more than we queued. 1/4 is reserved for possible
++	 * copying overhead: fragmentation, tunneling, mangling etc.
++	 *
++	 * This is a meta-retransmission thus we check on the meta-socket.
++	 */
++	if (atomic_read(&meta_sk->sk_wmem_alloc) >
++	    min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
++		return -EAGAIN;
++	}
++
++	/* We need to make sure that the retransmitted segment can be sent on a
++	 * subflow right now. If it is too big, it needs to be fragmented.
++	 */
++	subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
++	if (!subsk) {
++		/* We want to increase icsk_retransmits, thus return 0, so that
++		 * mptcp_retransmit_timer enters the desired branch.
++		 */
++		err = 0;
++		goto failed;
++	}
++	mss_now = tcp_current_mss(subsk);
++
++	/* If the segment was cloned (e.g. a meta retransmission), the header
++	 * must be expanded/copied so that there is no corruption of TSO
++	 * information.
++	 */
++	if (skb_unclone(skb, GFP_ATOMIC)) {
++		err = -ENOMEM;
++		goto failed;
++	}
++
++	/* Must have been set by mptcp_write_xmit before */
++	BUG_ON(!tcp_skb_pcount(skb));
++
++	limit = mss_now;
++	/* skb->len > mss_now is the equivalent of tso_segs > 1 in
++	 * tcp_write_xmit. Otherwise split-point would return 0.
++	 */
++	if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++		limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++					    UINT_MAX / mss_now,
++					    TCP_NAGLE_OFF);
++
++	if (skb->len > limit &&
++	    unlikely(mptcp_fragment(meta_sk, skb, limit,
++				    GFP_ATOMIC, 0)))
++		goto failed;
++
++	if (!mptcp_skb_entail(subsk, skb, -1))
++		goto failed;
++	TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++	/* Update global TCP statistics. */
++	TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
++
++	/* Diff to tcp_retransmit_skb */
++
++	/* Save stamp of the first retransmit. */
++	if (!meta_tp->retrans_stamp)
++		meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
++
++	__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++
++	return 0;
++
++failed:
++	NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
++	return err;
++}
++
++/* Similar to tcp_retransmit_timer
++ *
++ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
++ * and that we don't have an srtt estimation at the meta-level.
++ */
++void mptcp_retransmit_timer(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	int err;
++
++	/* In fallback, retransmission is handled at the subflow-level */
++	if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
++	    mpcb->send_infinite_mapping)
++		return;
++
++	WARN_ON(tcp_write_queue_empty(meta_sk));
++
++	if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
++	    !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
++		/* Receiver dastardly shrinks window. Our retransmits
++		 * become zero probes, but we should not timeout this
++		 * connection. If the socket is an orphan, time it out,
++		 * we cannot allow such beasts to hang infinitely.
++		 */
++		struct inet_sock *meta_inet = inet_sk(meta_sk);
++		if (meta_sk->sk_family == AF_INET) {
++			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++				       &meta_inet->inet_daddr,
++				       ntohs(meta_inet->inet_dport),
++				       meta_inet->inet_num, meta_tp->snd_una,
++				       meta_tp->snd_nxt);
++		}
++#if IS_ENABLED(CONFIG_IPV6)
++		else if (meta_sk->sk_family == AF_INET6) {
++			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++				       &meta_sk->sk_v6_daddr,
++				       ntohs(meta_inet->inet_dport),
++				       meta_inet->inet_num, meta_tp->snd_una,
++				       meta_tp->snd_nxt);
++		}
++#endif
++		if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
++			tcp_write_err(meta_sk);
++			return;
++		}
++
++		mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++		goto out_reset_timer;
++	}
++
++	if (tcp_write_timeout(meta_sk))
++		return;
++
++	if (meta_icsk->icsk_retransmits == 0)
++		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
++
++	meta_icsk->icsk_ca_state = TCP_CA_Loss;
++
++	err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++	if (err > 0) {
++		/* Retransmission failed because of local congestion,
++		 * do not backoff.
++		 */
++		if (!meta_icsk->icsk_retransmits)
++			meta_icsk->icsk_retransmits = 1;
++		inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++					  min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
++					  TCP_RTO_MAX);
++		return;
++	}
++
++	/* Increase the timeout each time we retransmit.  Note that
++	 * we do not increase the rtt estimate.  rto is initialized
++	 * from rtt, but increases here.  Jacobson (SIGCOMM 88) suggests
++	 * that doubling rto each time is the least we can get away with.
++	 * In KA9Q, Karn uses this for the first few times, and then
++	 * goes to quadratic.  netBSD doubles, but only goes up to *64,
++	 * and clamps at 1 to 64 sec afterwards.  Note that 120 sec is
++	 * defined in the protocol as the maximum possible RTT.  I guess
++	 * we'll have to use something other than TCP to talk to the
++	 * University of Mars.
++	 *
++	 * PAWS allows us longer timeouts and large windows, so once
++	 * implemented ftp to mars will work nicely. We will have to fix
++	 * the 120 second clamps though!
++	 */
++	meta_icsk->icsk_backoff++;
++	meta_icsk->icsk_retransmits++;
++
++out_reset_timer:
++	/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
++	 * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
++	 * might be increased if the stream oscillates between thin and thick,
++	 * thus the old value might already be too high compared to the value
++	 * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
++	 * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
++	 * exponential backoff behaviour to avoid continue hammering
++	 * linear-timeout retransmissions into a black hole
++	 */
++	if (meta_sk->sk_state == TCP_ESTABLISHED &&
++	    (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
++	    tcp_stream_is_thin(meta_tp) &&
++	    meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
++		meta_icsk->icsk_backoff = 0;
++		/* We cannot do the same as in tcp_write_timer because the
++		 * srtt is not set here.
++		 */
++		mptcp_set_rto(meta_sk);
++	} else {
++		/* Use normal (exponential) backoff */
++		meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
++	}
++	inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
++
++	return;
++}
++
++/* Modify values to an mptcp-level for the initial window of new subflows */
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++				__u32 *window_clamp, int wscale_ok,
++				__u8 *rcv_wscale, __u32 init_rcv_wnd,
++				 const struct sock *sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	*window_clamp = mpcb->orig_window_clamp;
++	__space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
++
++	tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
++				  wscale_ok, rcv_wscale, init_rcv_wnd, sk);
++}
++
++static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
++				  unsigned int (*mss_cb)(struct sock *sk))
++{
++	struct sock *sk;
++	u64 rate = 0;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		int this_mss;
++		u64 this_rate;
++
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		/* Do not consider subflows without a RTT estimation yet
++		 * otherwise this_rate >>> rate.
++		 */
++		if (unlikely(!tp->srtt_us))
++			continue;
++
++		this_mss = mss_cb(sk);
++
++		/* If this_mss is smaller than mss, it means that a segment will
++		 * be splitted in two (or more) when pushed on this subflow. If
++		 * you consider that mss = 1428 and this_mss = 1420 then two
++		 * segments will be generated: a 1420-byte and 8-byte segment.
++		 * The latter will introduce a large overhead as for a single
++		 * data segment 2 slots will be used in the congestion window.
++		 * Therefore reducing by ~2 the potential throughput of this
++		 * subflow. Indeed, 1428 will be send while 2840 could have been
++		 * sent if mss == 1420 reducing the throughput by 2840 / 1428.
++		 *
++		 * The following algorithm take into account this overhead
++		 * when computing the potential throughput that MPTCP can
++		 * achieve when generating mss-byte segments.
++		 *
++		 * The formulae is the following:
++		 *  \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
++		 * Where ratio is computed as follows:
++		 *  \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
++		 *
++		 * ratio gives the reduction factor of the theoretical
++		 * throughput a subflow can achieve if MPTCP uses a specific
++		 * MSS value.
++		 */
++		this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
++				      max(tp->snd_cwnd, tp->packets_out),
++				      (u64)tp->srtt_us *
++				      DIV_ROUND_UP(mss, this_mss) * this_mss);
++		rate += this_rate;
++	}
++
++	return rate;
++}
++
++static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
++					unsigned int (*mss_cb)(struct sock *sk))
++{
++	unsigned int mss = 0;
++	u64 rate = 0;
++	struct sock *sk;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		int this_mss;
++		u64 this_rate;
++
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		this_mss = mss_cb(sk);
++
++		/* Same mss values will produce the same throughput. */
++		if (this_mss == mss)
++			continue;
++
++		/* See whether using this mss value can theoretically improve
++		 * the performances.
++		 */
++		this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
++		if (this_rate >= rate) {
++			mss = this_mss;
++			rate = this_rate;
++		}
++	}
++
++	return mss;
++}
++
++unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++	unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
++
++	/* If no subflow is available, we take a default-mss from the
++	 * meta-socket.
++	 */
++	return !mss ? tcp_current_mss(meta_sk) : mss;
++}
++
++static unsigned int mptcp_select_size_mss(struct sock *sk)
++{
++	return tcp_sk(sk)->mss_cache;
++}
++
++int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++	unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
++
++	if (sg) {
++		if (mptcp_sk_can_gso(meta_sk)) {
++			mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
++		} else {
++			int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
++
++			if (mss >= pgbreak &&
++			    mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
++				mss = pgbreak;
++		}
++	}
++
++	return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
++}
++
++int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++	const struct sock *sk;
++	u32 rtt_max = tp->srtt_us;
++	u64 bw_est;
++
++	if (!tp->srtt_us)
++		return tp->reordering + 1;
++
++	mptcp_for_each_sk(tp->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		if (rtt_max < tcp_sk(sk)->srtt_us)
++			rtt_max = tcp_sk(sk)->srtt_us;
++	}
++
++	bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
++				(u64)tp->srtt_us);
++
++	return max_t(unsigned int, (u32)(bw_est >> 16),
++			tp->reordering + 1);
++}
++
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++				  int large_allowed)
++{
++	struct sock *sk;
++	u32 xmit_size_goal = 0;
++
++	if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
++		mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++			int this_size_goal;
++
++			if (!mptcp_sk_can_send(sk))
++				continue;
++
++			this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
++			if (this_size_goal > xmit_size_goal)
++				xmit_size_goal = this_size_goal;
++		}
++	}
++
++	return max(xmit_size_goal, mss_now);
++}
++
++/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++	if (skb_cloned(skb)) {
++		if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
++			return -ENOMEM;
++	}
++
++	__pskb_trim_head(skb, len);
++
++	TCP_SKB_CB(skb)->seq += len;
++	skb->ip_summed = CHECKSUM_PARTIAL;
++
++	skb->truesize	     -= len;
++	sk->sk_wmem_queued   -= len;
++	sk_mem_uncharge(sk, len);
++	sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++
++	/* Any change of skb->len requires recalculation of tso factor. */
++	if (tcp_skb_pcount(skb) > 1)
++		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
++
++	return 0;
++}
+diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
+new file mode 100644
+index 000000000000..9542f950729f
+--- /dev/null
++++ b/net/mptcp/mptcp_pm.c
+@@ -0,0 +1,169 @@
++/*
++ *     MPTCP implementation - MPTCP-subflow-management
++ *
++ *     Initial Design & Implementation:
++ *     Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *     Current Maintainer & Author:
++ *     Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *     Additional authors:
++ *     Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *     Gregory Detal <gregory.detal@uclouvain.be>
++ *     Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *     Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *     Lavkesh Lahngir <lavkesh51@gmail.com>
++ *     Andreas Ripke <ripke@neclab.eu>
++ *     Vlad Dogaru <vlad.dogaru@intel.com>
++ *     Octavian Purdila <octavian.purdila@intel.com>
++ *     John Ronan <jronan@tssg.org>
++ *     Catalin Nicutar <catalin.nicutar@gmail.com>
++ *     Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *     This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_pm_list_lock);
++static LIST_HEAD(mptcp_pm_list);
++
++static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
++			    struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++struct mptcp_pm_ops mptcp_pm_default = {
++	.get_local_id = mptcp_default_id, /* We do not care */
++	.name = "default",
++	.owner = THIS_MODULE,
++};
++
++static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
++{
++	struct mptcp_pm_ops *e;
++
++	list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
++		if (strcmp(e->name, name) == 0)
++			return e;
++	}
++
++	return NULL;
++}
++
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
++{
++	int ret = 0;
++
++	if (!pm->get_local_id)
++		return -EINVAL;
++
++	spin_lock(&mptcp_pm_list_lock);
++	if (mptcp_pm_find(pm->name)) {
++		pr_notice("%s already registered\n", pm->name);
++		ret = -EEXIST;
++	} else {
++		list_add_tail_rcu(&pm->list, &mptcp_pm_list);
++		pr_info("%s registered\n", pm->name);
++	}
++	spin_unlock(&mptcp_pm_list_lock);
++
++	return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
++
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
++{
++	spin_lock(&mptcp_pm_list_lock);
++	list_del_rcu(&pm->list);
++	spin_unlock(&mptcp_pm_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
++
++void mptcp_get_default_path_manager(char *name)
++{
++	struct mptcp_pm_ops *pm;
++
++	BUG_ON(list_empty(&mptcp_pm_list));
++
++	rcu_read_lock();
++	pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
++	strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
++	rcu_read_unlock();
++}
++
++int mptcp_set_default_path_manager(const char *name)
++{
++	struct mptcp_pm_ops *pm;
++	int ret = -ENOENT;
++
++	spin_lock(&mptcp_pm_list_lock);
++	pm = mptcp_pm_find(name);
++#ifdef CONFIG_MODULES
++	if (!pm && capable(CAP_NET_ADMIN)) {
++		spin_unlock(&mptcp_pm_list_lock);
++
++		request_module("mptcp_%s", name);
++		spin_lock(&mptcp_pm_list_lock);
++		pm = mptcp_pm_find(name);
++	}
++#endif
++
++	if (pm) {
++		list_move(&pm->list, &mptcp_pm_list);
++		ret = 0;
++	} else {
++		pr_info("%s is not available\n", name);
++	}
++	spin_unlock(&mptcp_pm_list_lock);
++
++	return ret;
++}
++
++void mptcp_init_path_manager(struct mptcp_cb *mpcb)
++{
++	struct mptcp_pm_ops *pm;
++
++	rcu_read_lock();
++	list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
++		if (try_module_get(pm->owner)) {
++			mpcb->pm_ops = pm;
++			break;
++		}
++	}
++	rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
++{
++	module_put(mpcb->pm_ops->owner);
++}
++
++/* Fallback to the default path-manager. */
++void mptcp_fallback_default(struct mptcp_cb *mpcb)
++{
++	struct mptcp_pm_ops *pm;
++
++	mptcp_cleanup_path_manager(mpcb);
++	pm = mptcp_pm_find("default");
++
++	/* Cannot fail - it's the default module */
++	try_module_get(pm->owner);
++	mpcb->pm_ops = pm;
++}
++EXPORT_SYMBOL_GPL(mptcp_fallback_default);
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_path_manager_default(void)
++{
++	return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
++}
++late_initcall(mptcp_path_manager_default);
+diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
+new file mode 100644
+index 000000000000..93278f684069
+--- /dev/null
++++ b/net/mptcp/mptcp_rr.c
+@@ -0,0 +1,301 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static unsigned char num_segments __read_mostly = 1;
++module_param(num_segments, byte, 0644);
++MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
++
++static bool cwnd_limited __read_mostly = 1;
++module_param(cwnd_limited, bool, 0644);
++MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
++
++struct rrsched_priv {
++	unsigned char quota;
++};
++
++static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
++{
++	return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
++				  bool zero_wnd_test, bool cwnd_test)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	unsigned int space, in_flight;
++
++	/* Set of states for which we are allowed to send data */
++	if (!mptcp_sk_can_send(sk))
++		return false;
++
++	/* We do not send data on this subflow unless it is
++	 * fully established, i.e. the 4th ack has been received.
++	 */
++	if (tp->mptcp->pre_established)
++		return false;
++
++	if (tp->pf)
++		return false;
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++		/* If SACK is disabled, and we got a loss, TCP does not exit
++		 * the loss-state until something above high_seq has been acked.
++		 * (see tcp_try_undo_recovery)
++		 *
++		 * high_seq is the snd_nxt at the moment of the RTO. As soon
++		 * as we have an RTO, we won't push data on the subflow.
++		 * Thus, snd_una can never go beyond high_seq.
++		 */
++		if (!tcp_is_reno(tp))
++			return false;
++		else if (tp->snd_una != tp->high_seq)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		/* Make sure that we send in-order data */
++		if (skb && tp->mptcp->second_packet &&
++		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++			return false;
++	}
++
++	if (!cwnd_test)
++		goto zero_wnd_test;
++
++	in_flight = tcp_packets_in_flight(tp);
++	/* Not even a single spot in the cwnd */
++	if (in_flight >= tp->snd_cwnd)
++		return false;
++
++	/* Now, check if what is queued in the subflow's send-queue
++	 * already fills the cwnd.
++	 */
++	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++	if (tp->write_seq - tp->snd_nxt > space)
++		return false;
++
++zero_wnd_test:
++	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++		return false;
++
++	return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++	/* If the skb has already been enqueued in this sk, try to find
++	 * another one.
++	 */
++	return skb &&
++		/* Has the skb already been enqueued into this subsocket? */
++		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* We just look for any subflow that is available */
++static struct sock *rr_get_available_subflow(struct sock *meta_sk,
++					     struct sk_buff *skb,
++					     bool zero_wnd_test)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk, *bestsk = NULL, *backupsk = NULL;
++
++	/* Answer data_fin on same subflow!!! */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++	    skb && mptcp_is_data_fin(skb)) {
++		mptcp_for_each_sk(mpcb, sk) {
++			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++			    mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++				return sk;
++		}
++	}
++
++	/* First, find the best subflow */
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++
++		if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++			continue;
++
++		if (mptcp_rr_dont_reinject_skb(tp, skb)) {
++			backupsk = sk;
++			continue;
++		}
++
++		bestsk = sk;
++	}
++
++	if (bestsk) {
++		sk = bestsk;
++	} else if (backupsk) {
++		/* It has been sent on all subflows once - let's give it a
++		 * chance again by restarting its pathmask.
++		 */
++		if (skb)
++			TCP_SKB_CB(skb)->path_mask = 0;
++		sk = backupsk;
++	}
++
++	return sk;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sk_buff *skb = NULL;
++
++	*reinject = 0;
++
++	/* If we are in fallback-mode, just take from the meta-send-queue */
++	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++		return tcp_send_head(meta_sk);
++
++	skb = skb_peek(&mpcb->reinject_queue);
++
++	if (skb)
++		*reinject = 1;
++	else
++		skb = tcp_send_head(meta_sk);
++	return skb;
++}
++
++static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
++					     int *reinject,
++					     struct sock **subsk,
++					     unsigned int *limit)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk_it, *choose_sk = NULL;
++	struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
++	unsigned char split = num_segments;
++	unsigned char iter = 0, full_subs = 0;
++
++	/* As we set it, we have to reset it as well. */
++	*limit = 0;
++
++	if (!skb)
++		return NULL;
++
++	if (*reinject) {
++		*subsk = rr_get_available_subflow(meta_sk, skb, false);
++		if (!*subsk)
++			return NULL;
++
++		return skb;
++	}
++
++retry:
++
++	/* First, we look for a subflow who is currently being used */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		struct tcp_sock *tp_it = tcp_sk(sk_it);
++		struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++		if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++			continue;
++
++		iter++;
++
++		/* Is this subflow currently being used? */
++		if (rsp->quota > 0 && rsp->quota < num_segments) {
++			split = num_segments - rsp->quota;
++			choose_sk = sk_it;
++			goto found;
++		}
++
++		/* Or, it's totally unused */
++		if (!rsp->quota) {
++			split = num_segments;
++			choose_sk = sk_it;
++		}
++
++		/* Or, it must then be fully used  */
++		if (rsp->quota == num_segments)
++			full_subs++;
++	}
++
++	/* All considered subflows have a full quota, and we considered at
++	 * least one.
++	 */
++	if (iter && iter == full_subs) {
++		/* So, we restart this round by setting quota to 0 and retry
++		 * to find a subflow.
++		 */
++		mptcp_for_each_sk(mpcb, sk_it) {
++			struct tcp_sock *tp_it = tcp_sk(sk_it);
++			struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++			if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++				continue;
++
++			rsp->quota = 0;
++		}
++
++		goto retry;
++	}
++
++found:
++	if (choose_sk) {
++		unsigned int mss_now;
++		struct tcp_sock *choose_tp = tcp_sk(choose_sk);
++		struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
++
++		if (!mptcp_rr_is_available(choose_sk, skb, false, true))
++			return NULL;
++
++		*subsk = choose_sk;
++		mss_now = tcp_current_mss(*subsk);
++		*limit = split * mss_now;
++
++		if (skb->len > mss_now)
++			rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
++		else
++			rsp->quota++;
++
++		return skb;
++	}
++
++	return NULL;
++}
++
++static struct mptcp_sched_ops mptcp_sched_rr = {
++	.get_subflow = rr_get_available_subflow,
++	.next_segment = mptcp_rr_next_segment,
++	.name = "roundrobin",
++	.owner = THIS_MODULE,
++};
++
++static int __init rr_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
++
++	if (mptcp_register_scheduler(&mptcp_sched_rr))
++		return -1;
++
++	return 0;
++}
++
++static void rr_unregister(void)
++{
++	mptcp_unregister_scheduler(&mptcp_sched_rr);
++}
++
++module_init(rr_register);
++module_exit(rr_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
++MODULE_VERSION("0.89");
+diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
+new file mode 100644
+index 000000000000..6c7ff4eceac1
+--- /dev/null
++++ b/net/mptcp/mptcp_sched.c
+@@ -0,0 +1,493 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_sched_list_lock);
++static LIST_HEAD(mptcp_sched_list);
++
++struct defsched_priv {
++	u32	last_rbuf_opti;
++};
++
++static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
++{
++	return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
++			       bool zero_wnd_test)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	unsigned int mss_now, space, in_flight;
++
++	/* Set of states for which we are allowed to send data */
++	if (!mptcp_sk_can_send(sk))
++		return false;
++
++	/* We do not send data on this subflow unless it is
++	 * fully established, i.e. the 4th ack has been received.
++	 */
++	if (tp->mptcp->pre_established)
++		return false;
++
++	if (tp->pf)
++		return false;
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++		/* If SACK is disabled, and we got a loss, TCP does not exit
++		 * the loss-state until something above high_seq has been acked.
++		 * (see tcp_try_undo_recovery)
++		 *
++		 * high_seq is the snd_nxt at the moment of the RTO. As soon
++		 * as we have an RTO, we won't push data on the subflow.
++		 * Thus, snd_una can never go beyond high_seq.
++		 */
++		if (!tcp_is_reno(tp))
++			return false;
++		else if (tp->snd_una != tp->high_seq)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		/* Make sure that we send in-order data */
++		if (skb && tp->mptcp->second_packet &&
++		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++			return false;
++	}
++
++	/* If TSQ is already throttling us, do not send on this subflow. When
++	 * TSQ gets cleared the subflow becomes eligible again.
++	 */
++	if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
++		return false;
++
++	in_flight = tcp_packets_in_flight(tp);
++	/* Not even a single spot in the cwnd */
++	if (in_flight >= tp->snd_cwnd)
++		return false;
++
++	/* Now, check if what is queued in the subflow's send-queue
++	 * already fills the cwnd.
++	 */
++	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++	if (tp->write_seq - tp->snd_nxt > space)
++		return false;
++
++	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++		return false;
++
++	mss_now = tcp_current_mss(sk);
++
++	/* Don't send on this subflow if we bypass the allowed send-window at
++	 * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
++	 * calculated end_seq (because here at this point end_seq is still at
++	 * the meta-level).
++	 */
++	if (skb && !zero_wnd_test &&
++	    after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
++		return false;
++
++	return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++	/* If the skb has already been enqueued in this sk, try to find
++	 * another one.
++	 */
++	return skb &&
++		/* Has the skb already been enqueued into this subsocket? */
++		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* This is the scheduler. This function decides on which flow to send
++ * a given MSS. If all subflows are found to be busy, NULL is returned
++ * The flow is selected based on the shortest RTT.
++ * If all paths have full cong windows, we simply return NULL.
++ *
++ * Additionally, this function is aware of the backup-subflows.
++ */
++static struct sock *get_available_subflow(struct sock *meta_sk,
++					  struct sk_buff *skb,
++					  bool zero_wnd_test)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
++	u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
++	int cnt_backups = 0;
++
++	/* if there is only one subflow, bypass the scheduling function */
++	if (mpcb->cnt_subflows == 1) {
++		bestsk = (struct sock *)mpcb->connection_list;
++		if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
++			bestsk = NULL;
++		return bestsk;
++	}
++
++	/* Answer data_fin on same subflow!!! */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++	    skb && mptcp_is_data_fin(skb)) {
++		mptcp_for_each_sk(mpcb, sk) {
++			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++			    mptcp_is_available(sk, skb, zero_wnd_test))
++				return sk;
++		}
++	}
++
++	/* First, find the best subflow */
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++
++		if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
++			cnt_backups++;
++
++		if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++		    tp->srtt_us < lowprio_min_time_to_peer) {
++			if (!mptcp_is_available(sk, skb, zero_wnd_test))
++				continue;
++
++			if (mptcp_dont_reinject_skb(tp, skb)) {
++				backupsk = sk;
++				continue;
++			}
++
++			lowprio_min_time_to_peer = tp->srtt_us;
++			lowpriosk = sk;
++		} else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++			   tp->srtt_us < min_time_to_peer) {
++			if (!mptcp_is_available(sk, skb, zero_wnd_test))
++				continue;
++
++			if (mptcp_dont_reinject_skb(tp, skb)) {
++				backupsk = sk;
++				continue;
++			}
++
++			min_time_to_peer = tp->srtt_us;
++			bestsk = sk;
++		}
++	}
++
++	if (mpcb->cnt_established == cnt_backups && lowpriosk) {
++		sk = lowpriosk;
++	} else if (bestsk) {
++		sk = bestsk;
++	} else if (backupsk) {
++		/* It has been sent on all subflows once - let's give it a
++		 * chance again by restarting its pathmask.
++		 */
++		if (skb)
++			TCP_SKB_CB(skb)->path_mask = 0;
++		sk = backupsk;
++	}
++
++	return sk;
++}
++
++static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
++{
++	struct sock *meta_sk;
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct tcp_sock *tp_it;
++	struct sk_buff *skb_head;
++	struct defsched_priv *dsp = defsched_get_priv(tp);
++
++	if (tp->mpcb->cnt_subflows == 1)
++		return NULL;
++
++	meta_sk = mptcp_meta_sk(sk);
++	skb_head = tcp_write_queue_head(meta_sk);
++
++	if (!skb_head || skb_head == tcp_send_head(meta_sk))
++		return NULL;
++
++	/* If penalization is optional (coming from mptcp_next_segment() and
++	 * We are not send-buffer-limited we do not penalize. The retransmission
++	 * is just an optimization to fix the idle-time due to the delay before
++	 * we wake up the application.
++	 */
++	if (!penal && sk_stream_memory_free(meta_sk))
++		goto retrans;
++
++	/* Only penalize again after an RTT has elapsed */
++	if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
++		goto retrans;
++
++	/* Half the cwnd of the slow flow */
++	mptcp_for_each_tp(tp->mpcb, tp_it) {
++		if (tp_it != tp &&
++		    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++			if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
++				tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
++				if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
++					tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
++
++				dsp->last_rbuf_opti = tcp_time_stamp;
++			}
++			break;
++		}
++	}
++
++retrans:
++
++	/* Segment not yet injected into this path? Take it!!! */
++	if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
++		bool do_retrans = false;
++		mptcp_for_each_tp(tp->mpcb, tp_it) {
++			if (tp_it != tp &&
++			    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++				if (tp_it->snd_cwnd <= 4) {
++					do_retrans = true;
++					break;
++				}
++
++				if (4 * tp->srtt_us >= tp_it->srtt_us) {
++					do_retrans = false;
++					break;
++				} else {
++					do_retrans = true;
++				}
++			}
++		}
++
++		if (do_retrans && mptcp_is_available(sk, skb_head, false))
++			return skb_head;
++	}
++	return NULL;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sk_buff *skb = NULL;
++
++	*reinject = 0;
++
++	/* If we are in fallback-mode, just take from the meta-send-queue */
++	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++		return tcp_send_head(meta_sk);
++
++	skb = skb_peek(&mpcb->reinject_queue);
++
++	if (skb) {
++		*reinject = 1;
++	} else {
++		skb = tcp_send_head(meta_sk);
++
++		if (!skb && meta_sk->sk_socket &&
++		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
++		    sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
++			struct sock *subsk = get_available_subflow(meta_sk, NULL,
++								   false);
++			if (!subsk)
++				return NULL;
++
++			skb = mptcp_rcv_buf_optimization(subsk, 0);
++			if (skb)
++				*reinject = -1;
++		}
++	}
++	return skb;
++}
++
++static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
++					  int *reinject,
++					  struct sock **subsk,
++					  unsigned int *limit)
++{
++	struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
++	unsigned int mss_now;
++	struct tcp_sock *subtp;
++	u16 gso_max_segs;
++	u32 max_len, max_segs, window, needed;
++
++	/* As we set it, we have to reset it as well. */
++	*limit = 0;
++
++	if (!skb)
++		return NULL;
++
++	*subsk = get_available_subflow(meta_sk, skb, false);
++	if (!*subsk)
++		return NULL;
++
++	subtp = tcp_sk(*subsk);
++	mss_now = tcp_current_mss(*subsk);
++
++	if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
++		skb = mptcp_rcv_buf_optimization(*subsk, 1);
++		if (skb)
++			*reinject = -1;
++		else
++			return NULL;
++	}
++
++	/* No splitting required, as we will only send one single segment */
++	if (skb->len <= mss_now)
++		return skb;
++
++	/* The following is similar to tcp_mss_split_point, but
++	 * we do not care about nagle, because we will anyways
++	 * use TCP_NAGLE_PUSH, which overrides this.
++	 *
++	 * So, we first limit according to the cwnd/gso-size and then according
++	 * to the subflow's window.
++	 */
++
++	gso_max_segs = (*subsk)->sk_gso_max_segs;
++	if (!gso_max_segs) /* No gso supported on the subflow's NIC */
++		gso_max_segs = 1;
++	max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
++	if (!max_segs)
++		return NULL;
++
++	max_len = mss_now * max_segs;
++	window = tcp_wnd_end(subtp) - subtp->write_seq;
++
++	needed = min(skb->len, window);
++	if (max_len <= skb->len)
++		/* Take max_win, which is actually the cwnd/gso-size */
++		*limit = max_len;
++	else
++		/* Or, take the window */
++		*limit = needed;
++
++	return skb;
++}
++
++static void defsched_init(struct sock *sk)
++{
++	struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
++
++	dsp->last_rbuf_opti = tcp_time_stamp;
++}
++
++struct mptcp_sched_ops mptcp_sched_default = {
++	.get_subflow = get_available_subflow,
++	.next_segment = mptcp_next_segment,
++	.init = defsched_init,
++	.name = "default",
++	.owner = THIS_MODULE,
++};
++
++static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
++{
++	struct mptcp_sched_ops *e;
++
++	list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
++		if (strcmp(e->name, name) == 0)
++			return e;
++	}
++
++	return NULL;
++}
++
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
++{
++	int ret = 0;
++
++	if (!sched->get_subflow || !sched->next_segment)
++		return -EINVAL;
++
++	spin_lock(&mptcp_sched_list_lock);
++	if (mptcp_sched_find(sched->name)) {
++		pr_notice("%s already registered\n", sched->name);
++		ret = -EEXIST;
++	} else {
++		list_add_tail_rcu(&sched->list, &mptcp_sched_list);
++		pr_info("%s registered\n", sched->name);
++	}
++	spin_unlock(&mptcp_sched_list_lock);
++
++	return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
++
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
++{
++	spin_lock(&mptcp_sched_list_lock);
++	list_del_rcu(&sched->list);
++	spin_unlock(&mptcp_sched_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
++
++void mptcp_get_default_scheduler(char *name)
++{
++	struct mptcp_sched_ops *sched;
++
++	BUG_ON(list_empty(&mptcp_sched_list));
++
++	rcu_read_lock();
++	sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
++	strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
++	rcu_read_unlock();
++}
++
++int mptcp_set_default_scheduler(const char *name)
++{
++	struct mptcp_sched_ops *sched;
++	int ret = -ENOENT;
++
++	spin_lock(&mptcp_sched_list_lock);
++	sched = mptcp_sched_find(name);
++#ifdef CONFIG_MODULES
++	if (!sched && capable(CAP_NET_ADMIN)) {
++		spin_unlock(&mptcp_sched_list_lock);
++
++		request_module("mptcp_%s", name);
++		spin_lock(&mptcp_sched_list_lock);
++		sched = mptcp_sched_find(name);
++	}
++#endif
++
++	if (sched) {
++		list_move(&sched->list, &mptcp_sched_list);
++		ret = 0;
++	} else {
++		pr_info("%s is not available\n", name);
++	}
++	spin_unlock(&mptcp_sched_list_lock);
++
++	return ret;
++}
++
++void mptcp_init_scheduler(struct mptcp_cb *mpcb)
++{
++	struct mptcp_sched_ops *sched;
++
++	rcu_read_lock();
++	list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
++		if (try_module_get(sched->owner)) {
++			mpcb->sched_ops = sched;
++			break;
++		}
++	}
++	rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
++{
++	module_put(mpcb->sched_ops->owner);
++}
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_scheduler_default(void)
++{
++	BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
++
++	return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
++}
++late_initcall(mptcp_scheduler_default);
+diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
+new file mode 100644
+index 000000000000..29ca1d868d17
+--- /dev/null
++++ b/net/mptcp/mptcp_wvegas.c
+@@ -0,0 +1,268 @@
++/*
++ *	MPTCP implementation - WEIGHTED VEGAS
++ *
++ *	Algorithm design:
++ *	Yu Cao <cyAnalyst@126.com>
++ *	Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
++ *	Xiaoming Fu <fu@cs.uni-goettinggen.de>
++ *
++ *	Implementation:
++ *	Yu Cao <cyAnalyst@126.com>
++ *	Enhuan Dong <deh13@mails.tsinghua.edu.cn>
++ *
++ *	Ported to the official MPTCP-kernel:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	This program is free software; you can redistribute it and/or
++ *	modify it under the terms of the GNU General Public License
++ *	as published by the Free Software Foundation; either version
++ *	2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++#include <linux/module.h>
++#include <linux/tcp.h>
++
++static int initial_alpha = 2;
++static int total_alpha = 10;
++static int gamma = 1;
++
++module_param(initial_alpha, int, 0644);
++MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
++module_param(total_alpha, int, 0644);
++MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
++module_param(gamma, int, 0644);
++MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
++
++#define MPTCP_WVEGAS_SCALE 16
++
++/* wVegas variables */
++struct wvegas {
++	u32	beg_snd_nxt;	/* right edge during last RTT */
++	u8	doing_wvegas_now;/* if true, do wvegas for this RTT */
++
++	u16	cnt_rtt;		/* # of RTTs measured within last RTT */
++	u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
++	u32	base_rtt;	/* the min of all wVegas RTT measurements seen (in usec) */
++
++	u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
++	u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
++	int alpha; /* alpha for each subflows */
++
++	u32 queue_delay; /* queue delay*/
++};
++
++
++static inline u64 mptcp_wvegas_scale(u32 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++static void wvegas_enable(const struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->doing_wvegas_now = 1;
++
++	wvegas->beg_snd_nxt = tp->snd_nxt;
++
++	wvegas->cnt_rtt = 0;
++	wvegas->sampled_rtt = 0;
++
++	wvegas->instant_rate = 0;
++	wvegas->alpha = initial_alpha;
++	wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
++
++	wvegas->queue_delay = 0;
++}
++
++static inline void wvegas_disable(const struct sock *sk)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->doing_wvegas_now = 0;
++}
++
++static void mptcp_wvegas_init(struct sock *sk)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->base_rtt = 0x7fffffff;
++	wvegas_enable(sk);
++}
++
++static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
++{
++	return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
++}
++
++static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++	u32 vrtt;
++
++	if (rtt_us < 0)
++		return;
++
++	vrtt = rtt_us + 1;
++
++	if (vrtt < wvegas->base_rtt)
++		wvegas->base_rtt = vrtt;
++
++	wvegas->sampled_rtt += vrtt;
++	wvegas->cnt_rtt++;
++}
++
++static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
++{
++	if (ca_state == TCP_CA_Open)
++		wvegas_enable(sk);
++	else
++		wvegas_disable(sk);
++}
++
++static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++	if (event == CA_EVENT_CWND_RESTART) {
++		mptcp_wvegas_init(sk);
++	} else if (event == CA_EVENT_LOSS) {
++		struct wvegas *wvegas = inet_csk_ca(sk);
++		wvegas->instant_rate = 0;
++	}
++}
++
++static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
++{
++	return  min(tp->snd_ssthresh, tp->snd_cwnd - 1);
++}
++
++static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
++{
++	u64 total_rate = 0;
++	struct sock *sub_sk;
++	const struct wvegas *wvegas = inet_csk_ca(sk);
++
++	if (!mpcb)
++		return wvegas->weight;
++
++
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
++
++		/* sampled_rtt is initialized by 0 */
++		if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
++			total_rate += sub_wvegas->instant_rate;
++	}
++
++	if (total_rate && wvegas->instant_rate)
++		return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
++	else
++		return wvegas->weight;
++}
++
++static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	if (!wvegas->doing_wvegas_now) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	if (after(ack, wvegas->beg_snd_nxt)) {
++		wvegas->beg_snd_nxt  = tp->snd_nxt;
++
++		if (wvegas->cnt_rtt <= 2) {
++			tcp_reno_cong_avoid(sk, ack, acked);
++		} else {
++			u32 rtt, diff, q_delay;
++			u64 target_cwnd;
++
++			rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
++			target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
++
++			diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
++
++			if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
++				tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
++				tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++
++			} else if (tp->snd_cwnd <= tp->snd_ssthresh) {
++				tcp_slow_start(tp, acked);
++			} else {
++				if (diff >= wvegas->alpha) {
++					wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
++					wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
++					wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
++				}
++				if (diff > wvegas->alpha) {
++					tp->snd_cwnd--;
++					tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++				} else if (diff < wvegas->alpha) {
++					tp->snd_cwnd++;
++				}
++
++				/* Try to drain link queue if needed*/
++				q_delay = rtt - wvegas->base_rtt;
++				if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
++					wvegas->queue_delay = q_delay;
++
++				if (q_delay >= 2 * wvegas->queue_delay) {
++					u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
++					tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
++					wvegas->queue_delay = 0;
++				}
++			}
++
++			if (tp->snd_cwnd < 2)
++				tp->snd_cwnd = 2;
++			else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
++				tp->snd_cwnd = tp->snd_cwnd_clamp;
++
++			tp->snd_ssthresh = tcp_current_ssthresh(sk);
++		}
++
++		wvegas->cnt_rtt = 0;
++		wvegas->sampled_rtt = 0;
++	}
++	/* Use normal slow start */
++	else if (tp->snd_cwnd <= tp->snd_ssthresh)
++		tcp_slow_start(tp, acked);
++}
++
++
++static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
++	.init		= mptcp_wvegas_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_wvegas_cong_avoid,
++	.pkts_acked	= mptcp_wvegas_pkts_acked,
++	.set_state	= mptcp_wvegas_state,
++	.cwnd_event	= mptcp_wvegas_cwnd_event,
++
++	.owner		= THIS_MODULE,
++	.name		= "wvegas",
++};
++
++static int __init mptcp_wvegas_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
++	tcp_register_congestion_control(&mptcp_wvegas);
++	return 0;
++}
++
++static void __exit mptcp_wvegas_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_wvegas);
++}
++
++module_init(mptcp_wvegas_register);
++module_exit(mptcp_wvegas_unregister);
++
++MODULE_AUTHOR("Yu Cao, Enhuan Dong");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP wVegas");
++MODULE_VERSION("0.1");

diff --git a/4567_distro-Gentoo-Kconfig.patch b/4567_distro-Gentoo-Kconfig.patch
index 71dbf09..652e2a7 100644
--- a/4567_distro-Gentoo-Kconfig.patch
+++ b/4567_distro-Gentoo-Kconfig.patch
@@ -1,15 +1,15 @@
---- a/Kconfig   2014-04-02 09:45:05.389224541 -0400
-+++ b/Kconfig   2014-04-02 09:45:39.269224273 -0400
+--- a/Kconfig	2014-04-02 09:45:05.389224541 -0400
++++ b/Kconfig	2014-04-02 09:45:39.269224273 -0400
 @@ -8,4 +8,6 @@ config SRCARCH
-	string
-	option env="SRCARCH"
-
+ 	string
+ 	option env="SRCARCH"
+ 
 +source "distro/Kconfig"
 +
  source "arch/$SRCARCH/Kconfig"
---- /dev/null	2014-09-22 14:19:24.316977284 -0400
-+++ distro/Kconfig	2014-09-22 19:30:35.670959281 -0400
-@@ -0,0 +1,109 @@
+--- 	1969-12-31 19:00:00.000000000 -0500
++++ b/distro/Kconfig	2014-04-02 09:57:03.539218861 -0400
+@@ -0,0 +1,108 @@
 +menu "Gentoo Linux"
 +
 +config GENTOO_LINUX
@@ -34,8 +34,6 @@
 +	select DEVTMPFS
 +	select TMPFS
 +
-+	select FHANDLE
-+
 +	select MMU
 +	select SHMEM
 +
@@ -91,6 +89,7 @@
 +	select CGROUPS
 +	select EPOLL
 +	select FANOTIFY
++	select FHANDLE
 +	select INOTIFY_USER
 +	select NET
 +	select NET_NS 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-09-27 13:37 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-09-27 13:37 UTC (permalink / raw
  To: gentoo-commits

commit:     1b28da13cd7150f66fae58043d3de661105a513a
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Sep 27 13:37:37 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Sep 27 13:37:37 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=1b28da13

Move mpctp patch to experimental

---
 0000_README                                 |     9 +-
 5010_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 ++++++++++++++++++++++++++
 2 files changed, 19235 insertions(+), 4 deletions(-)

diff --git a/0000_README b/0000_README
index d92e6b7..3cc9441 100644
--- a/0000_README
+++ b/0000_README
@@ -58,10 +58,6 @@ Patch:  2400_kcopy-patch-for-infiniband-driver.patch
 From:   Alexey Shvetsov <alexxy@gentoo.org>
 Desc:   Zero copy for infiniband psm userspace driver
 
-Patch:  2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
-From:   http://multipath-tcp.org/
-Desc:   Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
-
 Patch:  2700_ThinkPad-30-brightness-control-fix.patch
 From:   Seth Forshee <seth.forshee@canonical.com>
 Desc:   ACPI: Disable Windows 8 compatibility for some Lenovo ThinkPads
@@ -101,3 +97,8 @@ Desc:   BFQ v7r5 patch 2 for 3.16: BFQ Scheduler
 Patch:  5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
 From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
 Desc:   BFQ v7r5 patch 3 for 3.16: Early Queue Merge (EQM)
+
+Patch:  5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
+From:   http://multipath-tcp.org/
+Desc:   Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
+

diff --git a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
new file mode 100644
index 0000000..3000da3
--- /dev/null
+++ b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
@@ -0,0 +1,19230 @@
+diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
+index 768a0fb67dd6..5a46d91a8df9 100644
+--- a/drivers/infiniband/hw/cxgb4/cm.c
++++ b/drivers/infiniband/hw/cxgb4/cm.c
+@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
+ 	 */
+ 	memset(&tmp_opt, 0, sizeof(tmp_opt));
+ 	tcp_clear_options(&tmp_opt);
+-	tcp_parse_options(skb, &tmp_opt, 0, NULL);
++	tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
+ 
+ 	req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
+ 	memset(req, 0, sizeof(*req));
+diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
+index 2faef339d8f2..d86c853ffaad 100644
+--- a/include/linux/ipv6.h
++++ b/include/linux/ipv6.h
+@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ 	return inet_sk(__sk)->pinet6;
+ }
+ 
+-static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
+-{
+-	struct request_sock *req = reqsk_alloc(ops);
+-
+-	if (req)
+-		inet_rsk(req)->pktopts = NULL;
+-
+-	return req;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ 	return (struct raw6_sock *)sk;
+@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ 	return NULL;
+ }
+ 
+-static inline struct inet6_request_sock *
+-			inet6_rsk(const struct request_sock *rsk)
+-{
+-	return NULL;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ 	return NULL;
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index ec89301ada41..99ea4b0e3693 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
+ 						  bool zero_okay,
+ 						  __sum16 check)
+ {
+-	if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
+-		skb->csum_valid = 1;
++	if (skb_csum_unnecessary(skb)) {
++		return false;
++	} else if (zero_okay && !check) {
++		skb->ip_summed = CHECKSUM_UNNECESSARY;
+ 		return false;
+ 	}
+ 
+diff --git a/include/linux/tcp.h b/include/linux/tcp.h
+index a0513210798f..7bc2e078d6ca 100644
+--- a/include/linux/tcp.h
++++ b/include/linux/tcp.h
+@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
+ /* TCP Fast Open */
+ #define TCP_FASTOPEN_COOKIE_MIN	4	/* Min Fast Open Cookie size in bytes */
+ #define TCP_FASTOPEN_COOKIE_MAX	16	/* Max Fast Open Cookie size in bytes */
+-#define TCP_FASTOPEN_COOKIE_SIZE 8	/* the size employed by this impl. */
++#define TCP_FASTOPEN_COOKIE_SIZE 4	/* the size employed by this impl. */
+ 
+ /* TCP Fast Open Cookie as stored in memory */
+ struct tcp_fastopen_cookie {
+@@ -72,6 +72,51 @@ struct tcp_sack_block {
+ 	u32	end_seq;
+ };
+ 
++struct tcp_out_options {
++	u16	options;	/* bit field of OPTION_* */
++	u8	ws;		/* window scale, 0 to disable */
++	u8	num_sack_blocks;/* number of SACK blocks to include */
++	u8	hash_size;	/* bytes in hash_location */
++	u16	mss;		/* 0 to disable */
++	__u8	*hash_location;	/* temporary pointer, overloaded */
++	__u32	tsval, tsecr;	/* need to include OPTION_TS */
++	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
++#ifdef CONFIG_MPTCP
++	u16	mptcp_options;	/* bit field of MPTCP related OPTION_* */
++	u8	dss_csum:1,
++		add_addr_v4:1,
++		add_addr_v6:1;	/* dss-checksum required? */
++
++	union {
++		struct {
++			__u64	sender_key;	/* sender's key for mptcp */
++			__u64	receiver_key;	/* receiver's key for mptcp */
++		} mp_capable;
++
++		struct {
++			__u64	sender_truncated_mac;
++			__u32	sender_nonce;
++					/* random number of the sender */
++			__u32	token;	/* token for mptcp */
++			u8	low_prio:1;
++		} mp_join_syns;
++	};
++
++	struct {
++		struct in_addr addr;
++		u8 addr_id;
++	} add_addr4;
++
++	struct {
++		struct in6_addr addr;
++		u8 addr_id;
++	} add_addr6;
++
++	u16	remove_addrs;	/* list of address id */
++	u8	addr_id;	/* address id (mp_join or add_address) */
++#endif /* CONFIG_MPTCP */
++};
++
+ /*These are used to set the sack_ok field in struct tcp_options_received */
+ #define TCP_SACK_SEEN     (1 << 0)   /*1 = peer is SACK capable, */
+ #define TCP_FACK_ENABLED  (1 << 1)   /*1 = FACK is enabled locally*/
+@@ -95,6 +140,9 @@ struct tcp_options_received {
+ 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+ };
+ 
++struct mptcp_cb;
++struct mptcp_tcp_sock;
++
+ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+ {
+ 	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
+ 
+ struct tcp_request_sock {
+ 	struct inet_request_sock 	req;
+-#ifdef CONFIG_TCP_MD5SIG
+-	/* Only used by TCP MD5 Signature so far. */
+ 	const struct tcp_request_sock_ops *af_specific;
+-#endif
+ 	struct sock			*listener; /* needed for TFO */
+ 	u32				rcv_isn;
+ 	u32				snt_isn;
+@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
+ 	return (struct tcp_request_sock *)req;
+ }
+ 
++struct tcp_md5sig_key;
++
+ struct tcp_sock {
+ 	/* inet_connection_sock has to be the first member of tcp_sock */
+ 	struct inet_connection_sock	inet_conn;
+@@ -326,6 +373,37 @@ struct tcp_sock {
+ 	 * socket. Used to retransmit SYNACKs etc.
+ 	 */
+ 	struct request_sock *fastopen_rsk;
++
++	/* MPTCP/TCP-specific callbacks */
++	const struct tcp_sock_ops	*ops;
++
++	struct mptcp_cb		*mpcb;
++	struct sock		*meta_sk;
++	/* We keep these flags even if CONFIG_MPTCP is not checked, because
++	 * it allows checking MPTCP capability just by checking the mpc flag,
++	 * rather than adding ifdefs everywhere.
++	 */
++	u16     mpc:1,          /* Other end is multipath capable */
++		inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
++		send_mp_fclose:1,
++		request_mptcp:1, /* Did we send out an MP_CAPABLE?
++				  * (this speeds up mptcp_doit() in tcp_recvmsg)
++				  */
++		mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
++		pf:1, /* Potentially Failed state: when this flag is set, we
++		       * stop using the subflow
++		       */
++		mp_killed:1, /* Killed with a tcp_done in mptcp? */
++		was_meta_sk:1,	/* This was a meta sk (in case of reuse) */
++		is_master_sk,
++		close_it:1,	/* Must close socket in mptcp_data_ready? */
++		closing:1;
++	struct mptcp_tcp_sock *mptcp;
++#ifdef CONFIG_MPTCP
++	struct hlist_nulls_node tk_table;
++	u32		mptcp_loc_token;
++	u64		mptcp_loc_key;
++#endif /* CONFIG_MPTCP */
+ };
+ 
+ enum tsq_flags {
+@@ -337,6 +415,8 @@ enum tsq_flags {
+ 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
+ 				    * tcp_v{4|6}_mtu_reduced()
+ 				    */
++	MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
++	MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
+ };
+ 
+ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
+@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
+ #ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_md5sig_key	  *tw_md5_key;
+ #endif
++	struct mptcp_tw		  *mptcp_tw;
+ };
+ 
+ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
+diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
+index 74af137304be..83f63033897a 100644
+--- a/include/net/inet6_connection_sock.h
++++ b/include/net/inet6_connection_sock.h
+@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
+ 
+ struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
+ 				      const struct request_sock *req);
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++		    const u32 rnd, const u32 synq_hsize);
+ 
+ struct request_sock *inet6_csk_search_req(const struct sock *sk,
+ 					  struct request_sock ***prevp,
+diff --git a/include/net/inet_common.h b/include/net/inet_common.h
+index fe7994c48b75..780f229f46a8 100644
+--- a/include/net/inet_common.h
++++ b/include/net/inet_common.h
+@@ -1,6 +1,8 @@
+ #ifndef _INET_COMMON_H
+ #define _INET_COMMON_H
+ 
++#include <net/sock.h>
++
+ extern const struct proto_ops inet_stream_ops;
+ extern const struct proto_ops inet_dgram_ops;
+ 
+@@ -13,6 +15,8 @@ struct sock;
+ struct sockaddr;
+ struct socket;
+ 
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
+ int inet_release(struct socket *sock);
+ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ 			int addr_len, int flags);
+diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
+index 7a4313887568..f62159e39839 100644
+--- a/include/net/inet_connection_sock.h
++++ b/include/net/inet_connection_sock.h
+@@ -30,6 +30,7 @@
+ 
+ struct inet_bind_bucket;
+ struct tcp_congestion_ops;
++struct tcp_options_received;
+ 
+ /*
+  * Pointers to address related TCP functions
+@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
+ 
+ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
+ 
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++		   const u32 synq_hsize);
++
+ struct request_sock *inet_csk_search_req(const struct sock *sk,
+ 					 struct request_sock ***prevp,
+ 					 const __be16 rport,
+diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
+index b1edf17bec01..6a32d8d6b85e 100644
+--- a/include/net/inet_sock.h
++++ b/include/net/inet_sock.h
+@@ -86,10 +86,14 @@ struct inet_request_sock {
+ 				wscale_ok  : 1,
+ 				ecn_ok	   : 1,
+ 				acked	   : 1,
+-				no_srccheck: 1;
++				no_srccheck: 1,
++				mptcp_rqsk : 1,
++				saw_mpc    : 1;
+ 	kmemcheck_bitfield_end(flags);
+-	struct ip_options_rcu	*opt;
+-	struct sk_buff		*pktopts;
++	union {
++		struct ip_options_rcu	*opt;
++		struct sk_buff		*pktopts;
++	};
+ 	u32                     ir_mark;
+ };
+ 
+diff --git a/include/net/mptcp.h b/include/net/mptcp.h
+new file mode 100644
+index 000000000000..712780fc39e4
+--- /dev/null
++++ b/include/net/mptcp.h
+@@ -0,0 +1,1439 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_H
++#define _MPTCP_H
++
++#include <linux/inetdevice.h>
++#include <linux/ipv6.h>
++#include <linux/list.h>
++#include <linux/net.h>
++#include <linux/netpoll.h>
++#include <linux/skbuff.h>
++#include <linux/socket.h>
++#include <linux/tcp.h>
++#include <linux/kernel.h>
++
++#include <asm/byteorder.h>
++#include <asm/unaligned.h>
++#include <crypto/hash.h>
++#include <net/tcp.h>
++
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	#define ntohll(x)  be64_to_cpu(x)
++	#define htonll(x)  cpu_to_be64(x)
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	#define ntohll(x) (x)
++	#define htonll(x) (x)
++#endif
++
++struct mptcp_loc4 {
++	u8		loc4_id;
++	u8		low_prio:1;
++	struct in_addr	addr;
++};
++
++struct mptcp_rem4 {
++	u8		rem4_id;
++	__be16		port;
++	struct in_addr	addr;
++};
++
++struct mptcp_loc6 {
++	u8		loc6_id;
++	u8		low_prio:1;
++	struct in6_addr	addr;
++};
++
++struct mptcp_rem6 {
++	u8		rem6_id;
++	__be16		port;
++	struct in6_addr	addr;
++};
++
++struct mptcp_request_sock {
++	struct tcp_request_sock		req;
++	/* hlist-nulls entry to the hash-table. Depending on whether this is a
++	 * a new MPTCP connection or an additional subflow, the request-socket
++	 * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
++	 */
++	struct hlist_nulls_node		hash_entry;
++
++	union {
++		struct {
++			/* Only on initial subflows */
++			u64		mptcp_loc_key;
++			u64		mptcp_rem_key;
++			u32		mptcp_loc_token;
++		};
++
++		struct {
++			/* Only on additional subflows */
++			struct mptcp_cb	*mptcp_mpcb;
++			u32		mptcp_rem_nonce;
++			u32		mptcp_loc_nonce;
++			u64		mptcp_hash_tmac;
++		};
++	};
++
++	u8				loc_id;
++	u8				rem_id; /* Address-id in the MP_JOIN */
++	u8				dss_csum:1,
++					is_sub:1, /* Is this a new subflow? */
++					low_prio:1, /* Interface set to low-prio? */
++					rcv_low_prio:1;
++};
++
++struct mptcp_options_received {
++	u16	saw_mpc:1,
++		dss_csum:1,
++		drop_me:1,
++
++		is_mp_join:1,
++		join_ack:1,
++
++		saw_low_prio:2, /* 0x1 - low-prio set for this subflow
++				 * 0x2 - low-prio set for another subflow
++				 */
++		low_prio:1,
++
++		saw_add_addr:2, /* Saw at least one add_addr option:
++				 * 0x1: IPv4 - 0x2: IPv6
++				 */
++		more_add_addr:1, /* Saw one more add-addr. */
++
++		saw_rem_addr:1, /* Saw at least one rem_addr option */
++		more_rem_addr:1, /* Saw one more rem-addr. */
++
++		mp_fail:1,
++		mp_fclose:1;
++	u8	rem_id;		/* Address-id in the MP_JOIN */
++	u8	prio_addr_id;	/* Address-id in the MP_PRIO */
++
++	const unsigned char *add_addr_ptr; /* Pointer to add-address option */
++	const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
++
++	u32	data_ack;
++	u32	data_seq;
++	u16	data_len;
++
++	u32	mptcp_rem_token;/* Remote token */
++
++	/* Key inside the option (from mp_capable or fast_close) */
++	u64	mptcp_key;
++
++	u32	mptcp_recv_nonce;
++	u64	mptcp_recv_tmac;
++	u8	mptcp_recv_mac[20];
++};
++
++struct mptcp_tcp_sock {
++	struct tcp_sock	*next;		/* Next subflow socket */
++	struct hlist_node cb_list;
++	struct mptcp_options_received rx_opt;
++
++	 /* Those three fields record the current mapping */
++	u64	map_data_seq;
++	u32	map_subseq;
++	u16	map_data_len;
++	u16	slave_sk:1,
++		fully_established:1,
++		establish_increased:1,
++		second_packet:1,
++		attached:1,
++		send_mp_fail:1,
++		include_mpc:1,
++		mapping_present:1,
++		map_data_fin:1,
++		low_prio:1, /* use this socket as backup */
++		rcv_low_prio:1, /* Peer sent low-prio option to us */
++		send_mp_prio:1, /* Trigger to send mp_prio on this socket */
++		pre_established:1; /* State between sending 3rd ACK and
++				    * receiving the fourth ack of new subflows.
++				    */
++
++	/* isn: needed to translate abs to relative subflow seqnums */
++	u32	snt_isn;
++	u32	rcv_isn;
++	u8	path_index;
++	u8	loc_id;
++	u8	rem_id;
++
++#define MPTCP_SCHED_SIZE 4
++	u8	mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
++
++	struct sk_buff  *shortcut_ofoqueue; /* Shortcut to the current modified
++					     * skb in the ofo-queue.
++					     */
++
++	int	init_rcv_wnd;
++	u32	infinite_cutoff_seq;
++	struct delayed_work work;
++	u32	mptcp_loc_nonce;
++	struct tcp_sock *tp; /* Where is my daddy? */
++	u32	last_end_data_seq;
++
++	/* MP_JOIN subflow: timer for retransmitting the 3rd ack */
++	struct timer_list mptcp_ack_timer;
++
++	/* HMAC of the third ack */
++	char sender_mac[20];
++};
++
++struct mptcp_tw {
++	struct list_head list;
++	u64 loc_key;
++	u64 rcv_nxt;
++	struct mptcp_cb __rcu *mpcb;
++	u8 meta_tw:1,
++	   in_list:1;
++};
++
++#define MPTCP_PM_NAME_MAX 16
++struct mptcp_pm_ops {
++	struct list_head list;
++
++	/* Signal the creation of a new MPTCP-session. */
++	void (*new_session)(const struct sock *meta_sk);
++	void (*release_sock)(struct sock *meta_sk);
++	void (*fully_established)(struct sock *meta_sk);
++	void (*new_remote_address)(struct sock *meta_sk);
++	int  (*get_local_id)(sa_family_t family, union inet_addr *addr,
++			     struct net *net, bool *low_prio);
++	void (*addr_signal)(struct sock *sk, unsigned *size,
++			    struct tcp_out_options *opts, struct sk_buff *skb);
++	void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
++			  sa_family_t family, __be16 port, u8 id);
++	void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
++	void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
++	void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
++
++	char		name[MPTCP_PM_NAME_MAX];
++	struct module	*owner;
++};
++
++#define MPTCP_SCHED_NAME_MAX 16
++struct mptcp_sched_ops {
++	struct list_head list;
++
++	struct sock *		(*get_subflow)(struct sock *meta_sk,
++					       struct sk_buff *skb,
++					       bool zero_wnd_test);
++	struct sk_buff *	(*next_segment)(struct sock *meta_sk,
++						int *reinject,
++						struct sock **subsk,
++						unsigned int *limit);
++	void			(*init)(struct sock *sk);
++
++	char			name[MPTCP_SCHED_NAME_MAX];
++	struct module		*owner;
++};
++
++struct mptcp_cb {
++	/* list of sockets in this multipath connection */
++	struct tcp_sock *connection_list;
++	/* list of sockets that need a call to release_cb */
++	struct hlist_head callback_list;
++
++	/* High-order bits of 64-bit sequence numbers */
++	u32 snd_high_order[2];
++	u32 rcv_high_order[2];
++
++	u16	send_infinite_mapping:1,
++		in_time_wait:1,
++		list_rcvd:1, /* XXX TO REMOVE */
++		addr_signal:1, /* Path-manager wants us to call addr_signal */
++		dss_csum:1,
++		server_side:1,
++		infinite_mapping_rcv:1,
++		infinite_mapping_snd:1,
++		dfin_combined:1,   /* Was the DFIN combined with subflow-fin? */
++		passive_close:1,
++		snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
++		rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
++
++	/* socket count in this connection */
++	u8 cnt_subflows;
++	u8 cnt_established;
++
++	struct mptcp_sched_ops *sched_ops;
++
++	struct sk_buff_head reinject_queue;
++	/* First cache-line boundary is here minus 8 bytes. But from the
++	 * reinject-queue only the next and prev pointers are regularly
++	 * accessed. Thus, the whole data-path is on a single cache-line.
++	 */
++
++	u64	csum_cutoff_seq;
++
++	/***** Start of fields, used for connection closure */
++	spinlock_t	 tw_lock;
++	unsigned char	 mptw_state;
++	u8		 dfin_path_index;
++
++	struct list_head tw_list;
++
++	/***** Start of fields, used for subflow establishment and closure */
++	atomic_t	mpcb_refcnt;
++
++	/* Mutex needed, because otherwise mptcp_close will complain that the
++	 * socket is owned by the user.
++	 * E.g., mptcp_sub_close_wq is taking the meta-lock.
++	 */
++	struct mutex	mpcb_mutex;
++
++	/***** Start of fields, used for subflow establishment */
++	struct sock *meta_sk;
++
++	/* Master socket, also part of the connection_list, this
++	 * socket is the one that the application sees.
++	 */
++	struct sock *master_sk;
++
++	__u64	mptcp_loc_key;
++	__u64	mptcp_rem_key;
++	__u32	mptcp_loc_token;
++	__u32	mptcp_rem_token;
++
++#define MPTCP_PM_SIZE 608
++	u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
++	struct mptcp_pm_ops *pm_ops;
++
++	u32 path_index_bits;
++	/* Next pi to pick up in case a new path becomes available */
++	u8 next_path_index;
++
++	/* Original snd/rcvbuf of the initial subflow.
++	 * Used for the new subflows on the server-side to allow correct
++	 * autotuning
++	 */
++	int orig_sk_rcvbuf;
++	int orig_sk_sndbuf;
++	u32 orig_window_clamp;
++
++	/* Timer for retransmitting SYN/ACK+MP_JOIN */
++	struct timer_list synack_timer;
++};
++
++#define MPTCP_SUB_CAPABLE			0
++#define MPTCP_SUB_LEN_CAPABLE_SYN		12
++#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN		12
++#define MPTCP_SUB_LEN_CAPABLE_ACK		20
++#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN		20
++
++#define MPTCP_SUB_JOIN			1
++#define MPTCP_SUB_LEN_JOIN_SYN		12
++#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN	12
++#define MPTCP_SUB_LEN_JOIN_SYNACK	16
++#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN	16
++#define MPTCP_SUB_LEN_JOIN_ACK		24
++#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN	24
++
++#define MPTCP_SUB_DSS		2
++#define MPTCP_SUB_LEN_DSS	4
++#define MPTCP_SUB_LEN_DSS_ALIGN	4
++
++/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
++ * as they are part of the DSS-option.
++ * To get the total length, just add the different options together.
++ */
++#define MPTCP_SUB_LEN_SEQ	10
++#define MPTCP_SUB_LEN_SEQ_CSUM	12
++#define MPTCP_SUB_LEN_SEQ_ALIGN	12
++
++#define MPTCP_SUB_LEN_SEQ_64		14
++#define MPTCP_SUB_LEN_SEQ_CSUM_64	16
++#define MPTCP_SUB_LEN_SEQ_64_ALIGN	16
++
++#define MPTCP_SUB_LEN_ACK	4
++#define MPTCP_SUB_LEN_ACK_ALIGN	4
++
++#define MPTCP_SUB_LEN_ACK_64		8
++#define MPTCP_SUB_LEN_ACK_64_ALIGN	8
++
++/* This is the "default" option-length we will send out most often.
++ * MPTCP DSS-header
++ * 32-bit data sequence number
++ * 32-bit data ack
++ *
++ * It is necessary to calculate the effective MSS we will be using when
++ * sending data.
++ */
++#define MPTCP_SUB_LEN_DSM_ALIGN  (MPTCP_SUB_LEN_DSS_ALIGN +		\
++				  MPTCP_SUB_LEN_SEQ_ALIGN +		\
++				  MPTCP_SUB_LEN_ACK_ALIGN)
++
++#define MPTCP_SUB_ADD_ADDR		3
++#define MPTCP_SUB_LEN_ADD_ADDR4		8
++#define MPTCP_SUB_LEN_ADD_ADDR6		20
++#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN	8
++#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN	20
++
++#define MPTCP_SUB_REMOVE_ADDR	4
++#define MPTCP_SUB_LEN_REMOVE_ADDR	4
++
++#define MPTCP_SUB_PRIO		5
++#define MPTCP_SUB_LEN_PRIO	3
++#define MPTCP_SUB_LEN_PRIO_ADDR	4
++#define MPTCP_SUB_LEN_PRIO_ALIGN	4
++
++#define MPTCP_SUB_FAIL		6
++#define MPTCP_SUB_LEN_FAIL	12
++#define MPTCP_SUB_LEN_FAIL_ALIGN	12
++
++#define MPTCP_SUB_FCLOSE	7
++#define MPTCP_SUB_LEN_FCLOSE	12
++#define MPTCP_SUB_LEN_FCLOSE_ALIGN	12
++
++
++#define OPTION_MPTCP		(1 << 5)
++
++#ifdef CONFIG_MPTCP
++
++/* Used for checking if the mptcp initialization has been successful */
++extern bool mptcp_init_failed;
++
++/* MPTCP options */
++#define OPTION_TYPE_SYN		(1 << 0)
++#define OPTION_TYPE_SYNACK	(1 << 1)
++#define OPTION_TYPE_ACK		(1 << 2)
++#define OPTION_MP_CAPABLE	(1 << 3)
++#define OPTION_DATA_ACK		(1 << 4)
++#define OPTION_ADD_ADDR		(1 << 5)
++#define OPTION_MP_JOIN		(1 << 6)
++#define OPTION_MP_FAIL		(1 << 7)
++#define OPTION_MP_FCLOSE	(1 << 8)
++#define OPTION_REMOVE_ADDR	(1 << 9)
++#define OPTION_MP_PRIO		(1 << 10)
++
++/* MPTCP flags: both TX and RX */
++#define MPTCPHDR_SEQ		0x01 /* DSS.M option is present */
++#define MPTCPHDR_FIN		0x02 /* DSS.F option is present */
++#define MPTCPHDR_SEQ64_INDEX	0x04 /* index of seq in mpcb->snd_high_order */
++/* MPTCP flags: RX only */
++#define MPTCPHDR_ACK		0x08
++#define MPTCPHDR_SEQ64_SET	0x10 /* Did we received a 64-bit seq number?  */
++#define MPTCPHDR_SEQ64_OFO	0x20 /* Is it not in our circular array? */
++#define MPTCPHDR_DSS_CSUM	0x40
++#define MPTCPHDR_JOIN		0x80
++/* MPTCP flags: TX only */
++#define MPTCPHDR_INF		0x08
++
++struct mptcp_option {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ver:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ver:4;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_capable {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ver:4,
++		sub:4;
++	__u8	h:1,
++		rsv:5,
++		b:1,
++		a:1;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ver:4;
++	__u8	a:1,
++		b:1,
++		rsv:5,
++		h:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u64	sender_key;
++	__u64	receiver_key;
++} __attribute__((__packed__));
++
++struct mp_join {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	b:1,
++		rsv:3,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:3,
++		b:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++	union {
++		struct {
++			u32	token;
++			u32	nonce;
++		} syn;
++		struct {
++			__u64	mac;
++			u32	nonce;
++		} synack;
++		struct {
++			__u8	mac[20];
++		} ack;
++	} u;
++} __attribute__((__packed__));
++
++struct mp_dss {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		A:1,
++		a:1,
++		M:1,
++		m:1,
++		F:1,
++		rsv2:3;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:3,
++		F:1,
++		m:1,
++		M:1,
++		a:1,
++		A:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_add_addr {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ipver:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ipver:4;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++	union {
++		struct {
++			struct in_addr	addr;
++			__be16		port;
++		} v4;
++		struct {
++			struct in6_addr	addr;
++			__be16		port;
++		} v6;
++	} u;
++} __attribute__((__packed__));
++
++struct mp_remove_addr {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	rsv:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++	/* list of addr_id */
++	__u8	addrs_id;
++};
++
++struct mp_fail {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:8;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__be64	data_seq;
++} __attribute__((__packed__));
++
++struct mp_fclose {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:8;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u64	key;
++} __attribute__((__packed__));
++
++struct mp_prio {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	b:1,
++		rsv:3,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:3,
++		b:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++} __attribute__((__packed__));
++
++static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
++{
++	return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
++}
++
++#define MPTCP_APP	2
++
++extern int sysctl_mptcp_enabled;
++extern int sysctl_mptcp_checksum;
++extern int sysctl_mptcp_debug;
++extern int sysctl_mptcp_syn_retries;
++
++extern struct workqueue_struct *mptcp_wq;
++
++#define mptcp_debug(fmt, args...)					\
++	do {								\
++		if (unlikely(sysctl_mptcp_debug))			\
++			pr_err(__FILE__ ": " fmt, ##args);	\
++	} while (0)
++
++/* Iterates over all subflows */
++#define mptcp_for_each_tp(mpcb, tp)					\
++	for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
++
++#define mptcp_for_each_sk(mpcb, sk)					\
++	for ((sk) = (struct sock *)(mpcb)->connection_list;		\
++	     sk;							\
++	     sk = (struct sock *)tcp_sk(sk)->mptcp->next)
++
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)			\
++	for (__sk = (struct sock *)(__mpcb)->connection_list,		\
++	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
++	     __sk;							\
++	     __sk = __temp,						\
++	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
++
++/* Iterates over all bit set to 1 in a bitset */
++#define mptcp_for_each_bit_set(b, i)					\
++	for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
++
++#define mptcp_for_each_bit_unset(b, i)					\
++	mptcp_for_each_bit_set(~b, i)
++
++extern struct lock_class_key meta_key;
++extern struct lock_class_key meta_slock_key;
++extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
++
++/* This is needed to ensure that two subsequent key/nonce-generation result in
++ * different keys/nonces if the IPs and ports are the same.
++ */
++extern u32 mptcp_seed;
++
++#define MPTCP_HASH_SIZE                1024
++
++extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++extern spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
++
++/* Lock, protecting the two hash-tables that hold the token. Namely,
++ * mptcp_reqsk_tk_htb and tk_hashtable
++ */
++extern spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
++
++/* Request-sockets can be hashed in the tk_htb for collision-detection or in
++ * the regular htb for join-connections. We need to define different NULLS
++ * values so that we can correctly detect a request-socket that has been
++ * recycled. See also c25eb3bfb9729.
++ */
++#define MPTCP_REQSK_NULLS_BASE (1U << 29)
++
++
++void mptcp_data_ready(struct sock *sk);
++void mptcp_write_space(struct sock *sk);
++
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++			      struct sock *sk);
++void mptcp_ofo_queue(struct sock *meta_sk);
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++		   gfp_t flags);
++void mptcp_del_sock(struct sock *sk);
++void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
++void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
++void mptcp_update_sndbuf(const struct tcp_sock *tp);
++void mptcp_send_fin(struct sock *meta_sk);
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
++bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++		      int push_one, gfp_t gfp);
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++			     struct mptcp_options_received *mopt);
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++			 struct mptcp_options_received *mopt,
++			 const struct sk_buff *skb);
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++		       unsigned *remaining);
++void mptcp_synack_options(struct request_sock *req,
++			  struct tcp_out_options *opts,
++			  unsigned *remaining);
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++			       struct tcp_out_options *opts, unsigned *size);
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++			 const struct tcp_out_options *opts,
++			 struct sk_buff *skb);
++void mptcp_close(struct sock *meta_sk, long timeout);
++int mptcp_doit(struct sock *sk);
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++			   struct request_sock *req,
++			   struct request_sock **prev);
++struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
++				   struct request_sock *req,
++				   struct request_sock **prev,
++				   const struct mptcp_options_received *mopt);
++u32 __mptcp_select_window(struct sock *sk);
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++					__u32 *window_clamp, int wscale_ok,
++					__u8 *rcv_wscale, __u32 init_rcv_wnd,
++					const struct sock *sk);
++unsigned int mptcp_current_mss(struct sock *meta_sk);
++int mptcp_select_size(const struct sock *meta_sk, bool sg);
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++		     u32 *hash_out);
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
++void mptcp_fin(struct sock *meta_sk);
++void mptcp_retransmit_timer(struct sock *meta_sk);
++int mptcp_write_wakeup(struct sock *meta_sk);
++void mptcp_sub_close_wq(struct work_struct *work);
++void mptcp_sub_close(struct sock *sk, unsigned long delay);
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
++void mptcp_fallback_meta_sk(struct sock *meta_sk);
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_ack_handler(unsigned long);
++int mptcp_check_rtt(const struct tcp_sock *tp, int time);
++int mptcp_check_snd_buf(const struct tcp_sock *tp);
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++			 const struct sk_buff *skb);
++void __init mptcp_init(void);
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
++void mptcp_destroy_sock(struct sock *sk);
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++				    const struct sk_buff *skb,
++				    const struct mptcp_options_received *mopt);
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++				  int large_allowed);
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
++void mptcp_time_wait(struct sock *sk, int state, int timeo);
++void mptcp_disconnect(struct sock *sk);
++bool mptcp_should_expand_sndbuf(const struct sock *sk);
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_tsq_flags(struct sock *sk);
++void mptcp_tsq_sub_deferred(struct sock *meta_sk);
++struct mp_join *mptcp_find_join(const struct sk_buff *skb);
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
++void mptcp_hash_remove(struct tcp_sock *meta_tp);
++struct sock *mptcp_hash_find(const struct net *net, const u32 token);
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
++int mptcp_do_join_short(struct sk_buff *skb,
++			const struct mptcp_options_received *mopt,
++			struct net *net);
++void mptcp_reqsk_destructor(struct request_sock *req);
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++			   const struct mptcp_options_received *mopt,
++			   const struct sk_buff *skb);
++int mptcp_check_req(struct sk_buff *skb, struct net *net);
++void mptcp_connect_init(struct sock *sk);
++void mptcp_sub_force_close(struct sock *sk);
++int mptcp_sub_len_remove_addr_align(u16 bitfield);
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++			    const struct sk_buff *skb);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++			   struct sk_buff *skb);
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
++void mptcp_init_congestion_control(struct sock *sk);
++
++/* MPTCP-path-manager registration/initialization functions */
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_init_path_manager(struct mptcp_cb *mpcb);
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
++void mptcp_fallback_default(struct mptcp_cb *mpcb);
++void mptcp_get_default_path_manager(char *name);
++int mptcp_set_default_path_manager(const char *name);
++extern struct mptcp_pm_ops mptcp_pm_default;
++
++/* MPTCP-scheduler registration/initialization functions */
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_init_scheduler(struct mptcp_cb *mpcb);
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
++void mptcp_get_default_scheduler(char *name);
++int mptcp_set_default_scheduler(const char *name);
++extern struct mptcp_sched_ops mptcp_sched_default;
++
++static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
++					    unsigned long len)
++{
++	sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
++		       jiffies + len);
++}
++
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
++{
++	sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
++}
++
++static inline bool is_mptcp_enabled(const struct sock *sk)
++{
++	if (!sysctl_mptcp_enabled || mptcp_init_failed)
++		return false;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++		return false;
++
++	return true;
++}
++
++static inline int mptcp_pi_to_flag(int pi)
++{
++	return 1 << (pi - 1);
++}
++
++static inline
++struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
++{
++	return (struct mptcp_request_sock *)req;
++}
++
++static inline
++struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
++{
++	return (struct request_sock *)req;
++}
++
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++	struct sock *sk_it;
++
++	if (tcp_sk(sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
++		if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
++		    !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
++			return false;
++	}
++
++	return true;
++}
++
++static inline void mptcp_push_pending_frames(struct sock *meta_sk)
++{
++	/* We check packets out and send-head here. TCP only checks the
++	 * send-head. But, MPTCP also checks packets_out, as this is an
++	 * indication that we might want to do opportunistic reinjection.
++	 */
++	if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
++		struct tcp_sock *tp = tcp_sk(meta_sk);
++
++		/* We don't care about the MSS, because it will be set in
++		 * mptcp_write_xmit.
++		 */
++		__tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
++	}
++}
++
++static inline void mptcp_send_reset(struct sock *sk)
++{
++	tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++	mptcp_sub_force_close(sk);
++}
++
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
++}
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
++}
++
++/* Is it a data-fin while in infinite mapping mode?
++ * In infinite mode, a subflow-fin is in fact a data-fin.
++ */
++static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
++				     const struct tcp_sock *tp)
++{
++	return mptcp_is_data_fin(skb) ||
++	       (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
++}
++
++static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
++{
++	u64 data_seq_high = (u32)(data_seq >> 32);
++
++	if (mpcb->rcv_high_order[0] == data_seq_high)
++		return 0;
++	else if (mpcb->rcv_high_order[1] == data_seq_high)
++		return MPTCPHDR_SEQ64_INDEX;
++	else
++		return MPTCPHDR_SEQ64_OFO;
++}
++
++/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
++ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
++ */
++static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
++					    u32 *data_seq,
++					    struct mptcp_cb *mpcb)
++{
++	__u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
++
++	if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
++		u64 data_seq64 = get_unaligned_be64(ptr);
++
++		if (mpcb)
++			TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
++
++		*data_seq = (u32)data_seq64;
++		ptr++;
++	} else {
++		*data_seq = get_unaligned_be32(ptr);
++	}
++
++	return ptr;
++}
++
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++	return tcp_sk(sk)->meta_sk;
++}
++
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++	return tcp_sk(tp->meta_sk);
++}
++
++static inline int is_meta_tp(const struct tcp_sock *tp)
++{
++	return tp->mpcb && mptcp_meta_tp(tp) == tp;
++}
++
++static inline int is_meta_sk(const struct sock *sk)
++{
++	return sk->sk_type == SOCK_STREAM  && sk->sk_protocol == IPPROTO_TCP &&
++	       mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
++}
++
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++	return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
++}
++
++static inline void mptcp_hash_request_remove(struct request_sock *req)
++{
++	int in_softirq = 0;
++
++	if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
++		return;
++
++	if (in_softirq()) {
++		spin_lock(&mptcp_reqsk_hlock);
++		in_softirq = 1;
++	} else {
++		spin_lock_bh(&mptcp_reqsk_hlock);
++	}
++
++	hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++
++	if (in_softirq)
++		spin_unlock(&mptcp_reqsk_hlock);
++	else
++		spin_unlock_bh(&mptcp_reqsk_hlock);
++}
++
++static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
++{
++	mopt->saw_mpc = 0;
++	mopt->dss_csum = 0;
++	mopt->drop_me = 0;
++
++	mopt->is_mp_join = 0;
++	mopt->join_ack = 0;
++
++	mopt->saw_low_prio = 0;
++	mopt->low_prio = 0;
++
++	mopt->saw_add_addr = 0;
++	mopt->more_add_addr = 0;
++
++	mopt->saw_rem_addr = 0;
++	mopt->more_rem_addr = 0;
++
++	mopt->mp_fail = 0;
++	mopt->mp_fclose = 0;
++}
++
++static inline void mptcp_reset_mopt(struct tcp_sock *tp)
++{
++	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++	mopt->saw_low_prio = 0;
++	mopt->saw_add_addr = 0;
++	mopt->more_add_addr = 0;
++	mopt->saw_rem_addr = 0;
++	mopt->more_rem_addr = 0;
++	mopt->join_ack = 0;
++	mopt->mp_fail = 0;
++	mopt->mp_fclose = 0;
++}
++
++static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
++						 const struct mptcp_cb *mpcb)
++{
++	return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
++			MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
++}
++
++static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
++					u32 data_seq_32)
++{
++	return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
++}
++
++static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
++{
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++				     meta_tp->rcv_nxt);
++}
++
++static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
++{
++	if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
++		struct mptcp_cb *mpcb = meta_tp->mpcb;
++		mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++		mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
++	}
++}
++
++static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
++					   u32 old_rcv_nxt)
++{
++	if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
++		struct mptcp_cb *mpcb = meta_tp->mpcb;
++		mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
++		mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
++	}
++}
++
++static inline int mptcp_sk_can_send(const struct sock *sk)
++{
++	return tcp_passive_fastopen(sk) ||
++	       ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
++		!tcp_sk(sk)->mptcp->pre_established);
++}
++
++static inline int mptcp_sk_can_recv(const struct sock *sk)
++{
++	return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
++}
++
++static inline int mptcp_sk_can_send_ack(const struct sock *sk)
++{
++	return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
++					TCPF_CLOSE | TCPF_LISTEN)) &&
++	       !tcp_sk(sk)->mptcp->pre_established;
++}
++
++/* Only support GSO if all subflows supports it */
++static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
++{
++	struct sock *sk;
++
++	if (tcp_sk(meta_sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++		if (!sk_can_gso(sk))
++			return false;
++	}
++	return true;
++}
++
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++	struct sock *sk;
++
++	if (tcp_sk(meta_sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++		if (!(sk->sk_route_caps & NETIF_F_SG))
++			return false;
++	}
++	return true;
++}
++
++static inline void mptcp_set_rto(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *sk_it;
++	struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
++	__u32 max_rto = 0;
++
++	/* We are in recovery-phase on the MPTCP-level. Do not update the
++	 * RTO, because this would kill exponential backoff.
++	 */
++	if (micsk->icsk_retransmits)
++		return;
++
++	mptcp_for_each_sk(tp->mpcb, sk_it) {
++		if (mptcp_sk_can_send(sk_it) &&
++		    inet_csk(sk_it)->icsk_rto > max_rto)
++			max_rto = inet_csk(sk_it)->icsk_rto;
++	}
++	if (max_rto) {
++		micsk->icsk_rto = max_rto << 1;
++
++		/* A successfull rto-measurement - reset backoff counter */
++		micsk->icsk_backoff = 0;
++	}
++}
++
++static inline int mptcp_sysctl_syn_retries(void)
++{
++	return sysctl_mptcp_syn_retries;
++}
++
++static inline void mptcp_sub_close_passive(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
++
++	/* Only close, if the app did a send-shutdown (passive close), and we
++	 * received the data-ack of the data-fin.
++	 */
++	if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
++		mptcp_sub_close(sk, 0);
++}
++
++static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* If data has been acknowleged on the meta-level, fully_established
++	 * will have been set before and thus we will not fall back to infinite
++	 * mapping.
++	 */
++	if (likely(tp->mptcp->fully_established))
++		return false;
++
++	if (!(flag & MPTCP_FLAG_DATA_ACKED))
++		return false;
++
++	/* Don't fallback twice ;) */
++	if (tp->mpcb->infinite_mapping_snd)
++		return false;
++
++	pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
++	       __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
++	       &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
++	       __builtin_return_address(0));
++	if (!is_master_tp(tp))
++		return true;
++
++	tp->mpcb->infinite_mapping_snd = 1;
++	tp->mpcb->infinite_mapping_rcv = 1;
++	tp->mptcp->fully_established = 1;
++
++	return false;
++}
++
++/* Find the first index whose bit in the bit-field == 0 */
++static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
++{
++	u8 base = mpcb->next_path_index;
++	int i;
++
++	/* Start at 1, because 0 is reserved for the meta-sk */
++	mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
++		if (i + base < 1)
++			continue;
++		if (i + base >= sizeof(mpcb->path_index_bits) * 8)
++			break;
++		i += base;
++		mpcb->path_index_bits |= (1 << i);
++		mpcb->next_path_index = i + 1;
++		return i;
++	}
++	mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
++		if (i >= sizeof(mpcb->path_index_bits) * 8)
++			break;
++		if (i < 1)
++			continue;
++		mpcb->path_index_bits |= (1 << i);
++		mpcb->next_path_index = i + 1;
++		return i;
++	}
++
++	return 0;
++}
++
++static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
++{
++	return sk->sk_family == AF_INET6 &&
++	       ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
++}
++
++/* TCP and MPTCP mpc flag-depending functions */
++u16 mptcp_select_window(struct sock *sk);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_tcp_set_rto(struct sock *sk);
++
++/* TCP and MPTCP flag-depending functions */
++bool mptcp_prune_ofo_queue(struct sock *sk);
++
++#else /* CONFIG_MPTCP */
++#define mptcp_debug(fmt, args...)	\
++	do {				\
++	} while (0)
++
++/* Without MPTCP, we just do one iteration
++ * over the only socket available. This assumes that
++ * the sk/tp arg is the socket in that case.
++ */
++#define mptcp_for_each_sk(mpcb, sk)
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++	return false;
++}
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++	return false;
++}
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++	return NULL;
++}
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++	return NULL;
++}
++static inline int is_meta_sk(const struct sock *sk)
++{
++	return 0;
++}
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++	return 0;
++}
++static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
++static inline void mptcp_del_sock(const struct sock *sk) {}
++static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
++static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
++static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
++static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
++					    const struct sock *sk) {}
++static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
++static inline void mptcp_set_rto(const struct sock *sk) {}
++static inline void mptcp_send_fin(const struct sock *meta_sk) {}
++static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
++				       const struct mptcp_options_received *mopt,
++				       const struct sk_buff *skb) {}
++static inline void mptcp_syn_options(const struct sock *sk,
++				     struct tcp_out_options *opts,
++				     unsigned *remaining) {}
++static inline void mptcp_synack_options(struct request_sock *req,
++					struct tcp_out_options *opts,
++					unsigned *remaining) {}
++
++static inline void mptcp_established_options(struct sock *sk,
++					     struct sk_buff *skb,
++					     struct tcp_out_options *opts,
++					     unsigned *size) {}
++static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++				       const struct tcp_out_options *opts,
++				       struct sk_buff *skb) {}
++static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
++static inline int mptcp_doit(struct sock *sk)
++{
++	return 0;
++}
++static inline int mptcp_check_req_fastopen(struct sock *child,
++					   struct request_sock *req)
++{
++	return 1;
++}
++static inline int mptcp_check_req_master(const struct sock *sk,
++					 const struct sock *child,
++					 struct request_sock *req,
++					 struct request_sock **prev)
++{
++	return 1;
++}
++static inline struct sock *mptcp_check_req_child(struct sock *sk,
++						 struct sock *child,
++						 struct request_sock *req,
++						 struct request_sock **prev,
++						 const struct mptcp_options_received *mopt)
++{
++	return NULL;
++}
++static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++	return 0;
++}
++static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++	return 0;
++}
++static inline void mptcp_sub_close_passive(struct sock *sk) {}
++static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
++{
++	return false;
++}
++static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
++static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++	return 0;
++}
++static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++	return 0;
++}
++static inline int mptcp_sysctl_syn_retries(void)
++{
++	return 0;
++}
++static inline void mptcp_send_reset(const struct sock *sk) {}
++static inline int mptcp_handle_options(struct sock *sk,
++				       const struct tcphdr *th,
++				       struct sk_buff *skb)
++{
++	return 0;
++}
++static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
++static inline void  __init mptcp_init(void) {}
++static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++	return 0;
++}
++static inline bool mptcp_sk_can_gso(const struct sock *sk)
++{
++	return false;
++}
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++	return false;
++}
++static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
++						u32 mss_now, int large_allowed)
++{
++	return 0;
++}
++static inline void mptcp_destroy_sock(struct sock *sk) {}
++static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
++						  struct sock **skptr,
++						  struct sk_buff *skb,
++						  const struct mptcp_options_received *mopt)
++{
++	return 0;
++}
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++	return false;
++}
++static inline int mptcp_init_tw_sock(struct sock *sk,
++				     struct tcp_timewait_sock *tw)
++{
++	return 0;
++}
++static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
++static inline void mptcp_disconnect(struct sock *sk) {}
++static inline void mptcp_tsq_flags(struct sock *sk) {}
++static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
++static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
++static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
++static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
++					 const struct tcp_options_received *rx_opt,
++					 const struct mptcp_options_received *mopt,
++					 const struct sk_buff *skb) {}
++static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++					  const struct sk_buff *skb) {}
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_H */
+diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
+new file mode 100644
+index 000000000000..93ad97c77c5a
+--- /dev/null
++++ b/include/net/mptcp_v4.h
+@@ -0,0 +1,67 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef MPTCP_V4_H_
++#define MPTCP_V4_H_
++
++
++#include <linux/in.h>
++#include <linux/skbuff.h>
++#include <net/mptcp.h>
++#include <net/request_sock.h>
++#include <net/sock.h>
++
++extern struct request_sock_ops mptcp_request_sock_ops;
++extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++#ifdef CONFIG_MPTCP
++
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++				 const __be32 laddr, const struct net *net);
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++			   struct mptcp_rem4 *rem);
++int mptcp_pm_v4_init(void);
++void mptcp_pm_v4_undo(void);
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++
++#else
++
++static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
++				  const struct sk_buff *skb)
++{
++	return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* MPTCP_V4_H_ */
+diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
+new file mode 100644
+index 000000000000..49a4f30ccd4d
+--- /dev/null
++++ b/include/net/mptcp_v6.h
+@@ -0,0 +1,69 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_V6_H
++#define _MPTCP_V6_H
++
++#include <linux/in6.h>
++#include <net/if_inet6.h>
++
++#include <net/mptcp.h>
++
++
++#ifdef CONFIG_MPTCP
++extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
++extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
++extern struct request_sock_ops mptcp6_request_sock_ops;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++				 const struct in6_addr *laddr, const struct net *net);
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++			   struct mptcp_rem6 *rem);
++int mptcp_pm_v6_init(void);
++void mptcp_pm_v6_undo(void);
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++			 __be16 sport, __be16 dport);
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++		     __be16 sport, __be16 dport);
++
++#else /* CONFIG_MPTCP */
++
++#define mptcp_v6_mapped ipv6_mapped
++
++static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_V6_H */
+diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
+index 361d26077196..bae95a11c531 100644
+--- a/include/net/net_namespace.h
++++ b/include/net/net_namespace.h
+@@ -16,6 +16,7 @@
+ #include <net/netns/packet.h>
+ #include <net/netns/ipv4.h>
+ #include <net/netns/ipv6.h>
++#include <net/netns/mptcp.h>
+ #include <net/netns/ieee802154_6lowpan.h>
+ #include <net/netns/sctp.h>
+ #include <net/netns/dccp.h>
+@@ -92,6 +93,9 @@ struct net {
+ #if IS_ENABLED(CONFIG_IPV6)
+ 	struct netns_ipv6	ipv6;
+ #endif
++#if IS_ENABLED(CONFIG_MPTCP)
++	struct netns_mptcp	mptcp;
++#endif
+ #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
+ 	struct netns_ieee802154_lowpan	ieee802154_lowpan;
+ #endif
+diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
+new file mode 100644
+index 000000000000..bad418b04cc8
+--- /dev/null
++++ b/include/net/netns/mptcp.h
+@@ -0,0 +1,44 @@
++/*
++ *	MPTCP implementation - MPTCP namespace
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef __NETNS_MPTCP_H__
++#define __NETNS_MPTCP_H__
++
++#include <linux/compiler.h>
++
++enum {
++	MPTCP_PM_FULLMESH = 0,
++	MPTCP_PM_MAX
++};
++
++struct netns_mptcp {
++	void *path_managers[MPTCP_PM_MAX];
++};
++
++#endif /* __NETNS_MPTCP_H__ */
+diff --git a/include/net/request_sock.h b/include/net/request_sock.h
+index 7f830ff67f08..e79e87a8e1a6 100644
+--- a/include/net/request_sock.h
++++ b/include/net/request_sock.h
+@@ -164,7 +164,7 @@ struct request_sock_queue {
+ };
+ 
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+-		      unsigned int nr_table_entries);
++		      unsigned int nr_table_entries, gfp_t flags);
+ 
+ void __reqsk_queue_destroy(struct request_sock_queue *queue);
+ void reqsk_queue_destroy(struct request_sock_queue *queue);
+diff --git a/include/net/sock.h b/include/net/sock.h
+index 156350745700..0e23cae8861f 100644
+--- a/include/net/sock.h
++++ b/include/net/sock.h
+@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
+ 
+ int sk_wait_data(struct sock *sk, long *timeo);
+ 
++/* START - needed for MPTCP */
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
++void sock_lock_init(struct sock *sk);
++
++extern struct lock_class_key af_callback_keys[AF_MAX];
++extern char *const af_family_clock_key_strings[AF_MAX+1];
++
++#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
++/* END - needed for MPTCP */
++
+ struct request_sock_ops;
+ struct timewait_sock_ops;
+ struct inet_hashinfo;
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 7286db80e8b8..ff92e74cd684 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TCPOPT_SACK             5       /* SACK Block */
+ #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
+ #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
++#define TCPOPT_MPTCP		30
+ #define TCPOPT_EXP		254	/* Experimental */
+ /* Magic number to be after the option value for sharing TCP
+  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
+@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define	TFO_SERVER_WO_SOCKOPT1	0x400
+ #define	TFO_SERVER_WO_SOCKOPT2	0x800
+ 
++/* Flags from tcp_input.c for tcp_ack */
++#define FLAG_DATA               0x01 /* Incoming frame contained data.          */
++#define FLAG_WIN_UPDATE         0x02 /* Incoming ACK was a window update.       */
++#define FLAG_DATA_ACKED         0x04 /* This ACK acknowledged new data.         */
++#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted.  */
++#define FLAG_SYN_ACKED          0x10 /* This ACK acknowledged SYN.              */
++#define FLAG_DATA_SACKED        0x20 /* New SACK.                               */
++#define FLAG_ECE                0x40 /* ECE in this ACK                         */
++#define FLAG_SLOWPATH           0x100 /* Do not skip RFC checks for window update.*/
++#define FLAG_ORIG_SACK_ACKED    0x200 /* Never retransmitted data are (s)acked  */
++#define FLAG_SND_UNA_ADVANCED   0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
++#define FLAG_DSACKING_ACK       0x800 /* SACK blocks contained D-SACK info */
++#define FLAG_SACK_RENEGING      0x2000 /* snd_una advanced to a sacked seq */
++#define FLAG_UPDATE_TS_RECENT   0x4000 /* tcp_replace_ts_recent() */
++#define MPTCP_FLAG_DATA_ACKED	0x8000
++
++#define FLAG_ACKED              (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
++#define FLAG_NOT_DUP            (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
++#define FLAG_CA_ALERT           (FLAG_DATA_SACKED|FLAG_ECE)
++#define FLAG_FORWARD_PROGRESS   (FLAG_ACKED|FLAG_DATA_SACKED)
++
+ extern struct inet_timewait_death_row tcp_death_row;
+ 
+ /* sysctl variables for tcp */
+@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
+ #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+ #define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
+ 
++/**** START - Exports needed for MPTCP ****/
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
++
++struct mptcp_options_received;
++
++void tcp_enter_quickack_mode(struct sock *sk);
++int tcp_close_state(struct sock *sk);
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++			 const struct sk_buff *skb);
++int tcp_xmit_probe_skb(struct sock *sk, int urgent);
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++		     gfp_t gfp_mask);
++unsigned int tcp_mss_split_point(const struct sock *sk,
++				 const struct sk_buff *skb,
++				 unsigned int mss_now,
++				 unsigned int max_segs,
++				 int nonagle);
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		    unsigned int cur_mss, int nonagle);
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		      unsigned int cur_mss);
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++		      unsigned int mss_now);
++void __pskb_trim_head(struct sk_buff *skb, int len);
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
++void tcp_reset(struct sock *sk);
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++			   const u32 ack_seq, const u32 nwin);
++bool tcp_urg_mode(const struct tcp_sock *tp);
++void tcp_ack_probe(struct sock *sk);
++void tcp_rearm_rto(struct sock *sk);
++int tcp_write_timeout(struct sock *sk);
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++			   unsigned int timeout, bool syn_set);
++void tcp_write_err(struct sock *sk);
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++			  unsigned int mss_now);
++
++int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req);
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc);
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
++void tcp_v4_reqsk_destructor(struct request_sock *req);
++
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req);
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl, struct request_sock *req,
++		       u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
++void tcp_v6_destroy_sock(struct sock *sk);
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
++void tcp_v6_hash(struct sock *sk);
++struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++			          struct request_sock *req,
++				  struct dst_entry *dst);
++void tcp_v6_reqsk_destructor(struct request_sock *req);
++
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
++				       int large_allowed);
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
++
++void skb_clone_fraglist(struct sk_buff *skb);
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
++
++void inet_twsk_free(struct inet_timewait_sock *tw);
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
++/* These states need RST on ABORT according to RFC793 */
++static inline bool tcp_need_reset(int state)
++{
++	return (1 << state) &
++	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
++		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
++}
++
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
++			    int hlen);
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++			       bool *fragstolen);
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
++		      struct sk_buff *from, bool *fragstolen);
++/**** END - Exports needed for MPTCP ****/
++
+ void tcp_tasklet_init(void);
+ 
+ void tcp_v4_err(struct sk_buff *skb, u32);
+@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 		size_t len, int nonblock, int flags, int *addr_len);
+ void tcp_parse_options(const struct sk_buff *skb,
+ 		       struct tcp_options_received *opt_rx,
++		       struct mptcp_options_received *mopt_rx,
+ 		       int estab, struct tcp_fastopen_cookie *foc);
+ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
+ 
+@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
+ 
+ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ 			      u16 *mssp);
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
+-#else
+-static inline __u32 cookie_v4_init_sequence(struct sock *sk,
+-					    struct sk_buff *skb,
+-					    __u16 *mss)
+-{
+-	return 0;
+-}
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++			      __u16 *mss);
+ #endif
+ 
+ __u32 cookie_init_timestamp(struct request_sock *req);
+@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
+ 			      const struct tcphdr *th, u16 *mssp);
+ __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
+ 			      __u16 *mss);
+-#else
+-static inline __u32 cookie_v6_init_sequence(struct sock *sk,
+-					    struct sk_buff *skb,
+-					    __u16 *mss)
+-{
+-	return 0;
+-}
+ #endif
+ /* tcp_output.c */
+ 
+@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
+ void tcp_send_loss_probe(struct sock *sk);
+ bool tcp_schedule_loss_probe(struct sock *sk);
+ 
++u16 tcp_select_window(struct sock *sk);
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++		int push_one, gfp_t gfp);
++
+ /* tcp_input.c */
+ void tcp_resume_early_retransmit(struct sock *sk);
+ void tcp_rearm_rto(struct sock *sk);
+ void tcp_reset(struct sock *sk);
++void tcp_set_rto(struct sock *sk);
++bool tcp_should_expand_sndbuf(const struct sock *sk);
++bool tcp_prune_ofo_queue(struct sock *sk);
+ 
+ /* tcp_timer.c */
+ void tcp_init_xmit_timers(struct sock *);
+@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
+  */
+ struct tcp_skb_cb {
+ 	union {
+-		struct inet_skb_parm	h4;
++		union {
++			struct inet_skb_parm	h4;
+ #if IS_ENABLED(CONFIG_IPV6)
+-		struct inet6_skb_parm	h6;
++			struct inet6_skb_parm	h6;
+ #endif
+-	} header;	/* For incoming frames		*/
++		} header;	/* For incoming frames		*/
++#ifdef CONFIG_MPTCP
++		union {			/* For MPTCP outgoing frames */
++			__u32 path_mask; /* paths that tried to send this skb */
++			__u32 dss[6];	/* DSS options */
++		};
++#endif
++	};
+ 	__u32		seq;		/* Starting sequence number	*/
+ 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
+ 	__u32		when;		/* used to compute rtt's	*/
++#ifdef CONFIG_MPTCP
++	__u8		mptcp_flags;	/* flags for the MPTCP layer    */
++	__u8		dss_off;	/* Number of 4-byte words until
++					 * seq-number */
++#endif
+ 	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
+ 
+ 	__u8		sacked;		/* State flags for SACK/FACK.	*/
+@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
+ /* Determine a window scaling and initial window to offer. */
+ void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
+ 			       __u32 *window_clamp, int wscale_ok,
+-			       __u8 *rcv_wscale, __u32 init_rcv_wnd);
++			       __u8 *rcv_wscale, __u32 init_rcv_wnd,
++			       const struct sock *sk);
+ 
+ static inline int tcp_win_from_space(int space)
+ {
+@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
+ 		space - (space>>sysctl_tcp_adv_win_scale);
+ }
+ 
++#ifdef CONFIG_MPTCP
++extern struct static_key mptcp_static_key;
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++	return static_key_false(&mptcp_static_key) && tp->mpc;
++}
++#else
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++	return 0;
++}
++#endif
++
+ /* Note: caller must be prepared to deal with negative returns */ 
+ static inline int tcp_space(const struct sock *sk)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = tcp_sk(sk)->meta_sk;
++
+ 	return tcp_win_from_space(sk->sk_rcvbuf -
+ 				  atomic_read(&sk->sk_rmem_alloc));
+ } 
+ 
+ static inline int tcp_full_space(const struct sock *sk)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = tcp_sk(sk)->meta_sk;
++
+ 	return tcp_win_from_space(sk->sk_rcvbuf); 
+ }
+ 
+@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
+ 	ireq->wscale_ok = rx_opt->wscale_ok;
+ 	ireq->acked = 0;
+ 	ireq->ecn_ok = 0;
++	ireq->mptcp_rqsk = 0;
++	ireq->saw_mpc = 0;
+ 	ireq->ir_rmt_port = tcp_hdr(skb)->source;
+ 	ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
+ }
+@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
+ void tcp4_proc_exit(void);
+ #endif
+ 
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++		     const struct tcp_request_sock_ops *af_ops,
++		     struct sock *sk, struct sk_buff *skb);
++
+ /* TCP af-specific functions */
+ struct tcp_sock_af_ops {
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
+ #endif
+ };
+ 
++/* TCP/MPTCP-specific functions */
++struct tcp_sock_ops {
++	u32 (*__select_window)(struct sock *sk);
++	u16 (*select_window)(struct sock *sk);
++	void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
++				      __u32 *window_clamp, int wscale_ok,
++				      __u8 *rcv_wscale, __u32 init_rcv_wnd,
++				      const struct sock *sk);
++	void (*init_buffer_space)(struct sock *sk);
++	void (*set_rto)(struct sock *sk);
++	bool (*should_expand_sndbuf)(const struct sock *sk);
++	void (*send_fin)(struct sock *sk);
++	bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
++			   int push_one, gfp_t gfp);
++	void (*send_active_reset)(struct sock *sk, gfp_t priority);
++	int (*write_wakeup)(struct sock *sk);
++	bool (*prune_ofo_queue)(struct sock *sk);
++	void (*retransmit_timer)(struct sock *sk);
++	void (*time_wait)(struct sock *sk, int state, int timeo);
++	void (*cleanup_rbuf)(struct sock *sk, int copied);
++	void (*init_congestion_control)(struct sock *sk);
++};
++extern const struct tcp_sock_ops tcp_specific;
++
+ struct tcp_request_sock_ops {
++	u16 mss_clamp;
+ #ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_md5sig_key	*(*md5_lookup) (struct sock *sk,
+ 						struct request_sock *req);
+@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
+ 						  const struct request_sock *req,
+ 						  const struct sk_buff *skb);
+ #endif
++	int (*init_req)(struct request_sock *req, struct sock *sk,
++			 struct sk_buff *skb);
++#ifdef CONFIG_SYN_COOKIES
++	__u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
++				 __u16 *mss);
++#endif
++	struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
++				       const struct request_sock *req,
++				       bool *strict);
++	__u32 (*init_seq)(const struct sk_buff *skb);
++	int (*send_synack)(struct sock *sk, struct dst_entry *dst,
++			   struct flowi *fl, struct request_sock *req,
++			   u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++	void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
++			       const unsigned long timeout);
+ };
+ 
++#ifdef CONFIG_SYN_COOKIES
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++					 struct sock *sk, struct sk_buff *skb,
++					 __u16 *mss)
++{
++	return ops->cookie_init_seq(sk, skb, mss);
++}
++#else
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++					 struct sock *sk, struct sk_buff *skb,
++					 __u16 *mss)
++{
++	return 0;
++}
++#endif
++
+ int tcpv4_offload_init(void);
+ 
+ void tcp_v4_init(void);
+diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
+index 9cf2394f0bcf..c2634b6ed854 100644
+--- a/include/uapi/linux/if.h
++++ b/include/uapi/linux/if.h
+@@ -109,6 +109,9 @@ enum net_device_flags {
+ #define IFF_DORMANT			IFF_DORMANT
+ #define IFF_ECHO			IFF_ECHO
+ 
++#define IFF_NOMULTIPATH	0x80000		/* Disable for MPTCP 		*/
++#define IFF_MPBACKUP	0x100000	/* Use as backup path for MPTCP */
++
+ #define IFF_VOLATILE	(IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
+ 		IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
+ 
+diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
+index 3b9718328d8b..487475681d84 100644
+--- a/include/uapi/linux/tcp.h
++++ b/include/uapi/linux/tcp.h
+@@ -112,6 +112,7 @@ enum {
+ #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
+ #define TCP_TIMESTAMP		24
+ #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
++#define MPTCP_ENABLED		26
+ 
+ struct tcp_repair_opt {
+ 	__u32	opt_code;
+diff --git a/net/Kconfig b/net/Kconfig
+index d92afe4204d9..96b58593ad5e 100644
+--- a/net/Kconfig
++++ b/net/Kconfig
+@@ -79,6 +79,7 @@ if INET
+ source "net/ipv4/Kconfig"
+ source "net/ipv6/Kconfig"
+ source "net/netlabel/Kconfig"
++source "net/mptcp/Kconfig"
+ 
+ endif # if INET
+ 
+diff --git a/net/Makefile b/net/Makefile
+index cbbbe6d657ca..244bac1435b1 100644
+--- a/net/Makefile
++++ b/net/Makefile
+@@ -20,6 +20,7 @@ obj-$(CONFIG_INET)		+= ipv4/
+ obj-$(CONFIG_XFRM)		+= xfrm/
+ obj-$(CONFIG_UNIX)		+= unix/
+ obj-$(CONFIG_NET)		+= ipv6/
++obj-$(CONFIG_MPTCP)		+= mptcp/
+ obj-$(CONFIG_PACKET)		+= packet/
+ obj-$(CONFIG_NET_KEY)		+= key/
+ obj-$(CONFIG_BRIDGE)		+= bridge/
+diff --git a/net/core/dev.c b/net/core/dev.c
+index 367a586d0c8a..215d2757fbf6 100644
+--- a/net/core/dev.c
++++ b/net/core/dev.c
+@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
+ 
+ 	dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
+ 			       IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
+-			       IFF_AUTOMEDIA)) |
++			       IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
+ 		     (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
+ 				    IFF_ALLMULTI));
+ 
+diff --git a/net/core/request_sock.c b/net/core/request_sock.c
+index 467f326126e0..909dfa13f499 100644
+--- a/net/core/request_sock.c
++++ b/net/core/request_sock.c
+@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
+ EXPORT_SYMBOL(sysctl_max_syn_backlog);
+ 
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+-		      unsigned int nr_table_entries)
++		      unsigned int nr_table_entries,
++		      gfp_t flags)
+ {
+ 	size_t lopt_size = sizeof(struct listen_sock);
+ 	struct listen_sock *lopt;
+@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
+ 	nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
+ 	lopt_size += nr_table_entries * sizeof(struct request_sock *);
+ 	if (lopt_size > PAGE_SIZE)
+-		lopt = vzalloc(lopt_size);
++		lopt = __vmalloc(lopt_size,
++			flags | __GFP_HIGHMEM | __GFP_ZERO,
++			PAGE_KERNEL);
+ 	else
+-		lopt = kzalloc(lopt_size, GFP_KERNEL);
++		lopt = kzalloc(lopt_size, flags);
+ 	if (lopt == NULL)
+ 		return -ENOMEM;
+ 
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..8abc5d60fbe3 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
+ 	skb_drop_list(&skb_shinfo(skb)->frag_list);
+ }
+ 
+-static void skb_clone_fraglist(struct sk_buff *skb)
++void skb_clone_fraglist(struct sk_buff *skb)
+ {
+ 	struct sk_buff *list;
+ 
+@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
+ 	skb->inner_mac_header += off;
+ }
+ 
+-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+ {
+ 	__copy_skb_header(new, old);
+ 
+diff --git a/net/core/sock.c b/net/core/sock.c
+index 026e01f70274..359295523177 100644
+--- a/net/core/sock.c
++++ b/net/core/sock.c
+@@ -136,6 +136,11 @@
+ 
+ #include <trace/events/sock.h>
+ 
++#ifdef CONFIG_MPTCP
++#include <net/mptcp.h>
++#include <net/inet_common.h>
++#endif
++
+ #ifdef CONFIG_INET
+ #include <net/tcp.h>
+ #endif
+@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
+   "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG"      ,
+   "slock-AF_NFC"   , "slock-AF_VSOCK"    ,"slock-AF_MAX"
+ };
+-static const char *const af_family_clock_key_strings[AF_MAX+1] = {
++char *const af_family_clock_key_strings[AF_MAX+1] = {
+   "clock-AF_UNSPEC", "clock-AF_UNIX"     , "clock-AF_INET"     ,
+   "clock-AF_AX25"  , "clock-AF_IPX"      , "clock-AF_APPLETALK",
+   "clock-AF_NETROM", "clock-AF_BRIDGE"   , "clock-AF_ATMPVC"   ,
+@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
+  * sk_callback_lock locking rules are per-address-family,
+  * so split the lock classes by using a per-AF key:
+  */
+-static struct lock_class_key af_callback_keys[AF_MAX];
++struct lock_class_key af_callback_keys[AF_MAX];
+ 
+ /* Take into consideration the size of the struct sk_buff overhead in the
+  * determination of these values, since that is non-constant across
+@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
+ 	}
+ }
+ 
+-#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
+-
+ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
+ {
+ 	if (sk->sk_flags & flags) {
+@@ -1253,8 +1256,25 @@ lenout:
+  *
+  * (We also register the sk_lock with the lock validator.)
+  */
+-static inline void sock_lock_init(struct sock *sk)
+-{
++void sock_lock_init(struct sock *sk)
++{
++#ifdef CONFIG_MPTCP
++	/* Reclassify the lock-class for subflows */
++	if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
++		if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
++			sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
++						      &meta_slock_key,
++						      "sk_lock-AF_INET-MPTCP",
++						      &meta_key);
++
++			/* We don't yet have the mptcp-point.
++			 * Thus we still need inet_sock_destruct
++			 */
++			sk->sk_destruct = inet_sock_destruct;
++			return;
++		}
++#endif
++
+ 	sock_lock_init_class_and_name(sk,
+ 			af_family_slock_key_strings[sk->sk_family],
+ 			af_family_slock_keys + sk->sk_family,
+@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
+ }
+ EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
+ 
+-static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
+ 		int family)
+ {
+ 	struct sock *sk;
+diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
+index 4db3c2a1679c..04cb17d4b0ce 100644
+--- a/net/dccp/ipv6.c
++++ b/net/dccp/ipv6.c
+@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ 	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
+ 		goto drop;
+ 
+-	req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
++	req = inet_reqsk_alloc(&dccp6_request_sock_ops);
+ 	if (req == NULL)
+ 		goto drop;
+ 
+diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
+index 05c57f0fcabe..630434db0085 100644
+--- a/net/ipv4/Kconfig
++++ b/net/ipv4/Kconfig
+@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
+ 	For further details see:
+ 	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
+ 
++config TCP_CONG_COUPLED
++	tristate "MPTCP COUPLED CONGESTION CONTROL"
++	depends on MPTCP
++	default n
++	---help---
++	MultiPath TCP Coupled Congestion Control
++	To enable it, just put 'coupled' in tcp_congestion_control
++
++config TCP_CONG_OLIA
++	tristate "MPTCP Opportunistic Linked Increase"
++	depends on MPTCP
++	default n
++	---help---
++	MultiPath TCP Opportunistic Linked Increase Congestion Control
++	To enable it, just put 'olia' in tcp_congestion_control
++
++config TCP_CONG_WVEGAS
++	tristate "MPTCP WVEGAS CONGESTION CONTROL"
++	depends on MPTCP
++	default n
++	---help---
++	wVegas congestion control for MPTCP
++	To enable it, just put 'wvegas' in tcp_congestion_control
++
+ choice
+ 	prompt "Default TCP congestion control"
+ 	default DEFAULT_CUBIC
+@@ -584,6 +608,15 @@ choice
+ 	config DEFAULT_WESTWOOD
+ 		bool "Westwood" if TCP_CONG_WESTWOOD=y
+ 
++	config DEFAULT_COUPLED
++		bool "Coupled" if TCP_CONG_COUPLED=y
++
++	config DEFAULT_OLIA
++		bool "Olia" if TCP_CONG_OLIA=y
++
++	config DEFAULT_WVEGAS
++		bool "Wvegas" if TCP_CONG_WVEGAS=y
++
+ 	config DEFAULT_RENO
+ 		bool "Reno"
+ 
+@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
+ 	default "vegas" if DEFAULT_VEGAS
+ 	default "westwood" if DEFAULT_WESTWOOD
+ 	default "veno" if DEFAULT_VENO
++	default "coupled" if DEFAULT_COUPLED
++	default "wvegas" if DEFAULT_WVEGAS
+ 	default "reno" if DEFAULT_RENO
+ 	default "cubic"
+ 
+diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
+index d156b3c5f363..4afd6d8d9028 100644
+--- a/net/ipv4/af_inet.c
++++ b/net/ipv4/af_inet.c
+@@ -104,6 +104,7 @@
+ #include <net/ip_fib.h>
+ #include <net/inet_connection_sock.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/ping.h>
+@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
+  *	Create an inet socket.
+  */
+ 
+-static int inet_create(struct net *net, struct socket *sock, int protocol,
+-		       int kern)
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ 	struct sock *sk;
+ 	struct inet_protosw *answer;
+@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
+ 	lock_sock(sk2);
+ 
+ 	sock_rps_record_flow(sk2);
++
++	if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
++		struct sock *sk_it = sk2;
++
++		mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++
++		if (tcp_sk(sk2)->mpcb->master_sk) {
++			sk_it = tcp_sk(sk2)->mpcb->master_sk;
++
++			write_lock_bh(&sk_it->sk_callback_lock);
++			sk_it->sk_wq = newsock->wq;
++			sk_it->sk_socket = newsock;
++			write_unlock_bh(&sk_it->sk_callback_lock);
++		}
++	}
++
+ 	WARN_ON(!((1 << sk2->sk_state) &
+ 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
+ 		  TCPF_CLOSE_WAIT | TCPF_CLOSE)));
+@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
+ 
+ 	ip_init();
+ 
++	/* We must initialize MPTCP before TCP. */
++	mptcp_init();
++
+ 	tcp_v4_init();
+ 
+ 	/* Setup TCP slab cache for open requests. */
+diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
+index 14d02ea905b6..7d734d8af19b 100644
+--- a/net/ipv4/inet_connection_sock.c
++++ b/net/ipv4/inet_connection_sock.c
+@@ -23,6 +23,7 @@
+ #include <net/route.h>
+ #include <net/tcp_states.h>
+ #include <net/xfrm.h>
++#include <net/mptcp.h>
+ 
+ #ifdef INET_CSK_DEBUG
+ const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
+@@ -465,8 +466,8 @@ no_route:
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
+ 
+-static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
+-				 const u32 rnd, const u32 synq_hsize)
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++		   const u32 synq_hsize)
+ {
+ 	return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
+ }
+@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
+ 
+ 	lopt->clock_hand = i;
+ 
+-	if (lopt->qlen)
++	if (lopt->qlen && !is_meta_sk(parent))
+ 		inet_csk_reset_keepalive_timer(parent, interval);
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
+@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
+ 				 const struct request_sock *req,
+ 				 const gfp_t priority)
+ {
+-	struct sock *newsk = sk_clone_lock(sk, priority);
++	struct sock *newsk;
++
++	newsk = sk_clone_lock(sk, priority);
+ 
+ 	if (newsk != NULL) {
+ 		struct inet_connection_sock *newicsk = inet_csk(newsk);
+@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+-	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
++	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
++				   GFP_KERNEL);
+ 
+ 	if (rc != 0)
+ 		return rc;
+@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
+ 
+ 	while ((req = acc_req) != NULL) {
+ 		struct sock *child = req->sk;
++		bool mutex_taken = false;
+ 
+ 		acc_req = req->dl_next;
+ 
++		if (is_meta_sk(child)) {
++			mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
++			mutex_taken = true;
++		}
+ 		local_bh_disable();
+ 		bh_lock_sock(child);
+ 		WARN_ON(sock_owned_by_user(child));
+@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
+ 
+ 		bh_unlock_sock(child);
+ 		local_bh_enable();
++		if (mutex_taken)
++			mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
+ 		sock_put(child);
+ 
+ 		sk_acceptq_removed(sk);
+diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
+index c86624b36a62..0ff3fe004d62 100644
+--- a/net/ipv4/syncookies.c
++++ b/net/ipv4/syncookies.c
+@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ }
+ EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
+ 
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++			      __u16 *mssp)
+ {
+ 	const struct iphdr *iph = ip_hdr(skb);
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ 
+ 	/* check for timestamp cookie support */
+ 	memset(&tcp_opt, 0, sizeof(tcp_opt));
+-	tcp_parse_options(skb, &tcp_opt, 0, NULL);
++	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+ 
+ 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ 		goto out;
+@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ 	/* Try to redo what tcp_v4_send_synack did. */
+ 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
+ 
+-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+-				  &req->rcv_wnd, &req->window_clamp,
+-				  ireq->wscale_ok, &rcv_wscale,
+-				  dst_metric(&rt->dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++				       &req->rcv_wnd, &req->window_clamp,
++				       ireq->wscale_ok, &rcv_wscale,
++				       dst_metric(&rt->dst, RTAX_INITRWND), sk);
+ 
+ 	ireq->rcv_wscale  = rcv_wscale;
+ 
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 9d2118e5fbc7..2cb89f886d45 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -271,6 +271,7 @@
+ 
+ #include <net/icmp.h>
+ #include <net/inet_common.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/xfrm.h>
+ #include <net/ip.h>
+@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+ 	return period;
+ }
+ 
++const struct tcp_sock_ops tcp_specific = {
++	.__select_window		= __tcp_select_window,
++	.select_window			= tcp_select_window,
++	.select_initial_window		= tcp_select_initial_window,
++	.init_buffer_space		= tcp_init_buffer_space,
++	.set_rto			= tcp_set_rto,
++	.should_expand_sndbuf		= tcp_should_expand_sndbuf,
++	.init_congestion_control	= tcp_init_congestion_control,
++	.send_fin			= tcp_send_fin,
++	.write_xmit			= tcp_write_xmit,
++	.send_active_reset		= tcp_send_active_reset,
++	.write_wakeup			= tcp_write_wakeup,
++	.prune_ofo_queue		= tcp_prune_ofo_queue,
++	.retransmit_timer		= tcp_retransmit_timer,
++	.time_wait			= tcp_time_wait,
++	.cleanup_rbuf			= tcp_cleanup_rbuf,
++};
++
+ /* Address-family independent initialization for a tcp_sock.
+  *
+  * NOTE: A lot of things set to zero explicitly by call to
+@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
+ 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
+ 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
+ 
++	tp->ops = &tcp_specific;
++
+ 	local_bh_disable();
+ 	sock_update_memcg(sk);
+ 	sk_sockets_allocated_inc(sk);
+@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+ 	int ret;
+ 
+ 	sock_rps_record_flow(sk);
++
++#ifdef CONFIG_MPTCP
++	if (mptcp(tcp_sk(sk))) {
++		struct sock *sk_it;
++		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++#endif
+ 	/*
+ 	 * We can't seek on a socket input
+ 	 */
+@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
+ 	return NULL;
+ }
+ 
+-static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
+-				       int large_allowed)
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 xmit_size_goal, old_size_goal;
+@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+ {
+ 	int mss_now;
+ 
+-	mss_now = tcp_current_mss(sk);
+-	*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	if (mptcp(tcp_sk(sk))) {
++		mss_now = mptcp_current_mss(sk);
++		*size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	} else {
++		mss_now = tcp_current_mss(sk);
++		*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	}
+ 
+ 	return mss_now;
+ }
+@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+ 	 * is fully established.
+ 	 */
+ 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+-	    !tcp_passive_fastopen(sk)) {
++	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++				  tp->mpcb->master_sk : sk)) {
+ 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ 			goto out_err;
+ 	}
+ 
++	if (mptcp(tp)) {
++		struct sock *sk_it = sk;
++
++		/* We must check this with socket-lock hold because we iterate
++		 * over the subflows.
++		 */
++		if (!mptcp_can_sendpage(sk)) {
++			ssize_t ret;
++
++			release_sock(sk);
++			ret = sock_no_sendpage(sk->sk_socket, page, offset,
++					       size, flags);
++			lock_sock(sk);
++			return ret;
++		}
++
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++
+ 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+ 
+ 	mss_now = tcp_send_mss(sk, &size_goal, flags);
+@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+ {
+ 	ssize_t res;
+ 
+-	if (!(sk->sk_route_caps & NETIF_F_SG) ||
+-	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
++	/* If MPTCP is enabled, we check it later after establishment */
++	if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
++	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
+ 		return sock_no_sendpage(sk->sk_socket, page, offset, size,
+ 					flags);
+ 
+@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	int tmp = tp->mss_cache;
+ 
++	if (mptcp(tp))
++		return mptcp_select_size(sk, sg);
++
+ 	if (sg) {
+ 		if (sk_can_gso(sk)) {
+ 			/* Small frames wont use a full page:
+@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 	 * is fully established.
+ 	 */
+ 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+-	    !tcp_passive_fastopen(sk)) {
++	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++				  tp->mpcb->master_sk : sk)) {
+ 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ 			goto do_error;
+ 	}
+ 
++	if (mptcp(tp)) {
++		struct sock *sk_it = sk;
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++
+ 	if (unlikely(tp->repair)) {
+ 		if (tp->repair_queue == TCP_RECV_QUEUE) {
+ 			copied = tcp_send_rcvq(sk, msg, size);
+@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
+ 		goto out_err;
+ 
+-	sg = !!(sk->sk_route_caps & NETIF_F_SG);
++	if (mptcp(tp))
++		sg = mptcp_can_sg(sk);
++	else
++		sg = !!(sk->sk_route_caps & NETIF_F_SG);
+ 
+ 	while (--iovlen >= 0) {
+ 		size_t seglen = iov->iov_len;
+@@ -1183,8 +1251,15 @@ new_segment:
+ 
+ 				/*
+ 				 * Check whether we can use HW checksum.
++				 *
++				 * If dss-csum is enabled, we do not do hw-csum.
++				 * In case of non-mptcp we check the
++				 * device-capabilities.
++				 * In case of mptcp, hw-csum's will be handled
++				 * later in mptcp_write_xmit.
+ 				 */
+-				if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
++				if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
++				    (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ 					skb->ip_summed = CHECKSUM_PARTIAL;
+ 
+ 				skb_entail(sk, skb);
+@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
+ 
+ 		/* Optimize, __tcp_select_window() is not cheap. */
+ 		if (2*rcv_window_now <= tp->window_clamp) {
+-			__u32 new_window = __tcp_select_window(sk);
++			__u32 new_window = tp->ops->__select_window(sk);
+ 
+ 			/* Send ACK now, if this read freed lots of space
+ 			 * in our buffer. Certainly, new_window is new window.
+@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ 	/* Clean up data we have read: This will do ACK frames. */
+ 	if (copied > 0) {
+ 		tcp_recv_skb(sk, seq, &offset);
+-		tcp_cleanup_rbuf(sk, copied);
++		tp->ops->cleanup_rbuf(sk, copied);
+ 	}
+ 	return copied;
+ }
+@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 
+ 	lock_sock(sk);
+ 
++#ifdef CONFIG_MPTCP
++	if (mptcp(tp)) {
++		struct sock *sk_it;
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++#endif
++
+ 	err = -ENOTCONN;
+ 	if (sk->sk_state == TCP_LISTEN)
+ 		goto out;
+@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 			}
+ 		}
+ 
+-		tcp_cleanup_rbuf(sk, copied);
++		tp->ops->cleanup_rbuf(sk, copied);
+ 
+ 		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
+ 			/* Install new reader */
+@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 			if (tp->rcv_wnd == 0 &&
+ 			    !skb_queue_empty(&sk->sk_async_wait_queue)) {
+ 				tcp_service_net_dma(sk, true);
+-				tcp_cleanup_rbuf(sk, copied);
++				tp->ops->cleanup_rbuf(sk, copied);
+ 			} else
+ 				dma_async_issue_pending(tp->ucopy.dma_chan);
+ 		}
+@@ -1993,7 +2076,7 @@ skip_copy:
+ 	 */
+ 
+ 	/* Clean up data we have read: This will do ACK frames. */
+-	tcp_cleanup_rbuf(sk, copied);
++	tp->ops->cleanup_rbuf(sk, copied);
+ 
+ 	release_sock(sk);
+ 	return copied;
+@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
+   /* TCP_CLOSING	*/ TCP_CLOSING,
+ };
+ 
+-static int tcp_close_state(struct sock *sk)
++int tcp_close_state(struct sock *sk)
+ {
+ 	int next = (int)new_state[sk->sk_state];
+ 	int ns = next & TCP_STATE_MASK;
+@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
+ 	     TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
+ 		/* Clear out any half completed packets.  FIN if needed. */
+ 		if (tcp_close_state(sk))
+-			tcp_send_fin(sk);
++			tcp_sk(sk)->ops->send_fin(sk);
+ 	}
+ }
+ EXPORT_SYMBOL(tcp_shutdown);
+@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
+ 	int data_was_unread = 0;
+ 	int state;
+ 
++	if (is_meta_sk(sk)) {
++		mptcp_close(sk, timeout);
++		return;
++	}
++
+ 	lock_sock(sk);
+ 	sk->sk_shutdown = SHUTDOWN_MASK;
+ 
+@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
+ 		/* Unread data was tossed, zap the connection. */
+ 		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
+ 		tcp_set_state(sk, TCP_CLOSE);
+-		tcp_send_active_reset(sk, sk->sk_allocation);
++		tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
+ 	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
+ 		/* Check zero linger _after_ checking for unread data. */
+ 		sk->sk_prot->disconnect(sk, 0);
+@@ -2247,7 +2335,7 @@ adjudge_to_death:
+ 		struct tcp_sock *tp = tcp_sk(sk);
+ 		if (tp->linger2 < 0) {
+ 			tcp_set_state(sk, TCP_CLOSE);
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			NET_INC_STATS_BH(sock_net(sk),
+ 					LINUX_MIB_TCPABORTONLINGER);
+ 		} else {
+@@ -2257,7 +2345,8 @@ adjudge_to_death:
+ 				inet_csk_reset_keepalive_timer(sk,
+ 						tmo - TCP_TIMEWAIT_LEN);
+ 			} else {
+-				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++				tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
++							   tmo);
+ 				goto out;
+ 			}
+ 		}
+@@ -2266,7 +2355,7 @@ adjudge_to_death:
+ 		sk_mem_reclaim(sk);
+ 		if (tcp_check_oom(sk, 0)) {
+ 			tcp_set_state(sk, TCP_CLOSE);
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			NET_INC_STATS_BH(sock_net(sk),
+ 					LINUX_MIB_TCPABORTONMEMORY);
+ 		}
+@@ -2291,15 +2380,6 @@ out:
+ }
+ EXPORT_SYMBOL(tcp_close);
+ 
+-/* These states need RST on ABORT according to RFC793 */
+-
+-static inline bool tcp_need_reset(int state)
+-{
+-	return (1 << state) &
+-	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
+-		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
+-}
+-
+ int tcp_disconnect(struct sock *sk, int flags)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
+@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
+ 		/* The last check adjusts for discrepancy of Linux wrt. RFC
+ 		 * states
+ 		 */
+-		tcp_send_active_reset(sk, gfp_any());
++		tp->ops->send_active_reset(sk, gfp_any());
+ 		sk->sk_err = ECONNRESET;
+ 	} else if (old_state == TCP_SYN_SENT)
+ 		sk->sk_err = ECONNRESET;
+@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
+ 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+ 		inet_reset_saddr(sk);
+ 
++	if (is_meta_sk(sk)) {
++		mptcp_disconnect(sk);
++	} else {
++		if (tp->inside_tk_table)
++			mptcp_hash_remove_bh(tp);
++	}
++
+ 	sk->sk_shutdown = 0;
+ 	sock_reset_flag(sk, SOCK_DONE);
+ 	tp->srtt_us = 0;
+@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 		break;
+ 
+ 	case TCP_DEFER_ACCEPT:
++		/* An established MPTCP-connection (mptcp(tp) only returns true
++		 * if the socket is established) should not use DEFER on new
++		 * subflows.
++		 */
++		if (mptcp(tp))
++			break;
+ 		/* Translate value in seconds to number of retransmits */
+ 		icsk->icsk_accept_queue.rskq_defer_accept =
+ 			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 			    (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
+ 			    inet_csk_ack_scheduled(sk)) {
+ 				icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
+-				tcp_cleanup_rbuf(sk, 1);
++				tp->ops->cleanup_rbuf(sk, 1);
+ 				if (!(val & 1))
+ 					icsk->icsk_ack.pingpong = 1;
+ 			}
+@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 		tp->notsent_lowat = val;
+ 		sk->sk_write_space(sk);
+ 		break;
++#ifdef CONFIG_MPTCP
++	case MPTCP_ENABLED:
++		if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
++			if (val)
++				tp->mptcp_enabled = 1;
++			else
++				tp->mptcp_enabled = 0;
++		} else {
++			err = -EPERM;
++		}
++		break;
++#endif
+ 	default:
+ 		err = -ENOPROTOOPT;
+ 		break;
+@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
+ 	case TCP_NOTSENT_LOWAT:
+ 		val = tp->notsent_lowat;
+ 		break;
++#ifdef CONFIG_MPTCP
++	case MPTCP_ENABLED:
++		val = tp->mptcp_enabled;
++		break;
++#endif
+ 	default:
+ 		return -ENOPROTOOPT;
+ 	}
+@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
+ 	if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
+ 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
+ 
++	WARN_ON(sk->sk_state == TCP_CLOSE);
+ 	tcp_set_state(sk, TCP_CLOSE);
++
+ 	tcp_clear_xmit_timers(sk);
++
+ 	if (req != NULL)
+ 		reqsk_fastopen_remove(sk, req, false);
+ 
+diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
+index 9771563ab564..5c230d96c4c1 100644
+--- a/net/ipv4/tcp_fastopen.c
++++ b/net/ipv4/tcp_fastopen.c
+@@ -7,6 +7,7 @@
+ #include <linux/rculist.h>
+ #include <net/inetpeer.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ 
+ int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
+ 
+@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ {
+ 	struct tcp_sock *tp;
+ 	struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
+-	struct sock *child;
++	struct sock *child, *meta_sk;
+ 
+ 	req->num_retrans = 0;
+ 	req->num_timeout = 0;
+@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ 	/* Add the child socket directly into the accept queue */
+ 	inet_csk_reqsk_queue_add(sk, req, child);
+ 
+-	/* Now finish processing the fastopen child socket. */
+-	inet_csk(child)->icsk_af_ops->rebuild_header(child);
+-	tcp_init_congestion_control(child);
+-	tcp_mtup_init(child);
+-	tcp_init_metrics(child);
+-	tcp_init_buffer_space(child);
+-
+ 	/* Queue the data carried in the SYN packet. We need to first
+ 	 * bump skb's refcnt because the caller will attempt to free it.
+ 	 *
+@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ 		tp->syn_data_acked = 1;
+ 	}
+ 	tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++
++	meta_sk = child;
++	if (!mptcp_check_req_fastopen(meta_sk, req)) {
++		child = tcp_sk(meta_sk)->mpcb->master_sk;
++		tp = tcp_sk(child);
++	}
++
++	/* Now finish processing the fastopen child socket. */
++	inet_csk(child)->icsk_af_ops->rebuild_header(child);
++	tp->ops->init_congestion_control(child);
++	tcp_mtup_init(child);
++	tcp_init_metrics(child);
++	tp->ops->init_buffer_space(child);
++
+ 	sk->sk_data_ready(sk);
+-	bh_unlock_sock(child);
++	if (mptcp(tcp_sk(child)))
++		bh_unlock_sock(child);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(child);
+ 	WARN_ON(req->sk == NULL);
+ 	return true;
+diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
+index 40639c288dc2..3273bb69f387 100644
+--- a/net/ipv4/tcp_input.c
++++ b/net/ipv4/tcp_input.c
+@@ -74,6 +74,9 @@
+ #include <linux/ipsec.h>
+ #include <asm/unaligned.h>
+ #include <net/netdma.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
+ 
+ int sysctl_tcp_timestamps __read_mostly = 1;
+ int sysctl_tcp_window_scaling __read_mostly = 1;
+@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
+ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
+ int sysctl_tcp_early_retrans __read_mostly = 3;
+ 
+-#define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
+-#define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
+-#define FLAG_DATA_ACKED		0x04 /* This ACK acknowledged new data.		*/
+-#define FLAG_RETRANS_DATA_ACKED	0x08 /* "" "" some of which was retransmitted.	*/
+-#define FLAG_SYN_ACKED		0x10 /* This ACK acknowledged SYN.		*/
+-#define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
+-#define FLAG_ECE		0x40 /* ECE in this ACK				*/
+-#define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/
+-#define FLAG_ORIG_SACK_ACKED	0x200 /* Never retransmitted data are (s)acked	*/
+-#define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
+-#define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info */
+-#define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a sacked seq */
+-#define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
+-
+-#define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
+-#define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
+-#define FLAG_CA_ALERT		(FLAG_DATA_SACKED|FLAG_ECE)
+-#define FLAG_FORWARD_PROGRESS	(FLAG_ACKED|FLAG_DATA_SACKED)
+-
+ #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
+ #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
+ 
+@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
+ 		icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
+ }
+ 
+-static void tcp_enter_quickack_mode(struct sock *sk)
++void tcp_enter_quickack_mode(struct sock *sk)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	tcp_incr_quickack(sk);
+@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ 	per_mss = roundup_pow_of_two(per_mss) +
+ 		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
+ 
+-	nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
+-	nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++	if (mptcp(tp)) {
++		nr_segs = mptcp_check_snd_buf(tp);
++	} else {
++		nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
++		nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++	}
+ 
+ 	/* Fast Recovery (RFC 5681 3.2) :
+ 	 * Cubic needs 1.7 factor, rounded to 2 to include
+@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ 	 */
+ 	sndmem = 2 * nr_segs * per_mss;
+ 
+-	if (sk->sk_sndbuf < sndmem)
++	/* MPTCP: after this sndmem is the new contribution of the
++	 * current subflow to the aggregated sndbuf */
++	if (sk->sk_sndbuf < sndmem) {
++		int old_sndbuf = sk->sk_sndbuf;
+ 		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
++		/* MPTCP: ok, the subflow sndbuf has grown, reflect
++		 * this in the aggregate buffer.*/
++		if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
++			mptcp_update_sndbuf(tp);
++	}
+ }
+ 
+ /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
+@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
+ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
+ 
+ 	/* Check #1 */
+-	if (tp->rcv_ssthresh < tp->window_clamp &&
+-	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
++	if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
++	    (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
+ 	    !sk_under_memory_pressure(sk)) {
+ 		int incr;
+ 
+@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ 		 * will fit to rcvbuf in future.
+ 		 */
+ 		if (tcp_win_from_space(skb->truesize) <= skb->len)
+-			incr = 2 * tp->advmss;
++			incr = 2 * meta_tp->advmss;
+ 		else
+-			incr = __tcp_grow_window(sk, skb);
++			incr = __tcp_grow_window(meta_sk, skb);
+ 
+ 		if (incr) {
+ 			incr = max_t(int, incr, 2 * skb->len);
+-			tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
+-					       tp->window_clamp);
++			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
++					            meta_tp->window_clamp);
+ 			inet_csk(sk)->icsk_ack.quick |= 1;
+ 		}
+ 	}
+@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
+ 	int copied;
+ 
+ 	time = tcp_time_stamp - tp->rcvq_space.time;
+-	if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
++	if (mptcp(tp)) {
++		if (mptcp_check_rtt(tp, time))
++			return;
++	} else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+ 		return;
+ 
+ 	/* Number of bytes copied to user in last RTT */
+@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
+ /* Calculate rto without backoff.  This is the second half of Van Jacobson's
+  * routine referred to above.
+  */
+-static void tcp_set_rto(struct sock *sk)
++void tcp_set_rto(struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	/* Old crap is replaced with new one. 8)
+@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
+ 	int len;
+ 	int in_sack;
+ 
+-	if (!sk_can_gso(sk))
++	/* For MPTCP we cannot shift skb-data and remove one skb from the
++	 * send-queue, because this will make us loose the DSS-option (which
++	 * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
++	 */
++	if (!sk_can_gso(sk) || mptcp(tp))
+ 		goto fallback;
+ 
+ 	/* Normally R but no L won't result in plain S */
+@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
+ 		return false;
+ 
+ 	tcp_rtt_estimator(sk, seq_rtt_us);
+-	tcp_set_rto(sk);
++	tp->ops->set_rto(sk);
+ 
+ 	/* RFC6298: only reset backoff on valid RTT measurement. */
+ 	inet_csk(sk)->icsk_backoff = 0;
+@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
+ }
+ 
+ /* If we get here, the whole TSO packet has not been acked. */
+-static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 packets_acked;
+@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ 		 */
+ 		if (!(scb->tcp_flags & TCPHDR_SYN)) {
+ 			flag |= FLAG_DATA_ACKED;
++			if (mptcp(tp) && mptcp_is_data_seq(skb))
++				flag |= MPTCP_FLAG_DATA_ACKED;
+ 		} else {
+ 			flag |= FLAG_SYN_ACKED;
+ 			tp->retrans_stamp = 0;
+@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ 	return flag;
+ }
+ 
+-static void tcp_ack_probe(struct sock *sk)
++void tcp_ack_probe(struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
+ /* Check that window update is acceptable.
+  * The function assumes that snd_una<=ack<=snd_next.
+  */
+-static inline bool tcp_may_update_window(const struct tcp_sock *tp,
+-					const u32 ack, const u32 ack_seq,
+-					const u32 nwin)
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++			   const u32 ack_seq, const u32 nwin)
+ {
+ 	return	after(ack, tp->snd_una) ||
+ 		after(ack_seq, tp->snd_wl1) ||
+@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
+ }
+ 
+ /* This routine deals with incoming acks, but not outgoing ones. */
+-static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
++static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+ 				    sack_rtt_us);
+ 	acked -= tp->packets_out;
+ 
++	if (mptcp(tp)) {
++		if (mptcp_fallback_infinite(sk, flag)) {
++			pr_err("%s resetting flow\n", __func__);
++			mptcp_send_reset(sk);
++			goto invalid_ack;
++		}
++
++		mptcp_clean_rtx_infinite(skb, sk);
++	}
++
+ 	/* Advance cwnd if state allows */
+ 	if (tcp_may_raise_cwnd(sk, flag))
+ 		tcp_cong_avoid(sk, ack, acked);
+@@ -3512,8 +3528,9 @@ old_ack:
+  * the fast version below fails.
+  */
+ void tcp_parse_options(const struct sk_buff *skb,
+-		       struct tcp_options_received *opt_rx, int estab,
+-		       struct tcp_fastopen_cookie *foc)
++		       struct tcp_options_received *opt_rx,
++		       struct mptcp_options_received *mopt,
++		       int estab, struct tcp_fastopen_cookie *foc)
+ {
+ 	const unsigned char *ptr;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
+ 				 */
+ 				break;
+ #endif
++			case TCPOPT_MPTCP:
++				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++				break;
+ 			case TCPOPT_EXP:
+ 				/* Fast Open option shares code 254 using a
+ 				 * 16 bits magic number. It's valid only in
+@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
+ 		if (tcp_parse_aligned_timestamp(tp, th))
+ 			return true;
+ 	}
+-
+-	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
++	tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
++			  1, NULL);
+ 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+ 
+@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
+ 		dst = __sk_dst_get(sk);
+ 		if (!dst || !dst_metric(dst, RTAX_QUICKACK))
+ 			inet_csk(sk)->icsk_ack.pingpong = 1;
++		if (mptcp(tp))
++			mptcp_sub_close_passive(sk);
+ 		break;
+ 
+ 	case TCP_CLOSE_WAIT:
+@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
+ 		tcp_set_state(sk, TCP_CLOSING);
+ 		break;
+ 	case TCP_FIN_WAIT2:
++		if (mptcp(tp)) {
++			/* The socket will get closed by mptcp_data_ready.
++			 * We first have to process all data-sequences.
++			 */
++			tp->close_it = 1;
++			break;
++		}
+ 		/* Received a FIN -- send ACK and enter TIME_WAIT. */
+ 		tcp_send_ack(sk);
+-		tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++		tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ 		break;
+ 	default:
+ 		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
+@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
+ 	if (!sock_flag(sk, SOCK_DEAD)) {
+ 		sk->sk_state_change(sk);
+ 
++		/* Don't wake up MPTCP-subflows */
++		if (mptcp(tp))
++			return;
++
+ 		/* Do not send POLL_HUP for half duplex close. */
+ 		if (sk->sk_shutdown == SHUTDOWN_MASK ||
+ 		    sk->sk_state == TCP_CLOSE)
+@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
+ 			tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
+ 		}
+ 
+-		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
++		/* In case of MPTCP, the segment may be empty if it's a
++		 * non-data DATA_FIN. (see beginning of tcp_data_queue)
++		 */
++		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
++		    !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
+ 			SOCK_DEBUG(sk, "ofo packet was already received\n");
+ 			__skb_unlink(skb, &tp->out_of_order_queue);
+ 			__kfree_skb(skb);
+@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
+ 	}
+ }
+ 
+-static bool tcp_prune_ofo_queue(struct sock *sk);
+ static int tcp_prune_queue(struct sock *sk);
+ 
+ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ 				 unsigned int size)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = mptcp_meta_sk(sk);
++
+ 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+ 	    !sk_rmem_schedule(sk, skb, size)) {
+ 
+@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ 			return -1;
+ 
+ 		if (!sk_rmem_schedule(sk, skb, size)) {
+-			if (!tcp_prune_ofo_queue(sk))
++			if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
+ 				return -1;
+ 
+ 			if (!sk_rmem_schedule(sk, skb, size))
+@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+  * Better try to coalesce them right now to avoid future collapses.
+  * Returns true if caller should free @from instead of queueing it
+  */
+-static bool tcp_try_coalesce(struct sock *sk,
+-			     struct sk_buff *to,
+-			     struct sk_buff *from,
+-			     bool *fragstolen)
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
++		      bool *fragstolen)
+ {
+ 	int delta;
+ 
+ 	*fragstolen = false;
+ 
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++		return false;
++
+ 	if (tcp_hdr(from)->fin)
+ 		return false;
+ 
+@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ 
+ 	/* Do skb overlap to previous one? */
+ 	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+-		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++		/* MPTCP allows non-data data-fin to be in the ofo-queue */
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
++		    !(mptcp(tp) && end_seq == seq)) {
+ 			/* All the bits are present. Drop. */
+ 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
+ 			__kfree_skb(skb);
+@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ 					 end_seq);
+ 			break;
+ 		}
++		/* MPTCP allows non-data data-fin to be in the ofo-queue */
++		if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
++			continue;
+ 		__skb_unlink(skb1, &tp->out_of_order_queue);
+ 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
+ 				 TCP_SKB_CB(skb1)->end_seq);
+@@ -4280,8 +4325,8 @@ end:
+ 	}
+ }
+ 
+-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
+-		  bool *fragstolen)
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++			       bool *fragstolen)
+ {
+ 	int eaten;
+ 	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
+ 	int eaten = -1;
+ 	bool fragstolen = false;
+ 
+-	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
++	/* If no data is present, but a data_fin is in the options, we still
++	 * have to call mptcp_queue_skb later on. */
++	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
++	    !(mptcp(tp) && mptcp_is_data_fin(skb)))
+ 		goto drop;
+ 
+ 	skb_dst_drop(skb);
+@@ -4389,7 +4437,7 @@ queue_and_out:
+ 			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
+ 		}
+ 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+-		if (skb->len)
++		if (skb->len || mptcp_is_data_fin(skb))
+ 			tcp_event_data_recv(sk, skb);
+ 		if (th->fin)
+ 			tcp_fin(sk);
+@@ -4411,7 +4459,11 @@ queue_and_out:
+ 
+ 		if (eaten > 0)
+ 			kfree_skb_partial(skb, fragstolen);
+-		if (!sock_flag(sk, SOCK_DEAD))
++		if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
++			/* MPTCP: we always have to call data_ready, because
++			 * we may be about to receive a data-fin, which still
++			 * must get queued.
++			 */
+ 			sk->sk_data_ready(sk);
+ 		return;
+ 	}
+@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
+ 		next = skb_queue_next(list, skb);
+ 
+ 	__skb_unlink(skb, list);
++	if (mptcp(tcp_sk(sk)))
++		mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
+ 	__kfree_skb(skb);
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
+ 
+@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
+  * Purge the out-of-order queue.
+  * Return true if queue was pruned.
+  */
+-static bool tcp_prune_ofo_queue(struct sock *sk)
++bool tcp_prune_ofo_queue(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	bool res = false;
+@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
+ 	/* Collapsing did not help, destructive actions follow.
+ 	 * This must not ever occur. */
+ 
+-	tcp_prune_ofo_queue(sk);
++	tp->ops->prune_ofo_queue(sk);
+ 
+ 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+ 		return 0;
+@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
+ 	return -1;
+ }
+ 
+-static bool tcp_should_expand_sndbuf(const struct sock *sk)
++/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
++ * As additional protections, we do not touch cwnd in retransmission phases,
++ * and if application hit its sndbuf limit recently.
++ */
++void tcp_cwnd_application_limited(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
++	    sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
++		/* Limited by application or receiver window. */
++		u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
++		u32 win_used = max(tp->snd_cwnd_used, init_win);
++		if (win_used < tp->snd_cwnd) {
++			tp->snd_ssthresh = tcp_current_ssthresh(sk);
++			tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
++		}
++		tp->snd_cwnd_used = 0;
++	}
++	tp->snd_cwnd_stamp = tcp_time_stamp;
++}
++
++bool tcp_should_expand_sndbuf(const struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+-	if (tcp_should_expand_sndbuf(sk)) {
++	if (tp->ops->should_expand_sndbuf(sk)) {
+ 		tcp_sndbuf_expand(sk);
+ 		tp->snd_cwnd_stamp = tcp_time_stamp;
+ 	}
+@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
+ {
+ 	if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
+ 		sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
+-		if (sk->sk_socket &&
+-		    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
++		if (mptcp(tcp_sk(sk)) ||
++		    (sk->sk_socket &&
++			test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
+ 			tcp_new_space(sk);
+ 	}
+ }
+@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+ 	     /* ... and right edge of window advances far enough.
+ 	      * (tcp_recvmsg() will send ACK otherwise). Or...
+ 	      */
+-	     __tcp_select_window(sk) >= tp->rcv_wnd) ||
++	     tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
+ 	    /* We ACK each frame or... */
+ 	    tcp_in_quickack_mode(sk) ||
+ 	    /* We have out of order data. */
+@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
++	/* MPTCP urgent data is not yet supported */
++	if (mptcp(tp))
++		return;
++
+ 	/* Check if we get a new urgent pointer - normally not. */
+ 	if (th->urg)
+ 		tcp_check_urg(sk, th);
+@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
+ }
+ 
+ #ifdef CONFIG_NET_DMA
+-static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
+-				  int hlen)
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	int chunk = skb->len - hlen;
+@@ -5052,9 +5132,15 @@ syn_challenge:
+ 		goto discard;
+ 	}
+ 
++	/* If valid: post process the received MPTCP options. */
++	if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
++		goto discard;
++
+ 	return true;
+ 
+ discard:
++	if (mptcp(tp))
++		mptcp_reset_mopt(tp);
+ 	__kfree_skb(skb);
+ 	return false;
+ }
+@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ 
+ 	tp->rx_opt.saw_tstamp = 0;
+ 
++	/* MPTCP: force slowpath. */
++	if (mptcp(tp))
++		goto slow_path;
++
+ 	/*	pred_flags is 0xS?10 << 16 + snd_wnd
+ 	 *	if header_prediction is to be made
+ 	 *	'S' will always be tp->tcp_header_len >> 2
+@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ 					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
+ 				}
+ 				if (copied_early)
+-					tcp_cleanup_rbuf(sk, skb->len);
++					tp->ops->cleanup_rbuf(sk, skb->len);
+ 			}
+ 			if (!eaten) {
+ 				if (tcp_checksum_complete_user(sk, skb))
+@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
+ 
+ 	tcp_init_metrics(sk);
+ 
+-	tcp_init_congestion_control(sk);
++	tp->ops->init_congestion_control(sk);
+ 
+ 	/* Prevent spurious tcp_cwnd_restart() on first data
+ 	 * packet.
+ 	 */
+ 	tp->lsndtime = tcp_time_stamp;
+ 
+-	tcp_init_buffer_space(sk);
++	tp->ops->init_buffer_space(sk);
+ 
+ 	if (sock_flag(sk, SOCK_KEEPOPEN))
+ 		inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
+@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ 		/* Get original SYNACK MSS value if user MSS sets mss_clamp */
+ 		tcp_clear_options(&opt);
+ 		opt.user_mss = opt.mss_clamp = 0;
+-		tcp_parse_options(synack, &opt, 0, NULL);
++		tcp_parse_options(synack, &opt, NULL, 0, NULL);
+ 		mss = opt.mss_clamp;
+ 	}
+ 
+@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ 
+ 	tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
+ 
+-	if (data) { /* Retransmit unacked data in SYN */
++	/* In mptcp case, we do not rely on "retransmit", but instead on
++	 * "transmit", because if fastopen data is not acked, the retransmission
++	 * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
++	 */
++	if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
+ 		tcp_for_write_queue_from(data, sk) {
+ 			if (data == tcp_send_head(sk) ||
+ 			    __tcp_retransmit_skb(sk, data))
+@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct tcp_fastopen_cookie foc = { .len = -1 };
+ 	int saved_clamp = tp->rx_opt.mss_clamp;
++	struct mptcp_options_received mopt;
++	mptcp_init_mp_opt(&mopt);
+ 
+-	tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
++	tcp_parse_options(skb, &tp->rx_opt,
++			  mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
+ 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+ 
+@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
+ 		tcp_ack(sk, skb, FLAG_SLOWPATH);
+ 
++		if (tp->request_mptcp || mptcp(tp)) {
++			int ret;
++			ret = mptcp_rcv_synsent_state_process(sk, &sk,
++							      skb, &mopt);
++
++			/* May have changed if we support MPTCP */
++			tp = tcp_sk(sk);
++			icsk = inet_csk(sk);
++
++			if (ret == 1)
++				goto reset_and_undo;
++			if (ret == 2)
++				goto discard;
++		}
++
++		if (mptcp(tp) && !is_master_tp(tp)) {
++			/* Timer for repeating the ACK until an answer
++			 * arrives. Used only when establishing an additional
++			 * subflow inside of an MPTCP connection.
++			 */
++			sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++				       jiffies + icsk->icsk_rto);
++		}
++
+ 		/* Ok.. it's good. Set up sequence numbers and
+ 		 * move to established.
+ 		 */
+@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 			tp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
+ 
++		if (mptcp(tp)) {
++			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++		}
++
+ 		if (tcp_is_sack(tp) && sysctl_tcp_fack)
+ 			tcp_enable_fack(tp);
+ 
+@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 		    tcp_rcv_fastopen_synack(sk, skb, &foc))
+ 			return -1;
+ 
+-		if (sk->sk_write_pending ||
++		/* With MPTCP we cannot send data on the third ack due to the
++		 * lack of option-space to combine with an MP_CAPABLE.
++		 */
++		if (!mptcp(tp) && (sk->sk_write_pending ||
+ 		    icsk->icsk_accept_queue.rskq_defer_accept ||
+-		    icsk->icsk_ack.pingpong) {
++		    icsk->icsk_ack.pingpong)) {
+ 			/* Save one ACK. Data will be ready after
+ 			 * several ticks, if write_pending is set.
+ 			 *
+@@ -5536,6 +5665,7 @@ discard:
+ 	    tcp_paws_reject(&tp->rx_opt, 0))
+ 		goto discard_and_undo;
+ 
++	/* TODO - check this here for MPTCP */
+ 	if (th->syn) {
+ 		/* We see SYN without ACK. It is attempt of
+ 		 * simultaneous connect with crossed SYNs.
+@@ -5552,6 +5682,11 @@ discard:
+ 			tp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
+ 
++		if (mptcp(tp)) {
++			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++		}
++
+ 		tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
+ 		tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
+ 
+@@ -5610,6 +5745,7 @@ reset_and_undo:
+ 
+ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			  const struct tcphdr *th, unsigned int len)
++	__releases(&sk->sk_lock.slock)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 	case TCP_SYN_SENT:
+ 		queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
++		if (is_meta_sk(sk)) {
++			sk = tcp_sk(sk)->mpcb->master_sk;
++			tp = tcp_sk(sk);
++
++			/* Need to call it here, because it will announce new
++			 * addresses, which can only be done after the third ack
++			 * of the 3-way handshake.
++			 */
++			mptcp_update_metasocket(sk, tp->meta_sk);
++		}
+ 		if (queued >= 0)
+ 			return queued;
+ 
+@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tcp_urg(sk, skb, th);
+ 		__kfree_skb(skb);
+ 		tcp_data_snd_check(sk);
++		if (mptcp(tp) && is_master_tp(tp))
++			bh_unlock_sock(sk);
+ 		return 0;
+ 	}
+ 
+@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			synack_stamp = tp->lsndtime;
+ 			/* Make sure socket is routed, for correct metrics. */
+ 			icsk->icsk_af_ops->rebuild_header(sk);
+-			tcp_init_congestion_control(sk);
++			tp->ops->init_congestion_control(sk);
+ 
+ 			tcp_mtup_init(sk);
+ 			tp->copied_seq = tp->rcv_nxt;
+-			tcp_init_buffer_space(sk);
++			tp->ops->init_buffer_space(sk);
+ 		}
+ 		smp_mb();
+ 		tcp_set_state(sk, TCP_ESTABLISHED);
+@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 		if (tp->rx_opt.tstamp_ok)
+ 			tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
++		if (mptcp(tp))
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
+ 
+ 		if (req) {
+ 			/* Re-arm the timer because data may have been sent out.
+@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 		tcp_initialize_rcv_mss(sk);
+ 		tcp_fast_path_on(tp);
++		/* Send an ACK when establishing a new
++		 * MPTCP subflow, i.e. using an MP_JOIN
++		 * subtype.
++		 */
++		if (mptcp(tp) && !is_master_tp(tp))
++			tcp_send_ack(sk);
+ 		break;
+ 
+ 	case TCP_FIN_WAIT1: {
+@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tmo = tcp_fin_time(sk);
+ 		if (tmo > TCP_TIMEWAIT_LEN) {
+ 			inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
+-		} else if (th->fin || sock_owned_by_user(sk)) {
++		} else if (th->fin || mptcp_is_data_fin(skb) ||
++			   sock_owned_by_user(sk)) {
+ 			/* Bad case. We could lose such FIN otherwise.
+ 			 * It is not a big problem, but it looks confusing
+ 			 * and not so rare event. We still can lose it now,
+@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			 */
+ 			inet_csk_reset_keepalive_timer(sk, tmo);
+ 		} else {
+-			tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++			tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ 			goto discard;
+ 		}
+ 		break;
+@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 	case TCP_CLOSING:
+ 		if (tp->snd_una == tp->write_seq) {
+-			tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++			tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ 			goto discard;
+ 		}
+ 		break;
+@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			goto discard;
+ 		}
+ 		break;
++	case TCP_CLOSE:
++		if (tp->mp_killed)
++			goto discard;
+ 	}
+ 
+ 	/* step 6: check the URG bit */
+@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		 */
+ 		if (sk->sk_shutdown & RCV_SHUTDOWN) {
+ 			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
+-			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
++			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++			    !mptcp(tp)) {
++				/* In case of mptcp, the reset is handled by
++				 * mptcp_rcv_state_process
++				 */
+ 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
+ 				tcp_reset(sk);
+ 				return 1;
+@@ -5877,3 +6041,154 @@ discard:
+ 	return 0;
+ }
+ EXPORT_SYMBOL(tcp_rcv_state_process);
++
++static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++
++	if (family == AF_INET)
++		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
++			       &ireq->ir_rmt_addr, port);
++#if IS_ENABLED(CONFIG_IPV6)
++	else if (family == AF_INET6)
++		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
++			       &ireq->ir_v6_rmt_addr, port);
++#endif
++}
++
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++		     const struct tcp_request_sock_ops *af_ops,
++		     struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_options_received tmp_opt;
++	struct request_sock *req;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct dst_entry *dst = NULL;
++	__u32 isn = TCP_SKB_CB(skb)->when;
++	bool want_cookie = false, fastopen;
++	struct flowi fl;
++	struct tcp_fastopen_cookie foc = { .len = -1 };
++	int err;
++
++
++	/* TW buckets are converted to open requests without
++	 * limitations, they conserve resources and peer is
++	 * evidently real one.
++	 */
++	if ((sysctl_tcp_syncookies == 2 ||
++	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++		want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
++		if (!want_cookie)
++			goto drop;
++	}
++
++
++	/* Accept backlog is full. If we have already queued enough
++	 * of warm entries in syn queue, drop request. It is better than
++	 * clogging syn queue with openreqs with exponentially increasing
++	 * timeout.
++	 */
++	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
++		goto drop;
++	}
++
++	req = inet_reqsk_alloc(rsk_ops);
++	if (!req)
++		goto drop;
++
++	tcp_rsk(req)->af_specific = af_ops;
++
++	tcp_clear_options(&tmp_opt);
++	tmp_opt.mss_clamp = af_ops->mss_clamp;
++	tmp_opt.user_mss  = tp->rx_opt.user_mss;
++	tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
++
++	if (want_cookie && !tmp_opt.saw_tstamp)
++		tcp_clear_options(&tmp_opt);
++
++	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
++	tcp_openreq_init(req, &tmp_opt, skb);
++
++	if (af_ops->init_req(req, sk, skb))
++		goto drop_and_free;
++
++	if (security_inet_conn_request(sk, skb, req))
++		goto drop_and_free;
++
++	if (!want_cookie || tmp_opt.tstamp_ok)
++		TCP_ECN_create_request(req, skb, sock_net(sk));
++
++	if (want_cookie) {
++		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
++		req->cookie_ts = tmp_opt.tstamp_ok;
++	} else if (!isn) {
++		/* VJ's idea. We save last timestamp seen
++		 * from the destination in peer table, when entering
++		 * state TIME-WAIT, and check against it before
++		 * accepting new connection request.
++		 *
++		 * If "isn" is not zero, this request hit alive
++		 * timewait bucket, so that all the necessary checks
++		 * are made in the function processing timewait state.
++		 */
++		if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
++			bool strict;
++
++			dst = af_ops->route_req(sk, &fl, req, &strict);
++			if (dst && strict &&
++			    !tcp_peer_is_proven(req, dst, true)) {
++				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
++				goto drop_and_release;
++			}
++		}
++		/* Kill the following clause, if you dislike this way. */
++		else if (!sysctl_tcp_syncookies &&
++			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
++			  (sysctl_max_syn_backlog >> 2)) &&
++			 !tcp_peer_is_proven(req, dst, false)) {
++			/* Without syncookies last quarter of
++			 * backlog is filled with destinations,
++			 * proven to be alive.
++			 * It means that we continue to communicate
++			 * to destinations, already remembered
++			 * to the moment of synflood.
++			 */
++			pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
++				    rsk_ops->family);
++			goto drop_and_release;
++		}
++
++		isn = af_ops->init_seq(skb);
++	}
++	if (!dst) {
++		dst = af_ops->route_req(sk, &fl, req, NULL);
++		if (!dst)
++			goto drop_and_free;
++	}
++
++	tcp_rsk(req)->snt_isn = isn;
++	tcp_openreq_init_rwin(req, sk, dst);
++	fastopen = !want_cookie &&
++		   tcp_try_fastopen(sk, skb, req, &foc, dst);
++	err = af_ops->send_synack(sk, dst, &fl, req,
++				  skb_get_queue_mapping(skb), &foc);
++	if (!fastopen) {
++		if (err || want_cookie)
++			goto drop_and_free;
++
++		tcp_rsk(req)->listener = NULL;
++		af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
++	}
++
++	return 0;
++
++drop_and_release:
++	dst_release(dst);
++drop_and_free:
++	reqsk_free(req);
++drop:
++	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++	return 0;
++}
++EXPORT_SYMBOL(tcp_conn_request);
+diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
+index 77cccda1ad0c..c77017f600f1 100644
+--- a/net/ipv4/tcp_ipv4.c
++++ b/net/ipv4/tcp_ipv4.c
+@@ -67,6 +67,8 @@
+ #include <net/icmp.h>
+ #include <net/inet_hashtables.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/transp_v6.h>
+ #include <net/ipv6.h>
+ #include <net/inet_common.h>
+@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
+ struct inet_hashinfo tcp_hashinfo;
+ EXPORT_SYMBOL(tcp_hashinfo);
+ 
+-static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
+ {
+ 	return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
+ 					  ip_hdr(skb)->saddr,
+@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	struct inet_sock *inet;
+ 	const int type = icmp_hdr(icmp_skb)->type;
+ 	const int code = icmp_hdr(icmp_skb)->code;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 	struct sk_buff *skb;
+ 	struct request_sock *fastopen;
+ 	__u32 seq, snd_una;
+@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		return;
+ 	}
+ 
+-	bh_lock_sock(sk);
++	tp = tcp_sk(sk);
++	if (mptcp(tp))
++		meta_sk = mptcp_meta_sk(sk);
++	else
++		meta_sk = sk;
++
++	bh_lock_sock(meta_sk);
+ 	/* If too many ICMPs get dropped on busy
+ 	 * servers this needs to be solved differently.
+ 	 * We do take care of PMTU discovery (RFC1191) special case :
+ 	 * we can receive locally generated ICMP messages while socket is held.
+ 	 */
+-	if (sock_owned_by_user(sk)) {
++	if (sock_owned_by_user(meta_sk)) {
+ 		if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
+ 			NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ 	}
+@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	}
+ 
+ 	icsk = inet_csk(sk);
+-	tp = tcp_sk(sk);
+ 	seq = ntohl(th->seq);
+ 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ 	fastopen = tp->fastopen_rsk;
+@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 				goto out;
+ 
+ 			tp->mtu_info = info;
+-			if (!sock_owned_by_user(sk)) {
++			if (!sock_owned_by_user(meta_sk)) {
+ 				tcp_v4_mtu_reduced(sk);
+ 			} else {
+ 				if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
+ 					sock_hold(sk);
++				if (mptcp(tp))
++					mptcp_tsq_flags(sk);
+ 			}
+ 			goto out;
+ 		}
+@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		    !icsk->icsk_backoff || fastopen)
+ 			break;
+ 
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			break;
+ 
+ 		icsk->icsk_backoff--;
+@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	switch (sk->sk_state) {
+ 		struct request_sock *req, **prev;
+ 	case TCP_LISTEN:
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			goto out;
+ 
+ 		req = inet_csk_search_req(sk, &prev, th->dest,
+@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		if (fastopen && fastopen->sk == NULL)
+ 			break;
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			sk->sk_err = err;
+ 
+ 			sk->sk_error_report(sk);
+@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	 */
+ 
+ 	inet = inet_sk(sk);
+-	if (!sock_owned_by_user(sk) && inet->recverr) {
++	if (!sock_owned_by_user(meta_sk) && inet->recverr) {
+ 		sk->sk_err = err;
+ 		sk->sk_error_report(sk);
+ 	} else	{ /* Only an error on timeout */
+@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	}
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
+  *	Exception: precedence violation. We do not implement it in any case.
+  */
+ 
+-static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct {
+@@ -702,10 +711,10 @@ release_sk1:
+    outside socket context is ugly, certainly. What can I do?
+  */
+ 
+-static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ 			    u32 win, u32 tsval, u32 tsecr, int oif,
+ 			    struct tcp_md5sig_key *key,
+-			    int reply_flags, u8 tos)
++			    int reply_flags, u8 tos, int mptcp)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct {
+@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ #ifdef CONFIG_TCP_MD5SIG
+ 			   + (TCPOLEN_MD5SIG_ALIGNED >> 2)
+ #endif
++#ifdef CONFIG_MPTCP
++			   + ((MPTCP_SUB_LEN_DSS >> 2) +
++			      (MPTCP_SUB_LEN_ACK >> 2))
++#endif
+ 			];
+ 	} rep;
+ 	struct ip_reply_arg arg;
+@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ 				    ip_hdr(skb)->daddr, &rep.th);
+ 	}
+ #endif
++#ifdef CONFIG_MPTCP
++	if (mptcp) {
++		int offset = (tsecr) ? 3 : 0;
++		/* Construction of 32-bit data_ack */
++		rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
++					  ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++					  (0x20 << 8) |
++					  (0x01));
++		rep.opt[offset] = htonl(data_ack);
++
++		arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++		rep.th.doff = arg.iov[0].iov_len / 4;
++	}
++#endif /* CONFIG_MPTCP */
++
+ 	arg.flags = reply_flags;
+ 	arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
+ 				      ip_hdr(skb)->saddr, /* XXX */
+@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct inet_timewait_sock *tw = inet_twsk(sk);
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++	u32 data_ack = 0;
++	int mptcp = 0;
++
++	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++		mptcp = 1;
++	}
+ 
+ 	tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++			data_ack,
+ 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ 			tcp_time_stamp + tcptw->tw_ts_offset,
+ 			tcptw->tw_ts_recent,
+ 			tw->tw_bound_dev_if,
+ 			tcp_twsk_md5_key(tcptw),
+ 			tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
+-			tw->tw_tos
++			tw->tw_tos, mptcp
+ 			);
+ 
+ 	inet_twsk_put(tw);
+ }
+ 
+-static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				  struct request_sock *req)
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req)
+ {
+ 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ 	 */
+ 	tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+-			tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
++			tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
+ 			tcp_time_stamp,
+ 			req->ts_recent,
+ 			0,
+ 			tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
+ 					  AF_INET),
+ 			inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
+-			ip_hdr(skb)->tos);
++			ip_hdr(skb)->tos, 0);
+ }
+ 
+ /*
+@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+  *	This still operates on a request_sock only, not on a big
+  *	socket.
+  */
+-static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+-			      struct request_sock *req,
+-			      u16 queue_mapping,
+-			      struct tcp_fastopen_cookie *foc)
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc)
+ {
+ 	const struct inet_request_sock *ireq = inet_rsk(req);
+ 	struct flowi4 fl4;
+@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+ 	return err;
+ }
+ 
+-static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
+-{
+-	int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
+-
+-	if (!res) {
+-		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+-	}
+-	return res;
+-}
+-
+ /*
+  *	IPv4 request_sock destructor.
+  */
+-static void tcp_v4_reqsk_destructor(struct request_sock *req)
++void tcp_v4_reqsk_destructor(struct request_sock *req)
+ {
+ 	kfree(inet_rsk(req)->opt);
+ }
+@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
+ /*
+  * Save and compile IPv4 options into the request_sock if needed.
+  */
+-static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
+ {
+ 	const struct ip_options *opt = &(IPCB(skb)->opt);
+ 	struct ip_options_rcu *dopt = NULL;
+@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ 
+ #endif
+ 
++static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
++			   struct sk_buff *skb)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++
++	ireq->ir_loc_addr = ip_hdr(skb)->daddr;
++	ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
++	ireq->no_srccheck = inet_sk(sk)->transparent;
++	ireq->opt = tcp_v4_save_options(skb);
++	ireq->ir_mark = inet_request_mark(sk, skb);
++
++	return 0;
++}
++
++static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
++					  const struct request_sock *req,
++					  bool *strict)
++{
++	struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
++
++	if (strict) {
++		if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
++			*strict = true;
++		else
++			*strict = false;
++	}
++
++	return dst;
++}
++
+ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
+ 	.family		=	PF_INET,
+ 	.obj_size	=	sizeof(struct tcp_request_sock),
+-	.rtx_syn_ack	=	tcp_v4_rtx_synack,
++	.rtx_syn_ack	=	tcp_rtx_synack,
+ 	.send_ack	=	tcp_v4_reqsk_send_ack,
+ 	.destructor	=	tcp_v4_reqsk_destructor,
+ 	.send_reset	=	tcp_v4_send_reset,
+ 	.syn_ack_timeout = 	tcp_syn_ack_timeout,
+ };
+ 
++const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
++	.mss_clamp	=	TCP_MSS_DEFAULT,
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+ 	.md5_lookup	=	tcp_v4_reqsk_md5_lookup,
+ 	.calc_md5_hash	=	tcp_v4_md5_hash_skb,
+-};
+ #endif
++	.init_req	=	tcp_v4_init_req,
++#ifdef CONFIG_SYN_COOKIES
++	.cookie_init_seq =	cookie_v4_init_sequence,
++#endif
++	.route_req	=	tcp_v4_route_req,
++	.init_seq	=	tcp_v4_init_sequence,
++	.send_synack	=	tcp_v4_send_synack,
++	.queue_hash_add =	inet_csk_reqsk_queue_hash_add,
++};
+ 
+ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+-	struct tcp_options_received tmp_opt;
+-	struct request_sock *req;
+-	struct inet_request_sock *ireq;
+-	struct tcp_sock *tp = tcp_sk(sk);
+-	struct dst_entry *dst = NULL;
+-	__be32 saddr = ip_hdr(skb)->saddr;
+-	__be32 daddr = ip_hdr(skb)->daddr;
+-	__u32 isn = TCP_SKB_CB(skb)->when;
+-	bool want_cookie = false, fastopen;
+-	struct flowi4 fl4;
+-	struct tcp_fastopen_cookie foc = { .len = -1 };
+-	int err;
+-
+ 	/* Never answer to SYNs send to broadcast or multicast */
+ 	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+ 		goto drop;
+ 
+-	/* TW buckets are converted to open requests without
+-	 * limitations, they conserve resources and peer is
+-	 * evidently real one.
+-	 */
+-	if ((sysctl_tcp_syncookies == 2 ||
+-	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+-		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
+-		if (!want_cookie)
+-			goto drop;
+-	}
+-
+-	/* Accept backlog is full. If we have already queued enough
+-	 * of warm entries in syn queue, drop request. It is better than
+-	 * clogging syn queue with openreqs with exponentially increasing
+-	 * timeout.
+-	 */
+-	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+-		goto drop;
+-	}
+-
+-	req = inet_reqsk_alloc(&tcp_request_sock_ops);
+-	if (!req)
+-		goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+-	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
+-#endif
+-
+-	tcp_clear_options(&tmp_opt);
+-	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
+-	tmp_opt.user_mss  = tp->rx_opt.user_mss;
+-	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+-	if (want_cookie && !tmp_opt.saw_tstamp)
+-		tcp_clear_options(&tmp_opt);
+-
+-	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+-	tcp_openreq_init(req, &tmp_opt, skb);
++	return tcp_conn_request(&tcp_request_sock_ops,
++				&tcp_request_sock_ipv4_ops, sk, skb);
+ 
+-	ireq = inet_rsk(req);
+-	ireq->ir_loc_addr = daddr;
+-	ireq->ir_rmt_addr = saddr;
+-	ireq->no_srccheck = inet_sk(sk)->transparent;
+-	ireq->opt = tcp_v4_save_options(skb);
+-	ireq->ir_mark = inet_request_mark(sk, skb);
+-
+-	if (security_inet_conn_request(sk, skb, req))
+-		goto drop_and_free;
+-
+-	if (!want_cookie || tmp_opt.tstamp_ok)
+-		TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+-	if (want_cookie) {
+-		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+-		req->cookie_ts = tmp_opt.tstamp_ok;
+-	} else if (!isn) {
+-		/* VJ's idea. We save last timestamp seen
+-		 * from the destination in peer table, when entering
+-		 * state TIME-WAIT, and check against it before
+-		 * accepting new connection request.
+-		 *
+-		 * If "isn" is not zero, this request hit alive
+-		 * timewait bucket, so that all the necessary checks
+-		 * are made in the function processing timewait state.
+-		 */
+-		if (tmp_opt.saw_tstamp &&
+-		    tcp_death_row.sysctl_tw_recycle &&
+-		    (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
+-		    fl4.daddr == saddr) {
+-			if (!tcp_peer_is_proven(req, dst, true)) {
+-				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+-				goto drop_and_release;
+-			}
+-		}
+-		/* Kill the following clause, if you dislike this way. */
+-		else if (!sysctl_tcp_syncookies &&
+-			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+-			  (sysctl_max_syn_backlog >> 2)) &&
+-			 !tcp_peer_is_proven(req, dst, false)) {
+-			/* Without syncookies last quarter of
+-			 * backlog is filled with destinations,
+-			 * proven to be alive.
+-			 * It means that we continue to communicate
+-			 * to destinations, already remembered
+-			 * to the moment of synflood.
+-			 */
+-			LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
+-				       &saddr, ntohs(tcp_hdr(skb)->source));
+-			goto drop_and_release;
+-		}
+-
+-		isn = tcp_v4_init_sequence(skb);
+-	}
+-	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
+-		goto drop_and_free;
+-
+-	tcp_rsk(req)->snt_isn = isn;
+-	tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-	tcp_openreq_init_rwin(req, sk, dst);
+-	fastopen = !want_cookie &&
+-		   tcp_try_fastopen(sk, skb, req, &foc, dst);
+-	err = tcp_v4_send_synack(sk, dst, req,
+-				 skb_get_queue_mapping(skb), &foc);
+-	if (!fastopen) {
+-		if (err || want_cookie)
+-			goto drop_and_free;
+-
+-		tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-		tcp_rsk(req)->listener = NULL;
+-		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+-	}
+-
+-	return 0;
+-
+-drop_and_release:
+-	dst_release(dst);
+-drop_and_free:
+-	reqsk_free(req);
+ drop:
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ 	return 0;
+@@ -1497,7 +1433,7 @@ put_and_exit:
+ }
+ EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
+ 
+-static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcphdr *th = tcp_hdr(skb);
+ 	const struct iphdr *iph = ip_hdr(skb);
+@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 
+ 	if (nsk) {
+ 		if (nsk->sk_state != TCP_TIME_WAIT) {
++			/* Don't lock again the meta-sk. It has been locked
++			 * before mptcp_v4_do_rcv.
++			 */
++			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++				bh_lock_sock(mptcp_meta_sk(nsk));
+ 			bh_lock_sock(nsk);
++
+ 			return nsk;
++
+ 		}
+ 		inet_twsk_put(inet_twsk(nsk));
+ 		return NULL;
+@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
+ 		goto discard;
+ #endif
+ 
++	if (is_meta_sk(sk))
++		return mptcp_v4_do_rcv(sk, skb);
++
+ 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
+ 		struct dst_entry *dst = sk->sk_rx_dst;
+ 
+@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
+ 	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
+ 		wake_up_interruptible_sync_poll(sk_sleep(sk),
+ 					   POLLIN | POLLRDNORM | POLLRDBAND);
+-		if (!inet_csk_ack_scheduled(sk))
++		if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
+ 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
+ 						  (3 * tcp_rto_min(sk)) / 4,
+ 						  TCP_RTO_MAX);
+@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ {
+ 	const struct iphdr *iph;
+ 	const struct tcphdr *th;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk = NULL;
+ 	int ret;
+ 	struct net *net = dev_net(skb->dev);
+ 
+@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ 				    skb->len - th->doff * 4);
+ 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++	TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ 	TCP_SKB_CB(skb)->when	 = 0;
+ 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
+ 	TCP_SKB_CB(skb)->sacked	 = 0;
+ 
+ 	sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+-	if (!sk)
+-		goto no_tcp_socket;
+ 
+ process:
+-	if (sk->sk_state == TCP_TIME_WAIT)
++	if (sk && sk->sk_state == TCP_TIME_WAIT)
+ 		goto do_time_wait;
+ 
++#ifdef CONFIG_MPTCP
++	if (!sk && th->syn && !th->ack) {
++		int ret = mptcp_lookup_join(skb, NULL);
++
++		if (ret < 0) {
++			tcp_v4_send_reset(NULL, skb);
++			goto discard_it;
++		} else if (ret > 0) {
++			return 0;
++		}
++	}
++
++	/* Is there a pending request sock for this segment ? */
++	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++		if (sk)
++			sock_put(sk);
++		return 0;
++	}
++#endif
++	if (!sk)
++		goto no_tcp_socket;
++
+ 	if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ 		goto discard_and_relse;
+@@ -1759,11 +1729,21 @@ process:
+ 	sk_mark_napi_id(sk, skb);
+ 	skb->dev = NULL;
+ 
+-	bh_lock_sock_nested(sk);
++	if (mptcp(tcp_sk(sk))) {
++		meta_sk = mptcp_meta_sk(sk);
++
++		bh_lock_sock_nested(meta_sk);
++		if (sock_owned_by_user(meta_sk))
++			skb->sk = sk;
++	} else {
++		meta_sk = sk;
++		bh_lock_sock_nested(sk);
++	}
++
+ 	ret = 0;
+-	if (!sock_owned_by_user(sk)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+-		struct tcp_sock *tp = tcp_sk(sk);
++		struct tcp_sock *tp = tcp_sk(meta_sk);
+ 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ 			tp->ucopy.dma_chan = net_dma_find_channel();
+ 		if (tp->ucopy.dma_chan)
+@@ -1771,16 +1751,16 @@ process:
+ 		else
+ #endif
+ 		{
+-			if (!tcp_prequeue(sk, skb))
++			if (!tcp_prequeue(meta_sk, skb))
+ 				ret = tcp_v4_do_rcv(sk, skb);
+ 		}
+-	} else if (unlikely(sk_add_backlog(sk, skb,
+-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
+-		bh_unlock_sock(sk);
++	} else if (unlikely(sk_add_backlog(meta_sk, skb,
++					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++		bh_unlock_sock(meta_sk);
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ 		goto discard_and_relse;
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 
+ 	sock_put(sk);
+ 
+@@ -1835,6 +1815,18 @@ do_time_wait:
+ 			sk = sk2;
+ 			goto process;
+ 		}
++#ifdef CONFIG_MPTCP
++		if (th->syn && !th->ack) {
++			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++			if (ret < 0) {
++				tcp_v4_send_reset(NULL, skb);
++				goto discard_it;
++			} else if (ret > 0) {
++				return 0;
++			}
++		}
++#endif
+ 		/* Fall through to ACK */
+ 	}
+ 	case TCP_TW_ACK:
+@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
+ 
+ 	tcp_init_sock(sk);
+ 
+-	icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++	if (is_mptcp_enabled(sk))
++		icsk->icsk_af_ops = &mptcp_v4_specific;
++	else
++#endif
++		icsk->icsk_af_ops = &ipv4_specific;
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ 	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
+@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
+ 
+ 	tcp_cleanup_congestion_control(sk);
+ 
++	if (mptcp(tp))
++		mptcp_destroy_sock(sk);
++	if (tp->inside_tk_table)
++		mptcp_hash_remove(tp);
++
+ 	/* Cleanup up the write buffer. */
+ 	tcp_write_queue_purge(sk);
+ 
+@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
+ }
+ #endif /* CONFIG_PROC_FS */
+ 
++#ifdef CONFIG_MPTCP
++static void tcp_v4_clear_sk(struct sock *sk, int size)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* we do not want to clear tk_table field, because of RCU lookups */
++	sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
++
++	size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
++	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
++}
++#endif
++
+ struct proto tcp_prot = {
+ 	.name			= "TCP",
+ 	.owner			= THIS_MODULE,
+@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
+ 	.destroy_cgroup		= tcp_destroy_cgroup,
+ 	.proto_cgroup		= tcp_proto_cgroup,
+ #endif
++#ifdef CONFIG_MPTCP
++	.clear_sk		= tcp_v4_clear_sk,
++#endif
+ };
+ EXPORT_SYMBOL(tcp_prot);
+ 
+diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
+index e68e0d4af6c9..ae6946857dff 100644
+--- a/net/ipv4/tcp_minisocks.c
++++ b/net/ipv4/tcp_minisocks.c
+@@ -18,11 +18,13 @@
+  *		Jorge Cwik, <jorge@laser.satlink.net>
+  */
+ 
++#include <linux/kconfig.h>
+ #include <linux/mm.h>
+ #include <linux/module.h>
+ #include <linux/slab.h>
+ #include <linux/sysctl.h>
+ #include <linux/workqueue.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/inet_common.h>
+ #include <net/xfrm.h>
+@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 	struct tcp_options_received tmp_opt;
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
+ 	bool paws_reject = false;
++	struct mptcp_options_received mopt;
+ 
+ 	tmp_opt.saw_tstamp = 0;
+ 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
+-		tcp_parse_options(skb, &tmp_opt, 0, NULL);
++		mptcp_init_mp_opt(&mopt);
++
++		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+ 
+ 		if (tmp_opt.saw_tstamp) {
+ 			tmp_opt.rcv_tsecr	-= tcptw->tw_ts_offset;
+@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 			tmp_opt.ts_recent_stamp	= tcptw->tw_ts_recent_stamp;
+ 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
+ 		}
++
++		if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
++			if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
++				goto kill_with_rst;
++		}
+ 	}
+ 
+ 	if (tw->tw_substate == TCP_FIN_WAIT2) {
+@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 		if (!th->ack ||
+ 		    !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
+ 		    TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
++			/* If mptcp_is_data_fin() returns true, we are sure that
++			 * mopt has been initialized - otherwise it would not
++			 * be a DATA_FIN.
++			 */
++			if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
++			    mptcp_is_data_fin(skb) &&
++			    TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
++			    mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
++				return TCP_TW_ACK;
++
+ 			inet_twsk_put(tw);
+ 			return TCP_TW_SUCCESS;
+ 		}
+@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ 		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
+ 		tcptw->tw_ts_offset	= tp->tsoffset;
+ 
++		if (mptcp(tp)) {
++			if (mptcp_init_tw_sock(sk, tcptw)) {
++				inet_twsk_free(tw);
++				goto exit;
++			}
++		} else {
++			tcptw->mptcp_tw = NULL;
++		}
++
+ #if IS_ENABLED(CONFIG_IPV6)
+ 		if (tw->tw_family == PF_INET6) {
+ 			struct ipv6_pinfo *np = inet6_sk(sk);
+@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
+ 	}
+ 
++exit:
+ 	tcp_update_metrics(sk);
+ 	tcp_done(sk);
+ }
+ 
+ void tcp_twsk_destructor(struct sock *sk)
+ {
+-#ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_timewait_sock *twsk = tcp_twsk(sk);
+ 
++	if (twsk->mptcp_tw)
++		mptcp_twsk_destructor(twsk);
++#ifdef CONFIG_TCP_MD5SIG
+ 	if (twsk->tw_md5_key)
+ 		kfree_rcu(twsk->tw_md5_key, rcu);
+ #endif
+@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
+ 		req->window_clamp = tcp_full_space(sk);
+ 
+ 	/* tcp_full_space because it is guaranteed to be the first packet */
+-	tcp_select_initial_window(tcp_full_space(sk),
+-		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
++	tp->ops->select_initial_window(tcp_full_space(sk),
++		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
++		(ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
+ 		&req->rcv_wnd,
+ 		&req->window_clamp,
+ 		ireq->wscale_ok,
+ 		&rcv_wscale,
+-		dst_metric(dst, RTAX_INITRWND));
++		dst_metric(dst, RTAX_INITRWND), sk);
+ 	ireq->rcv_wscale = rcv_wscale;
+ }
+ EXPORT_SYMBOL(tcp_openreq_init_rwin);
+@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
+ 			newtp->rx_opt.ts_recent_stamp = 0;
+ 			newtp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
++		if (ireq->saw_mpc)
++			newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
+ 		newtp->tsoffset = 0;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		newtp->md5sig_info = NULL;	/*XXX*/
+@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 			   bool fastopen)
+ {
+ 	struct tcp_options_received tmp_opt;
++	struct mptcp_options_received mopt;
+ 	struct sock *child;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
+ 	bool paws_reject = false;
+ 
+-	BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
++	BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
+ 
+ 	tmp_opt.saw_tstamp = 0;
++
++	mptcp_init_mp_opt(&mopt);
++
+ 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
+-		tcp_parse_options(skb, &tmp_opt, 0, NULL);
++		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+ 
+ 		if (tmp_opt.saw_tstamp) {
+ 			tmp_opt.ts_recent = req->ts_recent;
+@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 		 *
+ 		 * Reset timer after retransmitting SYNACK, similar to
+ 		 * the idea of fast retransmit in recovery.
++		 *
++		 * Fall back to TCP if MP_CAPABLE is not set.
+ 		 */
++
++		if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
++			inet_rsk(req)->saw_mpc = false;
++
++
+ 		if (!inet_rtx_syn_ack(sk, req))
+ 			req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
+ 					   TCP_RTO_MAX) + jiffies;
+@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 	 * socket is created, wait for troubles.
+ 	 */
+ 	child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
++
+ 	if (child == NULL)
+ 		goto listen_overflow;
+ 
++	if (!is_meta_sk(sk)) {
++		int ret = mptcp_check_req_master(sk, child, req, prev);
++		if (ret < 0)
++			goto listen_overflow;
++
++		/* MPTCP-supported */
++		if (!ret)
++			return tcp_sk(child)->mpcb->master_sk;
++	} else {
++		return mptcp_check_req_child(sk, child, req, prev, &mopt);
++	}
+ 	inet_csk_reqsk_queue_unlink(sk, req, prev);
+ 	inet_csk_reqsk_queue_removed(sk, req);
+ 
+@@ -746,7 +804,17 @@ embryonic_reset:
+ 		tcp_reset(sk);
+ 	}
+ 	if (!fastopen) {
+-		inet_csk_reqsk_queue_drop(sk, req, prev);
++		if (is_meta_sk(sk)) {
++			/* We want to avoid stoping the keepalive-timer and so
++			 * avoid ending up in inet_csk_reqsk_queue_removed ...
++			 */
++			inet_csk_reqsk_queue_unlink(sk, req, prev);
++			if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
++				mptcp_delete_synack_timer(sk);
++			reqsk_free(req);
++		} else {
++			inet_csk_reqsk_queue_drop(sk, req, prev);
++		}
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+ 	}
+ 	return NULL;
+@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ {
+ 	int ret = 0;
+ 	int state = child->sk_state;
++	struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
+ 
+-	if (!sock_owned_by_user(child)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ 		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
+ 					    skb->len);
+ 		/* Wakeup parent, send SIGIO */
+@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ 		 * in main socket hash table and lock on listening
+ 		 * socket does not protect us more.
+ 		 */
+-		__sk_add_backlog(child, skb);
++		if (mptcp(tcp_sk(child)))
++			skb->sk = child;
++		__sk_add_backlog(meta_sk, skb);
+ 	}
+ 
+-	bh_unlock_sock(child);
++	if (mptcp(tcp_sk(child)))
++		bh_unlock_sock(child);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(child);
+ 	return ret;
+ }
+diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
+index 179b51e6bda3..efd31b6c5784 100644
+--- a/net/ipv4/tcp_output.c
++++ b/net/ipv4/tcp_output.c
+@@ -36,6 +36,12 @@
+ 
+ #define pr_fmt(fmt) "TCP: " fmt
+ 
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++#include <net/ipv6.h>
+ #include <net/tcp.h>
+ 
+ #include <linux/compiler.h>
+@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
+ unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
+ EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
+ 
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+-			   int push_one, gfp_t gfp);
+-
+ /* Account for new data that has been sent to the network. */
+-static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
+ void tcp_select_initial_window(int __space, __u32 mss,
+ 			       __u32 *rcv_wnd, __u32 *window_clamp,
+ 			       int wscale_ok, __u8 *rcv_wscale,
+-			       __u32 init_rcv_wnd)
++			       __u32 init_rcv_wnd, const struct sock *sk)
+ {
+ 	unsigned int space = (__space < 0 ? 0 : __space);
+ 
+@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
+  * value can be stuffed directly into th->window for an outgoing
+  * frame.
+  */
+-static u16 tcp_select_window(struct sock *sk)
++u16 tcp_select_window(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 old_win = tp->rcv_wnd;
+-	u32 cur_win = tcp_receive_window(tp);
+-	u32 new_win = __tcp_select_window(sk);
++	/* The window must never shrink at the meta-level. At the subflow we
++	 * have to allow this. Otherwise we may announce a window too large
++	 * for the current meta-level sk_rcvbuf.
++	 */
++	u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
++	u32 new_win = tp->ops->__select_window(sk);
+ 
+ 	/* Never shrink the offered window */
+ 	if (new_win < cur_win) {
+@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
+ 				      LINUX_MIB_TCPWANTZEROWINDOWADV);
+ 		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+ 	}
++
+ 	tp->rcv_wnd = new_win;
+ 	tp->rcv_wup = tp->rcv_nxt;
+ 
+@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
+ /* Constructs common control bits of non-data skb. If SYN/FIN is present,
+  * auto increment end seqno.
+  */
+-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ {
+ 	struct skb_shared_info *shinfo = skb_shinfo(skb);
+ 
+@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ 	TCP_SKB_CB(skb)->end_seq = seq;
+ }
+ 
+-static inline bool tcp_urg_mode(const struct tcp_sock *tp)
++bool tcp_urg_mode(const struct tcp_sock *tp)
+ {
+ 	return tp->snd_una != tp->snd_up;
+ }
+@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
+ #define OPTION_MD5		(1 << 2)
+ #define OPTION_WSCALE		(1 << 3)
+ #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
+-
+-struct tcp_out_options {
+-	u16 options;		/* bit field of OPTION_* */
+-	u16 mss;		/* 0 to disable */
+-	u8 ws;			/* window scale, 0 to disable */
+-	u8 num_sack_blocks;	/* number of SACK blocks to include */
+-	u8 hash_size;		/* bytes in hash_location */
+-	__u8 *hash_location;	/* temporary pointer, overloaded */
+-	__u32 tsval, tsecr;	/* need to include OPTION_TS */
+-	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
+-};
++/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
+ 
+ /* Write previously computed TCP options to the packet.
+  *
+@@ -430,7 +428,7 @@ struct tcp_out_options {
+  * (but it may well be that other scenarios fail similarly).
+  */
+ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+-			      struct tcp_out_options *opts)
++			      struct tcp_out_options *opts, struct sk_buff *skb)
+ {
+ 	u16 options = opts->options;	/* mungable copy */
+ 
+@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+ 		}
+ 		ptr += (foc->len + 3) >> 2;
+ 	}
++
++	if (unlikely(OPTION_MPTCP & opts->options))
++		mptcp_options_write(ptr, tp, opts, skb);
+ }
+ 
+ /* Compute TCP options for SYN packets. This is not the final
+@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
+ 		if (unlikely(!(OPTION_TS & opts->options)))
+ 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
+ 	}
++	if (tp->request_mptcp || mptcp(tp))
++		mptcp_syn_options(sk, opts, &remaining);
+ 
+ 	if (fastopen && fastopen->cookie.len >= 0) {
+ 		u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
+@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
+ 		}
+ 	}
+ 
++	if (ireq->saw_mpc)
++		mptcp_synack_options(req, opts, &remaining);
++
+ 	return MAX_TCP_OPTION_SPACE - remaining;
+ }
+ 
+@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
+ 		opts->tsecr = tp->rx_opt.ts_recent;
+ 		size += TCPOLEN_TSTAMP_ALIGNED;
+ 	}
++	if (mptcp(tp))
++		mptcp_established_options(sk, skb, opts, &size);
+ 
+ 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
+ 	if (unlikely(eff_sacks)) {
+-		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+-		opts->num_sack_blocks =
+-			min_t(unsigned int, eff_sacks,
+-			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
+-			      TCPOLEN_SACK_PERBLOCK);
+-		size += TCPOLEN_SACK_BASE_ALIGNED +
+-			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
++		const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
++		if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
++			opts->num_sack_blocks = 0;
++		else
++			opts->num_sack_blocks =
++			    min_t(unsigned int, eff_sacks,
++				  (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
++				  TCPOLEN_SACK_PERBLOCK);
++		if (opts->num_sack_blocks)
++			size += TCPOLEN_SACK_BASE_ALIGNED +
++			    opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+ 	}
+ 
+ 	return size;
+@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
+ 	if ((1 << sk->sk_state) &
+ 	    (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
+ 	     TCPF_CLOSE_WAIT  | TCPF_LAST_ACK))
+-		tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+-			       0, GFP_ATOMIC);
++		tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
++					    tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
+ }
+ /*
+  * One tasklet per cpu tries to send more skbs.
+@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
+ 	unsigned long flags;
+ 	struct list_head *q, *n;
+ 	struct tcp_sock *tp;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 
+ 	local_irq_save(flags);
+ 	list_splice_init(&tsq->head, &list);
+@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
+ 		list_del(&tp->tsq_node);
+ 
+ 		sk = (struct sock *)tp;
+-		bh_lock_sock(sk);
++		meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++		bh_lock_sock(meta_sk);
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			tcp_tsq_handler(sk);
++			if (mptcp(tp))
++				tcp_tsq_handler(meta_sk);
+ 		} else {
++			if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
++				goto exit;
++
+ 			/* defer the work to tcp_release_cb() */
+ 			set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
++
++			if (mptcp(tp))
++				mptcp_tsq_flags(sk);
+ 		}
+-		bh_unlock_sock(sk);
++exit:
++		bh_unlock_sock(meta_sk);
+ 
+ 		clear_bit(TSQ_QUEUED, &tp->tsq_flags);
+ 		sk_free(sk);
+@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
+ #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) |		\
+ 			  (1UL << TCP_WRITE_TIMER_DEFERRED) |	\
+ 			  (1UL << TCP_DELACK_TIMER_DEFERRED) |	\
+-			  (1UL << TCP_MTU_REDUCED_DEFERRED))
++			  (1UL << TCP_MTU_REDUCED_DEFERRED) |   \
++			  (1UL << MPTCP_PATH_MANAGER) |		\
++			  (1UL << MPTCP_SUB_DEFERRED))
++
+ /**
+  * tcp_release_cb - tcp release_sock() callback
+  * @sk: socket
+@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
+ 		sk->sk_prot->mtu_reduced(sk);
+ 		__sock_put(sk);
+ 	}
++	if (flags & (1UL << MPTCP_PATH_MANAGER)) {
++		if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
++			tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
++		__sock_put(sk);
++	}
++	if (flags & (1UL << MPTCP_SUB_DEFERRED))
++		mptcp_tsq_sub_deferred(sk);
+ }
+ EXPORT_SYMBOL(tcp_release_cb);
+ 
+@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
+  * We are working here with either a clone of the original
+  * SKB, or a fresh unique copy made by the retransmit engine.
+  */
+-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+-			    gfp_t gfp_mask)
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++		        gfp_t gfp_mask)
+ {
+ 	const struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct inet_sock *inet;
+@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ 		 */
+ 		th->window	= htons(min(tp->rcv_wnd, 65535U));
+ 	} else {
+-		th->window	= htons(tcp_select_window(sk));
++		th->window	= htons(tp->ops->select_window(sk));
+ 	}
+ 	th->check		= 0;
+ 	th->urg_ptr		= 0;
+@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ 		}
+ 	}
+ 
+-	tcp_options_write((__be32 *)(th + 1), tp, &opts);
++	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ 	if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
+ 		TCP_ECN_send(sk, skb, tcp_header_size);
+ 
+@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+  * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
+  * otherwise socket can stall.
+  */
+-static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ }
+ 
+ /* Initialize TSO segments for a packet. */
+-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+-				 unsigned int mss_now)
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++			  unsigned int mss_now)
+ {
+ 	struct skb_shared_info *shinfo = skb_shinfo(skb);
+ 
+ 	/* Make sure we own this skb before messing gso_size/gso_segs */
+ 	WARN_ON_ONCE(skb_cloned(skb));
+ 
+-	if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
++	if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
++	    (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
+ 		/* Avoid the costly divide in the normal
+ 		 * non-TSO case.
+ 		 */
+@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
+ /* Pcount in the middle of the write queue got changed, we need to do various
+  * tweaks to fix counters
+  */
+-static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
+  * eventually). The difference is that pulled data not copied, but
+  * immediately discarded.
+  */
+-static void __pskb_trim_head(struct sk_buff *skb, int len)
++void __pskb_trim_head(struct sk_buff *skb, int len)
+ {
+ 	struct skb_shared_info *shinfo;
+ 	int i, k, eat;
+@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
+ /* Remove acked data from a packet in the transmit queue. */
+ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ {
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
++		return mptcp_trim_head(sk, skb, len);
++
+ 	if (skb_unclone(skb, GFP_ATOMIC))
+ 		return -ENOMEM;
+ 
+@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ 	if (tcp_skb_pcount(skb) > 1)
+ 		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
+ 
++#ifdef CONFIG_MPTCP
++	/* Some data got acked - we assume that the seq-number reached the dest.
++	 * Anyway, our MPTCP-option has been trimmed above - we lost it here.
++	 * Only remove the SEQ if the call does not come from a meta retransmit.
++	 */
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++		TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
++#endif
++
+ 	return 0;
+ }
+ 
+@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
+ 
+ 	return mss_now;
+ }
++EXPORT_SYMBOL(tcp_current_mss);
+ 
+ /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
+  * As additional protections, we do not touch cwnd in retransmission phases,
+@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
+  * But we can avoid doing the divide again given we already have
+  *  skb_pcount = skb->len / mss_now
+  */
+-static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
+-				const struct sk_buff *skb)
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++			 const struct sk_buff *skb)
+ {
+ 	if (skb->len < tcp_skb_pcount(skb) * mss_now)
+ 		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
+ 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
+ }
+ /* Returns the portion of skb which can be sent right away */
+-static unsigned int tcp_mss_split_point(const struct sock *sk,
+-					const struct sk_buff *skb,
+-					unsigned int mss_now,
+-					unsigned int max_segs,
+-					int nonagle)
++unsigned int tcp_mss_split_point(const struct sock *sk,
++				 const struct sk_buff *skb,
++				 unsigned int mss_now,
++				 unsigned int max_segs,
++				 int nonagle)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 partial, needed, window, max_len;
+@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
+ /* Can at least one segment of SKB be sent right now, according to the
+  * congestion window rules?  If so, return how many segments are allowed.
+  */
+-static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+-					 const struct sk_buff *skb)
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
++			   const struct sk_buff *skb)
+ {
+ 	u32 in_flight, cwnd;
+ 
+ 	/* Don't be strict about the congestion window for the final FIN.  */
+-	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
++	if (skb &&
++	    (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
+ 	    tcp_skb_pcount(skb) == 1)
+ 		return 1;
+ 
+@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+  * This must be invoked the first time we consider transmitting
+  * SKB onto the wire.
+  */
+-static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+-			     unsigned int mss_now)
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++		      unsigned int mss_now)
+ {
+ 	int tso_segs = tcp_skb_pcount(skb);
+ 
+@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+ /* Return true if the Nagle test allows this packet to be
+  * sent now.
+  */
+-static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
+-				  unsigned int cur_mss, int nonagle)
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		    unsigned int cur_mss, int nonagle)
+ {
+ 	/* Nagle rule does not apply to frames, which sit in the middle of the
+ 	 * write_queue (they have no chances to get new data).
+@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ 		return true;
+ 
+ 	/* Don't use the nagle rule for urgent data (or for the final FIN). */
+-	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
++	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
++	    mptcp_is_data_fin(skb))
+ 		return true;
+ 
+ 	if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
+@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ }
+ 
+ /* Does at least the first segment of SKB fit into the send window? */
+-static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
+-			     const struct sk_buff *skb,
+-			     unsigned int cur_mss)
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		      unsigned int cur_mss)
+ {
+ 	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
+ 
+@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
+ 	u32 send_win, cong_win, limit, in_flight;
+ 	int win_divisor;
+ 
+-	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
++	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
+ 		goto send_now;
+ 
+ 	if (icsk->icsk_ca_state != TCP_CA_Open)
+@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
+  * Returns true, if no segments are in flight and we have queued segments,
+  * but cannot send anything now because of SWS or another problem.
+  */
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ 			   int push_one, gfp_t gfp)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ 
+ 	sent_pkts = 0;
+ 
+-	if (!push_one) {
++	/* pmtu not yet supported with MPTCP. Should be possible, by early
++	 * exiting the loop inside tcp_mtu_probe, making sure that only one
++	 * single DSS-mapping gets probed.
++	 */
++	if (!push_one && !mptcp(tp)) {
+ 		/* Do MTU probing. */
+ 		result = tcp_mtu_probe(sk);
+ 		if (!result) {
+@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
+ 	int err = -1;
+ 
+ 	if (tcp_send_head(sk) != NULL) {
+-		err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
++		err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
++					  GFP_ATOMIC);
+ 		goto rearm_timer;
+ 	}
+ 
+@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
+ 	if (unlikely(sk->sk_state == TCP_CLOSE))
+ 		return;
+ 
+-	if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
+-			   sk_gfp_atomic(sk, GFP_ATOMIC)))
++	if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
++					sk_gfp_atomic(sk, GFP_ATOMIC)))
+ 		tcp_check_probe_timer(sk);
+ }
+ 
+@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
+ 
+ 	BUG_ON(!skb || skb->len < mss_now);
+ 
+-	tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
++	tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
++				    sk->sk_allocation);
+ }
+ 
+ /* This function returns the amount that we can raise the
+@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+ 	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
+ 		return;
+ 
++	/* Currently not supported for MPTCP - but it should be possible */
++	if (mptcp(tp))
++		return;
++
+ 	tcp_for_write_queue_from_safe(skb, tmp, sk) {
+ 		if (!tcp_can_collapse(sk, skb))
+ 			break;
+@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+ 
+ 	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
+ 	th->window = htons(min(req->rcv_wnd, 65535U));
+-	tcp_options_write((__be32 *)(th + 1), tp, &opts);
++	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ 	th->doff = (tcp_header_size >> 2);
+ 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
+ 
+@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
+ 	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
+ 		tp->window_clamp = tcp_full_space(sk);
+ 
+-	tcp_select_initial_window(tcp_full_space(sk),
+-				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
+-				  &tp->rcv_wnd,
+-				  &tp->window_clamp,
+-				  sysctl_tcp_window_scaling,
+-				  &rcv_wscale,
+-				  dst_metric(dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk),
++				       tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
++				       &tp->rcv_wnd,
++				       &tp->window_clamp,
++				       sysctl_tcp_window_scaling,
++				       &rcv_wscale,
++				       dst_metric(dst, RTAX_INITRWND), sk);
+ 
+ 	tp->rx_opt.rcv_wscale = rcv_wscale;
+ 	tp->rcv_ssthresh = tp->rcv_wnd;
+@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
+ 	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ 	inet_csk(sk)->icsk_retransmits = 0;
+ 	tcp_clear_retrans(tp);
++
++#ifdef CONFIG_MPTCP
++	if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
++		if (is_master_tp(tp)) {
++			tp->request_mptcp = 1;
++			mptcp_connect_init(sk);
++		} else if (tp->mptcp) {
++			struct inet_sock *inet	= inet_sk(sk);
++
++			tp->mptcp->snt_isn	= tp->write_seq;
++			tp->mptcp->init_rcv_wnd	= tp->rcv_wnd;
++
++			/* Set nonce for new subflows */
++			if (sk->sk_family == AF_INET)
++				tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
++							inet->inet_saddr,
++							inet->inet_daddr,
++							inet->inet_sport,
++							inet->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++			else
++				tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
++						inet6_sk(sk)->saddr.s6_addr32,
++						sk->sk_v6_daddr.s6_addr32,
++						inet->inet_sport,
++						inet->inet_dport);
++#endif
++		}
++	}
++#endif
+ }
+ 
+ static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
+@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
+ 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ 	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
+ }
++EXPORT_SYMBOL(tcp_send_ack);
+ 
+ /* This routine sends a packet with an out of date sequence
+  * number. It assumes the other end will try to ack it.
+@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
+  * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
+  * out-of-date with SND.UNA-1 to probe window.
+  */
+-static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
++int tcp_xmit_probe_skb(struct sock *sk, int urgent)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct sk_buff *skb;
+@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	int err;
+ 
+-	err = tcp_write_wakeup(sk);
++	err = tp->ops->write_wakeup(sk);
+ 
+ 	if (tp->packets_out || !tcp_send_head(sk)) {
+ 		/* Cancel probe timer, if it is not required. */
+@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
+ 					  TCP_RTO_MAX);
+ 	}
+ }
++
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
++{
++	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++	struct flowi fl;
++	int res;
++
++	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
++	if (!res) {
++		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
++	}
++	return res;
++}
++EXPORT_SYMBOL(tcp_rtx_synack);
+diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
+index 286227abed10..966b873cbf3e 100644
+--- a/net/ipv4/tcp_timer.c
++++ b/net/ipv4/tcp_timer.c
+@@ -20,6 +20,7 @@
+ 
+ #include <linux/module.h>
+ #include <linux/gfp.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ 
+ int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
+@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+ int sysctl_tcp_orphan_retries __read_mostly;
+ int sysctl_tcp_thin_linear_timeouts __read_mostly;
+ 
+-static void tcp_write_err(struct sock *sk)
++void tcp_write_err(struct sock *sk)
+ {
+ 	sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
+ 	sk->sk_error_report(sk);
+@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
+ 		    (!tp->snd_wnd && !tp->packets_out))
+ 			do_reset = 1;
+ 		if (do_reset)
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 		tcp_done(sk);
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
+ 		return 1;
+@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
+  * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
+  * syn_set flag is set.
+  */
+-static bool retransmits_timed_out(struct sock *sk,
+-				  unsigned int boundary,
+-				  unsigned int timeout,
+-				  bool syn_set)
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++			   unsigned int timeout, bool syn_set)
+ {
+ 	unsigned int linear_backoff_thresh, start_ts;
+ 	unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
+ }
+ 
+ /* A write timeout has occurred. Process the after effects. */
+-static int tcp_write_timeout(struct sock *sk)
++int tcp_write_timeout(struct sock *sk)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
+ 		}
+ 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
+ 		syn_set = true;
++		/* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
++		if (tcp_sk(sk)->request_mptcp &&
++		    icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
++			tcp_sk(sk)->request_mptcp = 0;
+ 	} else {
+ 		if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
+ 			/* Black hole detection */
+@@ -251,18 +254,22 @@ out:
+ static void tcp_delack_timer(unsigned long data)
+ {
+ 	struct sock *sk = (struct sock *)data;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ 
+-	bh_lock_sock(sk);
+-	if (!sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (!sock_owned_by_user(meta_sk)) {
+ 		tcp_delack_timer_handler(sk);
+ 	} else {
+ 		inet_csk(sk)->icsk_ack.blocked = 1;
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
++		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
+ 		/* deleguate our work to tcp_release_cb() */
+ 		if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ 			sock_hold(sk);
++		if (mptcp(tp))
++			mptcp_tsq_flags(sk);
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -479,6 +486,10 @@ out_reset_timer:
+ 		__sk_dst_reset(sk);
+ 
+ out:;
++	if (mptcp(tp)) {
++		mptcp_reinject_data(sk, 1);
++		mptcp_set_rto(sk);
++	}
+ }
+ 
+ void tcp_write_timer_handler(struct sock *sk)
+@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
+ 		break;
+ 	case ICSK_TIME_RETRANS:
+ 		icsk->icsk_pending = 0;
+-		tcp_retransmit_timer(sk);
++		tcp_sk(sk)->ops->retransmit_timer(sk);
+ 		break;
+ 	case ICSK_TIME_PROBE0:
+ 		icsk->icsk_pending = 0;
+@@ -520,16 +531,19 @@ out:
+ static void tcp_write_timer(unsigned long data)
+ {
+ 	struct sock *sk = (struct sock *)data;
++	struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
+ 
+-	bh_lock_sock(sk);
+-	if (!sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (!sock_owned_by_user(meta_sk)) {
+ 		tcp_write_timer_handler(sk);
+ 	} else {
+ 		/* deleguate our work to tcp_release_cb() */
+ 		if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ 			sock_hold(sk);
++		if (mptcp(tcp_sk(sk)))
++			mptcp_tsq_flags(sk);
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
+ 	struct sock *sk = (struct sock *) data;
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ 	u32 elapsed;
+ 
+ 	/* Only process if socket is not in use. */
+-	bh_lock_sock(sk);
+-	if (sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
+ 		/* Try again later. */
+ 		inet_csk_reset_keepalive_timer (sk, HZ/20);
+ 		goto out;
+@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
+ 		goto out;
+ 	}
+ 
++	if (tp->send_mp_fclose) {
++		/* MUST do this before tcp_write_timeout, because retrans_stamp
++		 * may have been set to 0 in another part while we are
++		 * retransmitting MP_FASTCLOSE. Then, we would crash, because
++		 * retransmits_timed_out accesses the meta-write-queue.
++		 *
++		 * We make sure that the timestamp is != 0.
++		 */
++		if (!tp->retrans_stamp)
++			tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++		if (tcp_write_timeout(sk))
++			goto out;
++
++		tcp_send_ack(sk);
++		icsk->icsk_retransmits++;
++
++		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++		elapsed = icsk->icsk_rto;
++		goto resched;
++	}
++
+ 	if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
+ 		if (tp->linger2 >= 0) {
+ 			const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
+ 
+ 			if (tmo > 0) {
+-				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++				tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ 				goto out;
+ 			}
+ 		}
+-		tcp_send_active_reset(sk, GFP_ATOMIC);
++		tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 		goto death;
+ 	}
+ 
+@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
+ 		    icsk->icsk_probes_out > 0) ||
+ 		    (icsk->icsk_user_timeout == 0 &&
+ 		    icsk->icsk_probes_out >= keepalive_probes(tp))) {
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			tcp_write_err(sk);
+ 			goto out;
+ 		}
+-		if (tcp_write_wakeup(sk) <= 0) {
++		if (tp->ops->write_wakeup(sk) <= 0) {
+ 			icsk->icsk_probes_out++;
+ 			elapsed = keepalive_intvl_when(tp);
+ 		} else {
+@@ -642,7 +679,7 @@ death:
+ 	tcp_done(sk);
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
+index 5667b3003af9..7139c2973fd2 100644
+--- a/net/ipv6/addrconf.c
++++ b/net/ipv6/addrconf.c
+@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
+ 
+ 	kfree_rcu(ifp, rcu);
+ }
++EXPORT_SYMBOL(inet6_ifa_finish_destroy);
+ 
+ static void
+ ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
+diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
+index 7cb4392690dd..7057afbca4df 100644
+--- a/net/ipv6/af_inet6.c
++++ b/net/ipv6/af_inet6.c
+@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
+ 	return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
+ }
+ 
+-static int inet6_create(struct net *net, struct socket *sock, int protocol,
+-			int kern)
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ 	struct inet_sock *inet;
+ 	struct ipv6_pinfo *np;
+diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
+index a245e5ddffbd..99c892b8992d 100644
+--- a/net/ipv6/inet6_connection_sock.c
++++ b/net/ipv6/inet6_connection_sock.c
+@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
+ /*
+  * request_sock (formerly open request) hash tables.
+  */
+-static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
+-			   const u32 rnd, const u32 synq_hsize)
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++		    const u32 rnd, const u32 synq_hsize)
+ {
+ 	u32 c;
+ 
+diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
+index edb58aff4ae7..ea4d9fda0927 100644
+--- a/net/ipv6/ipv6_sockglue.c
++++ b/net/ipv6/ipv6_sockglue.c
+@@ -48,6 +48,8 @@
+ #include <net/addrconf.h>
+ #include <net/inet_common.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/xfrm.h>
+@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+ 				sock_prot_inuse_add(net, &tcp_prot, 1);
+ 				local_bh_enable();
+ 				sk->sk_prot = &tcp_prot;
+-				icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++				if (is_mptcp_enabled(sk))
++					icsk->icsk_af_ops = &mptcp_v4_specific;
++				else
++#endif
++					icsk->icsk_af_ops = &ipv4_specific;
+ 				sk->sk_socket->ops = &inet_stream_ops;
+ 				sk->sk_family = PF_INET;
+ 				tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
+diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
+index a822b880689b..b2b38869d795 100644
+--- a/net/ipv6/syncookies.c
++++ b/net/ipv6/syncookies.c
+@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ 
+ 	/* check for timestamp cookie support */
+ 	memset(&tcp_opt, 0, sizeof(tcp_opt));
+-	tcp_parse_options(skb, &tcp_opt, 0, NULL);
++	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+ 
+ 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ 		goto out;
+ 
+ 	ret = NULL;
+-	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
++	req = inet_reqsk_alloc(&tcp6_request_sock_ops);
+ 	if (!req)
+ 		goto out;
+ 
+@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ 	}
+ 
+ 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
+-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+-				  &req->rcv_wnd, &req->window_clamp,
+-				  ireq->wscale_ok, &rcv_wscale,
+-				  dst_metric(dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++				       &req->rcv_wnd, &req->window_clamp,
++				       ireq->wscale_ok, &rcv_wscale,
++				       dst_metric(dst, RTAX_INITRWND), sk);
+ 
+ 	ireq->rcv_wscale = rcv_wscale;
+ 
+diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
+index 229239ad96b1..fda94d71666e 100644
+--- a/net/ipv6/tcp_ipv6.c
++++ b/net/ipv6/tcp_ipv6.c
+@@ -63,6 +63,8 @@
+ #include <net/inet_common.h>
+ #include <net/secure_seq.h>
+ #include <net/tcp_memcontrol.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
+ #include <net/busy_poll.h>
+ 
+ #include <linux/proc_fs.h>
+@@ -71,12 +73,6 @@
+ #include <linux/crypto.h>
+ #include <linux/scatterlist.h>
+ 
+-static void	tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
+-static void	tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				      struct request_sock *req);
+-
+-static int	tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
+-
+ static const struct inet_connection_sock_af_ops ipv6_mapped;
+ static const struct inet_connection_sock_af_ops ipv6_specific;
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
+ }
+ #endif
+ 
+-static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct dst_entry *dst = skb_dst(skb);
+ 	const struct rt6_info *rt = (const struct rt6_info *)dst;
+@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ 		inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+ }
+ 
+-static void tcp_v6_hash(struct sock *sk)
++void tcp_v6_hash(struct sock *sk)
+ {
+ 	if (sk->sk_state != TCP_CLOSE) {
+-		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
++		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
++		    inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
+ 			tcp_prot.hash(sk);
+ 			return;
+ 		}
+@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
+ 	}
+ }
+ 
+-static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ {
+ 	return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
+ 					    ipv6_hdr(skb)->saddr.s6_addr32,
+@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ 					    tcp_hdr(skb)->source);
+ }
+ 
+-static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 			  int addr_len)
+ {
+ 	struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
+@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 		sin.sin_port = usin->sin6_port;
+ 		sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
+ 
+-		icsk->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++		if (is_mptcp_enabled(sk))
++			icsk->icsk_af_ops = &mptcp_v6_mapped;
++		else
++#endif
++			icsk->icsk_af_ops = &ipv6_mapped;
+ 		sk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		tp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 
+ 		if (err) {
+ 			icsk->icsk_ext_hdr_len = exthdrlen;
+-			icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++			if (is_mptcp_enabled(sk))
++				icsk->icsk_af_ops = &mptcp_v6_specific;
++			else
++#endif
++				icsk->icsk_af_ops = &ipv6_specific;
+ 			sk->sk_backlog_rcv = tcp_v6_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 			tp->af_specific = &tcp_sock_ipv6_specific;
+@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 	const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
+ 	const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
+ 	struct ipv6_pinfo *np;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 	int err;
+ 	struct tcp_sock *tp;
+ 	struct request_sock *fastopen;
+@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		return;
+ 	}
+ 
+-	bh_lock_sock(sk);
+-	if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
++	tp = tcp_sk(sk);
++	if (mptcp(tp))
++		meta_sk = mptcp_meta_sk(sk);
++	else
++		meta_sk = sk;
++
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
+ 		NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ 
+ 	if (sk->sk_state == TCP_CLOSE)
+@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		goto out;
+ 	}
+ 
+-	tp = tcp_sk(sk);
+ 	seq = ntohl(th->seq);
+ 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ 	fastopen = tp->fastopen_rsk;
+@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 			goto out;
+ 
+ 		tp->mtu_info = ntohl(info);
+-		if (!sock_owned_by_user(sk))
++		if (!sock_owned_by_user(meta_sk))
+ 			tcp_v6_mtu_reduced(sk);
+-		else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
++		else {
++			if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
+ 					   &tp->tsq_flags))
+-			sock_hold(sk);
++				sock_hold(sk);
++			if (mptcp(tp))
++				mptcp_tsq_flags(sk);
++		}
+ 		goto out;
+ 	}
+ 
+@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 	switch (sk->sk_state) {
+ 		struct request_sock *req, **prev;
+ 	case TCP_LISTEN:
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			goto out;
+ 
+ 		req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
+@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		if (fastopen && fastopen->sk == NULL)
+ 			break;
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			sk->sk_err = err;
+ 			sk->sk_error_report(sk);		/* Wake people up to see the error (see connect in sock.c) */
+ 
+@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		goto out;
+ 	}
+ 
+-	if (!sock_owned_by_user(sk) && np->recverr) {
++	if (!sock_owned_by_user(meta_sk) && np->recverr) {
+ 		sk->sk_err = err;
+ 		sk->sk_error_report(sk);
+ 	} else
+ 		sk->sk_err_soft = err;
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+ 
+-static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+-			      struct flowi6 *fl6,
+-			      struct request_sock *req,
+-			      u16 queue_mapping,
+-			      struct tcp_fastopen_cookie *foc)
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc)
+ {
+ 	struct inet_request_sock *ireq = inet_rsk(req);
+ 	struct ipv6_pinfo *np = inet6_sk(sk);
++	struct flowi6 *fl6 = &fl->u.ip6;
+ 	struct sk_buff *skb;
+ 	int err = -ENOMEM;
+ 
+@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+ 		skb_set_queue_mapping(skb, queue_mapping);
+ 		err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
+ 		err = net_xmit_eval(err);
++		if (!tcp_rsk(req)->snt_synack && !err)
++			tcp_rsk(req)->snt_synack = tcp_time_stamp;
+ 	}
+ 
+ done:
+ 	return err;
+ }
+ 
+-static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ {
+-	struct flowi6 fl6;
++	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++	struct flowi fl;
+ 	int res;
+ 
+-	res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
++	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
+ 	if (!res) {
+ 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ 	return res;
+ }
+ 
+-static void tcp_v6_reqsk_destructor(struct request_sock *req)
++void tcp_v6_reqsk_destructor(struct request_sock *req)
+ {
+ 	kfree_skb(inet_rsk(req)->pktopts);
+ }
+@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ }
+ #endif
+ 
++static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
++			   struct sk_buff *skb)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++	struct ipv6_pinfo *np = inet6_sk(sk);
++
++	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
++	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
++
++	ireq->ir_iif = sk->sk_bound_dev_if;
++	ireq->ir_mark = inet_request_mark(sk, skb);
++
++	/* So that link locals have meaning */
++	if (!sk->sk_bound_dev_if &&
++	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
++		ireq->ir_iif = inet6_iif(skb);
++
++	if (!TCP_SKB_CB(skb)->when &&
++	    (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
++	     np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
++	     np->rxopt.bits.rxohlim || np->repflow)) {
++		atomic_inc(&skb->users);
++		ireq->pktopts = skb;
++	}
++
++	return 0;
++}
++
++static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
++					  const struct request_sock *req,
++					  bool *strict)
++{
++	if (strict)
++		*strict = true;
++	return inet6_csk_route_req(sk, &fl->u.ip6, req);
++}
++
+ struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
+ 	.family		=	AF_INET6,
+ 	.obj_size	=	sizeof(struct tcp6_request_sock),
+-	.rtx_syn_ack	=	tcp_v6_rtx_synack,
++	.rtx_syn_ack	=	tcp_rtx_synack,
+ 	.send_ack	=	tcp_v6_reqsk_send_ack,
+ 	.destructor	=	tcp_v6_reqsk_destructor,
+ 	.send_reset	=	tcp_v6_send_reset,
+ 	.syn_ack_timeout =	tcp_syn_ack_timeout,
+ };
+ 
++const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
++	.mss_clamp	=	IPV6_MIN_MTU - sizeof(struct tcphdr) -
++				sizeof(struct ipv6hdr),
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
+ 	.md5_lookup	=	tcp_v6_reqsk_md5_lookup,
+ 	.calc_md5_hash	=	tcp_v6_md5_hash_skb,
+-};
+ #endif
++	.init_req	=	tcp_v6_init_req,
++#ifdef CONFIG_SYN_COOKIES
++	.cookie_init_seq =	cookie_v6_init_sequence,
++#endif
++	.route_req	=	tcp_v6_route_req,
++	.init_seq	=	tcp_v6_init_sequence,
++	.send_synack	=	tcp_v6_send_synack,
++	.queue_hash_add =	inet6_csk_reqsk_queue_hash_add,
++};
+ 
+-static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+-				 u32 tsval, u32 tsecr, int oif,
+-				 struct tcp_md5sig_key *key, int rst, u8 tclass,
+-				 u32 label)
++static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
++				 u32 data_ack, u32 win, u32 tsval, u32 tsecr,
++				 int oif, struct tcp_md5sig_key *key, int rst,
++				 u8 tclass, u32 label, int mptcp)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct tcphdr *t1;
+@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 	if (key)
+ 		tot_len += TCPOLEN_MD5SIG_ALIGNED;
+ #endif
+-
++#ifdef CONFIG_MPTCP
++	if (mptcp)
++		tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++#endif
+ 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
+ 			 GFP_ATOMIC);
+ 	if (buff == NULL)
+@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 		tcp_v6_md5_hash_hdr((__u8 *)topt, key,
+ 				    &ipv6_hdr(skb)->saddr,
+ 				    &ipv6_hdr(skb)->daddr, t1);
++		topt += 4;
++	}
++#endif
++#ifdef CONFIG_MPTCP
++	if (mptcp) {
++		/* Construction of 32-bit data_ack */
++		*topt++ = htonl((TCPOPT_MPTCP << 24) |
++				((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++				(0x20 << 8) |
++				(0x01));
++		*topt++ = htonl(data_ack);
+ 	}
+ #endif
+ 
+@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 	kfree_skb(buff);
+ }
+ 
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	u32 seq = 0, ack_seq = 0;
+@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ 			  (th->doff << 2);
+ 
+ 	oif = sk ? sk->sk_bound_dev_if : 0;
+-	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
++	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ release_sk1:
+@@ -902,45 +983,52 @@ release_sk1:
+ #endif
+ }
+ 
+-static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ 			    u32 win, u32 tsval, u32 tsecr, int oif,
+ 			    struct tcp_md5sig_key *key, u8 tclass,
+-			    u32 label)
++			    u32 label, int mptcp)
+ {
+-	tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
+-			     label);
++	tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
++			     key, 0, tclass, label, mptcp);
+ }
+ 
+ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct inet_timewait_sock *tw = inet_twsk(sk);
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++	u32 data_ack = 0;
++	int mptcp = 0;
+ 
++	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++		mptcp = 1;
++	}
+ 	tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++			data_ack,
+ 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ 			tcp_time_stamp + tcptw->tw_ts_offset,
+ 			tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
+-			tw->tw_tclass, (tw->tw_flowlabel << 12));
++			tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
+ 
+ 	inet_twsk_put(tw);
+ }
+ 
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				  struct request_sock *req)
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req)
+ {
+ 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ 	 */
+ 	tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+-			tcp_rsk(req)->rcv_nxt,
++			tcp_rsk(req)->rcv_nxt, 0,
+ 			req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
+ 			tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
+-			0, 0);
++			0, 0, 0);
+ }
+ 
+ 
+-static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct request_sock *req, **prev;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 
+ 	if (nsk) {
+ 		if (nsk->sk_state != TCP_TIME_WAIT) {
++			/* Don't lock again the meta-sk. It has been locked
++			 * before mptcp_v6_do_rcv.
++			 */
++			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++				bh_lock_sock(mptcp_meta_sk(nsk));
+ 			bh_lock_sock(nsk);
++
+ 			return nsk;
+ 		}
+ 		inet_twsk_put(inet_twsk(nsk));
+@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 	return sk;
+ }
+ 
+-/* FIXME: this is substantially similar to the ipv4 code.
+- * Can some kind of merge be done? -- erics
+- */
+-static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+-	struct tcp_options_received tmp_opt;
+-	struct request_sock *req;
+-	struct inet_request_sock *ireq;
+-	struct ipv6_pinfo *np = inet6_sk(sk);
+-	struct tcp_sock *tp = tcp_sk(sk);
+-	__u32 isn = TCP_SKB_CB(skb)->when;
+-	struct dst_entry *dst = NULL;
+-	struct tcp_fastopen_cookie foc = { .len = -1 };
+-	bool want_cookie = false, fastopen;
+-	struct flowi6 fl6;
+-	int err;
+-
+ 	if (skb->protocol == htons(ETH_P_IP))
+ 		return tcp_v4_conn_request(sk, skb);
+ 
+ 	if (!ipv6_unicast_destination(skb))
+ 		goto drop;
+ 
+-	if ((sysctl_tcp_syncookies == 2 ||
+-	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+-		want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
+-		if (!want_cookie)
+-			goto drop;
+-	}
+-
+-	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+-		goto drop;
+-	}
+-
+-	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
+-	if (req == NULL)
+-		goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+-	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
+-#endif
+-
+-	tcp_clear_options(&tmp_opt);
+-	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
+-	tmp_opt.user_mss = tp->rx_opt.user_mss;
+-	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+-	if (want_cookie && !tmp_opt.saw_tstamp)
+-		tcp_clear_options(&tmp_opt);
++	return tcp_conn_request(&tcp6_request_sock_ops,
++				&tcp_request_sock_ipv6_ops, sk, skb);
+ 
+-	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+-	tcp_openreq_init(req, &tmp_opt, skb);
+-
+-	ireq = inet_rsk(req);
+-	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
+-	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
+-	if (!want_cookie || tmp_opt.tstamp_ok)
+-		TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+-	ireq->ir_iif = sk->sk_bound_dev_if;
+-	ireq->ir_mark = inet_request_mark(sk, skb);
+-
+-	/* So that link locals have meaning */
+-	if (!sk->sk_bound_dev_if &&
+-	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
+-		ireq->ir_iif = inet6_iif(skb);
+-
+-	if (!isn) {
+-		if (ipv6_opt_accepted(sk, skb) ||
+-		    np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
+-		    np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
+-		    np->repflow) {
+-			atomic_inc(&skb->users);
+-			ireq->pktopts = skb;
+-		}
+-
+-		if (want_cookie) {
+-			isn = cookie_v6_init_sequence(sk, skb, &req->mss);
+-			req->cookie_ts = tmp_opt.tstamp_ok;
+-			goto have_isn;
+-		}
+-
+-		/* VJ's idea. We save last timestamp seen
+-		 * from the destination in peer table, when entering
+-		 * state TIME-WAIT, and check against it before
+-		 * accepting new connection request.
+-		 *
+-		 * If "isn" is not zero, this request hit alive
+-		 * timewait bucket, so that all the necessary checks
+-		 * are made in the function processing timewait state.
+-		 */
+-		if (tmp_opt.saw_tstamp &&
+-		    tcp_death_row.sysctl_tw_recycle &&
+-		    (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
+-			if (!tcp_peer_is_proven(req, dst, true)) {
+-				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+-				goto drop_and_release;
+-			}
+-		}
+-		/* Kill the following clause, if you dislike this way. */
+-		else if (!sysctl_tcp_syncookies &&
+-			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+-			  (sysctl_max_syn_backlog >> 2)) &&
+-			 !tcp_peer_is_proven(req, dst, false)) {
+-			/* Without syncookies last quarter of
+-			 * backlog is filled with destinations,
+-			 * proven to be alive.
+-			 * It means that we continue to communicate
+-			 * to destinations, already remembered
+-			 * to the moment of synflood.
+-			 */
+-			LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
+-				       &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
+-			goto drop_and_release;
+-		}
+-
+-		isn = tcp_v6_init_sequence(skb);
+-	}
+-have_isn:
+-
+-	if (security_inet_conn_request(sk, skb, req))
+-		goto drop_and_release;
+-
+-	if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
+-		goto drop_and_free;
+-
+-	tcp_rsk(req)->snt_isn = isn;
+-	tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-	tcp_openreq_init_rwin(req, sk, dst);
+-	fastopen = !want_cookie &&
+-		   tcp_try_fastopen(sk, skb, req, &foc, dst);
+-	err = tcp_v6_send_synack(sk, dst, &fl6, req,
+-				 skb_get_queue_mapping(skb), &foc);
+-	if (!fastopen) {
+-		if (err || want_cookie)
+-			goto drop_and_free;
+-
+-		tcp_rsk(req)->listener = NULL;
+-		inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+-	}
+-	return 0;
+-
+-drop_and_release:
+-	dst_release(dst);
+-drop_and_free:
+-	reqsk_free(req);
+ drop:
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ 	return 0; /* don't send reset */
+ }
+ 
+-static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+-					 struct request_sock *req,
+-					 struct dst_entry *dst)
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++				  struct request_sock *req,
++				  struct dst_entry *dst)
+ {
+ 	struct inet_request_sock *ireq;
+ 	struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
+@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+ 
+ 		newsk->sk_v6_rcv_saddr = newnp->saddr;
+ 
+-		inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++		if (is_mptcp_enabled(newsk))
++			inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
++		else
++#endif
++			inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
+ 		newsk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -1329,7 +1292,7 @@ out:
+  * This is because we cannot sleep with the original spinlock
+  * held.
+  */
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct ipv6_pinfo *np = inet6_sk(sk);
+ 	struct tcp_sock *tp;
+@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ 		goto discard;
+ #endif
+ 
++	if (is_meta_sk(sk))
++		return mptcp_v6_do_rcv(sk, skb);
++
+ 	if (sk_filter(sk, skb))
+ 		goto discard;
+ 
+@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th;
+ 	const struct ipv6hdr *hdr;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk = NULL;
+ 	int ret;
+ 	struct net *net = dev_net(skb->dev);
+ 
+@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ 				    skb->len - th->doff*4);
+ 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++	TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ 	TCP_SKB_CB(skb)->when = 0;
+ 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
+ 	TCP_SKB_CB(skb)->sacked = 0;
+ 
+ 	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+-	if (!sk)
+-		goto no_tcp_socket;
+ 
+ process:
+-	if (sk->sk_state == TCP_TIME_WAIT)
++	if (sk && sk->sk_state == TCP_TIME_WAIT)
+ 		goto do_time_wait;
+ 
++#ifdef CONFIG_MPTCP
++	if (!sk && th->syn && !th->ack) {
++		int ret = mptcp_lookup_join(skb, NULL);
++
++		if (ret < 0) {
++			tcp_v6_send_reset(NULL, skb);
++			goto discard_it;
++		} else if (ret > 0) {
++			return 0;
++		}
++	}
++
++	/* Is there a pending request sock for this segment ? */
++	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++		if (sk)
++			sock_put(sk);
++		return 0;
++	}
++#endif
++
++	if (!sk)
++		goto no_tcp_socket;
++
+ 	if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ 		goto discard_and_relse;
+@@ -1529,11 +1520,21 @@ process:
+ 	sk_mark_napi_id(sk, skb);
+ 	skb->dev = NULL;
+ 
+-	bh_lock_sock_nested(sk);
++	if (mptcp(tcp_sk(sk))) {
++		meta_sk = mptcp_meta_sk(sk);
++
++		bh_lock_sock_nested(meta_sk);
++		if (sock_owned_by_user(meta_sk))
++			skb->sk = sk;
++	} else {
++		meta_sk = sk;
++		bh_lock_sock_nested(sk);
++	}
++
+ 	ret = 0;
+-	if (!sock_owned_by_user(sk)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+-		struct tcp_sock *tp = tcp_sk(sk);
++		struct tcp_sock *tp = tcp_sk(meta_sk);
+ 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ 			tp->ucopy.dma_chan = net_dma_find_channel();
+ 		if (tp->ucopy.dma_chan)
+@@ -1541,16 +1542,17 @@ process:
+ 		else
+ #endif
+ 		{
+-			if (!tcp_prequeue(sk, skb))
++			if (!tcp_prequeue(meta_sk, skb))
+ 				ret = tcp_v6_do_rcv(sk, skb);
+ 		}
+-	} else if (unlikely(sk_add_backlog(sk, skb,
+-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
+-		bh_unlock_sock(sk);
++	} else if (unlikely(sk_add_backlog(meta_sk, skb,
++					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++		bh_unlock_sock(meta_sk);
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ 		goto discard_and_relse;
+ 	}
+-	bh_unlock_sock(sk);
++
++	bh_unlock_sock(meta_sk);
+ 
+ 	sock_put(sk);
+ 	return ret ? -1 : 0;
+@@ -1607,6 +1609,18 @@ do_time_wait:
+ 			sk = sk2;
+ 			goto process;
+ 		}
++#ifdef CONFIG_MPTCP
++		if (th->syn && !th->ack) {
++			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++			if (ret < 0) {
++				tcp_v6_send_reset(NULL, skb);
++				goto discard_it;
++			} else if (ret > 0) {
++				return 0;
++			}
++		}
++#endif
+ 		/* Fall through to ACK */
+ 	}
+ 	case TCP_TW_ACK:
+@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
+ 	}
+ }
+ 
+-static struct timewait_sock_ops tcp6_timewait_sock_ops = {
++struct timewait_sock_ops tcp6_timewait_sock_ops = {
+ 	.twsk_obj_size	= sizeof(struct tcp6_timewait_sock),
+ 	.twsk_unique	= tcp_twsk_unique,
+ 	.twsk_destructor = tcp_twsk_destructor,
+@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
+ 
+ 	tcp_init_sock(sk);
+ 
+-	icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++	if (is_mptcp_enabled(sk))
++		icsk->icsk_af_ops = &mptcp_v6_specific;
++	else
++#endif
++		icsk->icsk_af_ops = &ipv6_specific;
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ 	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
+@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
+ 	return 0;
+ }
+ 
+-static void tcp_v6_destroy_sock(struct sock *sk)
++void tcp_v6_destroy_sock(struct sock *sk)
+ {
+ 	tcp_v4_destroy_sock(sk);
+ 	inet6_destroy_sock(sk);
+@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
+ static void tcp_v6_clear_sk(struct sock *sk, int size)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
++#ifdef CONFIG_MPTCP
++	struct tcp_sock *tp = tcp_sk(sk);
++	/* size_tk_table goes from the end of tk_table to the end of sk */
++	int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
++			    sizeof(tp->tk_table);
++#endif
+ 
+ 	/* we do not want to clear pinet6 field, because of RCU lookups */
+ 	sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
+ 
+ 	size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
++
++#ifdef CONFIG_MPTCP
++	/* We zero out only from pinet6 to tk_table */
++	size -= size_tk_table + sizeof(tp->tk_table);
++#endif
+ 	memset(&inet->pinet6 + 1, 0, size);
++
++#ifdef CONFIG_MPTCP
++	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
++#endif
++
+ }
+ 
+ struct proto tcpv6_prot = {
+diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
+new file mode 100644
+index 000000000000..cdfc03adabf8
+--- /dev/null
++++ b/net/mptcp/Kconfig
+@@ -0,0 +1,115 @@
++#
++# MPTCP configuration
++#
++config MPTCP
++        bool "MPTCP protocol"
++        depends on (IPV6=y || IPV6=n)
++        ---help---
++          This replaces the normal TCP stack with a Multipath TCP stack,
++          able to use several paths at once.
++
++menuconfig MPTCP_PM_ADVANCED
++	bool "MPTCP: advanced path-manager control"
++	depends on MPTCP=y
++	---help---
++	  Support for selection of different path-managers. You should choose 'Y' here,
++	  because otherwise you will not actively create new MPTCP-subflows.
++
++if MPTCP_PM_ADVANCED
++
++config MPTCP_FULLMESH
++	tristate "MPTCP Full-Mesh Path-Manager"
++	depends on MPTCP=y
++	---help---
++	  This path-management module will create a full-mesh among all IP-addresses.
++
++config MPTCP_NDIFFPORTS
++	tristate "MPTCP ndiff-ports"
++	depends on MPTCP=y
++	---help---
++	  This path-management module will create multiple subflows between the same
++	  pair of IP-addresses, modifying the source-port. You can set the number
++	  of subflows via the mptcp_ndiffports-sysctl.
++
++config MPTCP_BINDER
++	tristate "MPTCP Binder"
++	depends on (MPTCP=y)
++	---help---
++	  This path-management module works like ndiffports, and adds the sysctl
++	  option to set the gateway (and/or path to) per each additional subflow
++	  via Loose Source Routing (IPv4 only).
++
++choice
++	prompt "Default MPTCP Path-Manager"
++	default DEFAULT
++	help
++	  Select the Path-Manager of your choice
++
++	config DEFAULT_FULLMESH
++		bool "Full mesh" if MPTCP_FULLMESH=y
++
++	config DEFAULT_NDIFFPORTS
++		bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
++
++	config DEFAULT_BINDER
++		bool "binder" if MPTCP_BINDER=y
++
++	config DEFAULT_DUMMY
++		bool "Default"
++
++endchoice
++
++endif
++
++config DEFAULT_MPTCP_PM
++	string
++	default "default" if DEFAULT_DUMMY
++	default "fullmesh" if DEFAULT_FULLMESH 
++	default "ndiffports" if DEFAULT_NDIFFPORTS
++	default "binder" if DEFAULT_BINDER
++	default "default"
++
++menuconfig MPTCP_SCHED_ADVANCED
++	bool "MPTCP: advanced scheduler control"
++	depends on MPTCP=y
++	---help---
++	  Support for selection of different schedulers. You should choose 'Y' here,
++	  if you want to choose a different scheduler than the default one.
++
++if MPTCP_SCHED_ADVANCED
++
++config MPTCP_ROUNDROBIN
++	tristate "MPTCP Round-Robin"
++	depends on (MPTCP=y)
++	---help---
++	  This is a very simple round-robin scheduler. Probably has bad performance
++	  but might be interesting for researchers.
++
++choice
++	prompt "Default MPTCP Scheduler"
++	default DEFAULT
++	help
++	  Select the Scheduler of your choice
++
++	config DEFAULT_SCHEDULER
++		bool "Default"
++		---help---
++		  This is the default scheduler, sending first on the subflow
++		  with the lowest RTT.
++
++	config DEFAULT_ROUNDROBIN
++		bool "Round-Robin" if MPTCP_ROUNDROBIN=y
++		---help---
++		  This is the round-rob scheduler, sending in a round-robin
++		  fashion..
++
++endchoice
++endif
++
++config DEFAULT_MPTCP_SCHED
++	string
++	depends on (MPTCP=y)
++	default "default" if DEFAULT_SCHEDULER
++	default "roundrobin" if DEFAULT_ROUNDROBIN
++	default "default"
++
+diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
+new file mode 100644
+index 000000000000..35561a7012e3
+--- /dev/null
++++ b/net/mptcp/Makefile
+@@ -0,0 +1,20 @@
++#
++## Makefile for MultiPath TCP support code.
++#
++#
++
++obj-$(CONFIG_MPTCP) += mptcp.o
++
++mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
++	   mptcp_output.o mptcp_input.o mptcp_sched.o
++
++obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
++obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
++obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
++obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
++obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
++obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
++obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
++
++mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
++
+diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
+new file mode 100644
+index 000000000000..95d8da560715
+--- /dev/null
++++ b/net/mptcp/mptcp_binder.c
+@@ -0,0 +1,487 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#include <linux/route.h>
++#include <linux/inet.h>
++#include <linux/mroute.h>
++#include <linux/spinlock_types.h>
++#include <net/inet_ecn.h>
++#include <net/route.h>
++#include <net/xfrm.h>
++#include <net/compat.h>
++#include <linux/slab.h>
++
++#define MPTCP_GW_MAX_LISTS	10
++#define MPTCP_GW_LIST_MAX_LEN	6
++#define MPTCP_GW_SYSCTL_MAX_LEN	(15 * MPTCP_GW_LIST_MAX_LEN *	\
++							MPTCP_GW_MAX_LISTS)
++
++struct mptcp_gw_list {
++	struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
++	u8 len[MPTCP_GW_MAX_LISTS];
++};
++
++struct binder_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++
++	struct mptcp_cb *mpcb;
++
++	/* Prevent multiple sub-sockets concurrently iterating over sockets */
++	spinlock_t *flow_lock;
++};
++
++static struct mptcp_gw_list *mptcp_gws;
++static rwlock_t mptcp_gws_lock;
++
++static int mptcp_binder_ndiffports __read_mostly = 1;
++
++static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
++
++static int mptcp_get_avail_list_ipv4(struct sock *sk)
++{
++	int i, j, list_taken, opt_ret, opt_len;
++	unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
++
++	for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
++		if (mptcp_gws->len[i] == 0)
++			goto error;
++
++		mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
++		list_taken = 0;
++
++		/* Loop through all sub-sockets in this connection */
++		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
++			mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
++
++			/* Reset length and options buffer, then retrieve
++			 * from socket
++			 */
++			opt_len = MAX_IPOPTLEN;
++			memset(opt, 0, MAX_IPOPTLEN);
++			opt_ret = ip_getsockopt(sk, IPPROTO_IP,
++				IP_OPTIONS, opt, &opt_len);
++			if (opt_ret < 0) {
++				mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
++					    __func__, opt_ret);
++				goto error;
++			}
++
++			/* If socket has no options, it has no stake in this list */
++			if (opt_len <= 0)
++				continue;
++
++			/* Iterate options buffer */
++			for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
++				if (*opt_ptr == IPOPT_LSRR) {
++					mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
++					goto sock_lsrr;
++				}
++			}
++			continue;
++
++sock_lsrr:
++			/* Pointer to the 2nd to last address */
++			opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
++
++			/* Addresses start 3 bytes after type offset */
++			opt_ptr += 3;
++			j = 0;
++
++			/* Different length lists cannot be the same */
++			if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
++				continue;
++
++			/* Iterate if we are still inside options list
++			 * and sysctl list
++			 */
++			while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
++				/* If there is a different address, this list must
++				 * not be set on this socket
++				 */
++				if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
++					break;
++
++				/* Jump 4 bytes to next address */
++				opt_ptr += 4;
++				j++;
++			}
++
++			/* Reached the end without a differing address, lists
++			 * are therefore identical.
++			 */
++			if (j == mptcp_gws->len[i]) {
++				mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
++				list_taken = 1;
++				break;
++			}
++		}
++
++		/* Free list found if not taken by a socket */
++		if (!list_taken) {
++			mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
++			break;
++		}
++	}
++
++	if (i >= MPTCP_GW_MAX_LISTS)
++		goto error;
++
++	return i;
++error:
++	return -1;
++}
++
++/* The list of addresses is parsed each time a new connection is opened,
++ *  to make sure it's up to date. In case of error, all the lists are
++ *  marked as unavailable and the subflow's fingerprint is set to 0.
++ */
++static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
++{
++	int i, j, ret;
++	unsigned char opt[MAX_IPOPTLEN] = {0};
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
++
++	/* Read lock: multiple sockets can read LSRR addresses at the same
++	 * time, but writes are done in mutual exclusion.
++	 * Spin lock: must search for free list for one socket at a time, or
++	 * multiple sockets could take the same list.
++	 */
++	read_lock(&mptcp_gws_lock);
++	spin_lock(fmp->flow_lock);
++
++	i = mptcp_get_avail_list_ipv4(sk);
++
++	/* Execution enters here only if a free path is found.
++	 */
++	if (i >= 0) {
++		opt[0] = IPOPT_NOP;
++		opt[1] = IPOPT_LSRR;
++		opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
++				(mptcp_gws->len[i] + 1) + 3;
++		opt[3] = IPOPT_MINOFF;
++		for (j = 0; j < mptcp_gws->len[i]; ++j)
++			memcpy(opt + 4 +
++				(j * sizeof(mptcp_gws->list[i][0].s_addr)),
++				&mptcp_gws->list[i][j].s_addr,
++				sizeof(mptcp_gws->list[i][0].s_addr));
++		/* Final destination must be part of IP_OPTIONS parameter. */
++		memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
++		       sizeof(addr.s_addr));
++
++		/* setsockopt must be inside the lock, otherwise another
++		 * subflow could fail to see that we have taken a list.
++		 */
++		ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
++				4 + sizeof(mptcp_gws->list[i][0].s_addr)
++				* (mptcp_gws->len[i] + 1));
++
++		if (ret < 0) {
++			mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
++				    __func__, ret);
++		}
++	}
++
++	spin_unlock(fmp->flow_lock);
++	read_unlock(&mptcp_gws_lock);
++
++	return;
++}
++
++/* Parses gateways string for a list of paths to different
++ * gateways, and stores them for use with the Loose Source Routing (LSRR)
++ * socket option. Each list must have "," separated addresses, and the lists
++ * themselves must be separated by "-". Returns -1 in case one or more of the
++ * addresses is not a valid ipv4/6 address.
++ */
++static int mptcp_parse_gateway_ipv4(char *gateways)
++{
++	int i, j, k, ret;
++	char *tmp_string = NULL;
++	struct in_addr tmp_addr;
++
++	tmp_string = kzalloc(16, GFP_KERNEL);
++	if (tmp_string == NULL)
++		return -ENOMEM;
++
++	write_lock(&mptcp_gws_lock);
++
++	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++
++	/* A TMP string is used since inet_pton needs a null terminated string
++	 * but we do not want to modify the sysctl for obvious reasons.
++	 * i will iterate over the SYSCTL string, j will iterate over the
++	 * temporary string where each IP is copied into, k will iterate over
++	 * the IPs in each list.
++	 */
++	for (i = j = k = 0;
++			i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
++			++i) {
++		if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
++			/* If the temp IP is empty and the current list is
++			 *  empty, we are done.
++			 */
++			if (j == 0 && mptcp_gws->len[k] == 0)
++				break;
++
++			/* Terminate the temp IP string, then if it is
++			 * non-empty parse the IP and copy it.
++			 */
++			tmp_string[j] = '\0';
++			if (j > 0) {
++				mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
++
++				ret = in4_pton(tmp_string, strlen(tmp_string),
++						(u8 *)&tmp_addr.s_addr, '\0',
++						NULL);
++
++				if (ret) {
++					mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
++						    ret,
++						    &tmp_addr.s_addr);
++					memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
++					       &tmp_addr.s_addr,
++					       sizeof(tmp_addr.s_addr));
++					mptcp_gws->len[k]++;
++					j = 0;
++					tmp_string[j] = '\0';
++					/* Since we can't impose a limit to
++					 * what the user can input, make sure
++					 * there are not too many IPs in the
++					 * SYSCTL string.
++					 */
++					if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
++						mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
++							    k,
++							    MPTCP_GW_LIST_MAX_LEN);
++						goto error;
++					}
++				} else {
++					goto error;
++				}
++			}
++
++			if (gateways[i] == '-' || gateways[i] == '\0')
++				++k;
++		} else {
++			tmp_string[j] = gateways[i];
++			++j;
++		}
++	}
++
++	/* Number of flows is number of gateway lists plus master flow */
++	mptcp_binder_ndiffports = k+1;
++
++	write_unlock(&mptcp_gws_lock);
++	kfree(tmp_string);
++
++	return 0;
++
++error:
++	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++	memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
++	write_unlock(&mptcp_gws_lock);
++	kfree(tmp_string);
++	return -1;
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	const struct binder_priv *pm_priv = container_of(work,
++						     struct binder_priv,
++						     subflow_work);
++	struct mptcp_cb *mpcb = pm_priv->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	int iter = 0;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	if (mptcp_binder_ndiffports > iter &&
++	    mptcp_binder_ndiffports > mpcb->cnt_subflows) {
++		struct mptcp_loc4 loc;
++		struct mptcp_rem4 rem;
++
++		loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++		loc.loc4_id = 0;
++		loc.low_prio = 0;
++
++		rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++		rem.port = inet_sk(meta_sk)->inet_dport;
++		rem.rem4_id = 0; /* Default 0 */
++
++		mptcp_init4_subsockets(meta_sk, &loc, &rem);
++
++		goto next_subflow;
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void binder_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
++	static DEFINE_SPINLOCK(flow_lock);
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (meta_sk->sk_family == AF_INET6 &&
++	    !mptcp_v6_is_v4_mapped(meta_sk)) {
++			mptcp_fallback_default(mpcb);
++			return;
++	}
++#endif
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	fmp->mpcb = mpcb;
++
++	fmp->flow_lock = &flow_lock;
++}
++
++static void binder_create_subflows(struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (!work_pending(&pm_priv->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &pm_priv->subflow_work);
++	}
++}
++
++static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
++				  struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
++ * Inspired from proc_tcp_congestion_control().
++ */
++static int proc_mptcp_gateways(ctl_table *ctl, int write,
++				       void __user *buffer, size_t *lenp,
++				       loff_t *ppos)
++{
++	int ret;
++	ctl_table tbl = {
++		.maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
++	};
++
++	if (write) {
++		tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
++		if (tbl.data == NULL)
++			return -1;
++		ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++		if (ret == 0) {
++			ret = mptcp_parse_gateway_ipv4(tbl.data);
++			memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
++		}
++		kfree(tbl.data);
++	} else {
++		ret = proc_dostring(ctl, write, buffer, lenp, ppos);
++	}
++
++
++	return ret;
++}
++
++static struct mptcp_pm_ops binder __read_mostly = {
++	.new_session = binder_new_session,
++	.fully_established = binder_create_subflows,
++	.get_local_id = binder_get_local_id,
++	.init_subsocket_v4 = mptcp_v4_add_lsrr,
++	.name = "binder",
++	.owner = THIS_MODULE,
++};
++
++static struct ctl_table binder_table[] = {
++	{
++		.procname = "mptcp_binder_gateways",
++		.data = &sysctl_mptcp_binder_gateways,
++		.maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
++		.mode = 0644,
++		.proc_handler = &proc_mptcp_gateways
++	},
++	{ }
++};
++
++struct ctl_table_header *mptcp_sysctl_binder;
++
++/* General initialization of MPTCP_PM */
++static int __init binder_register(void)
++{
++	mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
++	if (!mptcp_gws)
++		return -ENOMEM;
++
++	rwlock_init(&mptcp_gws_lock);
++
++	BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
++
++	mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
++			binder_table);
++	if (!mptcp_sysctl_binder)
++		goto sysctl_fail;
++
++	if (mptcp_register_path_manager(&binder))
++		goto pm_failed;
++
++	return 0;
++
++pm_failed:
++	unregister_net_sysctl_table(mptcp_sysctl_binder);
++sysctl_fail:
++	kfree(mptcp_gws);
++
++	return -1;
++}
++
++static void binder_unregister(void)
++{
++	mptcp_unregister_path_manager(&binder);
++	unregister_net_sysctl_table(mptcp_sysctl_binder);
++	kfree(mptcp_gws);
++}
++
++module_init(binder_register);
++module_exit(binder_unregister);
++
++MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("BINDER MPTCP");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
+new file mode 100644
+index 000000000000..5d761164eb85
+--- /dev/null
++++ b/net/mptcp/mptcp_coupled.c
+@@ -0,0 +1,270 @@
++/*
++ *	MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++/* Scaling is done in the numerator with alpha_scale_num and in the denominator
++ * with alpha_scale_den.
++ *
++ * To downscale, we just need to use alpha_scale.
++ *
++ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
++ */
++static int alpha_scale_den = 10;
++static int alpha_scale_num = 32;
++static int alpha_scale = 12;
++
++struct mptcp_ccc {
++	u64	alpha;
++	bool	forced_update;
++};
++
++static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
++{
++	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
++{
++	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
++}
++
++static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
++{
++	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
++}
++
++static inline u64 mptcp_ccc_scale(u32 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++static inline bool mptcp_get_forced(const struct sock *meta_sk)
++{
++	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
++}
++
++static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
++{
++	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
++}
++
++static void mptcp_ccc_recalc_alpha(const struct sock *sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	const struct sock *sub_sk;
++	int best_cwnd = 0, best_rtt = 0, can_send = 0;
++	u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
++
++	if (!mpcb)
++		return;
++
++	/* Only one subflow left - fall back to normal reno-behavior
++	 * (set alpha to 1)
++	 */
++	if (mpcb->cnt_established <= 1)
++		goto exit;
++
++	/* Do regular alpha-calculation for multiple subflows */
++
++	/* Find the max numerator of the alpha-calculation */
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++		u64 tmp;
++
++		if (!mptcp_ccc_sk_can_send(sub_sk))
++			continue;
++
++		can_send++;
++
++		/* We need to look for the path, that provides the max-value.
++		 * Integer-overflow is not possible here, because
++		 * tmp will be in u64.
++		 */
++		tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
++				alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
++
++		if (tmp >= max_numerator) {
++			max_numerator = tmp;
++			best_cwnd = sub_tp->snd_cwnd;
++			best_rtt = sub_tp->srtt_us;
++		}
++	}
++
++	/* No subflow is able to send - we don't care anymore */
++	if (unlikely(!can_send))
++		goto exit;
++
++	/* Calculate the denominator */
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++
++		if (!mptcp_ccc_sk_can_send(sub_sk))
++			continue;
++
++		sum_denominator += div_u64(
++				mptcp_ccc_scale(sub_tp->snd_cwnd,
++						alpha_scale_den) * best_rtt,
++						sub_tp->srtt_us);
++	}
++	sum_denominator *= sum_denominator;
++	if (unlikely(!sum_denominator)) {
++		pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
++		       __func__, mpcb->cnt_established);
++		mptcp_for_each_sk(mpcb, sub_sk) {
++			struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++			pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
++			       __func__, sub_tp->mptcp->path_index,
++			       sub_sk->sk_state, sub_tp->srtt_us,
++			       sub_tp->snd_cwnd);
++		}
++	}
++
++	alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
++
++	if (unlikely(!alpha))
++		alpha = 1;
++
++exit:
++	mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
++}
++
++static void mptcp_ccc_init(struct sock *sk)
++{
++	if (mptcp(tcp_sk(sk))) {
++		mptcp_set_forced(mptcp_meta_sk(sk), 0);
++		mptcp_set_alpha(mptcp_meta_sk(sk), 1);
++	}
++	/* If we do not mptcp, behave like reno: return */
++}
++
++static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++	if (event == CA_EVENT_LOSS)
++		mptcp_ccc_recalc_alpha(sk);
++}
++
++static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
++{
++	if (!mptcp(tcp_sk(sk)))
++		return;
++
++	mptcp_set_forced(mptcp_meta_sk(sk), 1);
++}
++
++static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++	int snd_cwnd;
++
++	if (!mptcp(tp)) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	if (!tcp_is_cwnd_limited(sk))
++		return;
++
++	if (tp->snd_cwnd <= tp->snd_ssthresh) {
++		/* In "safe" area, increase. */
++		tcp_slow_start(tp, acked);
++		mptcp_ccc_recalc_alpha(sk);
++		return;
++	}
++
++	if (mptcp_get_forced(mptcp_meta_sk(sk))) {
++		mptcp_ccc_recalc_alpha(sk);
++		mptcp_set_forced(mptcp_meta_sk(sk), 0);
++	}
++
++	if (mpcb->cnt_established > 1) {
++		u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
++
++		/* This may happen, if at the initialization, the mpcb
++		 * was not yet attached to the sock, and thus
++		 * initializing alpha failed.
++		 */
++		if (unlikely(!alpha))
++			alpha = 1;
++
++		snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
++						alpha);
++
++		/* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
++		 * Thus, we select here the max value.
++		 */
++		if (snd_cwnd < tp->snd_cwnd)
++			snd_cwnd = tp->snd_cwnd;
++	} else {
++		snd_cwnd = tp->snd_cwnd;
++	}
++
++	if (tp->snd_cwnd_cnt >= snd_cwnd) {
++		if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
++			tp->snd_cwnd++;
++			mptcp_ccc_recalc_alpha(sk);
++		}
++
++		tp->snd_cwnd_cnt = 0;
++	} else {
++		tp->snd_cwnd_cnt++;
++	}
++}
++
++static struct tcp_congestion_ops mptcp_ccc = {
++	.init		= mptcp_ccc_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_ccc_cong_avoid,
++	.cwnd_event	= mptcp_ccc_cwnd_event,
++	.set_state	= mptcp_ccc_set_state,
++	.owner		= THIS_MODULE,
++	.name		= "lia",
++};
++
++static int __init mptcp_ccc_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
++	return tcp_register_congestion_control(&mptcp_ccc);
++}
++
++static void __exit mptcp_ccc_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_ccc);
++}
++
++module_init(mptcp_ccc_register);
++module_exit(mptcp_ccc_unregister);
++
++MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
+new file mode 100644
+index 000000000000..28dfa0479f5e
+--- /dev/null
++++ b/net/mptcp/mptcp_ctrl.c
+@@ -0,0 +1,2401 @@
++/*
++ *	MPTCP implementation - MPTCP-control
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <net/inet_common.h>
++#include <net/inet6_hashtables.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/ip6_route.h>
++#include <net/mptcp_v6.h>
++#endif
++#include <net/sock.h>
++#include <net/tcp.h>
++#include <net/tcp_states.h>
++#include <net/transp_v6.h>
++#include <net/xfrm.h>
++
++#include <linux/cryptohash.h>
++#include <linux/kconfig.h>
++#include <linux/module.h>
++#include <linux/netpoll.h>
++#include <linux/list.h>
++#include <linux/jhash.h>
++#include <linux/tcp.h>
++#include <linux/net.h>
++#include <linux/in.h>
++#include <linux/random.h>
++#include <linux/inetdevice.h>
++#include <linux/workqueue.h>
++#include <linux/atomic.h>
++#include <linux/sysctl.h>
++
++static struct kmem_cache *mptcp_sock_cache __read_mostly;
++static struct kmem_cache *mptcp_cb_cache __read_mostly;
++static struct kmem_cache *mptcp_tw_cache __read_mostly;
++
++int sysctl_mptcp_enabled __read_mostly = 1;
++int sysctl_mptcp_checksum __read_mostly = 1;
++int sysctl_mptcp_debug __read_mostly;
++EXPORT_SYMBOL(sysctl_mptcp_debug);
++int sysctl_mptcp_syn_retries __read_mostly = 3;
++
++bool mptcp_init_failed __read_mostly;
++
++struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
++EXPORT_SYMBOL(mptcp_static_key);
++
++static int proc_mptcp_path_manager(ctl_table *ctl, int write,
++				   void __user *buffer, size_t *lenp,
++				   loff_t *ppos)
++{
++	char val[MPTCP_PM_NAME_MAX];
++	ctl_table tbl = {
++		.data = val,
++		.maxlen = MPTCP_PM_NAME_MAX,
++	};
++	int ret;
++
++	mptcp_get_default_path_manager(val);
++
++	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++	if (write && ret == 0)
++		ret = mptcp_set_default_path_manager(val);
++	return ret;
++}
++
++static int proc_mptcp_scheduler(ctl_table *ctl, int write,
++				void __user *buffer, size_t *lenp,
++				loff_t *ppos)
++{
++	char val[MPTCP_SCHED_NAME_MAX];
++	ctl_table tbl = {
++		.data = val,
++		.maxlen = MPTCP_SCHED_NAME_MAX,
++	};
++	int ret;
++
++	mptcp_get_default_scheduler(val);
++
++	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++	if (write && ret == 0)
++		ret = mptcp_set_default_scheduler(val);
++	return ret;
++}
++
++static struct ctl_table mptcp_table[] = {
++	{
++		.procname = "mptcp_enabled",
++		.data = &sysctl_mptcp_enabled,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_checksum",
++		.data = &sysctl_mptcp_checksum,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_debug",
++		.data = &sysctl_mptcp_debug,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_syn_retries",
++		.data = &sysctl_mptcp_syn_retries,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname	= "mptcp_path_manager",
++		.mode		= 0644,
++		.maxlen		= MPTCP_PM_NAME_MAX,
++		.proc_handler	= proc_mptcp_path_manager,
++	},
++	{
++		.procname	= "mptcp_scheduler",
++		.mode		= 0644,
++		.maxlen		= MPTCP_SCHED_NAME_MAX,
++		.proc_handler	= proc_mptcp_scheduler,
++	},
++	{ }
++};
++
++static inline u32 mptcp_hash_tk(u32 token)
++{
++	return token % MPTCP_HASH_SIZE;
++}
++
++struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++EXPORT_SYMBOL(tk_hashtable);
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
++
++/* The following hash table is used to avoid collision of token */
++static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
++
++static bool mptcp_reqsk_find_tk(const u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct mptcp_request_sock *mtreqsk;
++	const struct hlist_nulls_node *node;
++
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreqsk, node,
++				       &mptcp_reqsk_tk_htb[hash], hash_entry) {
++		if (token == mtreqsk->mptcp_loc_token)
++			return true;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++	return false;
++}
++
++static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
++{
++	u32 hash = mptcp_hash_tk(token);
++
++	hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
++				 &mptcp_reqsk_tk_htb[hash]);
++}
++
++static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
++{
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++}
++
++void mptcp_reqsk_destructor(struct request_sock *req)
++{
++	if (!mptcp_rsk(req)->is_sub) {
++		if (in_softirq()) {
++			mptcp_reqsk_remove_tk(req);
++		} else {
++			rcu_read_lock_bh();
++			spin_lock(&mptcp_tk_hashlock);
++			hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++			spin_unlock(&mptcp_tk_hashlock);
++			rcu_read_unlock_bh();
++		}
++	} else {
++		mptcp_hash_request_remove(req);
++	}
++}
++
++static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
++{
++	u32 hash = mptcp_hash_tk(token);
++	hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
++	meta_tp->inside_tk_table = 1;
++}
++
++static bool mptcp_find_token(u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct tcp_sock *meta_tp;
++	const struct hlist_nulls_node *node;
++
++begin:
++	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
++		if (token == meta_tp->mptcp_loc_token)
++			return true;
++	}
++	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++	return false;
++}
++
++static void mptcp_set_key_reqsk(struct request_sock *req,
++				const struct sk_buff *skb)
++{
++	const struct inet_request_sock *ireq = inet_rsk(req);
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++	if (skb->protocol == htons(ETH_P_IP)) {
++		mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
++							ip_hdr(skb)->daddr,
++							htons(ireq->ir_num),
++							ireq->ir_rmt_port);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
++							ipv6_hdr(skb)->daddr.s6_addr32,
++							htons(ireq->ir_num),
++							ireq->ir_rmt_port);
++#endif
++	}
++
++	mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
++}
++
++/* New MPTCP-connection request, prepare a new token for the meta-socket that
++ * will be created in mptcp_check_req_master(), and store the received token.
++ */
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++			   const struct mptcp_options_received *mopt,
++			   const struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++	inet_rsk(req)->saw_mpc = 1;
++
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	do {
++		mptcp_set_key_reqsk(req, skb);
++	} while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
++		 mptcp_find_token(mtreq->mptcp_loc_token));
++
++	mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++	mtreq->mptcp_rem_key = mopt->mptcp_key;
++}
++
++static void mptcp_set_key_sk(const struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct inet_sock *isk = inet_sk(sk);
++
++	if (sk->sk_family == AF_INET)
++		tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
++						     isk->inet_daddr,
++						     isk->inet_sport,
++						     isk->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
++						     sk->sk_v6_daddr.s6_addr32,
++						     isk->inet_sport,
++						     isk->inet_dport);
++#endif
++
++	mptcp_key_sha1(tp->mptcp_loc_key,
++		       &tp->mptcp_loc_token, NULL);
++}
++
++void mptcp_connect_init(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	rcu_read_lock_bh();
++	spin_lock(&mptcp_tk_hashlock);
++	do {
++		mptcp_set_key_sk(sk);
++	} while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
++		 mptcp_find_token(tp->mptcp_loc_token));
++
++	__mptcp_hash_insert(tp, tp->mptcp_loc_token);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock_bh();
++}
++
++/**
++ * This function increments the refcount of the mpcb struct.
++ * It is the responsibility of the caller to decrement when releasing
++ * the structure.
++ */
++struct sock *mptcp_hash_find(const struct net *net, const u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct tcp_sock *meta_tp;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
++				       tk_table) {
++		meta_sk = (struct sock *)meta_tp;
++		if (token == meta_tp->mptcp_loc_token &&
++		    net_eq(net, sock_net(meta_sk))) {
++			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++				goto out;
++			if (unlikely(token != meta_tp->mptcp_loc_token ||
++				     !net_eq(net, sock_net(meta_sk)))) {
++				sock_gen_put(meta_sk);
++				goto begin;
++			}
++			goto found;
++		}
++	}
++	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++out:
++	meta_sk = NULL;
++found:
++	rcu_read_unlock();
++	return meta_sk;
++}
++
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
++{
++	/* remove from the token hashtable */
++	rcu_read_lock_bh();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++	meta_tp->inside_tk_table = 0;
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock_bh();
++}
++
++void mptcp_hash_remove(struct tcp_sock *meta_tp)
++{
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++	meta_tp->inside_tk_table = 0;
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++}
++
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk, *rttsk = NULL, *lastsk = NULL;
++	u32 min_time = 0, last_active = 0;
++
++	mptcp_for_each_sk(meta_tp->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		u32 elapsed;
++
++		if (!mptcp_sk_can_send_ack(sk) || tp->pf)
++			continue;
++
++		elapsed = keepalive_time_elapsed(tp);
++
++		/* We take the one with the lowest RTT within a reasonable
++		 * (meta-RTO)-timeframe
++		 */
++		if (elapsed < inet_csk(meta_sk)->icsk_rto) {
++			if (!min_time || tp->srtt_us < min_time) {
++				min_time = tp->srtt_us;
++				rttsk = sk;
++			}
++			continue;
++		}
++
++		/* Otherwise, we just take the most recent active */
++		if (!rttsk && (!last_active || elapsed < last_active)) {
++			last_active = elapsed;
++			lastsk = sk;
++		}
++	}
++
++	if (rttsk)
++		return rttsk;
++
++	return lastsk;
++}
++EXPORT_SYMBOL(mptcp_select_ack_sock);
++
++static void mptcp_sock_def_error_report(struct sock *sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	if (!sock_flag(sk, SOCK_DEAD))
++		mptcp_sub_close(sk, 0);
++
++	if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
++	    mpcb->send_infinite_mapping) {
++		struct sock *meta_sk = mptcp_meta_sk(sk);
++
++		meta_sk->sk_err = sk->sk_err;
++		meta_sk->sk_err_soft = sk->sk_err_soft;
++
++		if (!sock_flag(meta_sk, SOCK_DEAD))
++			meta_sk->sk_error_report(meta_sk);
++
++		tcp_done(meta_sk);
++	}
++
++	sk->sk_err = 0;
++	return;
++}
++
++static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
++{
++	if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
++		mptcp_cleanup_path_manager(mpcb);
++		mptcp_cleanup_scheduler(mpcb);
++		kmem_cache_free(mptcp_cb_cache, mpcb);
++	}
++}
++
++static void mptcp_sock_destruct(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	inet_sock_destruct(sk);
++
++	if (!is_meta_sk(sk) && !tp->was_meta_sk) {
++		BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
++
++		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++		tp->mptcp = NULL;
++
++		/* Taken when mpcb pointer was set */
++		sock_put(mptcp_meta_sk(sk));
++		mptcp_mpcb_put(tp->mpcb);
++	} else {
++		struct mptcp_cb *mpcb = tp->mpcb;
++		struct mptcp_tw *mptw;
++
++		/* The mpcb is disappearing - we can make the final
++		 * update to the rcv_nxt of the time-wait-sock and remove
++		 * its reference to the mpcb.
++		 */
++		spin_lock_bh(&mpcb->tw_lock);
++		list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
++			list_del_rcu(&mptw->list);
++			mptw->in_list = 0;
++			mptcp_mpcb_put(mpcb);
++			rcu_assign_pointer(mptw->mpcb, NULL);
++		}
++		spin_unlock_bh(&mpcb->tw_lock);
++
++		mptcp_mpcb_put(mpcb);
++
++		mptcp_debug("%s destroying meta-sk\n", __func__);
++	}
++
++	WARN_ON(!static_key_false(&mptcp_static_key));
++	/* Must be the last call, because is_meta_sk() above still needs the
++	 * static key
++	 */
++	static_key_slow_dec(&mptcp_static_key);
++}
++
++void mptcp_destroy_sock(struct sock *sk)
++{
++	if (is_meta_sk(sk)) {
++		struct sock *sk_it, *tmpsk;
++
++		__skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
++		mptcp_purge_ofo_queue(tcp_sk(sk));
++
++		/* We have to close all remaining subflows. Normally, they
++		 * should all be about to get closed. But, if the kernel is
++		 * forcing a closure (e.g., tcp_write_err), the subflows might
++		 * not have been closed properly (as we are waiting for the
++		 * DATA_ACK of the DATA_FIN).
++		 */
++		mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
++			/* Already did call tcp_close - waiting for graceful
++			 * closure, or if we are retransmitting fast-close on
++			 * the subflow. The reset (or timeout) will kill the
++			 * subflow..
++			 */
++			if (tcp_sk(sk_it)->closing ||
++			    tcp_sk(sk_it)->send_mp_fclose)
++				continue;
++
++			/* Allow the delayed work first to prevent time-wait state */
++			if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
++				continue;
++
++			mptcp_sub_close(sk_it, 0);
++		}
++
++		mptcp_delete_synack_timer(sk);
++	} else {
++		mptcp_del_sock(sk);
++	}
++}
++
++static void mptcp_set_state(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	/* Meta is not yet established - wake up the application */
++	if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
++	    sk->sk_state == TCP_ESTABLISHED) {
++		tcp_set_state(meta_sk, TCP_ESTABLISHED);
++
++		if (!sock_flag(meta_sk, SOCK_DEAD)) {
++			meta_sk->sk_state_change(meta_sk);
++			sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
++		}
++	}
++
++	if (sk->sk_state == TCP_ESTABLISHED) {
++		tcp_sk(sk)->mptcp->establish_increased = 1;
++		tcp_sk(sk)->mpcb->cnt_established++;
++	}
++}
++
++void mptcp_init_congestion_control(struct sock *sk)
++{
++	struct inet_connection_sock *icsk = inet_csk(sk);
++	struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
++	const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
++
++	/* The application didn't set the congestion control to use
++	 * fallback to the default one.
++	 */
++	if (ca == &tcp_init_congestion_ops)
++		goto use_default;
++
++	/* Use the same congestion control as set by the user. If the
++	 * module is not available fallback to the default one.
++	 */
++	if (!try_module_get(ca->owner)) {
++		pr_warn("%s: fallback to the system default CC\n", __func__);
++		goto use_default;
++	}
++
++	icsk->icsk_ca_ops = ca;
++	if (icsk->icsk_ca_ops->init)
++		icsk->icsk_ca_ops->init(sk);
++
++	return;
++
++use_default:
++	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
++	tcp_init_congestion_control(sk);
++}
++
++u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
++u32 mptcp_seed = 0;
++
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
++{
++	u32 workspace[SHA_WORKSPACE_WORDS];
++	u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
++	u8 input[64];
++	int i;
++
++	memset(workspace, 0, sizeof(workspace));
++
++	/* Initialize input with appropriate padding */
++	memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
++						   * is explicitly set too
++						   */
++	memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
++	input[8] = 0x80; /* Padding: First bit after message = 1 */
++	input[63] = 0x40; /* Padding: Length of the message = 64 bits */
++
++	sha_init(mptcp_hashed_key);
++	sha_transform(mptcp_hashed_key, input, workspace);
++
++	for (i = 0; i < 5; i++)
++		mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
++
++	if (token)
++		*token = mptcp_hashed_key[0];
++	if (idsn)
++		*idsn = *((u64 *)&mptcp_hashed_key[3]);
++}
++
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++		       u32 *hash_out)
++{
++	u32 workspace[SHA_WORKSPACE_WORDS];
++	u8 input[128]; /* 2 512-bit blocks */
++	int i;
++
++	memset(workspace, 0, sizeof(workspace));
++
++	/* Generate key xored with ipad */
++	memset(input, 0x36, 64);
++	for (i = 0; i < 8; i++)
++		input[i] ^= key_1[i];
++	for (i = 0; i < 8; i++)
++		input[i + 8] ^= key_2[i];
++
++	memcpy(&input[64], rand_1, 4);
++	memcpy(&input[68], rand_2, 4);
++	input[72] = 0x80; /* Padding: First bit after message = 1 */
++	memset(&input[73], 0, 53);
++
++	/* Padding: Length of the message = 512 + 64 bits */
++	input[126] = 0x02;
++	input[127] = 0x40;
++
++	sha_init(hash_out);
++	sha_transform(hash_out, input, workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	sha_transform(hash_out, &input[64], workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	for (i = 0; i < 5; i++)
++		hash_out[i] = cpu_to_be32(hash_out[i]);
++
++	/* Prepare second part of hmac */
++	memset(input, 0x5C, 64);
++	for (i = 0; i < 8; i++)
++		input[i] ^= key_1[i];
++	for (i = 0; i < 8; i++)
++		input[i + 8] ^= key_2[i];
++
++	memcpy(&input[64], hash_out, 20);
++	input[84] = 0x80;
++	memset(&input[85], 0, 41);
++
++	/* Padding: Length of the message = 512 + 160 bits */
++	input[126] = 0x02;
++	input[127] = 0xA0;
++
++	sha_init(hash_out);
++	sha_transform(hash_out, input, workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	sha_transform(hash_out, &input[64], workspace);
++
++	for (i = 0; i < 5; i++)
++		hash_out[i] = cpu_to_be32(hash_out[i]);
++}
++
++static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
++{
++	/* Socket-options handled by sk_clone_lock while creating the meta-sk.
++	 * ======
++	 * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
++	 * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
++	 * TCP_NODELAY, TCP_CORK
++	 *
++	 * Socket-options handled in this function here
++	 * ======
++	 * TCP_DEFER_ACCEPT
++	 * SO_KEEPALIVE
++	 *
++	 * Socket-options on the todo-list
++	 * ======
++	 * SO_BINDTODEVICE - should probably prevent creation of new subsocks
++	 *		     across other devices. - what about the api-draft?
++	 * SO_DEBUG
++	 * SO_REUSEADDR - probably we don't care about this
++	 * SO_DONTROUTE, SO_BROADCAST
++	 * SO_OOBINLINE
++	 * SO_LINGER
++	 * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
++	 * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
++	 * SO_RXQ_OVFL
++	 * TCP_COOKIE_TRANSACTIONS
++	 * TCP_MAXSEG
++	 * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
++	 *		in mptcp_retransmit_timer. AND we need to check what is
++	 *		about the subsockets.
++	 * TCP_LINGER2
++	 * TCP_WINDOW_CLAMP
++	 * TCP_USER_TIMEOUT
++	 * TCP_MD5SIG
++	 *
++	 * Socket-options of no concern for the meta-socket (but for the subsocket)
++	 * ======
++	 * SO_PRIORITY
++	 * SO_MARK
++	 * TCP_CONGESTION
++	 * TCP_SYNCNT
++	 * TCP_QUICKACK
++	 */
++
++	/* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
++	inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
++
++	/* Keepalives are handled entirely at the MPTCP-layer */
++	if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
++		inet_csk_reset_keepalive_timer(meta_sk,
++					       keepalive_time_when(tcp_sk(meta_sk)));
++		sock_reset_flag(master_sk, SOCK_KEEPOPEN);
++		inet_csk_delete_keepalive_timer(master_sk);
++	}
++
++	/* Do not propagate subflow-errors up to the MPTCP-layer */
++	inet_sk(master_sk)->recverr = 0;
++}
++
++static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
++{
++	/* IP_TOS also goes to the subflow. */
++	if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
++		inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
++		sub_sk->sk_priority = meta_sk->sk_priority;
++		sk_dst_reset(sub_sk);
++	}
++
++	/* Inherit SO_REUSEADDR */
++	sub_sk->sk_reuse = meta_sk->sk_reuse;
++
++	/* Inherit snd/rcv-buffer locks */
++	sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
++
++	/* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
++	tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
++
++	/* Keepalives are handled entirely at the MPTCP-layer */
++	if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
++		sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
++		inet_csk_delete_keepalive_timer(sub_sk);
++	}
++
++	/* Do not propagate subflow-errors up to the MPTCP-layer */
++	inet_sk(sub_sk)->recverr = 0;
++}
++
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	/* skb-sk may be NULL if we receive a packet immediatly after the
++	 * SYN/ACK + MP_CAPABLE.
++	 */
++	struct sock *sk = skb->sk ? skb->sk : meta_sk;
++	int ret = 0;
++
++	skb->sk = NULL;
++
++	if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
++		kfree_skb(skb);
++		return 0;
++	}
++
++	if (sk->sk_family == AF_INET)
++		ret = tcp_v4_do_rcv(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		ret = tcp_v6_do_rcv(sk, skb);
++#endif
++
++	sock_put(sk);
++	return ret;
++}
++
++struct lock_class_key meta_key;
++struct lock_class_key meta_slock_key;
++
++static void mptcp_synack_timer_handler(unsigned long data)
++{
++	struct sock *meta_sk = (struct sock *) data;
++	struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
++
++	/* Only process if socket is not in use. */
++	bh_lock_sock(meta_sk);
++
++	if (sock_owned_by_user(meta_sk)) {
++		/* Try again later. */
++		mptcp_reset_synack_timer(meta_sk, HZ/20);
++		goto out;
++	}
++
++	/* May happen if the queue got destructed in mptcp_close */
++	if (!lopt)
++		goto out;
++
++	inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
++				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
++
++	if (lopt->qlen)
++		mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
++
++out:
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk);
++}
++
++static const struct tcp_sock_ops mptcp_meta_specific = {
++	.__select_window		= __mptcp_select_window,
++	.select_window			= mptcp_select_window,
++	.select_initial_window		= mptcp_select_initial_window,
++	.init_buffer_space		= mptcp_init_buffer_space,
++	.set_rto			= mptcp_tcp_set_rto,
++	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
++	.init_congestion_control	= mptcp_init_congestion_control,
++	.send_fin			= mptcp_send_fin,
++	.write_xmit			= mptcp_write_xmit,
++	.send_active_reset		= mptcp_send_active_reset,
++	.write_wakeup			= mptcp_write_wakeup,
++	.prune_ofo_queue		= mptcp_prune_ofo_queue,
++	.retransmit_timer		= mptcp_retransmit_timer,
++	.time_wait			= mptcp_time_wait,
++	.cleanup_rbuf			= mptcp_cleanup_rbuf,
++};
++
++static const struct tcp_sock_ops mptcp_sub_specific = {
++	.__select_window		= __mptcp_select_window,
++	.select_window			= mptcp_select_window,
++	.select_initial_window		= mptcp_select_initial_window,
++	.init_buffer_space		= mptcp_init_buffer_space,
++	.set_rto			= mptcp_tcp_set_rto,
++	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
++	.init_congestion_control	= mptcp_init_congestion_control,
++	.send_fin			= tcp_send_fin,
++	.write_xmit			= tcp_write_xmit,
++	.send_active_reset		= tcp_send_active_reset,
++	.write_wakeup			= tcp_write_wakeup,
++	.prune_ofo_queue		= tcp_prune_ofo_queue,
++	.retransmit_timer		= tcp_retransmit_timer,
++	.time_wait			= tcp_time_wait,
++	.cleanup_rbuf			= tcp_cleanup_rbuf,
++};
++
++static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++	struct mptcp_cb *mpcb;
++	struct sock *master_sk;
++	struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
++	struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
++	u64 idsn;
++
++	dst_release(meta_sk->sk_rx_dst);
++	meta_sk->sk_rx_dst = NULL;
++	/* This flag is set to announce sock_lock_init to
++	 * reclassify the lock-class of the master socket.
++	 */
++	meta_tp->is_master_sk = 1;
++	master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
++	meta_tp->is_master_sk = 0;
++	if (!master_sk)
++		return -ENOBUFS;
++
++	master_tp = tcp_sk(master_sk);
++	master_icsk = inet_csk(master_sk);
++
++	mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
++	if (!mpcb) {
++		/* sk_free (and __sk_free) requirese wmem_alloc to be 1.
++		 * All the rest is set to 0 thanks to __GFP_ZERO above.
++		 */
++		atomic_set(&master_sk->sk_wmem_alloc, 1);
++		sk_free(master_sk);
++		return -ENOBUFS;
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
++		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++		newnp = inet6_sk(master_sk);
++		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++		newnp->ipv6_mc_list = NULL;
++		newnp->ipv6_ac_list = NULL;
++		newnp->ipv6_fl_list = NULL;
++		newnp->opt = NULL;
++		newnp->pktoptions = NULL;
++		(void)xchg(&newnp->rxpmtu, NULL);
++	} else if (meta_sk->sk_family == AF_INET6) {
++		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++		newnp = inet6_sk(master_sk);
++		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++		newnp->hop_limit	= -1;
++		newnp->mcast_hops	= IPV6_DEFAULT_MCASTHOPS;
++		newnp->mc_loop	= 1;
++		newnp->pmtudisc	= IPV6_PMTUDISC_WANT;
++		newnp->ipv6only	= sock_net(master_sk)->ipv6.sysctl.bindv6only;
++	}
++#endif
++
++	meta_tp->mptcp = NULL;
++
++	/* Store the keys and generate the peer's token */
++	mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
++	mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
++
++	/* Generate Initial data-sequence-numbers */
++	mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
++	idsn = ntohll(idsn) + 1;
++	mpcb->snd_high_order[0] = idsn >> 32;
++	mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
++
++	meta_tp->write_seq = (u32)idsn;
++	meta_tp->snd_sml = meta_tp->write_seq;
++	meta_tp->snd_una = meta_tp->write_seq;
++	meta_tp->snd_nxt = meta_tp->write_seq;
++	meta_tp->pushed_seq = meta_tp->write_seq;
++	meta_tp->snd_up = meta_tp->write_seq;
++
++	mpcb->mptcp_rem_key = remote_key;
++	mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
++	idsn = ntohll(idsn) + 1;
++	mpcb->rcv_high_order[0] = idsn >> 32;
++	mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
++	meta_tp->copied_seq = (u32) idsn;
++	meta_tp->rcv_nxt = (u32) idsn;
++	meta_tp->rcv_wup = (u32) idsn;
++
++	meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
++	meta_tp->snd_wnd = window;
++	meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
++
++	meta_tp->packets_out = 0;
++	meta_icsk->icsk_probes_out = 0;
++
++	/* Set mptcp-pointers */
++	master_tp->mpcb = mpcb;
++	master_tp->meta_sk = meta_sk;
++	meta_tp->mpcb = mpcb;
++	meta_tp->meta_sk = meta_sk;
++	mpcb->meta_sk = meta_sk;
++	mpcb->master_sk = master_sk;
++
++	meta_tp->was_meta_sk = 0;
++
++	/* Initialize the queues */
++	skb_queue_head_init(&mpcb->reinject_queue);
++	skb_queue_head_init(&master_tp->out_of_order_queue);
++	tcp_prequeue_init(master_tp);
++	INIT_LIST_HEAD(&master_tp->tsq_node);
++
++	master_tp->tsq_flags = 0;
++
++	mutex_init(&mpcb->mpcb_mutex);
++
++	/* Init the accept_queue structure, we support a queue of 32 pending
++	 * connections, it does not need to be huge, since we only store  here
++	 * pending subflow creations.
++	 */
++	if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
++		inet_put_port(master_sk);
++		kmem_cache_free(mptcp_cb_cache, mpcb);
++		sk_free(master_sk);
++		return -ENOMEM;
++	}
++
++	/* Redefine function-pointers as the meta-sk is now fully ready */
++	static_key_slow_inc(&mptcp_static_key);
++	meta_tp->mpc = 1;
++	meta_tp->ops = &mptcp_meta_specific;
++
++	meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
++	meta_sk->sk_destruct = mptcp_sock_destruct;
++
++	/* Meta-level retransmit timer */
++	meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
++
++	tcp_init_xmit_timers(master_sk);
++	/* Has been set for sending out the SYN */
++	inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
++
++	if (!meta_tp->inside_tk_table) {
++		/* Adding the meta_tp in the token hashtable - coming from server-side */
++		rcu_read_lock();
++		spin_lock(&mptcp_tk_hashlock);
++
++		__mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
++
++		spin_unlock(&mptcp_tk_hashlock);
++		rcu_read_unlock();
++	}
++	master_tp->inside_tk_table = 0;
++
++	/* Init time-wait stuff */
++	INIT_LIST_HEAD(&mpcb->tw_list);
++	spin_lock_init(&mpcb->tw_lock);
++
++	INIT_HLIST_HEAD(&mpcb->callback_list);
++
++	mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
++
++	mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
++	mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
++	mpcb->orig_window_clamp = meta_tp->window_clamp;
++
++	/* The meta is directly linked - set refcnt to 1 */
++	atomic_set(&mpcb->mpcb_refcnt, 1);
++
++	mptcp_init_path_manager(mpcb);
++	mptcp_init_scheduler(mpcb);
++
++	setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
++		    (unsigned long)meta_sk);
++
++	mptcp_debug("%s: created mpcb with token %#x\n",
++		    __func__, mpcb->mptcp_loc_token);
++
++	return 0;
++}
++
++void mptcp_fallback_meta_sk(struct sock *meta_sk)
++{
++	kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
++	kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
++}
++
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++		   gfp_t flags)
++{
++	struct mptcp_cb *mpcb	= tcp_sk(meta_sk)->mpcb;
++	struct tcp_sock *tp	= tcp_sk(sk);
++
++	tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
++	if (!tp->mptcp)
++		return -ENOMEM;
++
++	tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
++	/* No more space for more subflows? */
++	if (!tp->mptcp->path_index) {
++		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++		return -EPERM;
++	}
++
++	INIT_HLIST_NODE(&tp->mptcp->cb_list);
++
++	tp->mptcp->tp = tp;
++	tp->mpcb = mpcb;
++	tp->meta_sk = meta_sk;
++
++	static_key_slow_inc(&mptcp_static_key);
++	tp->mpc = 1;
++	tp->ops = &mptcp_sub_specific;
++
++	tp->mptcp->loc_id = loc_id;
++	tp->mptcp->rem_id = rem_id;
++	if (mpcb->sched_ops->init)
++		mpcb->sched_ops->init(sk);
++
++	/* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
++	 * included in mptcp_del_sock(), because the mpcb must remain alive
++	 * until the last subsocket is completely destroyed.
++	 */
++	sock_hold(meta_sk);
++	atomic_inc(&mpcb->mpcb_refcnt);
++
++	tp->mptcp->next = mpcb->connection_list;
++	mpcb->connection_list = tp;
++	tp->mptcp->attached = 1;
++
++	mpcb->cnt_subflows++;
++	atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
++		   &meta_sk->sk_rmem_alloc);
++
++	mptcp_sub_inherit_sockopts(meta_sk, sk);
++	INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
++
++	/* As we successfully allocated the mptcp_tcp_sock, we have to
++	 * change the function-pointers here (for sk_destruct to work correctly)
++	 */
++	sk->sk_error_report = mptcp_sock_def_error_report;
++	sk->sk_data_ready = mptcp_data_ready;
++	sk->sk_write_space = mptcp_write_space;
++	sk->sk_state_change = mptcp_set_state;
++	sk->sk_destruct = mptcp_sock_destruct;
++
++	if (sk->sk_family == AF_INET)
++		mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
++			    __func__ , mpcb->mptcp_loc_token,
++			    tp->mptcp->path_index,
++			    &((struct inet_sock *)tp)->inet_saddr,
++			    ntohs(((struct inet_sock *)tp)->inet_sport),
++			    &((struct inet_sock *)tp)->inet_daddr,
++			    ntohs(((struct inet_sock *)tp)->inet_dport),
++			    mpcb->cnt_subflows);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
++			    __func__ , mpcb->mptcp_loc_token,
++			    tp->mptcp->path_index, &inet6_sk(sk)->saddr,
++			    ntohs(((struct inet_sock *)tp)->inet_sport),
++			    &sk->sk_v6_daddr,
++			    ntohs(((struct inet_sock *)tp)->inet_dport),
++			    mpcb->cnt_subflows);
++#endif
++
++	return 0;
++}
++
++void mptcp_del_sock(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
++	struct mptcp_cb *mpcb;
++
++	if (!tp->mptcp || !tp->mptcp->attached)
++		return;
++
++	mpcb = tp->mpcb;
++	tp_prev = mpcb->connection_list;
++
++	mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
++		    __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
++		    sk->sk_state, is_meta_sk(sk));
++
++	if (tp_prev == tp) {
++		mpcb->connection_list = tp->mptcp->next;
++	} else {
++		for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
++			if (tp_prev->mptcp->next == tp) {
++				tp_prev->mptcp->next = tp->mptcp->next;
++				break;
++			}
++		}
++	}
++	mpcb->cnt_subflows--;
++	if (tp->mptcp->establish_increased)
++		mpcb->cnt_established--;
++
++	tp->mptcp->next = NULL;
++	tp->mptcp->attached = 0;
++	mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
++
++	if (!skb_queue_empty(&sk->sk_write_queue))
++		mptcp_reinject_data(sk, 0);
++
++	if (is_master_tp(tp))
++		mpcb->master_sk = NULL;
++	else if (tp->mptcp->pre_established)
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++
++	rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
++}
++
++/* Updates the metasocket ULID/port data, based on the given sock.
++ * The argument sock must be the sock accessible to the application.
++ * In this function, we update the meta socket info, based on the changes
++ * in the application socket (bind, address allocation, ...)
++ */
++void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
++{
++	if (tcp_sk(sk)->mpcb->pm_ops->new_session)
++		tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
++
++	tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
++}
++
++/* Clean up the receive buffer for full frames taken by the user,
++ * then send an ACK if necessary.  COPIED is the number of bytes
++ * tcp_recvmsg has given to the user so far, it speeds up the
++ * calculation of whether or not we must ACK for the sake of
++ * a window update.
++ */
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk;
++	__u32 rcv_window_now = 0;
++
++	if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
++		rcv_window_now = tcp_receive_window(meta_tp);
++
++		if (2 * rcv_window_now > meta_tp->window_clamp)
++			rcv_window_now = 0;
++	}
++
++	mptcp_for_each_sk(meta_tp->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		const struct inet_connection_sock *icsk = inet_csk(sk);
++
++		if (!mptcp_sk_can_send_ack(sk))
++			continue;
++
++		if (!inet_csk_ack_scheduled(sk))
++			goto second_part;
++		/* Delayed ACKs frequently hit locked sockets during bulk
++		 * receive.
++		 */
++		if (icsk->icsk_ack.blocked ||
++		    /* Once-per-two-segments ACK was not sent by tcp_input.c */
++		    tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
++		    /* If this read emptied read buffer, we send ACK, if
++		     * connection is not bidirectional, user drained
++		     * receive buffer and there was a small segment
++		     * in queue.
++		     */
++		    (copied > 0 &&
++		     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
++		      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
++		       !icsk->icsk_ack.pingpong)) &&
++		     !atomic_read(&meta_sk->sk_rmem_alloc))) {
++			tcp_send_ack(sk);
++			continue;
++		}
++
++second_part:
++		/* This here is the second part of tcp_cleanup_rbuf */
++		if (rcv_window_now) {
++			__u32 new_window = tp->ops->__select_window(sk);
++
++			/* Send ACK now, if this read freed lots of space
++			 * in our buffer. Certainly, new_window is new window.
++			 * We can advertise it now, if it is not less than
++			 * current one.
++			 * "Lots" means "at least twice" here.
++			 */
++			if (new_window && new_window >= 2 * rcv_window_now)
++				tcp_send_ack(sk);
++		}
++	}
++}
++
++static int mptcp_sub_send_fin(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *skb = tcp_write_queue_tail(sk);
++	int mss_now;
++
++	/* Optimization, tack on the FIN if we have a queue of
++	 * unsent frames.  But be careful about outgoing SACKS
++	 * and IP options.
++	 */
++	mss_now = tcp_current_mss(sk);
++
++	if (tcp_send_head(sk) != NULL) {
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++		TCP_SKB_CB(skb)->end_seq++;
++		tp->write_seq++;
++	} else {
++		skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
++		if (!skb)
++			return 1;
++
++		/* Reserve space for headers and prepare control bits. */
++		skb_reserve(skb, MAX_TCP_HEADER);
++		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
++		tcp_init_nondata_skb(skb, tp->write_seq,
++				     TCPHDR_ACK | TCPHDR_FIN);
++		tcp_queue_skb(sk, skb);
++	}
++	__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
++
++	return 0;
++}
++
++void mptcp_sub_close_wq(struct work_struct *work)
++{
++	struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
++	struct sock *sk = (struct sock *)tp;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	mutex_lock(&tp->mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	if (sock_flag(sk, SOCK_DEAD))
++		goto exit;
++
++	/* We come from tcp_disconnect. We are sure that meta_sk is set */
++	if (!mptcp(tp)) {
++		tp->closing = 1;
++		sock_rps_reset_flow(sk);
++		tcp_close(sk, 0);
++		goto exit;
++	}
++
++	if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
++		tp->closing = 1;
++		sock_rps_reset_flow(sk);
++		tcp_close(sk, 0);
++	} else if (tcp_close_state(sk)) {
++		sk->sk_shutdown |= SEND_SHUTDOWN;
++		tcp_send_fin(sk);
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&tp->mpcb->mpcb_mutex);
++	sock_put(sk);
++}
++
++void mptcp_sub_close(struct sock *sk, unsigned long delay)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
++
++	/* We are already closing - e.g., call from sock_def_error_report upon
++	 * tcp_disconnect in tcp_close.
++	 */
++	if (tp->closing)
++		return;
++
++	/* Work already scheduled ? */
++	if (work_pending(&work->work)) {
++		/* Work present - who will be first ? */
++		if (jiffies + delay > work->timer.expires)
++			return;
++
++		/* Try canceling - if it fails, work will be executed soon */
++		if (!cancel_delayed_work(work))
++			return;
++		sock_put(sk);
++	}
++
++	if (!delay) {
++		unsigned char old_state = sk->sk_state;
++
++		/* If we are in user-context we can directly do the closing
++		 * procedure. No need to schedule a work-queue.
++		 */
++		if (!in_softirq()) {
++			if (sock_flag(sk, SOCK_DEAD))
++				return;
++
++			if (!mptcp(tp)) {
++				tp->closing = 1;
++				sock_rps_reset_flow(sk);
++				tcp_close(sk, 0);
++				return;
++			}
++
++			if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
++			    sk->sk_state == TCP_CLOSE) {
++				tp->closing = 1;
++				sock_rps_reset_flow(sk);
++				tcp_close(sk, 0);
++			} else if (tcp_close_state(sk)) {
++				sk->sk_shutdown |= SEND_SHUTDOWN;
++				tcp_send_fin(sk);
++			}
++
++			return;
++		}
++
++		/* We directly send the FIN. Because it may take so a long time,
++		 * untile the work-queue will get scheduled...
++		 *
++		 * If mptcp_sub_send_fin returns 1, it failed and thus we reset
++		 * the old state so that tcp_close will finally send the fin
++		 * in user-context.
++		 */
++		if (!sk->sk_err && old_state != TCP_CLOSE &&
++		    tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
++			if (old_state == TCP_ESTABLISHED)
++				TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
++			sk->sk_state = old_state;
++		}
++	}
++
++	sock_hold(sk);
++	queue_delayed_work(mptcp_wq, work, delay);
++}
++
++void mptcp_sub_force_close(struct sock *sk)
++{
++	/* The below tcp_done may have freed the socket, if he is already dead.
++	 * Thus, we are not allowed to access it afterwards. That's why
++	 * we have to store the dead-state in this local variable.
++	 */
++	int sock_is_dead = sock_flag(sk, SOCK_DEAD);
++
++	tcp_sk(sk)->mp_killed = 1;
++
++	if (sk->sk_state != TCP_CLOSE)
++		tcp_done(sk);
++
++	if (!sock_is_dead)
++		mptcp_sub_close(sk, 0);
++}
++EXPORT_SYMBOL(mptcp_sub_force_close);
++
++/* Update the mpcb send window, based on the contributions
++ * of each subflow
++ */
++void mptcp_update_sndbuf(const struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk, *sk;
++	int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
++
++	mptcp_for_each_sk(tp->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		new_sndbuf += sk->sk_sndbuf;
++
++		if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
++			new_sndbuf = sysctl_tcp_wmem[2];
++			break;
++		}
++	}
++	meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
++
++	/* The subflow's call to sk_write_space in tcp_new_space ends up in
++	 * mptcp_write_space.
++	 * It has nothing to do with waking up the application.
++	 * So, we do it here.
++	 */
++	if (old_sndbuf != meta_sk->sk_sndbuf)
++		meta_sk->sk_write_space(meta_sk);
++}
++
++void mptcp_close(struct sock *meta_sk, long timeout)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk_it, *tmpsk;
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb;
++	int data_was_unread = 0;
++	int state;
++
++	mptcp_debug("%s: Close of meta_sk with tok %#x\n",
++		    __func__, mpcb->mptcp_loc_token);
++
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock(meta_sk);
++
++	if (meta_tp->inside_tk_table) {
++		/* Detach the mpcb from the token hashtable */
++		mptcp_hash_remove_bh(meta_tp);
++		reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
++	}
++
++	meta_sk->sk_shutdown = SHUTDOWN_MASK;
++	/* We need to flush the recv. buffs.  We do this only on the
++	 * descriptor close, not protocol-sourced closes, because the
++	 * reader process may not have drained the data yet!
++	 */
++	while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
++		u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
++			  tcp_hdr(skb)->fin;
++		data_was_unread += len;
++		__kfree_skb(skb);
++	}
++
++	sk_mem_reclaim(meta_sk);
++
++	/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
++	if (meta_sk->sk_state == TCP_CLOSE) {
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++			if (tcp_sk(sk_it)->send_mp_fclose)
++				continue;
++			mptcp_sub_close(sk_it, 0);
++		}
++		goto adjudge_to_death;
++	}
++
++	if (data_was_unread) {
++		/* Unread data was tossed, zap the connection. */
++		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
++		tcp_set_state(meta_sk, TCP_CLOSE);
++		tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
++							meta_sk->sk_allocation);
++	} else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
++		/* Check zero linger _after_ checking for unread data. */
++		meta_sk->sk_prot->disconnect(meta_sk, 0);
++		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++	} else if (tcp_close_state(meta_sk)) {
++		mptcp_send_fin(meta_sk);
++	} else if (meta_tp->snd_una == meta_tp->write_seq) {
++		/* The DATA_FIN has been sent and acknowledged
++		 * (e.g., by sk_shutdown). Close all the other subflows
++		 */
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++			unsigned long delay = 0;
++			/* If we are the passive closer, don't trigger
++			 * subflow-fin until the subflow has been finned
++			 * by the peer. - thus we add a delay
++			 */
++			if (mpcb->passive_close &&
++			    sk_it->sk_state == TCP_ESTABLISHED)
++				delay = inet_csk(sk_it)->icsk_rto << 3;
++
++			mptcp_sub_close(sk_it, delay);
++		}
++	}
++
++	sk_stream_wait_close(meta_sk, timeout);
++
++adjudge_to_death:
++	state = meta_sk->sk_state;
++	sock_hold(meta_sk);
++	sock_orphan(meta_sk);
++
++	/* socket will be freed after mptcp_close - we have to prevent
++	 * access from the subflows.
++	 */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		/* Similar to sock_orphan, but we don't set it DEAD, because
++		 * the callbacks are still set and must be called.
++		 */
++		write_lock_bh(&sk_it->sk_callback_lock);
++		sk_set_socket(sk_it, NULL);
++		sk_it->sk_wq  = NULL;
++		write_unlock_bh(&sk_it->sk_callback_lock);
++	}
++
++	/* It is the last release_sock in its life. It will remove backlog. */
++	release_sock(meta_sk);
++
++	/* Now socket is owned by kernel and we acquire BH lock
++	 * to finish close. No need to check for user refs.
++	 */
++	local_bh_disable();
++	bh_lock_sock(meta_sk);
++	WARN_ON(sock_owned_by_user(meta_sk));
++
++	percpu_counter_inc(meta_sk->sk_prot->orphan_count);
++
++	/* Have we already been destroyed by a softirq or backlog? */
++	if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
++		goto out;
++
++	/*	This is a (useful) BSD violating of the RFC. There is a
++	 *	problem with TCP as specified in that the other end could
++	 *	keep a socket open forever with no application left this end.
++	 *	We use a 3 minute timeout (about the same as BSD) then kill
++	 *	our end. If they send after that then tough - BUT: long enough
++	 *	that we won't make the old 4*rto = almost no time - whoops
++	 *	reset mistake.
++	 *
++	 *	Nope, it was not mistake. It is really desired behaviour
++	 *	f.e. on http servers, when such sockets are useless, but
++	 *	consume significant resources. Let's do it with special
++	 *	linger2	option.					--ANK
++	 */
++
++	if (meta_sk->sk_state == TCP_FIN_WAIT2) {
++		if (meta_tp->linger2 < 0) {
++			tcp_set_state(meta_sk, TCP_CLOSE);
++			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPABORTONLINGER);
++		} else {
++			const int tmo = tcp_fin_time(meta_sk);
++
++			if (tmo > TCP_TIMEWAIT_LEN) {
++				inet_csk_reset_keepalive_timer(meta_sk,
++							       tmo - TCP_TIMEWAIT_LEN);
++			} else {
++				meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
++							tmo);
++				goto out;
++			}
++		}
++	}
++	if (meta_sk->sk_state != TCP_CLOSE) {
++		sk_mem_reclaim(meta_sk);
++		if (tcp_too_many_orphans(meta_sk, 0)) {
++			if (net_ratelimit())
++				pr_info("MPTCP: too many of orphaned sockets\n");
++			tcp_set_state(meta_sk, TCP_CLOSE);
++			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPABORTONMEMORY);
++		}
++	}
++
++
++	if (meta_sk->sk_state == TCP_CLOSE)
++		inet_csk_destroy_sock(meta_sk);
++	/* Otherwise, socket is reprieved until protocol close. */
++
++out:
++	bh_unlock_sock(meta_sk);
++	local_bh_enable();
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk); /* Taken by sock_hold */
++}
++
++void mptcp_disconnect(struct sock *sk)
++{
++	struct sock *subsk, *tmpsk;
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	mptcp_delete_synack_timer(sk);
++
++	__skb_queue_purge(&tp->mpcb->reinject_queue);
++
++	if (tp->inside_tk_table) {
++		mptcp_hash_remove_bh(tp);
++		reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
++	}
++
++	local_bh_disable();
++	mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
++		/* The socket will get removed from the subsocket-list
++		 * and made non-mptcp by setting mpc to 0.
++		 *
++		 * This is necessary, because tcp_disconnect assumes
++		 * that the connection is completly dead afterwards.
++		 * Thus we need to do a mptcp_del_sock. Due to this call
++		 * we have to make it non-mptcp.
++		 *
++		 * We have to lock the socket, because we set mpc to 0.
++		 * An incoming packet would take the subsocket's lock
++		 * and go on into the receive-path.
++		 * This would be a race.
++		 */
++
++		bh_lock_sock(subsk);
++		mptcp_del_sock(subsk);
++		tcp_sk(subsk)->mpc = 0;
++		tcp_sk(subsk)->ops = &tcp_specific;
++		mptcp_sub_force_close(subsk);
++		bh_unlock_sock(subsk);
++	}
++	local_bh_enable();
++
++	tp->was_meta_sk = 1;
++	tp->mpc = 0;
++	tp->ops = &tcp_specific;
++}
++
++
++/* Returns 1 if we should enable MPTCP for that socket. */
++int mptcp_doit(struct sock *sk)
++{
++	/* Do not allow MPTCP enabling if the MPTCP initialization failed */
++	if (mptcp_init_failed)
++		return 0;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++		return 0;
++
++	/* Socket may already be established (e.g., called from tcp_recvmsg) */
++	if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
++		return 1;
++
++	/* Don't do mptcp over loopback */
++	if (sk->sk_family == AF_INET &&
++	    (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
++	     ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
++		return 0;
++#if IS_ENABLED(CONFIG_IPV6)
++	if (sk->sk_family == AF_INET6 &&
++	    (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
++	     ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
++		return 0;
++#endif
++	if (mptcp_v6_is_v4_mapped(sk) &&
++	    ipv4_is_loopback(inet_sk(sk)->inet_saddr))
++		return 0;
++
++#ifdef CONFIG_TCP_MD5SIG
++	/* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
++	if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
++		return 0;
++#endif
++
++	return 1;
++}
++
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++	struct tcp_sock *master_tp;
++	struct sock *master_sk;
++
++	if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
++		goto err_alloc_mpcb;
++
++	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++	master_tp = tcp_sk(master_sk);
++
++	if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
++		goto err_add_sock;
++
++	if (__inet_inherit_port(meta_sk, master_sk) < 0)
++		goto err_add_sock;
++
++	meta_sk->sk_prot->unhash(meta_sk);
++
++	if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
++		__inet_hash_nolisten(master_sk, NULL);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		__inet6_hash(master_sk, NULL);
++#endif
++
++	master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
++
++	return 0;
++
++err_add_sock:
++	mptcp_fallback_meta_sk(meta_sk);
++
++	inet_csk_prepare_forced_close(master_sk);
++	tcp_done(master_sk);
++	inet_csk_prepare_forced_close(meta_sk);
++	tcp_done(meta_sk);
++
++err_alloc_mpcb:
++	return -ENOBUFS;
++}
++
++static int __mptcp_check_req_master(struct sock *child,
++				    struct request_sock *req)
++{
++	struct tcp_sock *child_tp = tcp_sk(child);
++	struct sock *meta_sk = child;
++	struct mptcp_cb *mpcb;
++	struct mptcp_request_sock *mtreq;
++
++	/* Never contained an MP_CAPABLE */
++	if (!inet_rsk(req)->mptcp_rqsk)
++		return 1;
++
++	if (!inet_rsk(req)->saw_mpc) {
++		/* Fallback to regular TCP, because we saw one SYN without
++		 * MP_CAPABLE. In tcp_check_req we continue the regular path.
++		 * But, the socket has been added to the reqsk_tk_htb, so we
++		 * must still remove it.
++		 */
++		mptcp_reqsk_remove_tk(req);
++		return 1;
++	}
++
++	/* Just set this values to pass them to mptcp_alloc_mpcb */
++	mtreq = mptcp_rsk(req);
++	child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
++	child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
++
++	if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
++				   child_tp->snd_wnd))
++		return -ENOBUFS;
++
++	child = tcp_sk(child)->mpcb->master_sk;
++	child_tp = tcp_sk(child);
++	mpcb = child_tp->mpcb;
++
++	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++
++	mpcb->dss_csum = mtreq->dss_csum;
++	mpcb->server_side = 1;
++
++	/* Will be moved to ESTABLISHED by  tcp_rcv_state_process() */
++	mptcp_update_metasocket(child, meta_sk);
++
++	/* Needs to be done here additionally, because when accepting a
++	 * new connection we pass by __reqsk_free and not reqsk_free.
++	 */
++	mptcp_reqsk_remove_tk(req);
++
++	/* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
++	sock_put(meta_sk);
++
++	return 0;
++}
++
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
++{
++	struct sock *meta_sk = child, *master_sk;
++	struct sk_buff *skb;
++	u32 new_mapping;
++	int ret;
++
++	ret = __mptcp_check_req_master(child, req);
++	if (ret)
++		return ret;
++
++	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++
++	/* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
++	 * pre-MPTCP data in the receive queue.
++	 */
++	tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
++				       tcp_rsk(req)->rcv_isn - 1;
++
++	/* Map subflow sequence number to data sequence numbers. We need to map
++	 * these data to [IDSN - len - 1, IDSN[.
++	 */
++	new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
++
++	/* There should be only one skb: the SYN + data. */
++	skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
++		TCP_SKB_CB(skb)->seq += new_mapping;
++		TCP_SKB_CB(skb)->end_seq += new_mapping;
++	}
++
++	/* With fastopen we change the semantics of the relative subflow
++	 * sequence numbers to deal with middleboxes that could add/remove
++	 * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
++	 * instead of the regular TCP ISN.
++	 */
++	tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
++
++	/* We need to update copied_seq of the master_sk to account for the
++	 * already moved data to the meta receive queue.
++	 */
++	tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
++
++	/* Handled by the master_sk */
++	tcp_sk(meta_sk)->fastopen_rsk = NULL;
++
++	return 0;
++}
++
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++			   struct request_sock *req,
++			   struct request_sock **prev)
++{
++	struct sock *meta_sk = child;
++	int ret;
++
++	ret = __mptcp_check_req_master(child, req);
++	if (ret)
++		return ret;
++
++	inet_csk_reqsk_queue_unlink(sk, req, prev);
++	inet_csk_reqsk_queue_removed(sk, req);
++	inet_csk_reqsk_queue_add(sk, req, meta_sk);
++
++	return 0;
++}
++
++struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
++				   struct request_sock *req,
++				   struct request_sock **prev,
++				   const struct mptcp_options_received *mopt)
++{
++	struct tcp_sock *child_tp = tcp_sk(child);
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	u8 hash_mac_check[20];
++
++	child_tp->inside_tk_table = 0;
++
++	if (!mopt->join_ack)
++		goto teardown;
++
++	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++			(u8 *)&mpcb->mptcp_loc_key,
++			(u8 *)&mtreq->mptcp_rem_nonce,
++			(u8 *)&mtreq->mptcp_loc_nonce,
++			(u32 *)hash_mac_check);
++
++	if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
++		goto teardown;
++
++	/* Point it to the same struct socket and wq as the meta_sk */
++	sk_set_socket(child, meta_sk->sk_socket);
++	child->sk_wq = meta_sk->sk_wq;
++
++	if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
++		/* Has been inherited, but now child_tp->mptcp is NULL */
++		child_tp->mpc = 0;
++		child_tp->ops = &tcp_specific;
++
++		/* TODO when we support acking the third ack for new subflows,
++		 * we should silently discard this third ack, by returning NULL.
++		 *
++		 * Maybe, at the retransmission we will have enough memory to
++		 * fully add the socket to the meta-sk.
++		 */
++		goto teardown;
++	}
++
++	/* The child is a clone of the meta socket, we must now reset
++	 * some of the fields
++	 */
++	child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
++
++	/* We should allow proper increase of the snd/rcv-buffers. Thus, we
++	 * use the original values instead of the bloated up ones from the
++	 * clone.
++	 */
++	child->sk_sndbuf = mpcb->orig_sk_sndbuf;
++	child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
++
++	child_tp->mptcp->slave_sk = 1;
++	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++	child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
++
++	child_tp->tsq_flags = 0;
++
++	/* Subflows do not use the accept queue, as they
++	 * are attached immediately to the mpcb.
++	 */
++	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++	reqsk_free(req);
++	return child;
++
++teardown:
++	/* Drop this request - sock creation failed. */
++	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++	reqsk_free(req);
++	inet_csk_prepare_forced_close(child);
++	tcp_done(child);
++	return meta_sk;
++}
++
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
++{
++	struct mptcp_tw *mptw;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++
++	/* A subsocket in tw can only receive data. So, if we are in
++	 * infinite-receive, then we should not reply with a data-ack or act
++	 * upon general MPTCP-signaling. We prevent this by simply not creating
++	 * the mptcp_tw_sock.
++	 */
++	if (mpcb->infinite_mapping_rcv) {
++		tw->mptcp_tw = NULL;
++		return 0;
++	}
++
++	/* Alloc MPTCP-tw-sock */
++	mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
++	if (!mptw)
++		return -ENOBUFS;
++
++	atomic_inc(&mpcb->mpcb_refcnt);
++
++	tw->mptcp_tw = mptw;
++	mptw->loc_key = mpcb->mptcp_loc_key;
++	mptw->meta_tw = mpcb->in_time_wait;
++	if (mptw->meta_tw) {
++		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
++		if (mpcb->mptw_state != TCP_TIME_WAIT)
++			mptw->rcv_nxt++;
++	}
++	rcu_assign_pointer(mptw->mpcb, mpcb);
++
++	spin_lock(&mpcb->tw_lock);
++	list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
++	mptw->in_list = 1;
++	spin_unlock(&mpcb->tw_lock);
++
++	return 0;
++}
++
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
++{
++	struct mptcp_cb *mpcb;
++
++	rcu_read_lock();
++	mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
++
++	/* If we are still holding a ref to the mpcb, we have to remove ourself
++	 * from the list and drop the ref properly.
++	 */
++	if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
++		spin_lock(&mpcb->tw_lock);
++		if (tw->mptcp_tw->in_list) {
++			list_del_rcu(&tw->mptcp_tw->list);
++			tw->mptcp_tw->in_list = 0;
++		}
++		spin_unlock(&mpcb->tw_lock);
++
++		/* Twice, because we increased it above */
++		mptcp_mpcb_put(mpcb);
++		mptcp_mpcb_put(mpcb);
++	}
++
++	rcu_read_unlock();
++
++	kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
++}
++
++/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
++ * data-fin.
++ */
++void mptcp_time_wait(struct sock *sk, int state, int timeo)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_tw *mptw;
++
++	/* Used for sockets that go into tw after the meta
++	 * (see mptcp_init_tw_sock())
++	 */
++	tp->mpcb->in_time_wait = 1;
++	tp->mpcb->mptw_state = state;
++
++	/* Update the time-wait-sock's information */
++	rcu_read_lock_bh();
++	list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
++		mptw->meta_tw = 1;
++		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
++
++		/* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
++		 * pretend as if the DATA_FIN has already reached us, that way
++		 * the checks in tcp_timewait_state_process will be good as the
++		 * DATA_FIN comes in.
++		 */
++		if (state != TCP_TIME_WAIT)
++			mptw->rcv_nxt++;
++	}
++	rcu_read_unlock_bh();
++
++	tcp_done(sk);
++}
++
++void mptcp_tsq_flags(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	/* It will be handled as a regular deferred-call */
++	if (is_meta_sk(sk))
++		return;
++
++	if (hlist_unhashed(&tp->mptcp->cb_list)) {
++		hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
++		/* We need to hold it here, as the sock_hold is not assured
++		 * by the release_sock as it is done in regular TCP.
++		 *
++		 * The subsocket may get inet_csk_destroy'd while it is inside
++		 * the callback_list.
++		 */
++		sock_hold(sk);
++	}
++
++	if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
++		sock_hold(meta_sk);
++}
++
++void mptcp_tsq_sub_deferred(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_tcp_sock *mptcp;
++	struct hlist_node *tmp;
++
++	BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
++
++	__sock_put(meta_sk);
++	hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
++		struct tcp_sock *tp = mptcp->tp;
++		struct sock *sk = (struct sock *)tp;
++
++		hlist_del_init(&mptcp->cb_list);
++		sk->sk_prot->release_cb(sk);
++		/* Final sock_put (cfr. mptcp_tsq_flags */
++		sock_put(sk);
++	}
++}
++
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++			   struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_options_received mopt;
++	u8 mptcp_hash_mac[20];
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	mtreq = mptcp_rsk(req);
++	mtreq->mptcp_mpcb = mpcb;
++	mtreq->is_sub = 1;
++	inet_rsk(req)->mptcp_rqsk = 1;
++
++	mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
++
++	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++			(u8 *)&mpcb->mptcp_rem_key,
++			(u8 *)&mtreq->mptcp_loc_nonce,
++			(u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
++	mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
++
++	mtreq->rem_id = mopt.rem_id;
++	mtreq->rcv_low_prio = mopt.low_prio;
++	inet_rsk(req)->saw_mpc = 1;
++}
++
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
++{
++	struct mptcp_options_received mopt;
++	struct mptcp_request_sock *mreq = mptcp_rsk(req);
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	mreq->is_sub = 0;
++	inet_rsk(req)->mptcp_rqsk = 1;
++	mreq->dss_csum = mopt.dss_csum;
++	mreq->hash_entry.pprev = NULL;
++
++	mptcp_reqsk_new_mptcp(req, &mopt, skb);
++}
++
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
++{
++	struct mptcp_options_received mopt;
++	const struct tcp_sock *tp = tcp_sk(sk);
++	__u32 isn = TCP_SKB_CB(skb)->when;
++	bool want_cookie = false;
++
++	if ((sysctl_tcp_syncookies == 2 ||
++	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++		want_cookie = tcp_syn_flood_action(sk, skb,
++						   mptcp_request_sock_ops.slab_name);
++		if (!want_cookie)
++			goto drop;
++	}
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	if (mopt.is_mp_join)
++		return mptcp_do_join_short(skb, &mopt, sock_net(sk));
++	if (mopt.drop_me)
++		goto drop;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
++		mopt.saw_mpc = 0;
++
++	if (skb->protocol == htons(ETH_P_IP)) {
++		if (mopt.saw_mpc && !want_cookie) {
++			if (skb_rtable(skb)->rt_flags &
++			    (RTCF_BROADCAST | RTCF_MULTICAST))
++				goto drop;
++
++			return tcp_conn_request(&mptcp_request_sock_ops,
++						&mptcp_request_sock_ipv4_ops,
++						sk, skb);
++		}
++
++		return tcp_v4_conn_request(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		if (mopt.saw_mpc && !want_cookie) {
++			if (!ipv6_unicast_destination(skb))
++				goto drop;
++
++			return tcp_conn_request(&mptcp6_request_sock_ops,
++						&mptcp_request_sock_ipv6_ops,
++						sk, skb);
++		}
++
++		return tcp_v6_conn_request(sk, skb);
++#endif
++	}
++drop:
++	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++	return 0;
++}
++
++struct workqueue_struct *mptcp_wq;
++EXPORT_SYMBOL(mptcp_wq);
++
++/* Output /proc/net/mptcp */
++static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
++{
++	struct tcp_sock *meta_tp;
++	const struct net *net = seq->private;
++	int i, n = 0;
++
++	seq_printf(seq, "  sl  loc_tok  rem_tok  v6 local_address                         remote_address                        st ns tx_queue rx_queue inode");
++	seq_putc(seq, '\n');
++
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		struct hlist_nulls_node *node;
++		rcu_read_lock_bh();
++		hlist_nulls_for_each_entry_rcu(meta_tp, node,
++					       &tk_hashtable[i], tk_table) {
++			struct mptcp_cb *mpcb = meta_tp->mpcb;
++			struct sock *meta_sk = (struct sock *)meta_tp;
++			struct inet_sock *isk = inet_sk(meta_sk);
++
++			if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
++				continue;
++
++			if (capable(CAP_NET_ADMIN)) {
++				seq_printf(seq, "%4d: %04X %04X ", n++,
++						mpcb->mptcp_loc_token,
++						mpcb->mptcp_rem_token);
++			} else {
++				seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
++			}
++			if (meta_sk->sk_family == AF_INET ||
++			    mptcp_v6_is_v4_mapped(meta_sk)) {
++				seq_printf(seq, " 0 %08X:%04X                         %08X:%04X                        ",
++					   isk->inet_rcv_saddr,
++					   ntohs(isk->inet_sport),
++					   isk->inet_daddr,
++					   ntohs(isk->inet_dport));
++#if IS_ENABLED(CONFIG_IPV6)
++			} else if (meta_sk->sk_family == AF_INET6) {
++				struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
++				struct in6_addr *dst = &meta_sk->sk_v6_daddr;
++				seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
++					   src->s6_addr32[0], src->s6_addr32[1],
++					   src->s6_addr32[2], src->s6_addr32[3],
++					   ntohs(isk->inet_sport),
++					   dst->s6_addr32[0], dst->s6_addr32[1],
++					   dst->s6_addr32[2], dst->s6_addr32[3],
++					   ntohs(isk->inet_dport));
++#endif
++			}
++			seq_printf(seq, " %02X %02X %08X:%08X %lu",
++				   meta_sk->sk_state, mpcb->cnt_subflows,
++				   meta_tp->write_seq - meta_tp->snd_una,
++				   max_t(int, meta_tp->rcv_nxt -
++					 meta_tp->copied_seq, 0),
++				   sock_i_ino(meta_sk));
++			seq_putc(seq, '\n');
++		}
++
++		rcu_read_unlock_bh();
++	}
++
++	return 0;
++}
++
++static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
++{
++	return single_open_net(inode, file, mptcp_pm_seq_show);
++}
++
++static const struct file_operations mptcp_pm_seq_fops = {
++	.owner = THIS_MODULE,
++	.open = mptcp_pm_seq_open,
++	.read = seq_read,
++	.llseek = seq_lseek,
++	.release = single_release_net,
++};
++
++static int mptcp_pm_init_net(struct net *net)
++{
++	if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
++		return -ENOMEM;
++
++	return 0;
++}
++
++static void mptcp_pm_exit_net(struct net *net)
++{
++	remove_proc_entry("mptcp", net->proc_net);
++}
++
++static struct pernet_operations mptcp_pm_proc_ops = {
++	.init = mptcp_pm_init_net,
++	.exit = mptcp_pm_exit_net,
++};
++
++/* General initialization of mptcp */
++void __init mptcp_init(void)
++{
++	int i;
++	struct ctl_table_header *mptcp_sysctl;
++
++	mptcp_sock_cache = kmem_cache_create("mptcp_sock",
++					     sizeof(struct mptcp_tcp_sock),
++					     0, SLAB_HWCACHE_ALIGN,
++					     NULL);
++	if (!mptcp_sock_cache)
++		goto mptcp_sock_cache_failed;
++
++	mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
++					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++					   NULL);
++	if (!mptcp_cb_cache)
++		goto mptcp_cb_cache_failed;
++
++	mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
++					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++					   NULL);
++	if (!mptcp_tw_cache)
++		goto mptcp_tw_cache_failed;
++
++	get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
++
++	mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
++	if (!mptcp_wq)
++		goto alloc_workqueue_failed;
++
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
++		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
++				      i + MPTCP_REQSK_NULLS_BASE);
++		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
++	}
++
++	spin_lock_init(&mptcp_reqsk_hlock);
++	spin_lock_init(&mptcp_tk_hashlock);
++
++	if (register_pernet_subsys(&mptcp_pm_proc_ops))
++		goto pernet_failed;
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (mptcp_pm_v6_init())
++		goto mptcp_pm_v6_failed;
++#endif
++	if (mptcp_pm_v4_init())
++		goto mptcp_pm_v4_failed;
++
++	mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
++	if (!mptcp_sysctl)
++		goto register_sysctl_failed;
++
++	if (mptcp_register_path_manager(&mptcp_pm_default))
++		goto register_pm_failed;
++
++	if (mptcp_register_scheduler(&mptcp_sched_default))
++		goto register_sched_failed;
++
++	pr_info("MPTCP: Stable release v0.89.0-rc");
++
++	mptcp_init_failed = false;
++
++	return;
++
++register_sched_failed:
++	mptcp_unregister_path_manager(&mptcp_pm_default);
++register_pm_failed:
++	unregister_net_sysctl_table(mptcp_sysctl);
++register_sysctl_failed:
++	mptcp_pm_v4_undo();
++mptcp_pm_v4_failed:
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_pm_v6_undo();
++mptcp_pm_v6_failed:
++#endif
++	unregister_pernet_subsys(&mptcp_pm_proc_ops);
++pernet_failed:
++	destroy_workqueue(mptcp_wq);
++alloc_workqueue_failed:
++	kmem_cache_destroy(mptcp_tw_cache);
++mptcp_tw_cache_failed:
++	kmem_cache_destroy(mptcp_cb_cache);
++mptcp_cb_cache_failed:
++	kmem_cache_destroy(mptcp_sock_cache);
++mptcp_sock_cache_failed:
++	mptcp_init_failed = true;
++}
+diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
+new file mode 100644
+index 000000000000..3a54413ce25b
+--- /dev/null
++++ b/net/mptcp/mptcp_fullmesh.c
+@@ -0,0 +1,1722 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#include <net/addrconf.h>
++#endif
++
++enum {
++	MPTCP_EVENT_ADD = 1,
++	MPTCP_EVENT_DEL,
++	MPTCP_EVENT_MOD,
++};
++
++#define MPTCP_SUBFLOW_RETRY_DELAY	1000
++
++/* Max number of local or remote addresses we can store.
++ * When changing, see the bitfield below in fullmesh_rem4/6.
++ */
++#define MPTCP_MAX_ADDR	8
++
++struct fullmesh_rem4 {
++	u8		rem4_id;
++	u8		bitfield;
++	u8		retry_bitfield;
++	__be16		port;
++	struct in_addr	addr;
++};
++
++struct fullmesh_rem6 {
++	u8		rem6_id;
++	u8		bitfield;
++	u8		retry_bitfield;
++	__be16		port;
++	struct in6_addr	addr;
++};
++
++struct mptcp_loc_addr {
++	struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
++	u8 loc4_bits;
++	u8 next_v4_index;
++
++	struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
++	u8 loc6_bits;
++	u8 next_v6_index;
++};
++
++struct mptcp_addr_event {
++	struct list_head list;
++	unsigned short	family;
++	u8	code:7,
++		low_prio:1;
++	union inet_addr addr;
++};
++
++struct fullmesh_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++	/* Delayed worker, when the routing-tables are not yet ready. */
++	struct delayed_work subflow_retry_work;
++
++	/* Remote addresses */
++	struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
++	struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
++
++	struct mptcp_cb *mpcb;
++
++	u16 remove_addrs; /* Addresses to remove */
++	u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
++	u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
++
++	u8	add_addr; /* Are we sending an add_addr? */
++
++	u8 rem4_bits;
++	u8 rem6_bits;
++};
++
++struct mptcp_fm_ns {
++	struct mptcp_loc_addr __rcu *local;
++	spinlock_t local_lock; /* Protecting the above pointer */
++	struct list_head events;
++	struct delayed_work address_worker;
++
++	struct net *net;
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly;
++
++static void full_mesh_create_subflows(struct sock *meta_sk);
++
++static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
++{
++	return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
++}
++
++static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
++{
++	return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
++}
++
++/* Find the first free index in the bitfield */
++static int __mptcp_find_free_index(u8 bitfield, u8 base)
++{
++	int i;
++
++	/* There are anyways no free bits... */
++	if (bitfield == 0xff)
++		goto exit;
++
++	i = ffs(~(bitfield >> base)) - 1;
++	if (i < 0)
++		goto exit;
++
++	/* No free bits when starting at base, try from 0 on */
++	if (i + base >= sizeof(bitfield) * 8)
++		return __mptcp_find_free_index(bitfield, 0);
++
++	return i + base;
++exit:
++	return -1;
++}
++
++static int mptcp_find_free_index(u8 bitfield)
++{
++	return __mptcp_find_free_index(bitfield, 0);
++}
++
++static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
++			      const struct in_addr *addr,
++			      __be16 port, u8 id)
++{
++	int i;
++	struct fullmesh_rem4 *rem4;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		rem4 = &fmp->remaddr4[i];
++
++		/* Address is already in the list --- continue */
++		if (rem4->rem4_id == id &&
++		    rem4->addr.s_addr == addr->s_addr && rem4->port == port)
++			return;
++
++		/* This may be the case, when the peer is behind a NAT. He is
++		 * trying to JOIN, thus sending the JOIN with a certain ID.
++		 * However the src_addr of the IP-packet has been changed. We
++		 * update the addr in the list, because this is the address as
++		 * OUR BOX sees it.
++		 */
++		if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
++			/* update the address */
++			mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
++				    __func__, &rem4->addr.s_addr,
++				    &addr->s_addr, id);
++			rem4->addr.s_addr = addr->s_addr;
++			rem4->port = port;
++			mpcb->list_rcvd = 1;
++			return;
++		}
++	}
++
++	i = mptcp_find_free_index(fmp->rem4_bits);
++	/* Do we have already the maximum number of local/remote addresses? */
++	if (i < 0) {
++		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
++			    __func__, MPTCP_MAX_ADDR, &addr->s_addr);
++		return;
++	}
++
++	rem4 = &fmp->remaddr4[i];
++
++	/* Address is not known yet, store it */
++	rem4->addr.s_addr = addr->s_addr;
++	rem4->port = port;
++	rem4->bitfield = 0;
++	rem4->retry_bitfield = 0;
++	rem4->rem4_id = id;
++	mpcb->list_rcvd = 1;
++	fmp->rem4_bits |= (1 << i);
++
++	return;
++}
++
++static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
++			      const struct in6_addr *addr,
++			      __be16 port, u8 id)
++{
++	int i;
++	struct fullmesh_rem6 *rem6;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		rem6 = &fmp->remaddr6[i];
++
++		/* Address is already in the list --- continue */
++		if (rem6->rem6_id == id &&
++		    ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
++			return;
++
++		/* This may be the case, when the peer is behind a NAT. He is
++		 * trying to JOIN, thus sending the JOIN with a certain ID.
++		 * However the src_addr of the IP-packet has been changed. We
++		 * update the addr in the list, because this is the address as
++		 * OUR BOX sees it.
++		 */
++		if (rem6->rem6_id == id) {
++			/* update the address */
++			mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
++				    __func__, &rem6->addr, addr, id);
++			rem6->addr = *addr;
++			rem6->port = port;
++			mpcb->list_rcvd = 1;
++			return;
++		}
++	}
++
++	i = mptcp_find_free_index(fmp->rem6_bits);
++	/* Do we have already the maximum number of local/remote addresses? */
++	if (i < 0) {
++		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
++			    __func__, MPTCP_MAX_ADDR, addr);
++		return;
++	}
++
++	rem6 = &fmp->remaddr6[i];
++
++	/* Address is not known yet, store it */
++	rem6->addr = *addr;
++	rem6->port = port;
++	rem6->bitfield = 0;
++	rem6->retry_bitfield = 0;
++	rem6->rem6_id = id;
++	mpcb->list_rcvd = 1;
++	fmp->rem6_bits |= (1 << i);
++
++	return;
++}
++
++static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		if (fmp->remaddr4[i].rem4_id == id) {
++			/* remove address from bitfield */
++			fmp->rem4_bits &= ~(1 << i);
++
++			break;
++		}
++	}
++}
++
++static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		if (fmp->remaddr6[i].rem6_id == id) {
++			/* remove address from bitfield */
++			fmp->rem6_bits &= ~(1 << i);
++
++			break;
++		}
++	}
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
++				       const struct in_addr *addr, u8 index)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
++			fmp->remaddr4[i].bitfield |= (1 << index);
++			return;
++		}
++	}
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
++				       const struct in6_addr *addr, u8 index)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
++			fmp->remaddr6[i].bitfield |= (1 << index);
++			return;
++		}
++	}
++}
++
++static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
++				    const union inet_addr *addr,
++				    sa_family_t family, u8 id)
++{
++	if (family == AF_INET)
++		mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
++	else
++		mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
++}
++
++static void retry_subflow_worker(struct work_struct *work)
++{
++	struct delayed_work *delayed_work = container_of(work,
++							 struct delayed_work,
++							 work);
++	struct fullmesh_priv *fmp = container_of(delayed_work,
++						 struct fullmesh_priv,
++						 subflow_retry_work);
++	struct mptcp_cb *mpcb = fmp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int iter = 0, i;
++
++	/* We need a local (stable) copy of the address-list. Really, it is not
++	 * such a big deal, if the address-list is not 100% up-to-date.
++	 */
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++	rcu_read_unlock_bh();
++
++	if (!mptcp_local)
++		return;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
++		/* Do we need to retry establishing a subflow ? */
++		if (rem->retry_bitfield) {
++			int i = mptcp_find_free_index(~rem->retry_bitfield);
++			struct mptcp_rem4 rem4;
++
++			rem->bitfield |= (1 << i);
++			rem->retry_bitfield &= ~(1 << i);
++
++			rem4.addr = rem->addr;
++			rem4.port = rem->port;
++			rem4.rem4_id = rem->rem4_id;
++
++			mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
++			goto next_subflow;
++		}
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
++
++		/* Do we need to retry establishing a subflow ? */
++		if (rem->retry_bitfield) {
++			int i = mptcp_find_free_index(~rem->retry_bitfield);
++			struct mptcp_rem6 rem6;
++
++			rem->bitfield |= (1 << i);
++			rem->retry_bitfield &= ~(1 << i);
++
++			rem6.addr = rem->addr;
++			rem6.port = rem->port;
++			rem6.rem6_id = rem->rem6_id;
++
++			mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
++			goto next_subflow;
++		}
++	}
++#endif
++
++exit:
++	kfree(mptcp_local);
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
++						 subflow_work);
++	struct mptcp_cb *mpcb = fmp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int iter = 0, retry = 0;
++	int i;
++
++	/* We need a local (stable) copy of the address-list. Really, it is not
++	 * such a big deal, if the address-list is not 100% up-to-date.
++	 */
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++	rcu_read_unlock_bh();
++
++	if (!mptcp_local)
++		return;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		struct fullmesh_rem4 *rem;
++		u8 remaining_bits;
++
++		rem = &fmp->remaddr4[i];
++		remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
++
++		/* Are there still combinations to handle? */
++		if (remaining_bits) {
++			int i = mptcp_find_free_index(~remaining_bits);
++			struct mptcp_rem4 rem4;
++
++			rem->bitfield |= (1 << i);
++
++			rem4.addr = rem->addr;
++			rem4.port = rem->port;
++			rem4.rem4_id = rem->rem4_id;
++
++			/* If a route is not yet available then retry once */
++			if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
++						   &rem4) == -ENETUNREACH)
++				retry = rem->retry_bitfield |= (1 << i);
++			goto next_subflow;
++		}
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		struct fullmesh_rem6 *rem;
++		u8 remaining_bits;
++
++		rem = &fmp->remaddr6[i];
++		remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
++
++		/* Are there still combinations to handle? */
++		if (remaining_bits) {
++			int i = mptcp_find_free_index(~remaining_bits);
++			struct mptcp_rem6 rem6;
++
++			rem->bitfield |= (1 << i);
++
++			rem6.addr = rem->addr;
++			rem6.port = rem->port;
++			rem6.rem6_id = rem->rem6_id;
++
++			/* If a route is not yet available then retry once */
++			if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
++						   &rem6) == -ENETUNREACH)
++				retry = rem->retry_bitfield |= (1 << i);
++			goto next_subflow;
++		}
++	}
++#endif
++
++	if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
++		sock_hold(meta_sk);
++		queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
++				   msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
++	}
++
++exit:
++	kfree(mptcp_local);
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	struct sock *sk = mptcp_select_ack_sock(meta_sk);
++
++	fmp->remove_addrs |= (1 << addr_id);
++	mpcb->addr_signal = 1;
++
++	if (sk)
++		tcp_send_ack(sk);
++}
++
++static void update_addr_bitfields(struct sock *meta_sk,
++				  const struct mptcp_loc_addr *mptcp_local)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	int i;
++
++	/* The bits in announced_addrs_* always match with loc*_bits. So, a
++	 * simply & operation unsets the correct bits, because these go from
++	 * announced to non-announced
++	 */
++	fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
++		fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
++	}
++
++	fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
++		fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
++	}
++}
++
++static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
++			      sa_family_t family, const union inet_addr *addr)
++{
++	int i;
++	u8 loc_bits;
++	bool found = false;
++
++	if (family == AF_INET)
++		loc_bits = mptcp_local->loc4_bits;
++	else
++		loc_bits = mptcp_local->loc6_bits;
++
++	mptcp_for_each_bit_set(loc_bits, i) {
++		if (family == AF_INET &&
++		    mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
++			found = true;
++			break;
++		}
++		if (family == AF_INET6 &&
++		    ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
++				    &addr->in6)) {
++			found = true;
++			break;
++		}
++	}
++
++	if (!found)
++		return -1;
++
++	return i;
++}
++
++static void mptcp_address_worker(struct work_struct *work)
++{
++	const struct delayed_work *delayed_work = container_of(work,
++							 struct delayed_work,
++							 work);
++	struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
++						 struct mptcp_fm_ns,
++						 address_worker);
++	struct net *net = fm_ns->net;
++	struct mptcp_addr_event *event = NULL;
++	struct mptcp_loc_addr *mptcp_local, *old;
++	int i, id = -1; /* id is used in the socket-code on a delete-event */
++	bool success; /* Used to indicate if we succeeded handling the event */
++
++next_event:
++	success = false;
++	kfree(event);
++
++	/* First, let's dequeue an event from our event-list */
++	rcu_read_lock_bh();
++	spin_lock(&fm_ns->local_lock);
++
++	event = list_first_entry_or_null(&fm_ns->events,
++					 struct mptcp_addr_event, list);
++	if (!event) {
++		spin_unlock(&fm_ns->local_lock);
++		rcu_read_unlock_bh();
++		return;
++	}
++
++	list_del(&event->list);
++
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++	if (event->code == MPTCP_EVENT_DEL) {
++		id = mptcp_find_address(mptcp_local, event->family, &event->addr);
++
++		/* Not in the list - so we don't care */
++		if (id < 0) {
++			mptcp_debug("%s could not find id\n", __func__);
++			goto duno;
++		}
++
++		old = mptcp_local;
++		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++				      GFP_ATOMIC);
++		if (!mptcp_local)
++			goto duno;
++
++		if (event->family == AF_INET)
++			mptcp_local->loc4_bits &= ~(1 << id);
++		else
++			mptcp_local->loc6_bits &= ~(1 << id);
++
++		rcu_assign_pointer(fm_ns->local, mptcp_local);
++		kfree(old);
++	} else {
++		int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
++		int j = i;
++
++		if (j < 0) {
++			/* Not in the list, so we have to find an empty slot */
++			if (event->family == AF_INET)
++				i = __mptcp_find_free_index(mptcp_local->loc4_bits,
++							    mptcp_local->next_v4_index);
++			if (event->family == AF_INET6)
++				i = __mptcp_find_free_index(mptcp_local->loc6_bits,
++							    mptcp_local->next_v6_index);
++
++			if (i < 0) {
++				mptcp_debug("%s no more space\n", __func__);
++				goto duno;
++			}
++
++			/* It might have been a MOD-event. */
++			event->code = MPTCP_EVENT_ADD;
++		} else {
++			/* Let's check if anything changes */
++			if (event->family == AF_INET &&
++			    event->low_prio == mptcp_local->locaddr4[i].low_prio)
++				goto duno;
++
++			if (event->family == AF_INET6 &&
++			    event->low_prio == mptcp_local->locaddr6[i].low_prio)
++				goto duno;
++		}
++
++		old = mptcp_local;
++		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++				      GFP_ATOMIC);
++		if (!mptcp_local)
++			goto duno;
++
++		if (event->family == AF_INET) {
++			mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
++			mptcp_local->locaddr4[i].loc4_id = i + 1;
++			mptcp_local->locaddr4[i].low_prio = event->low_prio;
++		} else {
++			mptcp_local->locaddr6[i].addr = event->addr.in6;
++			mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
++			mptcp_local->locaddr6[i].low_prio = event->low_prio;
++		}
++
++		if (j < 0) {
++			if (event->family == AF_INET) {
++				mptcp_local->loc4_bits |= (1 << i);
++				mptcp_local->next_v4_index = i + 1;
++			} else {
++				mptcp_local->loc6_bits |= (1 << i);
++				mptcp_local->next_v6_index = i + 1;
++			}
++		}
++
++		rcu_assign_pointer(fm_ns->local, mptcp_local);
++		kfree(old);
++	}
++	success = true;
++
++duno:
++	spin_unlock(&fm_ns->local_lock);
++	rcu_read_unlock_bh();
++
++	if (!success)
++		goto next_event;
++
++	/* Now we iterate over the MPTCP-sockets and apply the event. */
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		const struct hlist_nulls_node *node;
++		struct tcp_sock *meta_tp;
++
++		rcu_read_lock_bh();
++		hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
++					       tk_table) {
++			struct mptcp_cb *mpcb = meta_tp->mpcb;
++			struct sock *meta_sk = (struct sock *)meta_tp, *sk;
++			struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++			bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++			if (sock_net(meta_sk) != net)
++				continue;
++
++			if (meta_v4) {
++				/* skip IPv6 events if meta is IPv4 */
++				if (event->family == AF_INET6)
++					continue;
++			}
++			/* skip IPv4 events if IPV6_V6ONLY is set */
++			else if (event->family == AF_INET &&
++				 inet6_sk(meta_sk)->ipv6only)
++				continue;
++
++			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++				continue;
++
++			bh_lock_sock(meta_sk);
++
++			if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
++			    mpcb->infinite_mapping_snd ||
++			    mpcb->infinite_mapping_rcv ||
++			    mpcb->send_infinite_mapping)
++				goto next;
++
++			/* May be that the pm has changed in-between */
++			if (mpcb->pm_ops != &full_mesh)
++				goto next;
++
++			if (sock_owned_by_user(meta_sk)) {
++				if (!test_and_set_bit(MPTCP_PATH_MANAGER,
++						      &meta_tp->tsq_flags))
++					sock_hold(meta_sk);
++
++				goto next;
++			}
++
++			if (event->code == MPTCP_EVENT_ADD) {
++				fmp->add_addr++;
++				mpcb->addr_signal = 1;
++
++				sk = mptcp_select_ack_sock(meta_sk);
++				if (sk)
++					tcp_send_ack(sk);
++
++				full_mesh_create_subflows(meta_sk);
++			}
++
++			if (event->code == MPTCP_EVENT_DEL) {
++				struct sock *sk, *tmpsk;
++				struct mptcp_loc_addr *mptcp_local;
++				bool found = false;
++
++				mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++				/* In any case, we need to update our bitfields */
++				if (id >= 0)
++					update_addr_bitfields(meta_sk, mptcp_local);
++
++				/* Look for the socket and remove him */
++				mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++					if ((event->family == AF_INET6 &&
++					     (sk->sk_family == AF_INET ||
++					      mptcp_v6_is_v4_mapped(sk))) ||
++					    (event->family == AF_INET &&
++					     (sk->sk_family == AF_INET6 &&
++					      !mptcp_v6_is_v4_mapped(sk))))
++						continue;
++
++					if (event->family == AF_INET &&
++					    (sk->sk_family == AF_INET ||
++					     mptcp_v6_is_v4_mapped(sk)) &&
++					     inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
++						continue;
++
++					if (event->family == AF_INET6 &&
++					    sk->sk_family == AF_INET6 &&
++					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
++						continue;
++
++					/* Reinject, so that pf = 1 and so we
++					 * won't select this one as the
++					 * ack-sock.
++					 */
++					mptcp_reinject_data(sk, 0);
++
++					/* We announce the removal of this id */
++					announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
++
++					mptcp_sub_force_close(sk);
++					found = true;
++				}
++
++				if (found)
++					goto next;
++
++				/* The id may have been given by the event,
++				 * matching on a local address. And it may not
++				 * have matched on one of the above sockets,
++				 * because the client never created a subflow.
++				 * So, we have to finally remove it here.
++				 */
++				if (id > 0)
++					announce_remove_addr(id, meta_sk);
++			}
++
++			if (event->code == MPTCP_EVENT_MOD) {
++				struct sock *sk;
++
++				mptcp_for_each_sk(mpcb, sk) {
++					struct tcp_sock *tp = tcp_sk(sk);
++					if (event->family == AF_INET &&
++					    (sk->sk_family == AF_INET ||
++					     mptcp_v6_is_v4_mapped(sk)) &&
++					     inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
++						if (event->low_prio != tp->mptcp->low_prio) {
++							tp->mptcp->send_mp_prio = 1;
++							tp->mptcp->low_prio = event->low_prio;
++
++							tcp_send_ack(sk);
++						}
++					}
++
++					if (event->family == AF_INET6 &&
++					    sk->sk_family == AF_INET6 &&
++					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
++						if (event->low_prio != tp->mptcp->low_prio) {
++							tp->mptcp->send_mp_prio = 1;
++							tp->mptcp->low_prio = event->low_prio;
++
++							tcp_send_ack(sk);
++						}
++					}
++				}
++			}
++next:
++			bh_unlock_sock(meta_sk);
++			sock_put(meta_sk);
++		}
++		rcu_read_unlock_bh();
++	}
++	goto next_event;
++}
++
++static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
++						     const struct mptcp_addr_event *event)
++{
++	struct mptcp_addr_event *eventq;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++	list_for_each_entry(eventq, &fm_ns->events, list) {
++		if (eventq->family != event->family)
++			continue;
++		if (event->family == AF_INET) {
++			if (eventq->addr.in.s_addr == event->addr.in.s_addr)
++				return eventq;
++		} else {
++			if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
++				return eventq;
++		}
++	}
++	return NULL;
++}
++
++/* We already hold the net-namespace MPTCP-lock */
++static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
++{
++	struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++	if (eventq) {
++		switch (event->code) {
++		case MPTCP_EVENT_DEL:
++			mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
++			list_del(&eventq->list);
++			kfree(eventq);
++			break;
++		case MPTCP_EVENT_ADD:
++			mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
++			eventq->low_prio = event->low_prio;
++			eventq->code = MPTCP_EVENT_ADD;
++			return;
++		case MPTCP_EVENT_MOD:
++			mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
++			eventq->low_prio = event->low_prio;
++			eventq->code = MPTCP_EVENT_MOD;
++			return;
++		}
++	}
++
++	/* OK, we have to add the new address to the wait queue */
++	eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
++	if (!eventq)
++		return;
++
++	list_add_tail(&eventq->list, &fm_ns->events);
++
++	/* Create work-queue */
++	if (!delayed_work_pending(&fm_ns->address_worker))
++		queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
++				   msecs_to_jiffies(500));
++}
++
++static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
++				struct net *net)
++{
++	const struct net_device *netdev = ifa->ifa_dev->dev;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	struct mptcp_addr_event mpevent;
++
++	if (ifa->ifa_scope > RT_SCOPE_LINK ||
++	    ipv4_is_loopback(ifa->ifa_local))
++		return;
++
++	spin_lock_bh(&fm_ns->local_lock);
++
++	mpevent.family = AF_INET;
++	mpevent.addr.in.s_addr = ifa->ifa_local;
++	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++	if (event == NETDEV_DOWN || !netif_running(netdev) ||
++	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++		mpevent.code = MPTCP_EVENT_DEL;
++	else if (event == NETDEV_UP)
++		mpevent.code = MPTCP_EVENT_ADD;
++	else if (event == NETDEV_CHANGE)
++		mpevent.code = MPTCP_EVENT_MOD;
++
++	mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
++		    &ifa->ifa_local, mpevent.code, mpevent.low_prio);
++	add_pm_event(net, &mpevent);
++
++	spin_unlock_bh(&fm_ns->local_lock);
++	return;
++}
++
++/* React on IPv4-addr add/rem-events */
++static int mptcp_pm_inetaddr_event(struct notifier_block *this,
++				   unsigned long event, void *ptr)
++{
++	const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
++	struct net *net = dev_net(ifa->ifa_dev->dev);
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	addr4_event_handler(ifa, event, net);
++
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_inetaddr_notifier = {
++		.notifier_call = mptcp_pm_inetaddr_event,
++};
++
++#if IS_ENABLED(CONFIG_IPV6)
++
++/* IPV6-related address/interface watchers */
++struct mptcp_dad_data {
++	struct timer_list timer;
++	struct inet6_ifaddr *ifa;
++};
++
++static void dad_callback(unsigned long arg);
++static int inet6_addr_event(struct notifier_block *this,
++				     unsigned long event, void *ptr);
++
++static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
++{
++	return (ifa->flags & IFA_F_TENTATIVE) &&
++	       ifa->state == INET6_IFADDR_STATE_DAD;
++}
++
++static void dad_init_timer(struct mptcp_dad_data *data,
++				 struct inet6_ifaddr *ifa)
++{
++	data->ifa = ifa;
++	data->timer.data = (unsigned long)data;
++	data->timer.function = dad_callback;
++	if (ifa->idev->cnf.rtr_solicit_delay)
++		data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
++	else
++		data->timer.expires = jiffies + (HZ/10);
++}
++
++static void dad_callback(unsigned long arg)
++{
++	struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
++
++	if (ipv6_is_in_dad_state(data->ifa)) {
++		dad_init_timer(data, data->ifa);
++		add_timer(&data->timer);
++	} else {
++		inet6_addr_event(NULL, NETDEV_UP, data->ifa);
++		in6_ifa_put(data->ifa);
++		kfree(data);
++	}
++}
++
++static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
++{
++	struct mptcp_dad_data *data;
++
++	data = kmalloc(sizeof(*data), GFP_ATOMIC);
++
++	if (!data)
++		return;
++
++	init_timer(&data->timer);
++	dad_init_timer(data, ifa);
++	add_timer(&data->timer);
++	in6_ifa_hold(ifa);
++}
++
++static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
++				struct net *net)
++{
++	const struct net_device *netdev = ifa->idev->dev;
++	int addr_type = ipv6_addr_type(&ifa->addr);
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	struct mptcp_addr_event mpevent;
++
++	if (ifa->scope > RT_SCOPE_LINK ||
++	    addr_type == IPV6_ADDR_ANY ||
++	    (addr_type & IPV6_ADDR_LOOPBACK) ||
++	    (addr_type & IPV6_ADDR_LINKLOCAL))
++		return;
++
++	spin_lock_bh(&fm_ns->local_lock);
++
++	mpevent.family = AF_INET6;
++	mpevent.addr.in6 = ifa->addr;
++	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++	if (event == NETDEV_DOWN || !netif_running(netdev) ||
++	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++		mpevent.code = MPTCP_EVENT_DEL;
++	else if (event == NETDEV_UP)
++		mpevent.code = MPTCP_EVENT_ADD;
++	else if (event == NETDEV_CHANGE)
++		mpevent.code = MPTCP_EVENT_MOD;
++
++	mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
++		    &ifa->addr, mpevent.code, mpevent.low_prio);
++	add_pm_event(net, &mpevent);
++
++	spin_unlock_bh(&fm_ns->local_lock);
++	return;
++}
++
++/* React on IPv6-addr add/rem-events */
++static int inet6_addr_event(struct notifier_block *this, unsigned long event,
++			    void *ptr)
++{
++	struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
++	struct net *net = dev_net(ifa6->idev->dev);
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	if (ipv6_is_in_dad_state(ifa6))
++		dad_setup_timer(ifa6);
++	else
++		addr6_event_handler(ifa6, event, net);
++
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block inet6_addr_notifier = {
++		.notifier_call = inet6_addr_event,
++};
++
++#endif
++
++/* React on ifup/down-events */
++static int netdev_event(struct notifier_block *this, unsigned long event,
++			void *ptr)
++{
++	const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
++	struct in_device *in_dev;
++#if IS_ENABLED(CONFIG_IPV6)
++	struct inet6_dev *in6_dev;
++#endif
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	rcu_read_lock();
++	in_dev = __in_dev_get_rtnl(dev);
++
++	if (in_dev) {
++		for_ifa(in_dev) {
++			mptcp_pm_inetaddr_event(NULL, event, ifa);
++		} endfor_ifa(in_dev);
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	in6_dev = __in6_dev_get(dev);
++
++	if (in6_dev) {
++		struct inet6_ifaddr *ifa6;
++		list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
++			inet6_addr_event(NULL, event, ifa6);
++	}
++#endif
++
++	rcu_read_unlock();
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_netdev_notifier = {
++		.notifier_call = netdev_event,
++};
++
++static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
++				const union inet_addr *addr,
++				sa_family_t family, __be16 port, u8 id)
++{
++	if (family == AF_INET)
++		mptcp_addv4_raddr(mpcb, &addr->in, port, id);
++	else
++		mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
++}
++
++static void full_mesh_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int i, index;
++	union inet_addr saddr, daddr;
++	sa_family_t family;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++	/* Init local variables necessary for the rest */
++	if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
++		saddr.ip = inet_sk(meta_sk)->inet_saddr;
++		daddr.ip = inet_sk(meta_sk)->inet_daddr;
++		family = AF_INET;
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		saddr.in6 = inet6_sk(meta_sk)->saddr;
++		daddr.in6 = meta_sk->sk_v6_daddr;
++		family = AF_INET6;
++#endif
++	}
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	index = mptcp_find_address(mptcp_local, family, &saddr);
++	if (index < 0)
++		goto fallback;
++
++	full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
++	mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
++	fmp->mpcb = mpcb;
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* Look for the address among the local addresses */
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		__be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
++
++		/* We do not need to announce the initial subflow's address again */
++		if (family == AF_INET && saddr.ip == ifa_address)
++			continue;
++
++		fmp->add_addr++;
++		mpcb->addr_signal = 1;
++	}
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++	/* skip IPv6 addresses if meta-socket is IPv4 */
++	if (meta_v4)
++		goto skip_ipv6;
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
++
++		/* We do not need to announce the initial subflow's address again */
++		if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
++			continue;
++
++		fmp->add_addr++;
++		mpcb->addr_signal = 1;
++	}
++
++skip_ipv6:
++#endif
++
++	rcu_read_unlock();
++
++	if (family == AF_INET)
++		fmp->announced_addrs_v4 |= (1 << index);
++	else
++		fmp->announced_addrs_v6 |= (1 << index);
++
++	for (i = fmp->add_addr; i && fmp->add_addr; i--)
++		tcp_send_ack(mpcb->master_sk);
++
++	return;
++
++fallback:
++	rcu_read_unlock();
++	mptcp_fallback_default(mpcb);
++	return;
++}
++
++static void full_mesh_create_subflows(struct sock *meta_sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		return;
++
++	if (!work_pending(&fmp->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &fmp->subflow_work);
++	}
++}
++
++/* Called upon release_sock, if the socket was owned by the user during
++ * a path-management event.
++ */
++static void full_mesh_release_sock(struct sock *meta_sk)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	struct sock *sk, *tmpsk;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++	int i;
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* First, detect modifications or additions */
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		struct in_addr ifa = mptcp_local->locaddr4[i].addr;
++		bool found = false;
++
++		mptcp_for_each_sk(mpcb, sk) {
++			struct tcp_sock *tp = tcp_sk(sk);
++
++			if (sk->sk_family == AF_INET6 &&
++			    !mptcp_v6_is_v4_mapped(sk))
++				continue;
++
++			if (inet_sk(sk)->inet_saddr != ifa.s_addr)
++				continue;
++
++			found = true;
++
++			if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
++				tp->mptcp->send_mp_prio = 1;
++				tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
++
++				tcp_send_ack(sk);
++			}
++		}
++
++		if (!found) {
++			fmp->add_addr++;
++			mpcb->addr_signal = 1;
++
++			sk = mptcp_select_ack_sock(meta_sk);
++			if (sk)
++				tcp_send_ack(sk);
++			full_mesh_create_subflows(meta_sk);
++		}
++	}
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++	/* skip IPv6 addresses if meta-socket is IPv4 */
++	if (meta_v4)
++		goto removal;
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
++		bool found = false;
++
++		mptcp_for_each_sk(mpcb, sk) {
++			struct tcp_sock *tp = tcp_sk(sk);
++
++			if (sk->sk_family == AF_INET ||
++			    mptcp_v6_is_v4_mapped(sk))
++				continue;
++
++			if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
++				continue;
++
++			found = true;
++
++			if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
++				tp->mptcp->send_mp_prio = 1;
++				tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
++
++				tcp_send_ack(sk);
++			}
++		}
++
++		if (!found) {
++			fmp->add_addr++;
++			mpcb->addr_signal = 1;
++
++			sk = mptcp_select_ack_sock(meta_sk);
++			if (sk)
++				tcp_send_ack(sk);
++			full_mesh_create_subflows(meta_sk);
++		}
++	}
++
++removal:
++#endif
++
++	/* Now, detect address-removals */
++	mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++		bool shall_remove = true;
++
++		if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
++			mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++				if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
++					shall_remove = false;
++					break;
++				}
++			}
++		} else {
++			mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++				if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
++					shall_remove = false;
++					break;
++				}
++			}
++		}
++
++		if (shall_remove) {
++			/* Reinject, so that pf = 1 and so we
++			 * won't select this one as the
++			 * ack-sock.
++			 */
++			mptcp_reinject_data(sk, 0);
++
++			announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
++					     meta_sk);
++
++			mptcp_sub_force_close(sk);
++		}
++	}
++
++	/* Just call it optimistically. It actually cannot do any harm */
++	update_addr_bitfields(meta_sk, mptcp_local);
++
++	rcu_read_unlock();
++}
++
++static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
++				  struct net *net, bool *low_prio)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	int index, id = -1;
++
++	/* Handle the backup-flows */
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	index = mptcp_find_address(mptcp_local, family, addr);
++
++	if (index != -1) {
++		if (family == AF_INET) {
++			id = mptcp_local->locaddr4[index].loc4_id;
++			*low_prio = mptcp_local->locaddr4[index].low_prio;
++		} else {
++			id = mptcp_local->locaddr6[index].loc6_id;
++			*low_prio = mptcp_local->locaddr6[index].low_prio;
++		}
++	}
++
++
++	rcu_read_unlock();
++
++	return id;
++}
++
++static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
++				  struct tcp_out_options *opts,
++				  struct sk_buff *skb)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
++	int remove_addr_len;
++	u8 unannouncedv4 = 0, unannouncedv6 = 0;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++	mpcb->addr_signal = 0;
++
++	if (likely(!fmp->add_addr))
++		goto remove_addr;
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* IPv4 */
++	unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
++	if (unannouncedv4 &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
++		int ind = mptcp_find_free_index(~unannouncedv4);
++
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_ADD_ADDR;
++		opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
++		opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
++		opts->add_addr_v4 = 1;
++
++		if (skb) {
++			fmp->announced_addrs_v4 |= (1 << ind);
++			fmp->add_addr--;
++		}
++		*size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
++	}
++
++	if (meta_v4)
++		goto skip_ipv6;
++
++skip_ipv4:
++	/* IPv6 */
++	unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
++	if (unannouncedv6 &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
++		int ind = mptcp_find_free_index(~unannouncedv6);
++
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_ADD_ADDR;
++		opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
++		opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
++		opts->add_addr_v6 = 1;
++
++		if (skb) {
++			fmp->announced_addrs_v6 |= (1 << ind);
++			fmp->add_addr--;
++		}
++		*size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
++	}
++
++skip_ipv6:
++	rcu_read_unlock();
++
++	if (!unannouncedv4 && !unannouncedv6 && skb)
++		fmp->add_addr--;
++
++remove_addr:
++	if (likely(!fmp->remove_addrs))
++		goto exit;
++
++	remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
++	if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
++		goto exit;
++
++	opts->options |= OPTION_MPTCP;
++	opts->mptcp_options |= OPTION_REMOVE_ADDR;
++	opts->remove_addrs = fmp->remove_addrs;
++	*size += remove_addr_len;
++	if (skb)
++		fmp->remove_addrs = 0;
++
++exit:
++	mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
++}
++
++static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
++{
++	mptcp_v4_rem_raddress(mpcb, rem_id);
++	mptcp_v6_rem_raddress(mpcb, rem_id);
++}
++
++/* Output /proc/net/mptcp_fullmesh */
++static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
++{
++	const struct net *net = seq->private;
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	int i;
++
++	seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
++
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
++
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
++
++		seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
++			   loc4->low_prio, &loc4->addr);
++	}
++
++	seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
++
++		seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
++			   loc6->low_prio, &loc6->addr);
++	}
++	rcu_read_unlock_bh();
++
++	return 0;
++}
++
++static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
++{
++	return single_open_net(inode, file, mptcp_fm_seq_show);
++}
++
++static const struct file_operations mptcp_fm_seq_fops = {
++	.owner = THIS_MODULE,
++	.open = mptcp_fm_seq_open,
++	.read = seq_read,
++	.llseek = seq_lseek,
++	.release = single_release_net,
++};
++
++static int mptcp_fm_init_net(struct net *net)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns;
++	int err = 0;
++
++	fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
++	if (!fm_ns)
++		return -ENOBUFS;
++
++	mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
++	if (!mptcp_local) {
++		err = -ENOBUFS;
++		goto err_mptcp_local;
++	}
++
++	if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
++			 &mptcp_fm_seq_fops)) {
++		err = -ENOMEM;
++		goto err_seq_fops;
++	}
++
++	mptcp_local->next_v4_index = 1;
++
++	rcu_assign_pointer(fm_ns->local, mptcp_local);
++	INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
++	INIT_LIST_HEAD(&fm_ns->events);
++	spin_lock_init(&fm_ns->local_lock);
++	fm_ns->net = net;
++	net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
++
++	return 0;
++err_seq_fops:
++	kfree(mptcp_local);
++err_mptcp_local:
++	kfree(fm_ns);
++	return err;
++}
++
++static void mptcp_fm_exit_net(struct net *net)
++{
++	struct mptcp_addr_event *eventq, *tmp;
++	struct mptcp_fm_ns *fm_ns;
++	struct mptcp_loc_addr *mptcp_local;
++
++	fm_ns = fm_get_ns(net);
++	cancel_delayed_work_sync(&fm_ns->address_worker);
++
++	rcu_read_lock_bh();
++
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	kfree(mptcp_local);
++
++	spin_lock(&fm_ns->local_lock);
++	list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
++		list_del(&eventq->list);
++		kfree(eventq);
++	}
++	spin_unlock(&fm_ns->local_lock);
++
++	rcu_read_unlock_bh();
++
++	remove_proc_entry("mptcp_fullmesh", net->proc_net);
++
++	kfree(fm_ns);
++}
++
++static struct pernet_operations full_mesh_net_ops = {
++	.init = mptcp_fm_init_net,
++	.exit = mptcp_fm_exit_net,
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly = {
++	.new_session = full_mesh_new_session,
++	.release_sock = full_mesh_release_sock,
++	.fully_established = full_mesh_create_subflows,
++	.new_remote_address = full_mesh_create_subflows,
++	.get_local_id = full_mesh_get_local_id,
++	.addr_signal = full_mesh_addr_signal,
++	.add_raddr = full_mesh_add_raddr,
++	.rem_raddr = full_mesh_rem_raddr,
++	.name = "fullmesh",
++	.owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init full_mesh_register(void)
++{
++	int ret;
++
++	BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
++
++	ret = register_pernet_subsys(&full_mesh_net_ops);
++	if (ret)
++		goto out;
++
++	ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++	if (ret)
++		goto err_reg_inetaddr;
++	ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
++	if (ret)
++		goto err_reg_netdev;
++
++#if IS_ENABLED(CONFIG_IPV6)
++	ret = register_inet6addr_notifier(&inet6_addr_notifier);
++	if (ret)
++		goto err_reg_inet6addr;
++#endif
++
++	ret = mptcp_register_path_manager(&full_mesh);
++	if (ret)
++		goto err_reg_pm;
++
++out:
++	return ret;
++
++
++err_reg_pm:
++#if IS_ENABLED(CONFIG_IPV6)
++	unregister_inet6addr_notifier(&inet6_addr_notifier);
++err_reg_inet6addr:
++#endif
++	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++err_reg_netdev:
++	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++err_reg_inetaddr:
++	unregister_pernet_subsys(&full_mesh_net_ops);
++	goto out;
++}
++
++static void full_mesh_unregister(void)
++{
++#if IS_ENABLED(CONFIG_IPV6)
++	unregister_inet6addr_notifier(&inet6_addr_notifier);
++#endif
++	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++	unregister_pernet_subsys(&full_mesh_net_ops);
++	mptcp_unregister_path_manager(&full_mesh);
++}
++
++module_init(full_mesh_register);
++module_exit(full_mesh_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("Full-Mesh MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
+new file mode 100644
+index 000000000000..43704ccb639e
+--- /dev/null
++++ b/net/mptcp/mptcp_input.c
+@@ -0,0 +1,2405 @@
++/*
++ *	MPTCP implementation - Sending side
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <asm/unaligned.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++
++#include <linux/kconfig.h>
++
++/* is seq1 < seq2 ? */
++static inline bool before64(const u64 seq1, const u64 seq2)
++{
++	return (s64)(seq1 - seq2) < 0;
++}
++
++/* is seq1 > seq2 ? */
++#define after64(seq1, seq2)	before64(seq2, seq1)
++
++static inline void mptcp_become_fully_estab(struct sock *sk)
++{
++	tcp_sk(sk)->mptcp->fully_established = 1;
++
++	if (is_master_tp(tcp_sk(sk)) &&
++	    tcp_sk(sk)->mpcb->pm_ops->fully_established)
++		tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
++}
++
++/* Similar to tcp_tso_acked without any memory accounting */
++static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
++					   struct sk_buff *skb)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	u32 packets_acked, len;
++
++	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
++
++	packets_acked = tcp_skb_pcount(skb);
++
++	if (skb_unclone(skb, GFP_ATOMIC))
++		return 0;
++
++	len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
++	__pskb_trim_head(skb, len);
++
++	TCP_SKB_CB(skb)->seq += len;
++	skb->ip_summed = CHECKSUM_PARTIAL;
++	skb->truesize	     -= len;
++
++	/* Any change of skb->len requires recalculation of tso factor. */
++	if (tcp_skb_pcount(skb) > 1)
++		tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
++	packets_acked -= tcp_skb_pcount(skb);
++
++	if (packets_acked) {
++		BUG_ON(tcp_skb_pcount(skb) == 0);
++		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
++	}
++
++	return packets_acked;
++}
++
++/**
++ * Cleans the meta-socket retransmission queue and the reinject-queue.
++ * @sk must be the metasocket.
++ */
++static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
++{
++	struct sk_buff *skb, *tmp;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	bool acked = false;
++	u32 acked_pcount;
++
++	while ((skb = tcp_write_queue_head(meta_sk)) &&
++	       skb != tcp_send_head(meta_sk)) {
++		bool fully_acked = true;
++
++		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++			if (tcp_skb_pcount(skb) == 1 ||
++			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++				break;
++
++			acked_pcount = tcp_tso_acked(meta_sk, skb);
++			if (!acked_pcount)
++				break;
++
++			fully_acked = false;
++		} else {
++			acked_pcount = tcp_skb_pcount(skb);
++		}
++
++		acked = true;
++		meta_tp->packets_out -= acked_pcount;
++		meta_tp->retrans_stamp = 0;
++
++		if (!fully_acked)
++			break;
++
++		tcp_unlink_write_queue(skb, meta_sk);
++
++		if (mptcp_is_data_fin(skb)) {
++			struct sock *sk_it;
++
++			/* DATA_FIN has been acknowledged - now we can close
++			 * the subflows
++			 */
++			mptcp_for_each_sk(mpcb, sk_it) {
++				unsigned long delay = 0;
++
++				/* If we are the passive closer, don't trigger
++				 * subflow-fin until the subflow has been finned
++				 * by the peer - thus we add a delay.
++				 */
++				if (mpcb->passive_close &&
++				    sk_it->sk_state == TCP_ESTABLISHED)
++					delay = inet_csk(sk_it)->icsk_rto << 3;
++
++				mptcp_sub_close(sk_it, delay);
++			}
++		}
++		sk_wmem_free_skb(meta_sk, skb);
++	}
++	/* Remove acknowledged data from the reinject queue */
++	skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
++		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++			if (tcp_skb_pcount(skb) == 1 ||
++			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++				break;
++
++			mptcp_tso_acked_reinject(meta_sk, skb);
++			break;
++		}
++
++		__skb_unlink(skb, &mpcb->reinject_queue);
++		__kfree_skb(skb);
++	}
++
++	if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
++		meta_tp->snd_up = meta_tp->snd_una;
++
++	if (acked) {
++		tcp_rearm_rto(meta_sk);
++		/* Normally this is done in tcp_try_undo_loss - but MPTCP
++		 * does not call this function.
++		 */
++		inet_csk(meta_sk)->icsk_retransmits = 0;
++	}
++}
++
++/* Inspired by tcp_rcv_state_process */
++static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
++				   const struct sk_buff *skb, u32 data_seq,
++				   u16 data_len)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++	const struct tcphdr *th = tcp_hdr(skb);
++
++	/* State-machine handling if FIN has been enqueued and he has
++	 * been acked (snd_una == write_seq) - it's important that this
++	 * here is after sk_wmem_free_skb because otherwise
++	 * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
++	 */
++	switch (meta_sk->sk_state) {
++	case TCP_FIN_WAIT1: {
++		struct dst_entry *dst;
++		int tmo;
++
++		if (meta_tp->snd_una != meta_tp->write_seq)
++			break;
++
++		tcp_set_state(meta_sk, TCP_FIN_WAIT2);
++		meta_sk->sk_shutdown |= SEND_SHUTDOWN;
++
++		dst = __sk_dst_get(sk);
++		if (dst)
++			dst_confirm(dst);
++
++		if (!sock_flag(meta_sk, SOCK_DEAD)) {
++			/* Wake up lingering close() */
++			meta_sk->sk_state_change(meta_sk);
++			break;
++		}
++
++		if (meta_tp->linger2 < 0 ||
++		    (data_len &&
++		     after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
++			   meta_tp->rcv_nxt))) {
++			mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++			tcp_done(meta_sk);
++			NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++			return 1;
++		}
++
++		tmo = tcp_fin_time(meta_sk);
++		if (tmo > TCP_TIMEWAIT_LEN) {
++			inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
++		} else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
++			/* Bad case. We could lose such FIN otherwise.
++			 * It is not a big problem, but it looks confusing
++			 * and not so rare event. We still can lose it now,
++			 * if it spins in bh_lock_sock(), but it is really
++			 * marginal case.
++			 */
++			inet_csk_reset_keepalive_timer(meta_sk, tmo);
++		} else {
++			meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
++		}
++		break;
++	}
++	case TCP_CLOSING:
++	case TCP_LAST_ACK:
++		if (meta_tp->snd_una == meta_tp->write_seq) {
++			tcp_done(meta_sk);
++			return 1;
++		}
++		break;
++	}
++
++	/* step 7: process the segment text */
++	switch (meta_sk->sk_state) {
++	case TCP_FIN_WAIT1:
++	case TCP_FIN_WAIT2:
++		/* RFC 793 says to queue data in these states,
++		 * RFC 1122 says we MUST send a reset.
++		 * BSD 4.4 also does reset.
++		 */
++		if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
++			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++			    !mptcp_is_data_fin2(skb, tp)) {
++				NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++				mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++				tcp_reset(meta_sk);
++				return 1;
++			}
++		}
++		break;
++	}
++
++	return 0;
++}
++
++/**
++ * @return:
++ *  i) 1: Everything's fine.
++ *  ii) -1: A reset has been sent on the subflow - csum-failure
++ *  iii) 0: csum-failure but no reset sent, because it's the last subflow.
++ *	 Last packet should not be destroyed by the caller because it has
++ *	 been done here.
++ */
++static int mptcp_verif_dss_csum(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *tmp, *tmp1, *last = NULL;
++	__wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
++	int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
++	int iter = 0;
++
++	skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
++		unsigned int csum_len;
++
++		if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
++			/* Mapping ends in the middle of the packet -
++			 * csum only these bytes
++			 */
++			csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
++		else
++			csum_len = tmp->len;
++
++		offset = 0;
++		if (overflowed) {
++			char first_word[4];
++			first_word[0] = 0;
++			first_word[1] = 0;
++			first_word[2] = 0;
++			first_word[3] = *(tmp->data);
++			csum_tcp = csum_partial(first_word, 4, csum_tcp);
++			offset = 1;
++			csum_len--;
++			overflowed = 0;
++		}
++
++		csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
++
++		/* Was it on an odd-length? Then we have to merge the next byte
++		 * correctly (see above)
++		 */
++		if (csum_len != (csum_len & (~1)))
++			overflowed = 1;
++
++		if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
++			__be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
++
++			/* If a 64-bit dss is present, we increase the offset
++			 * by 4 bytes, as the high-order 64-bits will be added
++			 * in the final csum_partial-call.
++			 */
++			u32 offset = skb_transport_offset(tmp) +
++				     TCP_SKB_CB(tmp)->dss_off;
++			if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
++				offset += 4;
++
++			csum_tcp = skb_checksum(tmp, offset,
++						MPTCP_SUB_LEN_SEQ_CSUM,
++						csum_tcp);
++
++			csum_tcp = csum_partial(&data_seq,
++						sizeof(data_seq), csum_tcp);
++
++			dss_csum_added = 1; /* Just do it once */
++		}
++		last = tmp;
++		iter++;
++
++		if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
++		    !before(TCP_SKB_CB(tmp1)->seq,
++			    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++			break;
++	}
++
++	/* Now, checksum must be 0 */
++	if (unlikely(csum_fold(csum_tcp))) {
++		pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
++		       __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
++		       dss_csum_added, overflowed, iter);
++
++		tp->mptcp->send_mp_fail = 1;
++
++		/* map_data_seq is the data-seq number of the
++		 * mapping we are currently checking
++		 */
++		tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
++
++		if (tp->mpcb->cnt_subflows > 1) {
++			mptcp_send_reset(sk);
++			ans = -1;
++		} else {
++			tp->mpcb->send_infinite_mapping = 1;
++
++			/* Need to purge the rcv-queue as it's no more valid */
++			while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
++				tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
++				kfree_skb(tmp);
++			}
++
++			ans = 0;
++		}
++	}
++
++	return ans;
++}
++
++static inline void mptcp_prepare_skb(struct sk_buff *skb,
++				     const struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 inc = 0;
++
++	/* If skb is the end of this mapping (end is always at mapping-boundary
++	 * thanks to the splitting/trimming), then we need to increase
++	 * data-end-seq by 1 if this here is a data-fin.
++	 *
++	 * We need to do -1 because end_seq includes the subflow-FIN.
++	 */
++	if (tp->mptcp->map_data_fin &&
++	    (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
++	    (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++		inc = 1;
++
++		/* We manually set the fin-flag if it is a data-fin. For easy
++		 * processing in tcp_recvmsg.
++		 */
++		tcp_hdr(skb)->fin = 1;
++	} else {
++		/* We may have a subflow-fin with data but without data-fin */
++		tcp_hdr(skb)->fin = 0;
++	}
++
++	/* Adapt data-seq's to the packet itself. We kinda transform the
++	 * dss-mapping to a per-packet granularity. This is necessary to
++	 * correctly handle overlapping mappings coming from different
++	 * subflows. Otherwise it would be a complete mess.
++	 */
++	tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
++	tcb->end_seq = tcb->seq + skb->len + inc;
++}
++
++/**
++ * @return: 1 if the segment has been eaten and can be suppressed,
++ *          otherwise 0.
++ */
++static inline int mptcp_direct_copy(const struct sk_buff *skb,
++				    struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
++	int eaten = 0;
++
++	__set_current_state(TASK_RUNNING);
++
++	local_bh_enable();
++	if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
++		meta_tp->ucopy.len -= chunk;
++		meta_tp->copied_seq += chunk;
++		eaten = (chunk == skb->len);
++		tcp_rcv_space_adjust(meta_sk);
++	}
++	local_bh_disable();
++	return eaten;
++}
++
++static inline void mptcp_reset_mapping(struct tcp_sock *tp)
++{
++	tp->mptcp->map_data_len = 0;
++	tp->mptcp->map_data_seq = 0;
++	tp->mptcp->map_subseq = 0;
++	tp->mptcp->map_data_fin = 0;
++	tp->mptcp->mapping_present = 0;
++}
++
++/* The DSS-mapping received on the sk only covers the second half of the skb
++ * (cut at seq). We trim the head from the skb.
++ * Data will be freed upon kfree().
++ *
++ * Inspired by tcp_trim_head().
++ */
++static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++	int len = seq - TCP_SKB_CB(skb)->seq;
++	u32 new_seq = TCP_SKB_CB(skb)->seq + len;
++
++	if (len < skb_headlen(skb))
++		__skb_pull(skb, len);
++	else
++		__pskb_trim_head(skb, len - skb_headlen(skb));
++
++	TCP_SKB_CB(skb)->seq = new_seq;
++
++	skb->truesize -= len;
++	atomic_sub(len, &sk->sk_rmem_alloc);
++	sk_mem_uncharge(sk, len);
++}
++
++/* The DSS-mapping received on the sk only covers the first half of the skb
++ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
++ * as further packets may resolve the mapping of the second half of data.
++ *
++ * Inspired by tcp_fragment().
++ */
++static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++	struct sk_buff *buff;
++	int nsize;
++	int nlen, len;
++
++	len = seq - TCP_SKB_CB(skb)->seq;
++	nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
++	if (nsize < 0)
++		nsize = 0;
++
++	/* Get a new skb... force flag on. */
++	buff = alloc_skb(nsize, GFP_ATOMIC);
++	if (buff == NULL)
++		return -ENOMEM;
++
++	skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
++	skb_reset_transport_header(buff);
++
++	tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
++	tcp_hdr(skb)->fin = 0;
++
++	/* We absolutly need to call skb_set_owner_r before refreshing the
++	 * truesize of buff, otherwise the moved data will account twice.
++	 */
++	skb_set_owner_r(buff, sk);
++	nlen = skb->len - len - nsize;
++	buff->truesize += nlen;
++	skb->truesize -= nlen;
++
++	/* Correct the sequence numbers. */
++	TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
++	TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
++	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
++
++	skb_split(skb, buff, len);
++
++	__skb_queue_after(&sk->sk_receive_queue, skb, buff);
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
++	if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
++	    !tp->mpcb->infinite_mapping_rcv) {
++		/* Remove a pure subflow-fin from the queue and increase
++		 * copied_seq.
++		 */
++		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++		__skb_unlink(skb, &sk->sk_receive_queue);
++		__kfree_skb(skb);
++		return -1;
++	}
++
++	/* If we are not yet fully established and do not know the mapping for
++	 * this segment, this path has to fallback to infinite or be torn down.
++	 */
++	if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
++	    !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
++		pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
++		       __func__, tp->mpcb->mptcp_loc_token,
++		       tp->mptcp->path_index, __builtin_return_address(0),
++		       TCP_SKB_CB(skb)->seq);
++
++		if (!is_master_tp(tp)) {
++			mptcp_send_reset(sk);
++			return 1;
++		}
++
++		tp->mpcb->infinite_mapping_snd = 1;
++		tp->mpcb->infinite_mapping_rcv = 1;
++		/* We do a seamless fallback and should not send a inf.mapping. */
++		tp->mpcb->send_infinite_mapping = 0;
++		tp->mptcp->fully_established = 1;
++	}
++
++	/* Receiver-side becomes fully established when a whole rcv-window has
++	 * been received without the need to fallback due to the previous
++	 * condition.
++	 */
++	if (!tp->mptcp->fully_established) {
++		tp->mptcp->init_rcv_wnd -= skb->len;
++		if (tp->mptcp->init_rcv_wnd < 0)
++			mptcp_become_fully_estab(sk);
++	}
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 *ptr;
++	u32 data_seq, sub_seq, data_len, tcp_end_seq;
++
++	/* If we are in infinite-mapping-mode, the subflow is guaranteed to be
++	 * in-order at the data-level. Thus data-seq-numbers can be inferred
++	 * from what is expected at the data-level.
++	 */
++	if (mpcb->infinite_mapping_rcv) {
++		tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
++		tp->mptcp->map_subseq = tcb->seq;
++		tp->mptcp->map_data_len = skb->len;
++		tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
++		tp->mptcp->mapping_present = 1;
++		return 0;
++	}
++
++	/* No mapping here? Exit - it is either already set or still on its way */
++	if (!mptcp_is_data_seq(skb)) {
++		/* Too many packets without a mapping - this subflow is broken */
++		if (!tp->mptcp->mapping_present &&
++		    tp->rcv_nxt - tp->copied_seq > 65536) {
++			mptcp_send_reset(sk);
++			return 1;
++		}
++
++		return 0;
++	}
++
++	ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
++	ptr++;
++	sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
++	ptr++;
++	data_len = get_unaligned_be16(ptr);
++
++	/* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
++	 * The draft sets it to 0, but we really would like to have the
++	 * real value, to have an easy handling afterwards here in this
++	 * function.
++	 */
++	if (mptcp_is_data_fin(skb) && skb->len == 0)
++		sub_seq = TCP_SKB_CB(skb)->seq;
++
++	/* If there is already a mapping - we check if it maps with the current
++	 * one. If not - we reset.
++	 */
++	if (tp->mptcp->mapping_present &&
++	    (data_seq != (u32)tp->mptcp->map_data_seq ||
++	     sub_seq != tp->mptcp->map_subseq ||
++	     data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
++	     mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
++		/* Mapping in packet is different from what we want */
++		pr_err("%s Mappings do not match!\n", __func__);
++		pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
++		       __func__, data_seq, (u32)tp->mptcp->map_data_seq,
++		       sub_seq, tp->mptcp->map_subseq, data_len,
++		       tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
++		       tp->mptcp->map_data_fin);
++		mptcp_send_reset(sk);
++		return 1;
++	}
++
++	/* If the previous check was good, the current mapping is valid and we exit. */
++	if (tp->mptcp->mapping_present)
++		return 0;
++
++	/* Mapping not yet set on this subflow - we set it here! */
++
++	if (!data_len) {
++		mpcb->infinite_mapping_rcv = 1;
++		tp->mptcp->fully_established = 1;
++		/* We need to repeat mp_fail's until the sender felt
++		 * back to infinite-mapping - here we stop repeating it.
++		 */
++		tp->mptcp->send_mp_fail = 0;
++
++		/* We have to fixup data_len - it must be the same as skb->len */
++		data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
++		sub_seq = tcb->seq;
++
++		/* TODO kill all other subflows than this one */
++		/* data_seq and so on are set correctly */
++
++		/* At this point, the meta-ofo-queue has to be emptied,
++		 * as the following data is guaranteed to be in-order at
++		 * the data and subflow-level
++		 */
++		mptcp_purge_ofo_queue(meta_tp);
++	}
++
++	/* We are sending mp-fail's and thus are in fallback mode.
++	 * Ignore packets which do not announce the fallback and still
++	 * want to provide a mapping.
++	 */
++	if (tp->mptcp->send_mp_fail) {
++		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++		__skb_unlink(skb, &sk->sk_receive_queue);
++		__kfree_skb(skb);
++		return -1;
++	}
++
++	/* FIN increased the mapping-length by 1 */
++	if (mptcp_is_data_fin(skb))
++		data_len--;
++
++	/* Subflow-sequences of packet must be
++	 * (at least partially) be part of the DSS-mapping's
++	 * subflow-sequence-space.
++	 *
++	 * Basically the mapping is not valid, if either of the
++	 * following conditions is true:
++	 *
++	 * 1. It's not a data_fin and
++	 *    MPTCP-sub_seq >= TCP-end_seq
++	 *
++	 * 2. It's a data_fin and TCP-end_seq > TCP-seq and
++	 *    MPTCP-sub_seq >= TCP-end_seq
++	 *
++	 * The previous two can be merged into:
++	 *    TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
++	 *    Because if it's not a data-fin, TCP-end_seq > TCP-seq
++	 *
++	 * 3. It's a data_fin and skb->len == 0 and
++	 *    MPTCP-sub_seq > TCP-end_seq
++	 *
++	 * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
++	 *    MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
++	 *
++	 * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
++	 */
++
++	/* subflow-fin is not part of the mapping - ignore it here ! */
++	tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
++	if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
++	    (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
++	    (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
++	    before(sub_seq, tp->copied_seq)) {
++		/* Subflow-sequences of packet is different from what is in the
++		 * packet's dss-mapping. The peer is misbehaving - reset
++		 */
++		pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
++		       "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
++		       "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
++		       skb->len, data_len, tp->copied_seq);
++		mptcp_send_reset(sk);
++		return 1;
++	}
++
++	/* Does the DSS had 64-bit seqnum's ? */
++	if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
++		/* Wrapped around? */
++		if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
++			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
++		} else {
++			/* Else, access the default high-order bits */
++			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
++		}
++	} else {
++		tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
++
++		if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
++			/* We make sure that the data_seq is invalid.
++			 * It will be dropped later.
++			 */
++			tp->mptcp->map_data_seq += 0xFFFFFFFF;
++			tp->mptcp->map_data_seq += 0xFFFFFFFF;
++		}
++	}
++
++	tp->mptcp->map_data_len = data_len;
++	tp->mptcp->map_subseq = sub_seq;
++	tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
++	tp->mptcp->mapping_present = 1;
++
++	return 0;
++}
++
++/* Similar to tcp_sequence(...) */
++static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
++				 u64 data_seq, u64 end_data_seq)
++{
++	const struct mptcp_cb *mpcb = meta_tp->mpcb;
++	u64 rcv_wup64;
++
++	/* Wrap-around? */
++	if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
++		rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
++				meta_tp->rcv_wup;
++	} else {
++		rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++						  meta_tp->rcv_wup);
++	}
++
++	return	!before64(end_data_seq, rcv_wup64) &&
++		!after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *tmp, *tmp1;
++	u32 tcp_end_seq;
++
++	if (!tp->mptcp->mapping_present)
++		return 0;
++
++	/* either, the new skb gave us the mapping and the first segment
++	 * in the sub-rcv-queue has to be trimmed ...
++	 */
++	tmp = skb_peek(&sk->sk_receive_queue);
++	if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
++	    after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
++		mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
++
++	/* ... or the new skb (tail) has to be split at the end. */
++	tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
++	if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++		u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
++		if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
++			/* TODO : maybe handle this here better.
++			 * We now just force meta-retransmission.
++			 */
++			tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++			__skb_unlink(skb, &sk->sk_receive_queue);
++			__kfree_skb(skb);
++			return -1;
++		}
++	}
++
++	/* Now, remove old sk_buff's from the receive-queue.
++	 * This may happen if the mapping has been lost for these segments and
++	 * the next mapping has already been received.
++	 */
++	if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
++				break;
++
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++
++			/* Impossible that we could free skb here, because his
++			 * mapping is known to be valid from previous checks
++			 */
++			__kfree_skb(tmp1);
++		}
++	}
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this mapping has been put in the meta-receive-queue
++ *	    -2 this mapping has been eaten by the application
++ */
++static int mptcp_queue_skb(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sk_buff *tmp, *tmp1;
++	u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
++	bool data_queued = false;
++
++	/* Have we not yet received the full mapping? */
++	if (!tp->mptcp->mapping_present ||
++	    before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++		return 0;
++
++	/* Is this an overlapping mapping? rcv_nxt >= end_data_seq
++	 * OR
++	 * This mapping is out of window
++	 */
++	if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
++	    !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
++			    tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			__kfree_skb(tmp1);
++
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++
++		mptcp_reset_mapping(tp);
++
++		return -1;
++	}
++
++	/* Record it, because we want to send our data_fin on the same path */
++	if (tp->mptcp->map_data_fin) {
++		mpcb->dfin_path_index = tp->mptcp->path_index;
++		mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
++	}
++
++	/* Verify the checksum */
++	if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
++		int ret = mptcp_verif_dss_csum(sk);
++
++		if (ret <= 0) {
++			mptcp_reset_mapping(tp);
++			return 1;
++		}
++	}
++
++	if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
++		/* Seg's have to go to the meta-ofo-queue */
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_prepare_skb(tmp1, sk);
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			/* MUST be done here, because fragstolen may be true later.
++			 * Then, kfree_skb_partial will not account the memory.
++			 */
++			skb_orphan(tmp1);
++
++			if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
++				mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
++			else
++				__kfree_skb(tmp1);
++
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++		tcp_enter_quickack_mode(sk);
++	} else {
++		/* Ready for the meta-rcv-queue */
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			int eaten = 0;
++			const bool copied_early = false;
++			bool fragstolen = false;
++			u32 old_rcv_nxt = meta_tp->rcv_nxt;
++
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_prepare_skb(tmp1, sk);
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			/* MUST be done here, because fragstolen may be true.
++			 * Then, kfree_skb_partial will not account the memory.
++			 */
++			skb_orphan(tmp1);
++
++			/* This segment has already been received */
++			if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
++				__kfree_skb(tmp1);
++				goto next;
++			}
++
++#ifdef CONFIG_NET_DMA
++			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt  &&
++			    meta_tp->ucopy.task == current &&
++			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
++			    tmp1->len <= meta_tp->ucopy.len &&
++			    sock_owned_by_user(meta_sk) &&
++			    tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
++				copied_early = true;
++				eaten = 1;
++			}
++#endif
++
++			/* Is direct copy possible ? */
++			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++			    meta_tp->ucopy.task == current &&
++			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
++			    meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
++			    !copied_early)
++				eaten = mptcp_direct_copy(tmp1, meta_sk);
++
++			if (mpcb->in_time_wait) /* In time-wait, do not receive data */
++				eaten = 1;
++
++			if (!eaten)
++				eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
++
++			meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++#ifdef CONFIG_NET_DMA
++			if (copied_early)
++				meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
++#endif
++
++			if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
++				mptcp_fin(meta_sk);
++
++			/* Check if this fills a gap in the ofo queue */
++			if (!skb_queue_empty(&meta_tp->out_of_order_queue))
++				mptcp_ofo_queue(meta_sk);
++
++#ifdef CONFIG_NET_DMA
++			if (copied_early)
++				__skb_queue_tail(&meta_sk->sk_async_wait_queue,
++						 tmp1);
++			else
++#endif
++			if (eaten)
++				kfree_skb_partial(tmp1, fragstolen);
++
++			data_queued = true;
++next:
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++	}
++
++	inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
++	mptcp_reset_mapping(tp);
++
++	return data_queued ? -1 : -2;
++}
++
++void mptcp_data_ready(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct sk_buff *skb, *tmp;
++	int queued = 0;
++
++	/* restart before the check, because mptcp_fin might have changed the
++	 * state.
++	 */
++restart:
++	/* If the meta cannot receive data, there is no point in pushing data.
++	 * If we are in time-wait, we may still be waiting for the final FIN.
++	 * So, we should proceed with the processing.
++	 */
++	if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
++		skb_queue_purge(&sk->sk_receive_queue);
++		tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
++		goto exit;
++	}
++
++	/* Iterate over all segments, detect their mapping (if we don't have
++	 * one yet), validate them and push everything one level higher.
++	 */
++	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
++		int ret;
++		/* Pre-validation - e.g., early fallback */
++		ret = mptcp_prevalidate_skb(sk, skb);
++		if (ret < 0)
++			goto restart;
++		else if (ret > 0)
++			break;
++
++		/* Set the current mapping */
++		ret = mptcp_detect_mapping(sk, skb);
++		if (ret < 0)
++			goto restart;
++		else if (ret > 0)
++			break;
++
++		/* Validation */
++		if (mptcp_validate_mapping(sk, skb) < 0)
++			goto restart;
++
++		/* Push a level higher */
++		ret = mptcp_queue_skb(sk);
++		if (ret < 0) {
++			if (ret == -1)
++				queued = ret;
++			goto restart;
++		} else if (ret == 0) {
++			continue;
++		} else { /* ret == 1 */
++			break;
++		}
++	}
++
++exit:
++	if (tcp_sk(sk)->close_it) {
++		tcp_send_ack(sk);
++		tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
++	}
++
++	if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
++		meta_sk->sk_data_ready(meta_sk);
++}
++
++
++int mptcp_check_req(struct sk_buff *skb, struct net *net)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	struct sock *meta_sk = NULL;
++
++	/* MPTCP structures not initialized */
++	if (mptcp_init_failed)
++		return 0;
++
++	if (skb->protocol == htons(ETH_P_IP))
++		meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
++					      ip_hdr(skb)->daddr, net);
++#if IS_ENABLED(CONFIG_IPV6)
++	else /* IPv6 */
++		meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
++					      &ipv6_hdr(skb)->daddr, net);
++#endif /* CONFIG_IPV6 */
++
++	if (!meta_sk)
++		return 0;
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++	bh_lock_sock_nested(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++			bh_unlock_sock(meta_sk);
++			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++			sock_put(meta_sk); /* Taken by mptcp_search_req */
++			kfree_skb(skb);
++			return 1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP)) {
++		tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else { /* IPv6 */
++		tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++	}
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
++	return 1;
++}
++
++struct mp_join *mptcp_find_join(const struct sk_buff *skb)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	unsigned char *ptr;
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++	/* Jump through the options to check whether JOIN is there */
++	ptr = (unsigned char *)(th + 1);
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return NULL;
++		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)	/* "silly options" */
++				return NULL;
++			if (opsize > length)
++				return NULL;  /* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
++				return (struct mp_join *)(ptr - 2);
++			}
++			ptr += opsize - 2;
++			length -= opsize;
++		}
++	}
++	return NULL;
++}
++
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
++{
++	const struct mptcp_cb *mpcb;
++	struct sock *meta_sk;
++	u32 token;
++	bool meta_v4;
++	struct mp_join *join_opt = mptcp_find_join(skb);
++	if (!join_opt)
++		return 0;
++
++	/* MPTCP structures were not initialized, so return error */
++	if (mptcp_init_failed)
++		return -1;
++
++	token = join_opt->u.syn.token;
++	meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
++	if (!meta_sk) {
++		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++		return -1;
++	}
++
++	meta_v4 = meta_sk->sk_family == AF_INET;
++	if (meta_v4) {
++		if (skb->protocol == htons(ETH_P_IPV6)) {
++			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			return -1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP) &&
++		   inet6_sk(meta_sk)->ipv6only) {
++		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	mpcb = tcp_sk(meta_sk)->mpcb;
++	if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
++		/* We are in fallback-mode on the reception-side -
++		 * no new subflows!
++		 */
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	/* Coming from time-wait-sock processing in tcp_v4_rcv.
++	 * We have to deschedule it before continuing, because otherwise
++	 * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
++	 */
++	if (tw) {
++		inet_twsk_deschedule(tw, &tcp_death_row);
++		inet_twsk_put(tw);
++	}
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++	/* OK, this is a new syn/join, let's create a new open request and
++	 * send syn+ack
++	 */
++	bh_lock_sock_nested(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++			bh_unlock_sock(meta_sk);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPBACKLOGDROP);
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			kfree_skb(skb);
++			return 1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP)) {
++		tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++	}
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_hash_find */
++	return 1;
++}
++
++int mptcp_do_join_short(struct sk_buff *skb,
++			const struct mptcp_options_received *mopt,
++			struct net *net)
++{
++	struct sock *meta_sk;
++	u32 token;
++	bool meta_v4;
++
++	token = mopt->mptcp_rem_token;
++	meta_sk = mptcp_hash_find(net, token);
++	if (!meta_sk) {
++		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++		return -1;
++	}
++
++	meta_v4 = meta_sk->sk_family == AF_INET;
++	if (meta_v4) {
++		if (skb->protocol == htons(ETH_P_IPV6)) {
++			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			return -1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP) &&
++		   inet6_sk(meta_sk)->ipv6only) {
++		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++	/* OK, this is a new syn/join, let's create a new open request and
++	 * send syn+ack
++	 */
++	bh_lock_sock(meta_sk);
++
++	/* This check is also done in mptcp_vX_do_rcv. But, there we cannot
++	 * call tcp_vX_send_reset, because we hold already two socket-locks.
++	 * (the listener and the meta from above)
++	 *
++	 * And the send-reset will try to take yet another one (ip_send_reply).
++	 * Thus, we propagate the reset up to tcp_rcv_state_process.
++	 */
++	if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
++	    tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
++	    meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
++		bh_unlock_sock(meta_sk);
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
++			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++		else
++			/* Must make sure that upper layers won't free the
++			 * skb if it is added to the backlog-queue.
++			 */
++			skb_get(skb);
++	} else {
++		/* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
++		 * the skb will finally be freed by tcp_v4_do_rcv (where we are
++		 * coming from)
++		 */
++		skb_get(skb);
++		if (skb->protocol == htons(ETH_P_IP)) {
++			tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++		} else { /* IPv6 */
++			tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++		}
++	}
++
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_hash_find */
++	return 0;
++}
++
++/**
++ * Equivalent of tcp_fin() for MPTCP
++ * Can be called only when the FIN is validly part
++ * of the data seqnum space. Not before when we get holes.
++ */
++void mptcp_fin(struct sock *meta_sk)
++{
++	struct sock *sk = NULL, *sk_it;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++
++	mptcp_for_each_sk(mpcb, sk_it) {
++		if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
++			sk = sk_it;
++			break;
++		}
++	}
++
++	if (!sk || sk->sk_state == TCP_CLOSE)
++		sk = mptcp_select_ack_sock(meta_sk);
++
++	inet_csk_schedule_ack(sk);
++
++	meta_sk->sk_shutdown |= RCV_SHUTDOWN;
++	sock_set_flag(meta_sk, SOCK_DONE);
++
++	switch (meta_sk->sk_state) {
++	case TCP_SYN_RECV:
++	case TCP_ESTABLISHED:
++		/* Move to CLOSE_WAIT */
++		tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
++		inet_csk(sk)->icsk_ack.pingpong = 1;
++		break;
++
++	case TCP_CLOSE_WAIT:
++	case TCP_CLOSING:
++		/* Received a retransmission of the FIN, do
++		 * nothing.
++		 */
++		break;
++	case TCP_LAST_ACK:
++		/* RFC793: Remain in the LAST-ACK state. */
++		break;
++
++	case TCP_FIN_WAIT1:
++		/* This case occurs when a simultaneous close
++		 * happens, we must ack the received FIN and
++		 * enter the CLOSING state.
++		 */
++		tcp_send_ack(sk);
++		tcp_set_state(meta_sk, TCP_CLOSING);
++		break;
++	case TCP_FIN_WAIT2:
++		/* Received a FIN -- send ACK and enter TIME_WAIT. */
++		tcp_send_ack(sk);
++		meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
++		break;
++	default:
++		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
++		 * cases we should never reach this piece of code.
++		 */
++		pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
++		       meta_sk->sk_state);
++		break;
++	}
++
++	/* It _is_ possible, that we have something out-of-order _after_ FIN.
++	 * Probably, we should reset in this case. For now drop them.
++	 */
++	mptcp_purge_ofo_queue(meta_tp);
++	sk_mem_reclaim(meta_sk);
++
++	if (!sock_flag(meta_sk, SOCK_DEAD)) {
++		meta_sk->sk_state_change(meta_sk);
++
++		/* Do not send POLL_HUP for half duplex close. */
++		if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
++		    meta_sk->sk_state == TCP_CLOSE)
++			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
++		else
++			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
++	}
++
++	return;
++}
++
++static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++
++	if (!meta_tp->packets_out)
++		return;
++
++	tcp_for_write_queue(skb, meta_sk) {
++		if (skb == tcp_send_head(meta_sk))
++			break;
++
++		if (mptcp_retransmit_skb(meta_sk, skb))
++			return;
++
++		if (skb == tcp_write_queue_head(meta_sk))
++			inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++						  inet_csk(meta_sk)->icsk_rto,
++						  TCP_RTO_MAX);
++	}
++}
++
++/* Handle the DATA_ACK */
++static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 prior_snd_una = meta_tp->snd_una;
++	int prior_packets;
++	u32 nwin, data_ack, data_seq;
++	u16 data_len = 0;
++
++	/* A valid packet came in - subflow is operational again */
++	tp->pf = 0;
++
++	/* Even if there is no data-ack, we stop retransmitting.
++	 * Except if this is a SYN/ACK. Then it is just a retransmission
++	 */
++	if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
++		tp->mptcp->pre_established = 0;
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++	}
++
++	/* If we are in infinite mapping mode, rx_opt.data_ack has been
++	 * set by mptcp_clean_rtx_infinite.
++	 */
++	if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
++		goto exit;
++
++	data_ack = tp->mptcp->rx_opt.data_ack;
++
++	if (unlikely(!tp->mptcp->fully_established) &&
++	    tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
++		/* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
++		 * includes a data-ack, we are fully established
++		 */
++		mptcp_become_fully_estab(sk);
++
++	/* Get the data_seq */
++	if (mptcp_is_data_seq(skb)) {
++		data_seq = tp->mptcp->rx_opt.data_seq;
++		data_len = tp->mptcp->rx_opt.data_len;
++	} else {
++		data_seq = meta_tp->snd_wl1;
++	}
++
++	/* If the ack is older than previous acks
++	 * then we can probably ignore it.
++	 */
++	if (before(data_ack, prior_snd_una))
++		goto exit;
++
++	/* If the ack includes data we haven't sent yet, discard
++	 * this segment (RFC793 Section 3.9).
++	 */
++	if (after(data_ack, meta_tp->snd_nxt))
++		goto exit;
++
++	/*** Now, update the window  - inspired by tcp_ack_update_window ***/
++	nwin = ntohs(tcp_hdr(skb)->window);
++
++	if (likely(!tcp_hdr(skb)->syn))
++		nwin <<= tp->rx_opt.snd_wscale;
++
++	if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
++		tcp_update_wl(meta_tp, data_seq);
++
++		/* Draft v09, Section 3.3.5:
++		 * [...] It should only update its local receive window values
++		 * when the largest sequence number allowed (i.e.  DATA_ACK +
++		 * receive window) increases. [...]
++		 */
++		if (meta_tp->snd_wnd != nwin &&
++		    !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
++			meta_tp->snd_wnd = nwin;
++
++			if (nwin > meta_tp->max_window)
++				meta_tp->max_window = nwin;
++		}
++	}
++	/*** Done, update the window ***/
++
++	/* We passed data and got it acked, remove any soft error
++	 * log. Something worked...
++	 */
++	sk->sk_err_soft = 0;
++	inet_csk(meta_sk)->icsk_probes_out = 0;
++	meta_tp->rcv_tstamp = tcp_time_stamp;
++	prior_packets = meta_tp->packets_out;
++	if (!prior_packets)
++		goto no_queue;
++
++	meta_tp->snd_una = data_ack;
++
++	mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
++
++	/* We are in loss-state, and something got acked, retransmit the whole
++	 * queue now!
++	 */
++	if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
++	    after(data_ack, prior_snd_una)) {
++		mptcp_xmit_retransmit_queue(meta_sk);
++		inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
++	}
++
++	/* Simplified version of tcp_new_space, because the snd-buffer
++	 * is handled by all the subflows.
++	 */
++	if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
++		sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
++		if (meta_sk->sk_socket &&
++		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++			meta_sk->sk_write_space(meta_sk);
++	}
++
++	if (meta_sk->sk_state != TCP_ESTABLISHED &&
++	    mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
++		return;
++
++exit:
++	mptcp_push_pending_frames(meta_sk);
++
++	return;
++
++no_queue:
++	if (tcp_send_head(meta_sk))
++		tcp_ack_probe(meta_sk);
++
++	mptcp_push_pending_frames(meta_sk);
++
++	return;
++}
++
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
++
++	if (!tp->mpcb->infinite_mapping_snd)
++		return;
++
++	/* The difference between both write_seq's represents the offset between
++	 * data-sequence and subflow-sequence. As we are infinite, this must
++	 * match.
++	 *
++	 * Thus, from this difference we can infer the meta snd_una.
++	 */
++	tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
++				     tp->snd_una;
++
++	mptcp_data_ack(sk, skb);
++}
++
++/**** static functions used by mptcp_parse_options */
++
++static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
++{
++	struct sock *sk_it, *tmpsk;
++
++	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++		if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
++			mptcp_reinject_data(sk_it, 0);
++			sk_it->sk_err = ECONNRESET;
++			if (tcp_need_reset(sk_it->sk_state))
++				tcp_sk(sk_it)->ops->send_active_reset(sk_it,
++								      GFP_ATOMIC);
++			mptcp_sub_force_close(sk_it);
++		}
++	}
++}
++
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++			 struct mptcp_options_received *mopt,
++			 const struct sk_buff *skb)
++{
++	const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
++
++	/* If the socket is mp-capable we would have a mopt. */
++	if (!mopt)
++		return;
++
++	switch (mp_opt->sub) {
++	case MPTCP_SUB_CAPABLE:
++	{
++		const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
++		    opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
++			mptcp_debug("%s: mp_capable: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		if (!sysctl_mptcp_enabled)
++			break;
++
++		/* We only support MPTCP version 0 */
++		if (mpcapable->ver != 0)
++			break;
++
++		/* MPTCP-RFC 6824:
++		 * "If receiving a message with the 'B' flag set to 1, and this
++		 * is not understood, then this SYN MUST be silently ignored;
++		 */
++		if (mpcapable->b) {
++			mopt->drop_me = 1;
++			break;
++		}
++
++		/* MPTCP-RFC 6824:
++		 * "An implementation that only supports this method MUST set
++		 *  bit "H" to 1, and bits "C" through "G" to 0."
++		 */
++		if (!mpcapable->h)
++			break;
++
++		mopt->saw_mpc = 1;
++		mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
++
++		if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
++			mopt->mptcp_key = mpcapable->sender_key;
++
++		break;
++	}
++	case MPTCP_SUB_JOIN:
++	{
++		const struct mp_join *mpjoin = (struct mp_join *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
++		    opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
++		    opsize != MPTCP_SUB_LEN_JOIN_ACK) {
++			mptcp_debug("%s: mp_join: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		/* saw_mpc must be set, because in tcp_check_req we assume that
++		 * it is set to support falling back to reg. TCP if a rexmitted
++		 * SYN has no MP_CAPABLE or MP_JOIN
++		 */
++		switch (opsize) {
++		case MPTCP_SUB_LEN_JOIN_SYN:
++			mopt->is_mp_join = 1;
++			mopt->saw_mpc = 1;
++			mopt->low_prio = mpjoin->b;
++			mopt->rem_id = mpjoin->addr_id;
++			mopt->mptcp_rem_token = mpjoin->u.syn.token;
++			mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
++			break;
++		case MPTCP_SUB_LEN_JOIN_SYNACK:
++			mopt->saw_mpc = 1;
++			mopt->low_prio = mpjoin->b;
++			mopt->rem_id = mpjoin->addr_id;
++			mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
++			mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
++			break;
++		case MPTCP_SUB_LEN_JOIN_ACK:
++			mopt->saw_mpc = 1;
++			mopt->join_ack = 1;
++			memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
++			break;
++		}
++		break;
++	}
++	case MPTCP_SUB_DSS:
++	{
++		const struct mp_dss *mdss = (struct mp_dss *)ptr;
++		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++
++		/* We check opsize for the csum and non-csum case. We do this,
++		 * because the draft says that the csum SHOULD be ignored if
++		 * it has not been negotiated in the MP_CAPABLE but still is
++		 * present in the data.
++		 *
++		 * It will get ignored later in mptcp_queue_skb.
++		 */
++		if (opsize != mptcp_sub_len_dss(mdss, 0) &&
++		    opsize != mptcp_sub_len_dss(mdss, 1)) {
++			mptcp_debug("%s: mp_dss: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		ptr += 4;
++
++		if (mdss->A) {
++			tcb->mptcp_flags |= MPTCPHDR_ACK;
++
++			if (mdss->a) {
++				mopt->data_ack = (u32) get_unaligned_be64(ptr);
++				ptr += MPTCP_SUB_LEN_ACK_64;
++			} else {
++				mopt->data_ack = get_unaligned_be32(ptr);
++				ptr += MPTCP_SUB_LEN_ACK;
++			}
++		}
++
++		tcb->dss_off = (ptr - skb_transport_header(skb));
++
++		if (mdss->M) {
++			if (mdss->m) {
++				u64 data_seq64 = get_unaligned_be64(ptr);
++
++				tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
++				mopt->data_seq = (u32) data_seq64;
++
++				ptr += 12; /* 64-bit dseq + subseq */
++			} else {
++				mopt->data_seq = get_unaligned_be32(ptr);
++				ptr += 8; /* 32-bit dseq + subseq */
++			}
++			mopt->data_len = get_unaligned_be16(ptr);
++
++			tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++			/* Is a check-sum present? */
++			if (opsize == mptcp_sub_len_dss(mdss, 1))
++				tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
++
++			/* DATA_FIN only possible with DSS-mapping */
++			if (mdss->F)
++				tcb->mptcp_flags |= MPTCPHDR_FIN;
++		}
++
++		break;
++	}
++	case MPTCP_SUB_ADD_ADDR:
++	{
++#if IS_ENABLED(CONFIG_IPV6)
++		const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++		if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++		     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++		    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++		     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
++#else
++		if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++		    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
++#endif /* CONFIG_IPV6 */
++			mptcp_debug("%s: mp_add_addr: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		/* We have to manually parse the options if we got two of them. */
++		if (mopt->saw_add_addr) {
++			mopt->more_add_addr = 1;
++			break;
++		}
++		mopt->saw_add_addr = 1;
++		mopt->add_addr_ptr = ptr;
++		break;
++	}
++	case MPTCP_SUB_REMOVE_ADDR:
++		if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
++			mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		if (mopt->saw_rem_addr) {
++			mopt->more_rem_addr = 1;
++			break;
++		}
++		mopt->saw_rem_addr = 1;
++		mopt->rem_addr_ptr = ptr;
++		break;
++	case MPTCP_SUB_PRIO:
++	{
++		const struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_PRIO &&
++		    opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
++			mptcp_debug("%s: mp_prio: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		mopt->saw_low_prio = 1;
++		mopt->low_prio = mpprio->b;
++
++		if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
++			mopt->saw_low_prio = 2;
++			mopt->prio_addr_id = mpprio->addr_id;
++		}
++		break;
++	}
++	case MPTCP_SUB_FAIL:
++		if (opsize != MPTCP_SUB_LEN_FAIL) {
++			mptcp_debug("%s: mp_fail: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++		mopt->mp_fail = 1;
++		break;
++	case MPTCP_SUB_FCLOSE:
++		if (opsize != MPTCP_SUB_LEN_FCLOSE) {
++			mptcp_debug("%s: mp_fclose: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		mopt->mp_fclose = 1;
++		mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
++
++		break;
++	default:
++		mptcp_debug("%s: Received unkown subtype: %d\n",
++			    __func__, mp_opt->sub);
++		break;
++	}
++}
++
++/** Parse only MPTCP options */
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++			     struct mptcp_options_received *mopt)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++	const unsigned char *ptr = (const unsigned char *)(th + 1);
++
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return;
++		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)	/* "silly options" */
++				return;
++			if (opsize > length)
++				return;	/* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP)
++				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++		}
++		ptr += opsize - 2;
++		length -= opsize;
++	}
++}
++
++int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sock *sk;
++	u32 rtt_max = 0;
++
++	/* In MPTCP, we take the max delay across all flows,
++	 * in order to take into account meta-reordering buffers.
++	 */
++	mptcp_for_each_sk(mpcb, sk) {
++		if (!mptcp_sk_can_recv(sk))
++			continue;
++
++		if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
++			rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
++	}
++	if (time < (rtt_max >> 3) || !rtt_max)
++		return 1;
++
++	return 0;
++}
++
++static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
++{
++	struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	__be16 port = 0;
++	union inet_addr addr;
++	sa_family_t family;
++
++	if (mpadd->ipver == 4) {
++		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++			port  = mpadd->u.v4.port;
++		family = AF_INET;
++		addr.in = mpadd->u.v4.addr;
++#if IS_ENABLED(CONFIG_IPV6)
++	} else if (mpadd->ipver == 6) {
++		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
++			port  = mpadd->u.v6.port;
++		family = AF_INET6;
++		addr.in6 = mpadd->u.v6.addr;
++#endif /* CONFIG_IPV6 */
++	} else {
++		return;
++	}
++
++	if (mpcb->pm_ops->add_raddr)
++		mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
++}
++
++static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
++{
++	struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++	int i;
++	u8 rem_id;
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
++		rem_id = (&mprem->addrs_id)[i];
++
++		if (mpcb->pm_ops->rem_raddr)
++			mpcb->pm_ops->rem_raddr(mpcb, rem_id);
++		mptcp_send_reset_rem_id(mpcb, rem_id);
++	}
++}
++
++static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
++{
++	struct tcphdr *th = tcp_hdr(skb);
++	unsigned char *ptr;
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++	/* Jump through the options to check whether ADD_ADDR is there */
++	ptr = (unsigned char *)(th + 1);
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return;
++		case TCPOPT_NOP:
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)
++				return;
++			if (opsize > length)
++				return;  /* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
++#if IS_ENABLED(CONFIG_IPV6)
++				struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++				if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++				     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++				    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++				     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
++#else
++				if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++				    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++#endif /* CONFIG_IPV6 */
++					goto cont;
++
++				mptcp_handle_add_addr(ptr, sk);
++			}
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
++				if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
++					goto cont;
++
++				mptcp_handle_rem_addr(ptr, sk);
++			}
++cont:
++			ptr += opsize - 2;
++			length -= opsize;
++		}
++	}
++	return;
++}
++
++static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
++{
++	struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	if (unlikely(mptcp->rx_opt.mp_fail)) {
++		mptcp->rx_opt.mp_fail = 0;
++
++		if (!th->rst && !mpcb->infinite_mapping_snd) {
++			struct sock *sk_it;
++
++			mpcb->send_infinite_mapping = 1;
++			/* We resend everything that has not been acknowledged */
++			meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++
++			/* We artificially restart the whole send-queue. Thus,
++			 * it is as if no packets are in flight
++			 */
++			tcp_sk(meta_sk)->packets_out = 0;
++
++			/* If the snd_nxt already wrapped around, we have to
++			 * undo the wrapping, as we are restarting from snd_una
++			 * on.
++			 */
++			if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
++				mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
++				mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++			}
++			tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
++
++			/* Trigger a sending on the meta. */
++			mptcp_push_pending_frames(meta_sk);
++
++			mptcp_for_each_sk(mpcb, sk_it) {
++				if (sk != sk_it)
++					mptcp_sub_force_close(sk_it);
++			}
++		}
++
++		return 0;
++	}
++
++	if (unlikely(mptcp->rx_opt.mp_fclose)) {
++		struct sock *sk_it, *tmpsk;
++
++		mptcp->rx_opt.mp_fclose = 0;
++		if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
++			return 0;
++
++		if (tcp_need_reset(sk->sk_state))
++			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
++			mptcp_sub_force_close(sk_it);
++
++		tcp_reset(meta_sk);
++
++		return 1;
++	}
++
++	return 0;
++}
++
++static inline void mptcp_path_array_check(struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++
++	if (unlikely(mpcb->list_rcvd)) {
++		mpcb->list_rcvd = 0;
++		if (mpcb->pm_ops->new_remote_address)
++			mpcb->pm_ops->new_remote_address(meta_sk);
++	}
++}
++
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++			 const struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++	if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
++		return 0;
++
++	if (mptcp_mp_fail_rcvd(sk, th))
++		return 1;
++
++	/* RFC 6824, Section 3.3:
++	 * If a checksum is not present when its use has been negotiated, the
++	 * receiver MUST close the subflow with a RST as it is considered broken.
++	 */
++	if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
++	    !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
++		if (tcp_need_reset(sk->sk_state))
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
++
++		mptcp_sub_force_close(sk);
++		return 1;
++	}
++
++	/* We have to acknowledge retransmissions of the third
++	 * ack.
++	 */
++	if (mopt->join_ack) {
++		tcp_send_delayed_ack(sk);
++		mopt->join_ack = 0;
++	}
++
++	if (mopt->saw_add_addr || mopt->saw_rem_addr) {
++		if (mopt->more_add_addr || mopt->more_rem_addr) {
++			mptcp_parse_addropt(skb, sk);
++		} else {
++			if (mopt->saw_add_addr)
++				mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
++			if (mopt->saw_rem_addr)
++				mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
++		}
++
++		mopt->more_add_addr = 0;
++		mopt->saw_add_addr = 0;
++		mopt->more_rem_addr = 0;
++		mopt->saw_rem_addr = 0;
++	}
++	if (mopt->saw_low_prio) {
++		if (mopt->saw_low_prio == 1) {
++			tp->mptcp->rcv_low_prio = mopt->low_prio;
++		} else {
++			struct sock *sk_it;
++			mptcp_for_each_sk(tp->mpcb, sk_it) {
++				struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
++				if (mptcp->rem_id == mopt->prio_addr_id)
++					mptcp->rcv_low_prio = mopt->low_prio;
++			}
++		}
++		mopt->saw_low_prio = 0;
++	}
++
++	mptcp_data_ack(sk, skb);
++
++	mptcp_path_array_check(mptcp_meta_sk(sk));
++	/* Socket may have been mp_killed by a REMOVE_ADDR */
++	if (tp->mp_killed)
++		return 1;
++
++	return 0;
++}
++
++/* In case of fastopen, some data can already be in the write queue.
++ * We need to update the sequence number of the segments as they
++ * were initially TCP sequence numbers.
++ */
++static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
++	struct sk_buff *skb;
++	u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
++
++	/* There should only be one skb in write queue: the data not
++	 * acknowledged in the SYN+ACK. In this case, we need to map
++	 * this data to data sequence numbers.
++	 */
++	skb_queue_walk(&meta_sk->sk_write_queue, skb) {
++		/* If the server only acknowledges partially the data sent in
++		 * the SYN, we need to trim the acknowledged part because
++		 * we don't want to retransmit this already received data.
++		 * When we reach this point, tcp_ack() has already cleaned up
++		 * fully acked segments. However, tcp trims partially acked
++		 * segments only when retransmitting. Since MPTCP comes into
++		 * play only now, we will fake an initial transmit, and
++		 * retransmit_skb() will not be called. The following fragment
++		 * comes from __tcp_retransmit_skb().
++		 */
++		if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
++			BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
++				      master_tp->snd_una));
++			/* tcp_trim_head can only returns ENOMEM if skb is
++			 * cloned. It is not the case here (see
++			 * tcp_send_syn_data).
++			 */
++			BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
++					     TCP_SKB_CB(skb)->seq));
++		}
++
++		TCP_SKB_CB(skb)->seq += new_mapping;
++		TCP_SKB_CB(skb)->end_seq += new_mapping;
++	}
++
++	/* We can advance write_seq by the number of bytes unacknowledged
++	 * and that were mapped in the previous loop.
++	 */
++	meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
++
++	/* The packets from the master_sk will be entailed to it later
++	 * Until that time, its write queue is empty, and
++	 * write_seq must align with snd_una
++	 */
++	master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
++	master_tp->packets_out = 0;
++
++	/* Although these data have been sent already over the subsk,
++	 * They have never been sent over the meta_sk, so we rewind
++	 * the send_head so that tcp considers it as an initial send
++	 * (instead of retransmit).
++	 */
++	meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++}
++
++/* The skptr is needed, because if we become MPTCP-capable, we have to switch
++ * from meta-socket to master-socket.
++ *
++ * @return: 1 - we want to reset this connection
++ *	    2 - we want to discard the received syn/ack
++ *	    0 - everything is fine - continue
++ */
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++				    const struct sk_buff *skb,
++				    const struct mptcp_options_received *mopt)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	if (mptcp(tp)) {
++		u8 hash_mac_check[20];
++		struct mptcp_cb *mpcb = tp->mpcb;
++
++		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++				(u8 *)&mpcb->mptcp_loc_key,
++				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++				(u8 *)&tp->mptcp->mptcp_loc_nonce,
++				(u32 *)hash_mac_check);
++		if (memcmp(hash_mac_check,
++			   (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
++			mptcp_sub_force_close(sk);
++			return 1;
++		}
++
++		/* Set this flag in order to postpone data sending
++		 * until the 4th ack arrives.
++		 */
++		tp->mptcp->pre_established = 1;
++		tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
++
++		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++				(u8 *)&mpcb->mptcp_rem_key,
++				(u8 *)&tp->mptcp->mptcp_loc_nonce,
++				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++				(u32 *)&tp->mptcp->sender_mac[0]);
++
++	} else if (mopt->saw_mpc) {
++		struct sock *meta_sk = sk;
++
++		if (mptcp_create_master_sk(sk, mopt->mptcp_key,
++					   ntohs(tcp_hdr(skb)->window)))
++			return 2;
++
++		sk = tcp_sk(sk)->mpcb->master_sk;
++		*skptr = sk;
++		tp = tcp_sk(sk);
++
++		/* If fastopen was used data might be in the send queue. We
++		 * need to update their sequence number to MPTCP-level seqno.
++		 * Note that it can happen in rare cases that fastopen_req is
++		 * NULL and syn_data is 0 but fastopen indeed occurred and
++		 * data has been queued in the write queue (but not sent).
++		 * Example of such rare cases: connect is non-blocking and
++		 * TFO is configured to work without cookies.
++		 */
++		if (!skb_queue_empty(&meta_sk->sk_write_queue))
++			mptcp_rcv_synsent_fastopen(meta_sk);
++
++		/* -1, because the SYN consumed 1 byte. In case of TFO, we
++		 * start the subflow-sequence number as if the data of the SYN
++		 * is not part of any mapping.
++		 */
++		tp->mptcp->snt_isn = tp->snd_una - 1;
++		tp->mpcb->dss_csum = mopt->dss_csum;
++		tp->mptcp->include_mpc = 1;
++
++		/* Ensure that fastopen is handled at the meta-level. */
++		tp->fastopen_req = NULL;
++
++		sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
++		sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
++
++		 /* hold in sk_clone_lock due to initialization to 2 */
++		sock_put(sk);
++	} else {
++		tp->request_mptcp = 0;
++
++		if (tp->inside_tk_table)
++			mptcp_hash_remove(tp);
++	}
++
++	if (mptcp(tp))
++		tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
++
++	return 0;
++}
++
++bool mptcp_should_expand_sndbuf(const struct sock *sk)
++{
++	const struct sock *sk_it;
++	const struct sock *meta_sk = mptcp_meta_sk(sk);
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int cnt_backups = 0;
++	int backup_available = 0;
++
++	/* We circumvent this check in tcp_check_space, because we want to
++	 * always call sk_write_space. So, we reproduce the check here.
++	 */
++	if (!meta_sk->sk_socket ||
++	    !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++		return false;
++
++	/* If the user specified a specific send buffer setting, do
++	 * not modify it.
++	 */
++	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++		return false;
++
++	/* If we are under global TCP memory pressure, do not expand.  */
++	if (sk_under_memory_pressure(meta_sk))
++		return false;
++
++	/* If we are under soft global TCP memory pressure, do not expand.  */
++	if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
++		return false;
++
++
++	/* For MPTCP we look for a subsocket that could send data.
++	 * If we found one, then we update the send-buffer.
++	 */
++	mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++		struct tcp_sock *tp_it = tcp_sk(sk_it);
++
++		if (!mptcp_sk_can_send(sk_it))
++			continue;
++
++		/* Backup-flows have to be counted - if there is no other
++		 * subflow we take the backup-flow into account.
++		 */
++		if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
++			cnt_backups++;
++
++		if (tp_it->packets_out < tp_it->snd_cwnd) {
++			if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
++				backup_available = 1;
++				continue;
++			}
++			return true;
++		}
++	}
++
++	/* Backup-flow is available for sending - update send-buffer */
++	if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
++		return true;
++	return false;
++}
++
++void mptcp_init_buffer_space(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int space;
++
++	tcp_init_buffer_space(sk);
++
++	if (is_master_tp(tp)) {
++		meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
++		meta_tp->rcvq_space.time = tcp_time_stamp;
++		meta_tp->rcvq_space.seq = meta_tp->copied_seq;
++
++		/* If there is only one subflow, we just use regular TCP
++		 * autotuning. User-locks are handled already by
++		 * tcp_init_buffer_space
++		 */
++		meta_tp->window_clamp = tp->window_clamp;
++		meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
++		meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
++		meta_sk->sk_sndbuf = sk->sk_sndbuf;
++
++		return;
++	}
++
++	if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
++		goto snd_buf;
++
++	/* Adding a new subflow to the rcv-buffer space. We make a simple
++	 * addition, to give some space to allow traffic on the new subflow.
++	 * Autotuning will increase it further later on.
++	 */
++	space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
++	if (space > meta_sk->sk_rcvbuf) {
++		meta_tp->window_clamp += tp->window_clamp;
++		meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
++		meta_sk->sk_rcvbuf = space;
++	}
++
++snd_buf:
++	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++		return;
++
++	/* Adding a new subflow to the send-buffer space. We make a simple
++	 * addition, to give some space to allow traffic on the new subflow.
++	 * Autotuning will increase it further later on.
++	 */
++	space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
++	if (space > meta_sk->sk_sndbuf) {
++		meta_sk->sk_sndbuf = space;
++		meta_sk->sk_write_space(meta_sk);
++	}
++}
++
++void mptcp_tcp_set_rto(struct sock *sk)
++{
++	tcp_set_rto(sk);
++	mptcp_set_rto(sk);
++}
+diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
+new file mode 100644
+index 000000000000..1183d1305d35
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv4.c
+@@ -0,0 +1,483 @@
++/*
++ *	MPTCP implementation - IPv4-specific functions
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/ip.h>
++#include <linux/list.h>
++#include <linux/skbuff.h>
++#include <linux/spinlock.h>
++#include <linux/tcp.h>
++
++#include <net/inet_common.h>
++#include <net/inet_connection_sock.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/request_sock.h>
++#include <net/tcp.h>
++
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++	u32 hash[MD5_DIGEST_WORDS];
++
++	hash[0] = (__force u32)saddr;
++	hash[1] = (__force u32)daddr;
++	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++	hash[3] = mptcp_seed++;
++
++	md5_transform(hash, mptcp_secret);
++
++	return hash[0];
++}
++
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++	u32 hash[MD5_DIGEST_WORDS];
++
++	hash[0] = (__force u32)saddr;
++	hash[1] = (__force u32)daddr;
++	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++	hash[3] = mptcp_seed++;
++
++	md5_transform(hash, mptcp_secret);
++
++	return *((u64 *)hash);
++}
++
++
++static void mptcp_v4_reqsk_destructor(struct request_sock *req)
++{
++	mptcp_reqsk_destructor(req);
++
++	tcp_v4_reqsk_destructor(req);
++}
++
++static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
++			     struct sk_buff *skb)
++{
++	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++	mptcp_reqsk_init(req, skb);
++
++	return 0;
++}
++
++static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
++				  struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	union inet_addr addr;
++	int loc_id;
++	bool low_prio = false;
++
++	/* We need to do this as early as possible. Because, if we fail later
++	 * (e.g., get_local_id), then reqsk_free tries to remove the
++	 * request-socket from the htb in mptcp_hash_request_remove as pprev
++	 * may be different from NULL.
++	 */
++	mtreq->hash_entry.pprev = NULL;
++
++	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++
++	mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
++						    ip_hdr(skb)->daddr,
++						    tcp_hdr(skb)->source,
++						    tcp_hdr(skb)->dest);
++	addr.ip = inet_rsk(req)->ir_loc_addr;
++	loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
++	if (loc_id == -1)
++		return -1;
++	mtreq->loc_id = loc_id;
++	mtreq->low_prio = low_prio;
++
++	mptcp_join_reqsk_init(mpcb, req, skb);
++
++	return 0;
++}
++
++/* Similar to tcp_request_sock_ops */
++struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
++	.family		=	PF_INET,
++	.obj_size	=	sizeof(struct mptcp_request_sock),
++	.rtx_syn_ack	=	tcp_rtx_synack,
++	.send_ack	=	tcp_v4_reqsk_send_ack,
++	.destructor	=	mptcp_v4_reqsk_destructor,
++	.send_reset	=	tcp_v4_send_reset,
++	.syn_ack_timeout =	tcp_syn_ack_timeout,
++};
++
++static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
++					  struct request_sock *req,
++					  const unsigned long timeout)
++{
++	const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++				     inet_rsk(req)->ir_rmt_port,
++				     0, MPTCP_HASH_SIZE);
++	/* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
++	 * want to reset the keepalive-timer (responsible for retransmitting
++	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++	 * overload the keepalive timer. Also, it's not a big deal, because the
++	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++	 * if the third ACK gets lost, the client will handle the retransmission
++	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
++	 * SYN.
++	 */
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++	const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++				     inet_rsk(req)->ir_rmt_port,
++				     lopt->hash_rnd, lopt->nr_table_entries);
++
++	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++		mptcp_reset_synack_timer(meta_sk, timeout);
++
++	rcu_read_lock();
++	spin_lock(&mptcp_reqsk_hlock);
++	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++	spin_unlock(&mptcp_reqsk_hlock);
++	rcu_read_unlock();
++}
++
++/* Similar to tcp_v4_conn_request */
++static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return tcp_conn_request(&mptcp_request_sock_ops,
++				&mptcp_join_request_sock_ipv4_ops,
++				meta_sk, skb);
++}
++
++/* We only process join requests here. (either the SYN or the final ACK) */
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *child, *rsk = NULL;
++	int ret;
++
++	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++		struct tcphdr *th = tcp_hdr(skb);
++		const struct iphdr *iph = ip_hdr(skb);
++		struct sock *sk;
++
++		sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
++					     iph->saddr, th->source, iph->daddr,
++					     th->dest, inet_iif(skb));
++
++		if (!sk) {
++			kfree_skb(skb);
++			return 0;
++		}
++		if (is_meta_sk(sk)) {
++			WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
++			kfree_skb(skb);
++			sock_put(sk);
++			return 0;
++		}
++
++		if (sk->sk_state == TCP_TIME_WAIT) {
++			inet_twsk_put(inet_twsk(sk));
++			kfree_skb(skb);
++			return 0;
++		}
++
++		ret = tcp_v4_do_rcv(sk, skb);
++		sock_put(sk);
++
++		return ret;
++	}
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++	/* Has been removed from the tk-table. Thus, no new subflows.
++	 *
++	 * Check for close-state is necessary, because we may have been closed
++	 * without passing by mptcp_close().
++	 *
++	 * When falling back, no new subflows are allowed either.
++	 */
++	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++		goto reset_and_discard;
++
++	child = tcp_v4_hnd_req(meta_sk, skb);
++
++	if (!child)
++		goto discard;
++
++	if (child != meta_sk) {
++		sock_rps_save_rxhash(child, skb);
++		/* We don't call tcp_child_process here, because we hold
++		 * already the meta-sk-lock and are sure that it is not owned
++		 * by the user.
++		 */
++		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++		bh_unlock_sock(child);
++		sock_put(child);
++		if (ret) {
++			rsk = child;
++			goto reset_and_discard;
++		}
++	} else {
++		if (tcp_hdr(skb)->syn) {
++			mptcp_v4_join_request(meta_sk, skb);
++			goto discard;
++		}
++		goto reset_and_discard;
++	}
++	return 0;
++
++reset_and_discard:
++	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++		const struct tcphdr *th = tcp_hdr(skb);
++		const struct iphdr *iph = ip_hdr(skb);
++		struct request_sock **prev, *req;
++		/* If we end up here, it means we should not have matched on the
++		 * request-socket. But, because the request-sock queue is only
++		 * destroyed in mptcp_close, the socket may actually already be
++		 * in close-state (e.g., through shutdown()) while still having
++		 * pending request sockets.
++		 */
++		req = inet_csk_search_req(meta_sk, &prev, th->source,
++					  iph->saddr, iph->daddr);
++		if (req) {
++			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++					    req);
++			reqsk_free(req);
++		}
++	}
++
++	tcp_v4_send_reset(rsk, skb);
++discard:
++	kfree_skb(skb);
++	return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++				 const __be32 laddr, const struct net *net)
++{
++	const struct mptcp_request_sock *mtreq;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++	const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++				       hash_entry) {
++		struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
++		meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++		if (ireq->ir_rmt_port == rport &&
++		    ireq->ir_rmt_addr == raddr &&
++		    ireq->ir_loc_addr == laddr &&
++		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
++		    net_eq(net, sock_net(meta_sk)))
++			goto found;
++		meta_sk = NULL;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++		goto begin;
++
++found:
++	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++		meta_sk = NULL;
++	rcu_read_unlock();
++
++	return meta_sk;
++}
++
++/* Create a new IPv4 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++			   struct mptcp_rem4 *rem)
++{
++	struct tcp_sock *tp;
++	struct sock *sk;
++	struct sockaddr_in loc_in, rem_in;
++	struct socket sock;
++	int ret;
++
++	/** First, create and prepare the new socket */
++
++	sock.type = meta_sk->sk_socket->type;
++	sock.state = SS_UNCONNECTED;
++	sock.wq = meta_sk->sk_socket->wq;
++	sock.file = meta_sk->sk_socket->file;
++	sock.ops = NULL;
++
++	ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++	if (unlikely(ret < 0)) {
++		mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
++		return ret;
++	}
++
++	sk = sock.sk;
++	tp = tcp_sk(sk);
++
++	/* All subsockets need the MPTCP-lock-class */
++	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++	if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
++		goto error;
++
++	tp->mptcp->slave_sk = 1;
++	tp->mptcp->low_prio = loc->low_prio;
++
++	/* Initializing the timer for an MPTCP subflow */
++	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++	/** Then, connect the socket to the peer */
++	loc_in.sin_family = AF_INET;
++	rem_in.sin_family = AF_INET;
++	loc_in.sin_port = 0;
++	if (rem->port)
++		rem_in.sin_port = rem->port;
++	else
++		rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
++	loc_in.sin_addr = loc->addr;
++	rem_in.sin_addr = rem->addr;
++
++	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
++	if (ret < 0) {
++		mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
++		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++		    tp->mptcp->path_index, &loc_in.sin_addr,
++		    ntohs(loc_in.sin_port), &rem_in.sin_addr,
++		    ntohs(rem_in.sin_port));
++
++	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
++		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
++
++	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++				sizeof(struct sockaddr_in), O_NONBLOCK);
++	if (ret < 0 && ret != -EINPROGRESS) {
++		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	sk_set_socket(sk, meta_sk->sk_socket);
++	sk->sk_wq = meta_sk->sk_wq;
++
++	return 0;
++
++error:
++	/* May happen if mptcp_add_sock fails first */
++	if (!mptcp(tp)) {
++		tcp_close(sk, 0);
++	} else {
++		local_bh_disable();
++		mptcp_sub_force_close(sk);
++		local_bh_enable();
++	}
++	return ret;
++}
++EXPORT_SYMBOL(mptcp_init4_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v4_specific = {
++	.queue_xmit	   = ip_queue_xmit,
++	.send_check	   = tcp_v4_send_check,
++	.rebuild_header	   = inet_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v4_syn_recv_sock,
++	.net_header_len	   = sizeof(struct iphdr),
++	.setsockopt	   = ip_setsockopt,
++	.getsockopt	   = ip_getsockopt,
++	.addr2sockaddr	   = inet_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in),
++	.bind_conflict	   = inet_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ip_setsockopt,
++	.compat_getsockopt = compat_ip_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++/* General initialization of IPv4 for MPTCP */
++int mptcp_pm_v4_init(void)
++{
++	int ret = 0;
++	struct request_sock_ops *ops = &mptcp_request_sock_ops;
++
++	mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++	mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
++
++	mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++	mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
++	mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
++
++	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
++	if (ops->slab_name == NULL) {
++		ret = -ENOMEM;
++		goto out;
++	}
++
++	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++				      NULL);
++
++	if (ops->slab == NULL) {
++		ret =  -ENOMEM;
++		goto err_reqsk_create;
++	}
++
++out:
++	return ret;
++
++err_reqsk_create:
++	kfree(ops->slab_name);
++	ops->slab_name = NULL;
++	goto out;
++}
++
++void mptcp_pm_v4_undo(void)
++{
++	kmem_cache_destroy(mptcp_request_sock_ops.slab);
++	kfree(mptcp_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
+new file mode 100644
+index 000000000000..1036973aa855
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv6.c
+@@ -0,0 +1,518 @@
++/*
++ *	MPTCP implementation - IPv6-specific functions
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/in6.h>
++#include <linux/kernel.h>
++
++#include <net/addrconf.h>
++#include <net/flow.h>
++#include <net/inet6_connection_sock.h>
++#include <net/inet6_hashtables.h>
++#include <net/inet_common.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/ip6_route.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
++#include <net/tcp.h>
++#include <net/transp_v6.h>
++
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++			 __be16 sport, __be16 dport)
++{
++	u32 secret[MD5_MESSAGE_BYTES / 4];
++	u32 hash[MD5_DIGEST_WORDS];
++	u32 i;
++
++	memcpy(hash, saddr, 16);
++	for (i = 0; i < 4; i++)
++		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++	secret[4] = mptcp_secret[4] +
++		    (((__force u16)sport << 16) + (__force u16)dport);
++	secret[5] = mptcp_seed++;
++	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++		secret[i] = mptcp_secret[i];
++
++	md5_transform(hash, secret);
++
++	return hash[0];
++}
++
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++		     __be16 sport, __be16 dport)
++{
++	u32 secret[MD5_MESSAGE_BYTES / 4];
++	u32 hash[MD5_DIGEST_WORDS];
++	u32 i;
++
++	memcpy(hash, saddr, 16);
++	for (i = 0; i < 4; i++)
++		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++	secret[4] = mptcp_secret[4] +
++		    (((__force u16)sport << 16) + (__force u16)dport);
++	secret[5] = mptcp_seed++;
++	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++		secret[i] = mptcp_secret[i];
++
++	md5_transform(hash, secret);
++
++	return *((u64 *)hash);
++}
++
++static void mptcp_v6_reqsk_destructor(struct request_sock *req)
++{
++	mptcp_reqsk_destructor(req);
++
++	tcp_v6_reqsk_destructor(req);
++}
++
++static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
++			     struct sk_buff *skb)
++{
++	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++	mptcp_reqsk_init(req, skb);
++
++	return 0;
++}
++
++static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
++				  struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	union inet_addr addr;
++	int loc_id;
++	bool low_prio = false;
++
++	/* We need to do this as early as possible. Because, if we fail later
++	 * (e.g., get_local_id), then reqsk_free tries to remove the
++	 * request-socket from the htb in mptcp_hash_request_remove as pprev
++	 * may be different from NULL.
++	 */
++	mtreq->hash_entry.pprev = NULL;
++
++	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++
++	mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
++						    ipv6_hdr(skb)->daddr.s6_addr32,
++						    tcp_hdr(skb)->source,
++						    tcp_hdr(skb)->dest);
++	addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
++	loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
++	if (loc_id == -1)
++		return -1;
++	mtreq->loc_id = loc_id;
++	mtreq->low_prio = low_prio;
++
++	mptcp_join_reqsk_init(mpcb, req, skb);
++
++	return 0;
++}
++
++/* Similar to tcp6_request_sock_ops */
++struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
++	.family		=	AF_INET6,
++	.obj_size	=	sizeof(struct mptcp_request_sock),
++	.rtx_syn_ack	=	tcp_v6_rtx_synack,
++	.send_ack	=	tcp_v6_reqsk_send_ack,
++	.destructor	=	mptcp_v6_reqsk_destructor,
++	.send_reset	=	tcp_v6_send_reset,
++	.syn_ack_timeout =	tcp_syn_ack_timeout,
++};
++
++static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
++					  struct request_sock *req,
++					  const unsigned long timeout)
++{
++	const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++				      inet_rsk(req)->ir_rmt_port,
++				      0, MPTCP_HASH_SIZE);
++	/* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
++	 * want to reset the keepalive-timer (responsible for retransmitting
++	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++	 * overload the keepalive timer. Also, it's not a big deal, because the
++	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++	 * if the third ACK gets lost, the client will handle the retransmission
++	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
++	 * SYN.
++	 */
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++	const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++				      inet_rsk(req)->ir_rmt_port,
++				      lopt->hash_rnd, lopt->nr_table_entries);
++
++	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++		mptcp_reset_synack_timer(meta_sk, timeout);
++
++	rcu_read_lock();
++	spin_lock(&mptcp_reqsk_hlock);
++	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++	spin_unlock(&mptcp_reqsk_hlock);
++	rcu_read_unlock();
++}
++
++static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return tcp_conn_request(&mptcp6_request_sock_ops,
++				&mptcp_join_request_sock_ipv6_ops,
++				meta_sk, skb);
++}
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *child, *rsk = NULL;
++	int ret;
++
++	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++		struct tcphdr *th = tcp_hdr(skb);
++		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
++		struct sock *sk;
++
++		sk = __inet6_lookup_established(sock_net(meta_sk),
++						&tcp_hashinfo,
++						&ip6h->saddr, th->source,
++						&ip6h->daddr, ntohs(th->dest),
++						inet6_iif(skb));
++
++		if (!sk) {
++			kfree_skb(skb);
++			return 0;
++		}
++		if (is_meta_sk(sk)) {
++			WARN("%s Did not find a sub-sk!\n", __func__);
++			kfree_skb(skb);
++			sock_put(sk);
++			return 0;
++		}
++
++		if (sk->sk_state == TCP_TIME_WAIT) {
++			inet_twsk_put(inet_twsk(sk));
++			kfree_skb(skb);
++			return 0;
++		}
++
++		ret = tcp_v6_do_rcv(sk, skb);
++		sock_put(sk);
++
++		return ret;
++	}
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++	/* Has been removed from the tk-table. Thus, no new subflows.
++	 *
++	 * Check for close-state is necessary, because we may have been closed
++	 * without passing by mptcp_close().
++	 *
++	 * When falling back, no new subflows are allowed either.
++	 */
++	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++		goto reset_and_discard;
++
++	child = tcp_v6_hnd_req(meta_sk, skb);
++
++	if (!child)
++		goto discard;
++
++	if (child != meta_sk) {
++		sock_rps_save_rxhash(child, skb);
++		/* We don't call tcp_child_process here, because we hold
++		 * already the meta-sk-lock and are sure that it is not owned
++		 * by the user.
++		 */
++		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++		bh_unlock_sock(child);
++		sock_put(child);
++		if (ret) {
++			rsk = child;
++			goto reset_and_discard;
++		}
++	} else {
++		if (tcp_hdr(skb)->syn) {
++			mptcp_v6_join_request(meta_sk, skb);
++			goto discard;
++		}
++		goto reset_and_discard;
++	}
++	return 0;
++
++reset_and_discard:
++	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++		const struct tcphdr *th = tcp_hdr(skb);
++		struct request_sock **prev, *req;
++		/* If we end up here, it means we should not have matched on the
++		 * request-socket. But, because the request-sock queue is only
++		 * destroyed in mptcp_close, the socket may actually already be
++		 * in close-state (e.g., through shutdown()) while still having
++		 * pending request sockets.
++		 */
++		req = inet6_csk_search_req(meta_sk, &prev, th->source,
++					   &ipv6_hdr(skb)->saddr,
++					   &ipv6_hdr(skb)->daddr, inet6_iif(skb));
++		if (req) {
++			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++					    req);
++			reqsk_free(req);
++		}
++	}
++
++	tcp_v6_send_reset(rsk, skb);
++discard:
++	kfree_skb(skb);
++	return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++				 const struct in6_addr *laddr, const struct net *net)
++{
++	const struct mptcp_request_sock *mtreq;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++	const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++				       hash_entry) {
++		struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
++		meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++		if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
++		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
++		    ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
++		    ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
++		    net_eq(net, sock_net(meta_sk)))
++			goto found;
++		meta_sk = NULL;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++		goto begin;
++
++found:
++	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++		meta_sk = NULL;
++	rcu_read_unlock();
++
++	return meta_sk;
++}
++
++/* Create a new IPv6 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++			   struct mptcp_rem6 *rem)
++{
++	struct tcp_sock *tp;
++	struct sock *sk;
++	struct sockaddr_in6 loc_in, rem_in;
++	struct socket sock;
++	int ret;
++
++	/** First, create and prepare the new socket */
++
++	sock.type = meta_sk->sk_socket->type;
++	sock.state = SS_UNCONNECTED;
++	sock.wq = meta_sk->sk_socket->wq;
++	sock.file = meta_sk->sk_socket->file;
++	sock.ops = NULL;
++
++	ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++	if (unlikely(ret < 0)) {
++		mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
++		return ret;
++	}
++
++	sk = sock.sk;
++	tp = tcp_sk(sk);
++
++	/* All subsockets need the MPTCP-lock-class */
++	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++	if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
++		goto error;
++
++	tp->mptcp->slave_sk = 1;
++	tp->mptcp->low_prio = loc->low_prio;
++
++	/* Initializing the timer for an MPTCP subflow */
++	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++	/** Then, connect the socket to the peer */
++	loc_in.sin6_family = AF_INET6;
++	rem_in.sin6_family = AF_INET6;
++	loc_in.sin6_port = 0;
++	if (rem->port)
++		rem_in.sin6_port = rem->port;
++	else
++		rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
++	loc_in.sin6_addr = loc->addr;
++	rem_in.sin6_addr = rem->addr;
++
++	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
++	if (ret < 0) {
++		mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
++		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++		    tp->mptcp->path_index, &loc_in.sin6_addr,
++		    ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
++		    ntohs(rem_in.sin6_port));
++
++	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
++		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
++
++	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++				sizeof(struct sockaddr_in6), O_NONBLOCK);
++	if (ret < 0 && ret != -EINPROGRESS) {
++		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	sk_set_socket(sk, meta_sk->sk_socket);
++	sk->sk_wq = meta_sk->sk_wq;
++
++	return 0;
++
++error:
++	/* May happen if mptcp_add_sock fails first */
++	if (!mptcp(tp)) {
++		tcp_close(sk, 0);
++	} else {
++		local_bh_disable();
++		mptcp_sub_force_close(sk);
++		local_bh_enable();
++	}
++	return ret;
++}
++EXPORT_SYMBOL(mptcp_init6_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v6_specific = {
++	.queue_xmit	   = inet6_csk_xmit,
++	.send_check	   = tcp_v6_send_check,
++	.rebuild_header	   = inet6_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet6_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
++	.net_header_len	   = sizeof(struct ipv6hdr),
++	.net_frag_header_len = sizeof(struct frag_hdr),
++	.setsockopt	   = ipv6_setsockopt,
++	.getsockopt	   = ipv6_getsockopt,
++	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in6),
++	.bind_conflict	   = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ipv6_setsockopt,
++	.compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
++	.queue_xmit	   = ip_queue_xmit,
++	.send_check	   = tcp_v4_send_check,
++	.rebuild_header	   = inet_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
++	.net_header_len	   = sizeof(struct iphdr),
++	.setsockopt	   = ipv6_setsockopt,
++	.getsockopt	   = ipv6_getsockopt,
++	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in6),
++	.bind_conflict	   = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ipv6_setsockopt,
++	.compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_pm_v6_init(void)
++{
++	int ret = 0;
++	struct request_sock_ops *ops = &mptcp6_request_sock_ops;
++
++	mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++	mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
++
++	mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++	mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
++	mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
++
++	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
++	if (ops->slab_name == NULL) {
++		ret = -ENOMEM;
++		goto out;
++	}
++
++	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++				      NULL);
++
++	if (ops->slab == NULL) {
++		ret =  -ENOMEM;
++		goto err_reqsk_create;
++	}
++
++out:
++	return ret;
++
++err_reqsk_create:
++	kfree(ops->slab_name);
++	ops->slab_name = NULL;
++	goto out;
++}
++
++void mptcp_pm_v6_undo(void)
++{
++	kmem_cache_destroy(mptcp6_request_sock_ops.slab);
++	kfree(mptcp6_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
+new file mode 100644
+index 000000000000..6f5087983175
+--- /dev/null
++++ b/net/mptcp/mptcp_ndiffports.c
+@@ -0,0 +1,161 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++
++struct ndiffports_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++
++	struct mptcp_cb *mpcb;
++};
++
++static int num_subflows __read_mostly = 2;
++module_param(num_subflows, int, 0644);
++MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	const struct ndiffports_priv *pm_priv = container_of(work,
++						     struct ndiffports_priv,
++						     subflow_work);
++	struct mptcp_cb *mpcb = pm_priv->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	int iter = 0;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
++		if (meta_sk->sk_family == AF_INET ||
++		    mptcp_v6_is_v4_mapped(meta_sk)) {
++			struct mptcp_loc4 loc;
++			struct mptcp_rem4 rem;
++
++			loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++			loc.loc4_id = 0;
++			loc.low_prio = 0;
++
++			rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++			rem.port = inet_sk(meta_sk)->inet_dport;
++			rem.rem4_id = 0; /* Default 0 */
++
++			mptcp_init4_subsockets(meta_sk, &loc, &rem);
++		} else {
++#if IS_ENABLED(CONFIG_IPV6)
++			struct mptcp_loc6 loc;
++			struct mptcp_rem6 rem;
++
++			loc.addr = inet6_sk(meta_sk)->saddr;
++			loc.loc6_id = 0;
++			loc.low_prio = 0;
++
++			rem.addr = meta_sk->sk_v6_daddr;
++			rem.port = inet_sk(meta_sk)->inet_dport;
++			rem.rem6_id = 0; /* Default 0 */
++
++			mptcp_init6_subsockets(meta_sk, &loc, &rem);
++#endif
++		}
++		goto next_subflow;
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void ndiffports_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	fmp->mpcb = mpcb;
++}
++
++static void ndiffports_create_subflows(struct sock *meta_sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (!work_pending(&pm_priv->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &pm_priv->subflow_work);
++	}
++}
++
++static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
++				   struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++static struct mptcp_pm_ops ndiffports __read_mostly = {
++	.new_session = ndiffports_new_session,
++	.fully_established = ndiffports_create_subflows,
++	.get_local_id = ndiffports_get_local_id,
++	.name = "ndiffports",
++	.owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init ndiffports_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
++
++	if (mptcp_register_path_manager(&ndiffports))
++		goto exit;
++
++	return 0;
++
++exit:
++	return -1;
++}
++
++static void ndiffports_unregister(void)
++{
++	mptcp_unregister_path_manager(&ndiffports);
++}
++
++module_init(ndiffports_register);
++module_exit(ndiffports_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
+new file mode 100644
+index 000000000000..ec4e98622637
+--- /dev/null
++++ b/net/mptcp/mptcp_ofo_queue.c
+@@ -0,0 +1,295 @@
++/*
++ *	MPTCP implementation - Fast algorithm for MPTCP meta-reordering
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <linux/slab.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++			    const struct sk_buff *skb)
++{
++	struct tcp_sock *tp;
++
++	mptcp_for_each_tp(mpcb, tp) {
++		if (tp->mptcp->shortcut_ofoqueue == skb) {
++			tp->mptcp->shortcut_ofoqueue = NULL;
++			return;
++		}
++	}
++}
++
++/* Does 'skb' fits after 'here' in the queue 'head' ?
++ * If yes, we queue it and return 1
++ */
++static int mptcp_ofo_queue_after(struct sk_buff_head *head,
++				 struct sk_buff *skb, struct sk_buff *here,
++				 const struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	u32 seq = TCP_SKB_CB(skb)->seq;
++	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++	/* We want to queue skb after here, thus seq >= end_seq */
++	if (before(seq, TCP_SKB_CB(here)->end_seq))
++		return 0;
++
++	if (seq == TCP_SKB_CB(here)->end_seq) {
++		bool fragstolen = false;
++
++		if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
++			__skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
++			return 1;
++		} else {
++			kfree_skb_partial(skb, fragstolen);
++			return -1;
++		}
++	}
++
++	/* If here is the last one, we can always queue it */
++	if (skb_queue_is_last(head, here)) {
++		__skb_queue_after(head, here, skb);
++		return 1;
++	} else {
++		struct sk_buff *skb1 = skb_queue_next(head, here);
++		/* It's not the last one, but does it fits between 'here' and
++		 * the one after 'here' ? Thus, does end_seq <= after_here->seq
++		 */
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
++			__skb_queue_after(head, here, skb);
++			return 1;
++		}
++	}
++
++	return 0;
++}
++
++static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
++			 struct sk_buff_head *head, struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk;
++	struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb1, *best_shortcut = NULL;
++	u32 seq = TCP_SKB_CB(skb)->seq;
++	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++	u32 distance = 0xffffffff;
++
++	/* First, check the tp's shortcut */
++	if (!shortcut) {
++		if (skb_queue_empty(head)) {
++			__skb_queue_head(head, skb);
++			goto end;
++		}
++	} else {
++		int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++		/* Does the tp's shortcut is a hit? If yes, we insert. */
++
++		if (ret) {
++			skb = (ret > 0) ? skb : NULL;
++			goto end;
++		}
++	}
++
++	/* Check the shortcuts of the other subsockets. */
++	mptcp_for_each_tp(mpcb, tp_it) {
++		shortcut = tp_it->mptcp->shortcut_ofoqueue;
++		/* Can we queue it here? If yes, do so! */
++		if (shortcut) {
++			int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++
++			if (ret) {
++				skb = (ret > 0) ? skb : NULL;
++				goto end;
++			}
++		}
++
++		/* Could not queue it, check if we are close.
++		 * We are looking for a shortcut, close enough to seq to
++		 * set skb1 prematurely and thus improve the subsequent lookup,
++		 * which tries to find a skb1 so that skb1->seq <= seq.
++		 *
++		 * So, here we only take shortcuts, whose shortcut->seq > seq,
++		 * and minimize the distance between shortcut->seq and seq and
++		 * set best_shortcut to this one with the minimal distance.
++		 *
++		 * That way, the subsequent while-loop is shortest.
++		 */
++		if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
++			/* Are we closer than the current best shortcut? */
++			if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
++				distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
++				best_shortcut = shortcut;
++			}
++		}
++	}
++
++	if (best_shortcut)
++		skb1 = best_shortcut;
++	else
++		skb1 = skb_peek_tail(head);
++
++	if (seq == TCP_SKB_CB(skb1)->end_seq) {
++		bool fragstolen = false;
++
++		if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
++			__skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
++		} else {
++			kfree_skb_partial(skb, fragstolen);
++			skb = NULL;
++		}
++
++		goto end;
++	}
++
++	/* Find the insertion point, starting from best_shortcut if available.
++	 *
++	 * Inspired from tcp_data_queue_ofo.
++	 */
++	while (1) {
++		/* skb1->seq <= seq */
++		if (!after(TCP_SKB_CB(skb1)->seq, seq))
++			break;
++		if (skb_queue_is_first(head, skb1)) {
++			skb1 = NULL;
++			break;
++		}
++		skb1 = skb_queue_prev(head, skb1);
++	}
++
++	/* Do skb overlap to previous one? */
++	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++			/* All the bits are present. */
++			__kfree_skb(skb);
++			skb = NULL;
++			goto end;
++		}
++		if (seq == TCP_SKB_CB(skb1)->seq) {
++			if (skb_queue_is_first(head, skb1))
++				skb1 = NULL;
++			else
++				skb1 = skb_queue_prev(head, skb1);
++		}
++	}
++	if (!skb1)
++		__skb_queue_head(head, skb);
++	else
++		__skb_queue_after(head, skb1, skb);
++
++	/* And clean segments covered by new one as whole. */
++	while (!skb_queue_is_last(head, skb)) {
++		skb1 = skb_queue_next(head, skb);
++
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++			break;
++
++		__skb_unlink(skb1, head);
++		mptcp_remove_shortcuts(mpcb, skb1);
++		__kfree_skb(skb1);
++	}
++
++end:
++	if (skb) {
++		skb_set_owner_r(skb, meta_sk);
++		tp->mptcp->shortcut_ofoqueue = skb;
++	}
++
++	return;
++}
++
++/**
++ * @sk: the subflow that received this skb.
++ */
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++			      struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
++		     &tcp_sk(meta_sk)->out_of_order_queue, tp);
++}
++
++bool mptcp_prune_ofo_queue(struct sock *sk)
++{
++	struct tcp_sock *tp	= tcp_sk(sk);
++	bool res		= false;
++
++	if (!skb_queue_empty(&tp->out_of_order_queue)) {
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
++		mptcp_purge_ofo_queue(tp);
++
++		/* No sack at the mptcp-level */
++		sk_mem_reclaim(sk);
++		res = true;
++	}
++
++	return res;
++}
++
++void mptcp_ofo_queue(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++
++	while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
++		u32 old_rcv_nxt = meta_tp->rcv_nxt;
++		if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
++			break;
++
++		if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
++			__skb_unlink(skb, &meta_tp->out_of_order_queue);
++			mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++			__kfree_skb(skb);
++			continue;
++		}
++
++		__skb_unlink(skb, &meta_tp->out_of_order_queue);
++		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++
++		__skb_queue_tail(&meta_sk->sk_receive_queue, skb);
++		meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++		mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++		if (tcp_hdr(skb)->fin)
++			mptcp_fin(meta_sk);
++	}
++}
++
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
++{
++	struct sk_buff_head *head = &meta_tp->out_of_order_queue;
++	struct sk_buff *skb, *tmp;
++
++	skb_queue_walk_safe(head, skb, tmp) {
++		__skb_unlink(skb, head);
++		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++		kfree_skb(skb);
++	}
++}
+diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
+new file mode 100644
+index 000000000000..53f5c43bb488
+--- /dev/null
++++ b/net/mptcp/mptcp_olia.c
+@@ -0,0 +1,311 @@
++/*
++ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
++ *
++ * Algorithm design:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ * Nicolas Gast <nicolas.gast@epfl.ch>
++ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
++ *
++ * Implementation:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++static int scale = 10;
++
++struct mptcp_olia {
++	u32	mptcp_loss1;
++	u32	mptcp_loss2;
++	u32	mptcp_loss3;
++	int	epsilon_num;
++	u32	epsilon_den;
++	int	mptcp_snd_cwnd_cnt;
++};
++
++static inline int mptcp_olia_sk_can_send(const struct sock *sk)
++{
++	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_olia_scale(u64 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++/* take care of artificially inflate (see RFC5681)
++ * of cwnd during fast-retransmit phase
++ */
++static u32 mptcp_get_crt_cwnd(struct sock *sk)
++{
++	const struct inet_connection_sock *icsk = inet_csk(sk);
++
++	if (icsk->icsk_ca_state == TCP_CA_Recovery)
++		return tcp_sk(sk)->snd_ssthresh;
++	else
++		return tcp_sk(sk)->snd_cwnd;
++}
++
++/* return the dominator of the first term of  the increasing term */
++static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
++{
++	struct sock *sk;
++	u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
++
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		u64 scaled_num;
++		u32 tmp_cwnd;
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
++		rate += div_u64(scaled_num , tp->srtt_us);
++	}
++	rate *= rate;
++	return rate;
++}
++
++/* find the maximum cwnd, used to find set M */
++static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
++{
++	struct sock *sk;
++	u32 best_cwnd = 0;
++
++	mptcp_for_each_sk(mpcb, sk) {
++		u32 tmp_cwnd;
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if (tmp_cwnd > best_cwnd)
++			best_cwnd = tmp_cwnd;
++	}
++	return best_cwnd;
++}
++
++static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
++{
++	struct mptcp_olia *ca;
++	struct tcp_sock *tp;
++	struct sock *sk;
++	u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
++	u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
++	u8 M = 0, B_not_M = 0;
++
++	/* TODO - integrate this in the following loop - we just want to iterate once */
++
++	max_cwnd = mptcp_get_max_cwnd(mpcb);
++
++	/* find the best path */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++		/* TODO - check here and rename variables */
++		tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++			      ca->mptcp_loss2 - ca->mptcp_loss1);
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
++			best_rtt = tmp_rtt;
++			best_int = tmp_int;
++			best_cwnd = tmp_cwnd;
++		}
++	}
++
++	/* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
++	/* find the size of M and B_not_M */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if (tmp_cwnd == max_cwnd) {
++			M++;
++		} else {
++			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++				      ca->mptcp_loss2 - ca->mptcp_loss1);
++
++			if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
++				B_not_M++;
++		}
++	}
++
++	/* check if the path is in M or B_not_M and set the value of epsilon accordingly */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		if (B_not_M == 0) {
++			ca->epsilon_num = 0;
++			ca->epsilon_den = 1;
++		} else {
++			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++				      ca->mptcp_loss2 - ca->mptcp_loss1);
++			tmp_cwnd = mptcp_get_crt_cwnd(sk);
++
++			if (tmp_cwnd < max_cwnd &&
++			    (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
++				ca->epsilon_num = 1;
++				ca->epsilon_den = mpcb->cnt_established * B_not_M;
++			} else if (tmp_cwnd == max_cwnd) {
++				ca->epsilon_num = -1;
++				ca->epsilon_den = mpcb->cnt_established  * M;
++			} else {
++				ca->epsilon_num = 0;
++				ca->epsilon_den = 1;
++			}
++		}
++	}
++}
++
++/* setting the initial values */
++static void mptcp_olia_init(struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_olia *ca = inet_csk_ca(sk);
++
++	if (mptcp(tp)) {
++		ca->mptcp_loss1 = tp->snd_una;
++		ca->mptcp_loss2 = tp->snd_una;
++		ca->mptcp_loss3 = tp->snd_una;
++		ca->mptcp_snd_cwnd_cnt = 0;
++		ca->epsilon_num = 0;
++		ca->epsilon_den = 1;
++	}
++}
++
++/* updating inter-loss distance and ssthresh */
++static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
++{
++	if (!mptcp(tcp_sk(sk)))
++		return;
++
++	if (new_state == TCP_CA_Loss ||
++	    new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
++		struct mptcp_olia *ca = inet_csk_ca(sk);
++
++		if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
++		    !inet_csk(sk)->icsk_retransmits) {
++			ca->mptcp_loss1 = ca->mptcp_loss2;
++			ca->mptcp_loss2 = ca->mptcp_loss3;
++		}
++	}
++}
++
++/* main algorithm */
++static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_olia *ca = inet_csk_ca(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++
++	u64 inc_num, inc_den, rate, cwnd_scaled;
++
++	if (!mptcp(tp)) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	ca->mptcp_loss3 = tp->snd_una;
++
++	if (!tcp_is_cwnd_limited(sk))
++		return;
++
++	/* slow start if it is in the safe area */
++	if (tp->snd_cwnd <= tp->snd_ssthresh) {
++		tcp_slow_start(tp, acked);
++		return;
++	}
++
++	mptcp_get_epsilon(mpcb);
++	rate = mptcp_get_rate(mpcb, tp->srtt_us);
++	cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
++	inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
++
++	/* calculate the increasing term, scaling is used to reduce the rounding effect */
++	if (ca->epsilon_num == -1) {
++		if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
++			inc_num = rate - ca->epsilon_den *
++				cwnd_scaled * cwnd_scaled;
++			ca->mptcp_snd_cwnd_cnt -= div64_u64(
++			    mptcp_olia_scale(inc_num , scale) , inc_den);
++		} else {
++			inc_num = ca->epsilon_den *
++			    cwnd_scaled * cwnd_scaled - rate;
++			ca->mptcp_snd_cwnd_cnt += div64_u64(
++			    mptcp_olia_scale(inc_num , scale) , inc_den);
++		}
++	} else {
++		inc_num = ca->epsilon_num * rate +
++		    ca->epsilon_den * cwnd_scaled * cwnd_scaled;
++		ca->mptcp_snd_cwnd_cnt += div64_u64(
++		    mptcp_olia_scale(inc_num , scale) , inc_den);
++	}
++
++
++	if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
++		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
++			tp->snd_cwnd++;
++		ca->mptcp_snd_cwnd_cnt = 0;
++	} else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
++		tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
++		ca->mptcp_snd_cwnd_cnt = 0;
++	}
++}
++
++static struct tcp_congestion_ops mptcp_olia = {
++	.init		= mptcp_olia_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_olia_cong_avoid,
++	.set_state	= mptcp_olia_set_state,
++	.owner		= THIS_MODULE,
++	.name		= "olia",
++};
++
++static int __init mptcp_olia_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
++	return tcp_register_congestion_control(&mptcp_olia);
++}
++
++static void __exit mptcp_olia_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_olia);
++}
++
++module_init(mptcp_olia_register);
++module_exit(mptcp_olia_unregister);
++
++MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
+new file mode 100644
+index 000000000000..400ea254c078
+--- /dev/null
++++ b/net/mptcp/mptcp_output.c
+@@ -0,0 +1,1743 @@
++/*
++ *	MPTCP implementation - Sending side
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/kconfig.h>
++#include <linux/skbuff.h>
++#include <linux/tcp.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++#include <net/sock.h>
++
++static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
++				 MPTCP_SUB_LEN_ACK_ALIGN +
++				 MPTCP_SUB_LEN_SEQ_ALIGN;
++
++static inline int mptcp_sub_len_remove_addr(u16 bitfield)
++{
++	unsigned int c;
++	for (c = 0; bitfield; c++)
++		bitfield &= bitfield - 1;
++	return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
++}
++
++int mptcp_sub_len_remove_addr_align(u16 bitfield)
++{
++	return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
++}
++EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
++
++/* get the data-seq and end-data-seq and store them again in the
++ * tcp_skb_cb
++ */
++static int mptcp_reconstruct_mapping(struct sk_buff *skb)
++{
++	const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
++	u32 *p32;
++	u16 *p16;
++
++	if (!mpdss->M)
++		return 1;
++
++	/* Move the pointer to the data-seq */
++	p32 = (u32 *)mpdss;
++	p32++;
++	if (mpdss->A) {
++		p32++;
++		if (mpdss->a)
++			p32++;
++	}
++
++	TCP_SKB_CB(skb)->seq = ntohl(*p32);
++
++	/* Get the data_len to calculate the end_data_seq */
++	p32++;
++	p32++;
++	p16 = (u16 *)p32;
++	TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
++
++	return 0;
++}
++
++static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
++{
++	struct sk_buff *skb_it;
++
++	skb_it = tcp_write_queue_head(meta_sk);
++
++	tcp_for_write_queue_from(skb_it, meta_sk) {
++		if (skb_it == tcp_send_head(meta_sk))
++			break;
++
++		if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
++			TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
++			break;
++		}
++	}
++}
++
++/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
++ * coming from the meta-retransmit-timer
++ */
++static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
++				  struct sock *sk, int clone_it)
++{
++	struct sk_buff *skb, *skb1;
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	u32 seq, end_seq;
++
++	if (clone_it) {
++		/* pskb_copy is necessary here, because the TCP/IP-headers
++		 * will be changed when it's going to be reinjected on another
++		 * subflow.
++		 */
++		skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
++	} else {
++		__skb_unlink(orig_skb, &sk->sk_write_queue);
++		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++		sk->sk_wmem_queued -= orig_skb->truesize;
++		sk_mem_uncharge(sk, orig_skb->truesize);
++		skb = orig_skb;
++	}
++	if (unlikely(!skb))
++		return;
++
++	if (sk && mptcp_reconstruct_mapping(skb)) {
++		__kfree_skb(skb);
++		return;
++	}
++
++	skb->sk = meta_sk;
++
++	/* If it reached already the destination, we don't have to reinject it */
++	if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++		__kfree_skb(skb);
++		return;
++	}
++
++	/* Only reinject segments that are fully covered by the mapping */
++	if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
++	    TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
++		u32 seq = TCP_SKB_CB(skb)->seq;
++		u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++		__kfree_skb(skb);
++
++		/* Ok, now we have to look for the full mapping in the meta
++		 * send-queue :S
++		 */
++		tcp_for_write_queue(skb, meta_sk) {
++			/* Not yet at the mapping? */
++			if (before(TCP_SKB_CB(skb)->seq, seq))
++				continue;
++			/* We have passed by the mapping */
++			if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
++				return;
++
++			__mptcp_reinject_data(skb, meta_sk, NULL, 1);
++		}
++		return;
++	}
++
++	/* Segment goes back to the MPTCP-layer. So, we need to zero the
++	 * path_mask/dss.
++	 */
++	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++	/* We need to find out the path-mask from the meta-write-queue
++	 * to properly select a subflow.
++	 */
++	mptcp_find_and_set_pathmask(meta_sk, skb);
++
++	/* If it's empty, just add */
++	if (skb_queue_empty(&mpcb->reinject_queue)) {
++		skb_queue_head(&mpcb->reinject_queue, skb);
++		return;
++	}
++
++	/* Find place to insert skb - or even we can 'drop' it, as the
++	 * data is already covered by other skb's in the reinject-queue.
++	 *
++	 * This is inspired by code from tcp_data_queue.
++	 */
++
++	skb1 = skb_peek_tail(&mpcb->reinject_queue);
++	seq = TCP_SKB_CB(skb)->seq;
++	while (1) {
++		if (!after(TCP_SKB_CB(skb1)->seq, seq))
++			break;
++		if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
++			skb1 = NULL;
++			break;
++		}
++		skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++	}
++
++	/* Do skb overlap to previous one? */
++	end_seq = TCP_SKB_CB(skb)->end_seq;
++	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++			/* All the bits are present. Don't reinject */
++			__kfree_skb(skb);
++			return;
++		}
++		if (seq == TCP_SKB_CB(skb1)->seq) {
++			if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
++				skb1 = NULL;
++			else
++				skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++		}
++	}
++	if (!skb1)
++		__skb_queue_head(&mpcb->reinject_queue, skb);
++	else
++		__skb_queue_after(&mpcb->reinject_queue, skb1, skb);
++
++	/* And clean segments covered by new one as whole. */
++	while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
++		skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
++
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++			break;
++
++		__skb_unlink(skb1, &mpcb->reinject_queue);
++		__kfree_skb(skb1);
++	}
++	return;
++}
++
++/* Inserts data into the reinject queue */
++void mptcp_reinject_data(struct sock *sk, int clone_it)
++{
++	struct sk_buff *skb_it, *tmp;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = tp->meta_sk;
++
++	/* It has already been closed - there is really no point in reinjecting */
++	if (meta_sk->sk_state == TCP_CLOSE)
++		return;
++
++	skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
++		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
++		/* Subflow syn's and fin's are not reinjected.
++		 *
++		 * As well as empty subflow-fins with a data-fin.
++		 * They are reinjected below (without the subflow-fin-flag)
++		 */
++		if (tcb->tcp_flags & TCPHDR_SYN ||
++		    (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
++		    (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
++			continue;
++
++		__mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
++	}
++
++	skb_it = tcp_write_queue_tail(meta_sk);
++	/* If sk has sent the empty data-fin, we have to reinject it too. */
++	if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
++	    TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
++		__mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
++	}
++
++	mptcp_push_pending_frames(meta_sk);
++
++	tp->pf = 1;
++}
++EXPORT_SYMBOL(mptcp_reinject_data);
++
++static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
++			       struct sock *subsk)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sock *sk_it;
++	int all_empty = 1, all_acked;
++
++	/* In infinite mapping we always try to combine */
++	if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
++		subsk->sk_shutdown |= SEND_SHUTDOWN;
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++		return;
++	}
++
++	/* Don't combine, if they didn't combine - otherwise we end up in
++	 * TIME_WAIT, even if our app is smart enough to avoid it
++	 */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++		if (!mpcb->dfin_combined)
++			return;
++	}
++
++	/* If no other subflow has data to send, we can combine */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		if (!mptcp_sk_can_send(sk_it))
++			continue;
++
++		if (!tcp_write_queue_empty(sk_it))
++			all_empty = 0;
++	}
++
++	/* If all data has been DATA_ACKed, we can combine.
++	 * -1, because the data_fin consumed one byte
++	 */
++	all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
++
++	if ((all_empty || all_acked) && tcp_close_state(subsk)) {
++		subsk->sk_shutdown |= SEND_SHUTDOWN;
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++	}
++}
++
++static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
++				   __be32 *ptr)
++{
++	const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	__be32 *start = ptr;
++	__u16 data_len;
++
++	*ptr++ = htonl(tcb->seq); /* data_seq */
++
++	/* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
++	if (mptcp_is_data_fin(skb) && skb->len == 0)
++		*ptr++ = 0; /* subseq */
++	else
++		*ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
++
++	if (tcb->mptcp_flags & MPTCPHDR_INF)
++		data_len = 0;
++	else
++		data_len = tcb->end_seq - tcb->seq;
++
++	if (tp->mpcb->dss_csum && data_len) {
++		__be16 *p16 = (__be16 *)ptr;
++		__be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
++		__wsum csum;
++
++		*ptr = htonl(((data_len) << 16) |
++			     (TCPOPT_EOL << 8) |
++			     (TCPOPT_EOL));
++		csum = csum_partial(ptr - 2, 12, skb->csum);
++		p16++;
++		*p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
++	} else {
++		*ptr++ = htonl(((data_len) << 16) |
++			       (TCPOPT_NOP << 8) |
++			       (TCPOPT_NOP));
++	}
++
++	return ptr - start;
++}
++
++static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
++				    __be32 *ptr)
++{
++	struct mp_dss *mdss = (struct mp_dss *)ptr;
++	__be32 *start = ptr;
++
++	mdss->kind = TCPOPT_MPTCP;
++	mdss->sub = MPTCP_SUB_DSS;
++	mdss->rsv1 = 0;
++	mdss->rsv2 = 0;
++	mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
++	mdss->m = 0;
++	mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
++	mdss->a = 0;
++	mdss->A = 1;
++	mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
++	ptr++;
++
++	*ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++	return ptr - start;
++}
++
++/* RFC6824 states that once a particular subflow mapping has been sent
++ * out it must never be changed. However, packets may be split while
++ * they are in the retransmission queue (due to SACK or ACKs) and that
++ * arguably means that we would change the mapping (e.g. it splits it,
++ * our sends out a subset of the initial mapping).
++ *
++ * Furthermore, the skb checksum is not always preserved across splits
++ * (e.g. mptcp_fragment) which would mean that we need to recompute
++ * the DSS checksum in this case.
++ *
++ * To avoid this we save the initial DSS mapping which allows us to
++ * send the same DSS mapping even for fragmented retransmits.
++ */
++static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
++{
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	__be32 *ptr = (__be32 *)tcb->dss;
++
++	tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++	ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++	ptr += mptcp_write_dss_mapping(tp, skb, ptr);
++}
++
++/* Write the saved DSS mapping to the header */
++static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
++				    __be32 *ptr)
++{
++	__be32 *start = ptr;
++
++	memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
++
++	/* update the data_ack */
++	start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++	/* dss is in a union with inet_skb_parm and
++	 * the IP layer expects zeroed IPCB fields.
++	 */
++	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++	return mptcp_dss_len/sizeof(*ptr);
++}
++
++static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct sock *meta_sk = mptcp_meta_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++	struct tcp_skb_cb *tcb;
++	struct sk_buff *subskb = NULL;
++
++	if (!reinject)
++		TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
++						  MPTCPHDR_SEQ64_INDEX : 0);
++
++	subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
++	if (!subskb)
++		return false;
++
++	/* At the subflow-level we need to call again tcp_init_tso_segs. We
++	 * force this, by setting gso_segs to 0. It has been set to 1 prior to
++	 * the call to mptcp_skb_entail.
++	 */
++	skb_shinfo(subskb)->gso_segs = 0;
++
++	TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
++
++	if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
++	    skb->ip_summed == CHECKSUM_PARTIAL) {
++		subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
++		subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
++	}
++
++	tcb = TCP_SKB_CB(subskb);
++
++	if (tp->mpcb->send_infinite_mapping &&
++	    !tp->mpcb->infinite_mapping_snd &&
++	    !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
++		tp->mptcp->fully_established = 1;
++		tp->mpcb->infinite_mapping_snd = 1;
++		tp->mptcp->infinite_cutoff_seq = tp->write_seq;
++		tcb->mptcp_flags |= MPTCPHDR_INF;
++	}
++
++	if (mptcp_is_data_fin(subskb))
++		mptcp_combine_dfin(subskb, meta_sk, sk);
++
++	mptcp_save_dss_data_seq(tp, subskb);
++
++	tcb->seq = tp->write_seq;
++	tcb->sacked = 0; /* reset the sacked field: from the point of view
++			  * of this subflow, we are sending a brand new
++			  * segment
++			  */
++	/* Take into account seg len */
++	tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
++	tcb->end_seq = tp->write_seq;
++
++	/* If it's a non-payload DATA_FIN (also no subflow-fin), the
++	 * segment is not part of the subflow but on a meta-only-level.
++	 */
++	if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
++		tcp_add_write_queue_tail(sk, subskb);
++		sk->sk_wmem_queued += subskb->truesize;
++		sk_mem_charge(sk, subskb->truesize);
++	} else {
++		int err;
++
++		/* Necessary to initialize for tcp_transmit_skb. mss of 1, as
++		 * skb->len = 0 will force tso_segs to 1.
++		 */
++		tcp_init_tso_segs(sk, subskb, 1);
++		/* Empty data-fins are sent immediatly on the subflow */
++		TCP_SKB_CB(subskb)->when = tcp_time_stamp;
++		err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
++
++		/* It has not been queued, we can free it now. */
++		kfree_skb(subskb);
++
++		if (err)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		tp->mptcp->second_packet = 1;
++		tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
++	}
++
++	return true;
++}
++
++/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
++ * might need to undo some operations done by tcp_fragment.
++ */
++static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
++			  gfp_t gfp, int reinject)
++{
++	int ret, diff, old_factor;
++	struct sk_buff *buff;
++	u8 flags;
++
++	if (skb_headlen(skb) < len)
++		diff = skb->len - len;
++	else
++		diff = skb->data_len;
++	old_factor = tcp_skb_pcount(skb);
++
++	/* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
++	 * At the MPTCP-level we do not care about the absolute value. All we
++	 * care about is that it is set to 1 for accurate packets_out
++	 * accounting.
++	 */
++	ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
++	if (ret)
++		return ret;
++
++	buff = skb->next;
++
++	flags = TCP_SKB_CB(skb)->mptcp_flags;
++	TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
++	TCP_SKB_CB(buff)->mptcp_flags = flags;
++	TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
++
++	/* If reinject == 1, the buff will be added to the reinject
++	 * queue, which is currently not part of memory accounting. So
++	 * undo the changes done by tcp_fragment and update the
++	 * reinject queue. Also, undo changes to the packet counters.
++	 */
++	if (reinject == 1) {
++		int undo = buff->truesize - diff;
++		meta_sk->sk_wmem_queued -= undo;
++		sk_mem_uncharge(meta_sk, undo);
++
++		tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
++		meta_sk->sk_write_queue.qlen--;
++
++		if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
++			undo = old_factor - tcp_skb_pcount(skb) -
++				tcp_skb_pcount(buff);
++			if (undo)
++				tcp_adjust_pcount(meta_sk, skb, -undo);
++		}
++	}
++
++	return 0;
++}
++
++/* Inspired by tcp_write_wakeup */
++int mptcp_write_wakeup(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++	struct sock *sk_it;
++	int ans = 0;
++
++	if (meta_sk->sk_state == TCP_CLOSE)
++		return -1;
++
++	skb = tcp_send_head(meta_sk);
++	if (skb &&
++	    before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
++		unsigned int mss;
++		unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
++		struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
++		struct tcp_sock *subtp;
++		if (!subsk)
++			goto window_probe;
++		subtp = tcp_sk(subsk);
++		mss = tcp_current_mss(subsk);
++
++		seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
++			       tcp_wnd_end(subtp) - subtp->write_seq);
++
++		if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
++			meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
++
++		/* We are probing the opening of a window
++		 * but the window size is != 0
++		 * must have been a result SWS avoidance ( sender )
++		 */
++		if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
++		    skb->len > mss) {
++			seg_size = min(seg_size, mss);
++			TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++			if (mptcp_fragment(meta_sk, skb, seg_size,
++					   GFP_ATOMIC, 0))
++				return -1;
++		} else if (!tcp_skb_pcount(skb)) {
++			/* see mptcp_write_xmit on why we use UINT_MAX */
++			tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++		}
++
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++		if (!mptcp_skb_entail(subsk, skb, 0))
++			return -1;
++		TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++		mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
++						 TCP_SKB_CB(skb)->seq);
++		tcp_event_new_data_sent(meta_sk, skb);
++
++		__tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
++
++		return 0;
++	} else {
++window_probe:
++		if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
++			    meta_tp->snd_una + 0xFFFF)) {
++			mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++				if (mptcp_sk_can_send_ack(sk_it))
++					tcp_xmit_probe_skb(sk_it, 1);
++			}
++		}
++
++		/* At least one of the tcp_xmit_probe_skb's has to succeed */
++		mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++			int ret;
++
++			if (!mptcp_sk_can_send_ack(sk_it))
++				continue;
++
++			ret = tcp_xmit_probe_skb(sk_it, 0);
++			if (unlikely(ret > 0))
++				ans = ret;
++		}
++		return ans;
++	}
++}
++
++bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
++		     int push_one, gfp_t gfp)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
++	struct sock *subsk = NULL;
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb;
++	unsigned int sent_pkts;
++	int reinject = 0;
++	unsigned int sublimit;
++
++	sent_pkts = 0;
++
++	while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
++						    &sublimit))) {
++		unsigned int limit;
++
++		subtp = tcp_sk(subsk);
++		mss_now = tcp_current_mss(subsk);
++
++		if (reinject == 1) {
++			if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++				/* Segment already reached the peer, take the next one */
++				__skb_unlink(skb, &mpcb->reinject_queue);
++				__kfree_skb(skb);
++				continue;
++			}
++		}
++
++		/* If the segment was cloned (e.g. a meta retransmission),
++		 * the header must be expanded/copied so that there is no
++		 * corruption of TSO information.
++		 */
++		if (skb_unclone(skb, GFP_ATOMIC))
++			break;
++
++		if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
++			break;
++
++		/* Force tso_segs to 1 by using UINT_MAX.
++		 * We actually don't care about the exact number of segments
++		 * emitted on the subflow. We need just to set tso_segs, because
++		 * we still need an accurate packets_out count in
++		 * tcp_event_new_data_sent.
++		 */
++		tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++
++		/* Check for nagle, irregardless of tso_segs. If the segment is
++		 * actually larger than mss_now (TSO segment), then
++		 * tcp_nagle_check will have partial == false and always trigger
++		 * the transmission.
++		 * tcp_write_xmit has a TSO-level nagle check which is not
++		 * subject to the MPTCP-level. It is based on the properties of
++		 * the subflow, not the MPTCP-level.
++		 */
++		if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
++					     (tcp_skb_is_last(meta_sk, skb) ?
++					      nonagle : TCP_NAGLE_PUSH))))
++			break;
++
++		limit = mss_now;
++		/* skb->len > mss_now is the equivalent of tso_segs > 1 in
++		 * tcp_write_xmit. Otherwise split-point would return 0.
++		 */
++		if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++			/* We limit the size of the skb so that it fits into the
++			 * window. Call tcp_mss_split_point to avoid duplicating
++			 * code.
++			 * We really only care about fitting the skb into the
++			 * window. That's why we use UINT_MAX. If the skb does
++			 * not fit into the cwnd_quota or the NIC's max-segs
++			 * limitation, it will be split by the subflow's
++			 * tcp_write_xmit which does the appropriate call to
++			 * tcp_mss_split_point.
++			 */
++			limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++						    UINT_MAX / mss_now,
++						    nonagle);
++
++		if (sublimit)
++			limit = min(limit, sublimit);
++
++		if (skb->len > limit &&
++		    unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
++			break;
++
++		if (!mptcp_skb_entail(subsk, skb, reinject))
++			break;
++		/* Nagle is handled at the MPTCP-layer, so
++		 * always push on the subflow
++		 */
++		__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++		TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++		if (!reinject) {
++			mptcp_check_sndseq_wrap(meta_tp,
++						TCP_SKB_CB(skb)->end_seq -
++						TCP_SKB_CB(skb)->seq);
++			tcp_event_new_data_sent(meta_sk, skb);
++		}
++
++		tcp_minshall_update(meta_tp, mss_now, skb);
++		sent_pkts += tcp_skb_pcount(skb);
++
++		if (reinject > 0) {
++			__skb_unlink(skb, &mpcb->reinject_queue);
++			kfree_skb(skb);
++		}
++
++		if (push_one)
++			break;
++	}
++
++	return !meta_tp->packets_out && tcp_send_head(meta_sk);
++}
++
++void mptcp_write_space(struct sock *sk)
++{
++	mptcp_push_pending_frames(mptcp_meta_sk(sk));
++}
++
++u32 __mptcp_select_window(struct sock *sk)
++{
++	struct inet_connection_sock *icsk = inet_csk(sk);
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	int mss, free_space, full_space, window;
++
++	/* MSS for the peer's data.  Previous versions used mss_clamp
++	 * here.  I don't know if the value based on our guesses
++	 * of peer's MSS is better for the performance.  It's more correct
++	 * but may be worse for the performance because of rcv_mss
++	 * fluctuations.  --SAW  1998/11/1
++	 */
++	mss = icsk->icsk_ack.rcv_mss;
++	free_space = tcp_space(sk);
++	full_space = min_t(int, meta_tp->window_clamp,
++			tcp_full_space(sk));
++
++	if (mss > full_space)
++		mss = full_space;
++
++	if (free_space < (full_space >> 1)) {
++		icsk->icsk_ack.quick = 0;
++
++		if (tcp_memory_pressure)
++			/* TODO this has to be adapted when we support different
++			 * MSS's among the subflows.
++			 */
++			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
++						    4U * meta_tp->advmss);
++
++		if (free_space < mss)
++			return 0;
++	}
++
++	if (free_space > meta_tp->rcv_ssthresh)
++		free_space = meta_tp->rcv_ssthresh;
++
++	/* Don't do rounding if we are using window scaling, since the
++	 * scaled window will not line up with the MSS boundary anyway.
++	 */
++	window = meta_tp->rcv_wnd;
++	if (tp->rx_opt.rcv_wscale) {
++		window = free_space;
++
++		/* Advertise enough space so that it won't get scaled away.
++		 * Import case: prevent zero window announcement if
++		 * 1<<rcv_wscale > mss.
++		 */
++		if (((window >> tp->rx_opt.rcv_wscale) << tp->
++		     rx_opt.rcv_wscale) != window)
++			window = (((window >> tp->rx_opt.rcv_wscale) + 1)
++				  << tp->rx_opt.rcv_wscale);
++	} else {
++		/* Get the largest window that is a nice multiple of mss.
++		 * Window clamp already applied above.
++		 * If our current window offering is within 1 mss of the
++		 * free space we just keep it. This prevents the divide
++		 * and multiply from happening most of the time.
++		 * We also don't do any window rounding when the free space
++		 * is too small.
++		 */
++		if (window <= free_space - mss || window > free_space)
++			window = (free_space / mss) * mss;
++		else if (mss == full_space &&
++			 free_space > window + (full_space >> 1))
++			window = free_space;
++	}
++
++	return window;
++}
++
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++		       unsigned *remaining)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++
++	opts->options |= OPTION_MPTCP;
++	if (is_master_tp(tp)) {
++		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
++		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++		opts->mp_capable.sender_key = tp->mptcp_loc_key;
++		opts->dss_csum = !!sysctl_mptcp_checksum;
++	} else {
++		const struct mptcp_cb *mpcb = tp->mpcb;
++
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
++		*remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
++		opts->mp_join_syns.token = mpcb->mptcp_rem_token;
++		opts->mp_join_syns.low_prio  = tp->mptcp->low_prio;
++		opts->addr_id = tp->mptcp->loc_id;
++		opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
++	}
++}
++
++void mptcp_synack_options(struct request_sock *req,
++			  struct tcp_out_options *opts, unsigned *remaining)
++{
++	struct mptcp_request_sock *mtreq;
++	mtreq = mptcp_rsk(req);
++
++	opts->options |= OPTION_MPTCP;
++	/* MPCB not yet set - thus it's a new MPTCP-session */
++	if (!mtreq->is_sub) {
++		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
++		opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
++		opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
++		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++	} else {
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
++		opts->mp_join_syns.sender_truncated_mac =
++				mtreq->mptcp_hash_tmac;
++		opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
++		opts->mp_join_syns.low_prio = mtreq->low_prio;
++		opts->addr_id = mtreq->loc_id;
++		*remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
++	}
++}
++
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++			       struct tcp_out_options *opts, unsigned *size)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
++
++	/* We are coming from tcp_current_mss with the meta_sk as an argument.
++	 * It does not make sense to check for the options, because when the
++	 * segment gets sent, another subflow will be chosen.
++	 */
++	if (!skb && is_meta_sk(sk))
++		return;
++
++	/* In fallback mp_fail-mode, we have to repeat it until the fallback
++	 * has been done by the sender
++	 */
++	if (unlikely(tp->mptcp->send_mp_fail)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_FAIL;
++		*size += MPTCP_SUB_LEN_FAIL;
++		return;
++	}
++
++	if (unlikely(tp->send_mp_fclose)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_FCLOSE;
++		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++		*size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
++		return;
++	}
++
++	/* 1. If we are the sender of the infinite-mapping, we need the
++	 *    MPTCPHDR_INF-flag, because a retransmission of the
++	 *    infinite-announcment still needs the mptcp-option.
++	 *
++	 *    We need infinite_cutoff_seq, because retransmissions from before
++	 *    the infinite-cutoff-moment still need the MPTCP-signalling to stay
++	 *    consistent.
++	 *
++	 * 2. If we are the receiver of the infinite-mapping, we always skip
++	 *    mptcp-options, because acknowledgments from before the
++	 *    infinite-mapping point have already been sent out.
++	 *
++	 * I know, the whole infinite-mapping stuff is ugly...
++	 *
++	 * TODO: Handle wrapped data-sequence numbers
++	 *       (even if it's very unlikely)
++	 */
++	if (unlikely(mpcb->infinite_mapping_snd) &&
++	    ((mpcb->send_infinite_mapping && tcb &&
++	      mptcp_is_data_seq(skb) &&
++	      !(tcb->mptcp_flags & MPTCPHDR_INF) &&
++	      !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
++	     !mpcb->send_infinite_mapping))
++		return;
++
++	if (unlikely(tp->mptcp->include_mpc)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_CAPABLE |
++				       OPTION_TYPE_ACK;
++		*size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
++		opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
++		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++		opts->dss_csum = mpcb->dss_csum;
++
++		if (skb)
++			tp->mptcp->include_mpc = 0;
++	}
++	if (unlikely(tp->mptcp->pre_established)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
++		*size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
++	}
++
++	if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_DATA_ACK;
++		/* If !skb, we come from tcp_current_mss and thus we always
++		 * assume that the DSS-option will be set for the data-packet.
++		 */
++		if (skb && !mptcp_is_data_seq(skb)) {
++			*size += MPTCP_SUB_LEN_ACK_ALIGN;
++		} else {
++			/* Doesn't matter, if csum included or not. It will be
++			 * either 10 or 12, and thus aligned = 12
++			 */
++			*size += MPTCP_SUB_LEN_ACK_ALIGN +
++				 MPTCP_SUB_LEN_SEQ_ALIGN;
++		}
++
++		*size += MPTCP_SUB_LEN_DSS_ALIGN;
++	}
++
++	if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
++		mpcb->pm_ops->addr_signal(sk, size, opts, skb);
++
++	if (unlikely(tp->mptcp->send_mp_prio) &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_PRIO;
++		if (skb)
++			tp->mptcp->send_mp_prio = 0;
++		*size += MPTCP_SUB_LEN_PRIO_ALIGN;
++	}
++
++	return;
++}
++
++u16 mptcp_select_window(struct sock *sk)
++{
++	u16 new_win		= tcp_select_window(sk);
++	struct tcp_sock *tp	= tcp_sk(sk);
++	struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
++
++	meta_tp->rcv_wnd	= tp->rcv_wnd;
++	meta_tp->rcv_wup	= meta_tp->rcv_nxt;
++
++	return new_win;
++}
++
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++			 const struct tcp_out_options *opts,
++			 struct sk_buff *skb)
++{
++	if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
++		struct mp_capable *mpc = (struct mp_capable *)ptr;
++
++		mpc->kind = TCPOPT_MPTCP;
++
++		if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
++		    (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
++			mpc->sender_key = opts->mp_capable.sender_key;
++			mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
++			ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
++		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++			mpc->sender_key = opts->mp_capable.sender_key;
++			mpc->receiver_key = opts->mp_capable.receiver_key;
++			mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
++			ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
++		}
++
++		mpc->sub = MPTCP_SUB_CAPABLE;
++		mpc->ver = 0;
++		mpc->a = opts->dss_csum;
++		mpc->b = 0;
++		mpc->rsv = 0;
++		mpc->h = 1;
++	}
++
++	if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
++		struct mp_join *mpj = (struct mp_join *)ptr;
++
++		mpj->kind = TCPOPT_MPTCP;
++		mpj->sub = MPTCP_SUB_JOIN;
++		mpj->rsv = 0;
++
++		if (OPTION_TYPE_SYN & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
++			mpj->u.syn.token = opts->mp_join_syns.token;
++			mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
++			mpj->b = opts->mp_join_syns.low_prio;
++			mpj->addr_id = opts->addr_id;
++			ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
++		} else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
++			mpj->u.synack.mac =
++				opts->mp_join_syns.sender_truncated_mac;
++			mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
++			mpj->b = opts->mp_join_syns.low_prio;
++			mpj->addr_id = opts->addr_id;
++			ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
++		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
++			mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
++			memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
++			ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
++		}
++	}
++	if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
++		struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++		mpadd->kind = TCPOPT_MPTCP;
++		if (opts->add_addr_v4) {
++			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
++			mpadd->sub = MPTCP_SUB_ADD_ADDR;
++			mpadd->ipver = 4;
++			mpadd->addr_id = opts->add_addr4.addr_id;
++			mpadd->u.v4.addr = opts->add_addr4.addr;
++			ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
++		} else if (opts->add_addr_v6) {
++			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
++			mpadd->sub = MPTCP_SUB_ADD_ADDR;
++			mpadd->ipver = 6;
++			mpadd->addr_id = opts->add_addr6.addr_id;
++			memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
++			       sizeof(mpadd->u.v6.addr));
++			ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
++		}
++	}
++	if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
++		struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++		u8 *addrs_id;
++		int id, len, len_align;
++
++		len = mptcp_sub_len_remove_addr(opts->remove_addrs);
++		len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
++
++		mprem->kind = TCPOPT_MPTCP;
++		mprem->len = len;
++		mprem->sub = MPTCP_SUB_REMOVE_ADDR;
++		mprem->rsv = 0;
++		addrs_id = &mprem->addrs_id;
++
++		mptcp_for_each_bit_set(opts->remove_addrs, id)
++			*(addrs_id++) = id;
++
++		/* Fill the rest with NOP's */
++		if (len_align > len) {
++			int i;
++			for (i = 0; i < len_align - len; i++)
++				*(addrs_id++) = TCPOPT_NOP;
++		}
++
++		ptr += len_align >> 2;
++	}
++	if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
++		struct mp_fail *mpfail = (struct mp_fail *)ptr;
++
++		mpfail->kind = TCPOPT_MPTCP;
++		mpfail->len = MPTCP_SUB_LEN_FAIL;
++		mpfail->sub = MPTCP_SUB_FAIL;
++		mpfail->rsv1 = 0;
++		mpfail->rsv2 = 0;
++		mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
++
++		ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
++	}
++	if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
++		struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
++
++		mpfclose->kind = TCPOPT_MPTCP;
++		mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
++		mpfclose->sub = MPTCP_SUB_FCLOSE;
++		mpfclose->rsv1 = 0;
++		mpfclose->rsv2 = 0;
++		mpfclose->key = opts->mp_capable.receiver_key;
++
++		ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
++	}
++
++	if (OPTION_DATA_ACK & opts->mptcp_options) {
++		if (!mptcp_is_data_seq(skb))
++			ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++		else
++			ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
++	}
++	if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
++		struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++		mpprio->kind = TCPOPT_MPTCP;
++		mpprio->len = MPTCP_SUB_LEN_PRIO;
++		mpprio->sub = MPTCP_SUB_PRIO;
++		mpprio->rsv = 0;
++		mpprio->b = tp->mptcp->low_prio;
++		mpprio->addr_id = TCPOPT_NOP;
++
++		ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
++	}
++}
++
++/* Sends the datafin */
++void mptcp_send_fin(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
++	int mss_now;
++
++	if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
++		meta_tp->mpcb->passive_close = 1;
++
++	/* Optimization, tack on the FIN if we have a queue of
++	 * unsent frames.  But be careful about outgoing SACKS
++	 * and IP options.
++	 */
++	mss_now = mptcp_current_mss(meta_sk);
++
++	if (tcp_send_head(meta_sk) != NULL) {
++		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++		TCP_SKB_CB(skb)->end_seq++;
++		meta_tp->write_seq++;
++	} else {
++		/* Socket is locked, keep trying until memory is available. */
++		for (;;) {
++			skb = alloc_skb_fclone(MAX_TCP_HEADER,
++					       meta_sk->sk_allocation);
++			if (skb)
++				break;
++			yield();
++		}
++		/* Reserve space for headers and prepare control bits. */
++		skb_reserve(skb, MAX_TCP_HEADER);
++
++		tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
++		TCP_SKB_CB(skb)->end_seq++;
++		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++		tcp_queue_skb(meta_sk, skb);
++	}
++	__tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
++}
++
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
++
++	if (!mpcb->cnt_subflows)
++		return;
++
++	WARN_ON(meta_tp->send_mp_fclose);
++
++	/* First - select a socket */
++	sk = mptcp_select_ack_sock(meta_sk);
++
++	/* May happen if no subflow is in an appropriate state */
++	if (!sk)
++		return;
++
++	/* We are in infinite mode - just send a reset */
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
++		sk->sk_err = ECONNRESET;
++		if (tcp_need_reset(sk->sk_state))
++			tcp_send_active_reset(sk, priority);
++		mptcp_sub_force_close(sk);
++		return;
++	}
++
++
++	tcp_sk(sk)->send_mp_fclose = 1;
++	/** Reset all other subflows */
++
++	/* tcp_done must be handled with bh disabled */
++	if (!in_serving_softirq())
++		local_bh_disable();
++
++	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++		if (tcp_sk(sk_it)->send_mp_fclose)
++			continue;
++
++		sk_it->sk_err = ECONNRESET;
++		if (tcp_need_reset(sk_it->sk_state))
++			tcp_send_active_reset(sk_it, GFP_ATOMIC);
++		mptcp_sub_force_close(sk_it);
++	}
++
++	if (!in_serving_softirq())
++		local_bh_enable();
++
++	tcp_send_ack(sk);
++	inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
++
++	meta_tp->send_mp_fclose = 1;
++}
++
++static void mptcp_ack_retransmit_timer(struct sock *sk)
++{
++	struct sk_buff *skb;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct inet_connection_sock *icsk = inet_csk(sk);
++
++	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
++		goto out; /* Routing failure or similar */
++
++	if (!tp->retrans_stamp)
++		tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++	if (tcp_write_timeout(sk)) {
++		tp->mptcp->pre_established = 0;
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++		tp->ops->send_active_reset(sk, GFP_ATOMIC);
++		goto out;
++	}
++
++	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
++	if (skb == NULL) {
++		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++			       jiffies + icsk->icsk_rto);
++		return;
++	}
++
++	/* Reserve space for headers and prepare control bits */
++	skb_reserve(skb, MAX_TCP_HEADER);
++	tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
++
++	TCP_SKB_CB(skb)->when = tcp_time_stamp;
++	if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
++		/* Retransmission failed because of local congestion,
++		 * do not backoff.
++		 */
++		if (!icsk->icsk_retransmits)
++			icsk->icsk_retransmits = 1;
++		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++			       jiffies + icsk->icsk_rto);
++		return;
++	}
++
++
++	icsk->icsk_retransmits++;
++	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++	sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++		       jiffies + icsk->icsk_rto);
++	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
++		__sk_dst_reset(sk);
++
++out:;
++}
++
++void mptcp_ack_handler(unsigned long data)
++{
++	struct sock *sk = (struct sock *)data;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		/* Try again later */
++		sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
++			       jiffies + (HZ / 20));
++		goto out_unlock;
++	}
++
++	if (sk->sk_state == TCP_CLOSE)
++		goto out_unlock;
++	if (!tcp_sk(sk)->mptcp->pre_established)
++		goto out_unlock;
++
++	mptcp_ack_retransmit_timer(sk);
++
++	sk_mem_reclaim(sk);
++
++out_unlock:
++	bh_unlock_sock(meta_sk);
++	sock_put(sk);
++}
++
++/* Similar to tcp_retransmit_skb
++ *
++ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
++ * meta-level.
++ */
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *subsk;
++	unsigned int limit, mss_now;
++	int err = -1;
++
++	/* Do not sent more than we queued. 1/4 is reserved for possible
++	 * copying overhead: fragmentation, tunneling, mangling etc.
++	 *
++	 * This is a meta-retransmission thus we check on the meta-socket.
++	 */
++	if (atomic_read(&meta_sk->sk_wmem_alloc) >
++	    min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
++		return -EAGAIN;
++	}
++
++	/* We need to make sure that the retransmitted segment can be sent on a
++	 * subflow right now. If it is too big, it needs to be fragmented.
++	 */
++	subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
++	if (!subsk) {
++		/* We want to increase icsk_retransmits, thus return 0, so that
++		 * mptcp_retransmit_timer enters the desired branch.
++		 */
++		err = 0;
++		goto failed;
++	}
++	mss_now = tcp_current_mss(subsk);
++
++	/* If the segment was cloned (e.g. a meta retransmission), the header
++	 * must be expanded/copied so that there is no corruption of TSO
++	 * information.
++	 */
++	if (skb_unclone(skb, GFP_ATOMIC)) {
++		err = -ENOMEM;
++		goto failed;
++	}
++
++	/* Must have been set by mptcp_write_xmit before */
++	BUG_ON(!tcp_skb_pcount(skb));
++
++	limit = mss_now;
++	/* skb->len > mss_now is the equivalent of tso_segs > 1 in
++	 * tcp_write_xmit. Otherwise split-point would return 0.
++	 */
++	if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++		limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++					    UINT_MAX / mss_now,
++					    TCP_NAGLE_OFF);
++
++	if (skb->len > limit &&
++	    unlikely(mptcp_fragment(meta_sk, skb, limit,
++				    GFP_ATOMIC, 0)))
++		goto failed;
++
++	if (!mptcp_skb_entail(subsk, skb, -1))
++		goto failed;
++	TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++	/* Update global TCP statistics. */
++	TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
++
++	/* Diff to tcp_retransmit_skb */
++
++	/* Save stamp of the first retransmit. */
++	if (!meta_tp->retrans_stamp)
++		meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
++
++	__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++
++	return 0;
++
++failed:
++	NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
++	return err;
++}
++
++/* Similar to tcp_retransmit_timer
++ *
++ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
++ * and that we don't have an srtt estimation at the meta-level.
++ */
++void mptcp_retransmit_timer(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	int err;
++
++	/* In fallback, retransmission is handled at the subflow-level */
++	if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
++	    mpcb->send_infinite_mapping)
++		return;
++
++	WARN_ON(tcp_write_queue_empty(meta_sk));
++
++	if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
++	    !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
++		/* Receiver dastardly shrinks window. Our retransmits
++		 * become zero probes, but we should not timeout this
++		 * connection. If the socket is an orphan, time it out,
++		 * we cannot allow such beasts to hang infinitely.
++		 */
++		struct inet_sock *meta_inet = inet_sk(meta_sk);
++		if (meta_sk->sk_family == AF_INET) {
++			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++				       &meta_inet->inet_daddr,
++				       ntohs(meta_inet->inet_dport),
++				       meta_inet->inet_num, meta_tp->snd_una,
++				       meta_tp->snd_nxt);
++		}
++#if IS_ENABLED(CONFIG_IPV6)
++		else if (meta_sk->sk_family == AF_INET6) {
++			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++				       &meta_sk->sk_v6_daddr,
++				       ntohs(meta_inet->inet_dport),
++				       meta_inet->inet_num, meta_tp->snd_una,
++				       meta_tp->snd_nxt);
++		}
++#endif
++		if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
++			tcp_write_err(meta_sk);
++			return;
++		}
++
++		mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++		goto out_reset_timer;
++	}
++
++	if (tcp_write_timeout(meta_sk))
++		return;
++
++	if (meta_icsk->icsk_retransmits == 0)
++		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
++
++	meta_icsk->icsk_ca_state = TCP_CA_Loss;
++
++	err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++	if (err > 0) {
++		/* Retransmission failed because of local congestion,
++		 * do not backoff.
++		 */
++		if (!meta_icsk->icsk_retransmits)
++			meta_icsk->icsk_retransmits = 1;
++		inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++					  min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
++					  TCP_RTO_MAX);
++		return;
++	}
++
++	/* Increase the timeout each time we retransmit.  Note that
++	 * we do not increase the rtt estimate.  rto is initialized
++	 * from rtt, but increases here.  Jacobson (SIGCOMM 88) suggests
++	 * that doubling rto each time is the least we can get away with.
++	 * In KA9Q, Karn uses this for the first few times, and then
++	 * goes to quadratic.  netBSD doubles, but only goes up to *64,
++	 * and clamps at 1 to 64 sec afterwards.  Note that 120 sec is
++	 * defined in the protocol as the maximum possible RTT.  I guess
++	 * we'll have to use something other than TCP to talk to the
++	 * University of Mars.
++	 *
++	 * PAWS allows us longer timeouts and large windows, so once
++	 * implemented ftp to mars will work nicely. We will have to fix
++	 * the 120 second clamps though!
++	 */
++	meta_icsk->icsk_backoff++;
++	meta_icsk->icsk_retransmits++;
++
++out_reset_timer:
++	/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
++	 * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
++	 * might be increased if the stream oscillates between thin and thick,
++	 * thus the old value might already be too high compared to the value
++	 * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
++	 * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
++	 * exponential backoff behaviour to avoid continue hammering
++	 * linear-timeout retransmissions into a black hole
++	 */
++	if (meta_sk->sk_state == TCP_ESTABLISHED &&
++	    (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
++	    tcp_stream_is_thin(meta_tp) &&
++	    meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
++		meta_icsk->icsk_backoff = 0;
++		/* We cannot do the same as in tcp_write_timer because the
++		 * srtt is not set here.
++		 */
++		mptcp_set_rto(meta_sk);
++	} else {
++		/* Use normal (exponential) backoff */
++		meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
++	}
++	inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
++
++	return;
++}
++
++/* Modify values to an mptcp-level for the initial window of new subflows */
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++				__u32 *window_clamp, int wscale_ok,
++				__u8 *rcv_wscale, __u32 init_rcv_wnd,
++				 const struct sock *sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	*window_clamp = mpcb->orig_window_clamp;
++	__space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
++
++	tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
++				  wscale_ok, rcv_wscale, init_rcv_wnd, sk);
++}
++
++static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
++				  unsigned int (*mss_cb)(struct sock *sk))
++{
++	struct sock *sk;
++	u64 rate = 0;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		int this_mss;
++		u64 this_rate;
++
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		/* Do not consider subflows without a RTT estimation yet
++		 * otherwise this_rate >>> rate.
++		 */
++		if (unlikely(!tp->srtt_us))
++			continue;
++
++		this_mss = mss_cb(sk);
++
++		/* If this_mss is smaller than mss, it means that a segment will
++		 * be splitted in two (or more) when pushed on this subflow. If
++		 * you consider that mss = 1428 and this_mss = 1420 then two
++		 * segments will be generated: a 1420-byte and 8-byte segment.
++		 * The latter will introduce a large overhead as for a single
++		 * data segment 2 slots will be used in the congestion window.
++		 * Therefore reducing by ~2 the potential throughput of this
++		 * subflow. Indeed, 1428 will be send while 2840 could have been
++		 * sent if mss == 1420 reducing the throughput by 2840 / 1428.
++		 *
++		 * The following algorithm take into account this overhead
++		 * when computing the potential throughput that MPTCP can
++		 * achieve when generating mss-byte segments.
++		 *
++		 * The formulae is the following:
++		 *  \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
++		 * Where ratio is computed as follows:
++		 *  \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
++		 *
++		 * ratio gives the reduction factor of the theoretical
++		 * throughput a subflow can achieve if MPTCP uses a specific
++		 * MSS value.
++		 */
++		this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
++				      max(tp->snd_cwnd, tp->packets_out),
++				      (u64)tp->srtt_us *
++				      DIV_ROUND_UP(mss, this_mss) * this_mss);
++		rate += this_rate;
++	}
++
++	return rate;
++}
++
++static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
++					unsigned int (*mss_cb)(struct sock *sk))
++{
++	unsigned int mss = 0;
++	u64 rate = 0;
++	struct sock *sk;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		int this_mss;
++		u64 this_rate;
++
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		this_mss = mss_cb(sk);
++
++		/* Same mss values will produce the same throughput. */
++		if (this_mss == mss)
++			continue;
++
++		/* See whether using this mss value can theoretically improve
++		 * the performances.
++		 */
++		this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
++		if (this_rate >= rate) {
++			mss = this_mss;
++			rate = this_rate;
++		}
++	}
++
++	return mss;
++}
++
++unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++	unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
++
++	/* If no subflow is available, we take a default-mss from the
++	 * meta-socket.
++	 */
++	return !mss ? tcp_current_mss(meta_sk) : mss;
++}
++
++static unsigned int mptcp_select_size_mss(struct sock *sk)
++{
++	return tcp_sk(sk)->mss_cache;
++}
++
++int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++	unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
++
++	if (sg) {
++		if (mptcp_sk_can_gso(meta_sk)) {
++			mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
++		} else {
++			int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
++
++			if (mss >= pgbreak &&
++			    mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
++				mss = pgbreak;
++		}
++	}
++
++	return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
++}
++
++int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++	const struct sock *sk;
++	u32 rtt_max = tp->srtt_us;
++	u64 bw_est;
++
++	if (!tp->srtt_us)
++		return tp->reordering + 1;
++
++	mptcp_for_each_sk(tp->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		if (rtt_max < tcp_sk(sk)->srtt_us)
++			rtt_max = tcp_sk(sk)->srtt_us;
++	}
++
++	bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
++				(u64)tp->srtt_us);
++
++	return max_t(unsigned int, (u32)(bw_est >> 16),
++			tp->reordering + 1);
++}
++
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++				  int large_allowed)
++{
++	struct sock *sk;
++	u32 xmit_size_goal = 0;
++
++	if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
++		mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++			int this_size_goal;
++
++			if (!mptcp_sk_can_send(sk))
++				continue;
++
++			this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
++			if (this_size_goal > xmit_size_goal)
++				xmit_size_goal = this_size_goal;
++		}
++	}
++
++	return max(xmit_size_goal, mss_now);
++}
++
++/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++	if (skb_cloned(skb)) {
++		if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
++			return -ENOMEM;
++	}
++
++	__pskb_trim_head(skb, len);
++
++	TCP_SKB_CB(skb)->seq += len;
++	skb->ip_summed = CHECKSUM_PARTIAL;
++
++	skb->truesize	     -= len;
++	sk->sk_wmem_queued   -= len;
++	sk_mem_uncharge(sk, len);
++	sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++
++	/* Any change of skb->len requires recalculation of tso factor. */
++	if (tcp_skb_pcount(skb) > 1)
++		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
++
++	return 0;
++}
+diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
+new file mode 100644
+index 000000000000..9542f950729f
+--- /dev/null
++++ b/net/mptcp/mptcp_pm.c
+@@ -0,0 +1,169 @@
++/*
++ *     MPTCP implementation - MPTCP-subflow-management
++ *
++ *     Initial Design & Implementation:
++ *     Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *     Current Maintainer & Author:
++ *     Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *     Additional authors:
++ *     Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *     Gregory Detal <gregory.detal@uclouvain.be>
++ *     Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *     Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *     Lavkesh Lahngir <lavkesh51@gmail.com>
++ *     Andreas Ripke <ripke@neclab.eu>
++ *     Vlad Dogaru <vlad.dogaru@intel.com>
++ *     Octavian Purdila <octavian.purdila@intel.com>
++ *     John Ronan <jronan@tssg.org>
++ *     Catalin Nicutar <catalin.nicutar@gmail.com>
++ *     Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *     This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_pm_list_lock);
++static LIST_HEAD(mptcp_pm_list);
++
++static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
++			    struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++struct mptcp_pm_ops mptcp_pm_default = {
++	.get_local_id = mptcp_default_id, /* We do not care */
++	.name = "default",
++	.owner = THIS_MODULE,
++};
++
++static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
++{
++	struct mptcp_pm_ops *e;
++
++	list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
++		if (strcmp(e->name, name) == 0)
++			return e;
++	}
++
++	return NULL;
++}
++
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
++{
++	int ret = 0;
++
++	if (!pm->get_local_id)
++		return -EINVAL;
++
++	spin_lock(&mptcp_pm_list_lock);
++	if (mptcp_pm_find(pm->name)) {
++		pr_notice("%s already registered\n", pm->name);
++		ret = -EEXIST;
++	} else {
++		list_add_tail_rcu(&pm->list, &mptcp_pm_list);
++		pr_info("%s registered\n", pm->name);
++	}
++	spin_unlock(&mptcp_pm_list_lock);
++
++	return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
++
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
++{
++	spin_lock(&mptcp_pm_list_lock);
++	list_del_rcu(&pm->list);
++	spin_unlock(&mptcp_pm_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
++
++void mptcp_get_default_path_manager(char *name)
++{
++	struct mptcp_pm_ops *pm;
++
++	BUG_ON(list_empty(&mptcp_pm_list));
++
++	rcu_read_lock();
++	pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
++	strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
++	rcu_read_unlock();
++}
++
++int mptcp_set_default_path_manager(const char *name)
++{
++	struct mptcp_pm_ops *pm;
++	int ret = -ENOENT;
++
++	spin_lock(&mptcp_pm_list_lock);
++	pm = mptcp_pm_find(name);
++#ifdef CONFIG_MODULES
++	if (!pm && capable(CAP_NET_ADMIN)) {
++		spin_unlock(&mptcp_pm_list_lock);
++
++		request_module("mptcp_%s", name);
++		spin_lock(&mptcp_pm_list_lock);
++		pm = mptcp_pm_find(name);
++	}
++#endif
++
++	if (pm) {
++		list_move(&pm->list, &mptcp_pm_list);
++		ret = 0;
++	} else {
++		pr_info("%s is not available\n", name);
++	}
++	spin_unlock(&mptcp_pm_list_lock);
++
++	return ret;
++}
++
++void mptcp_init_path_manager(struct mptcp_cb *mpcb)
++{
++	struct mptcp_pm_ops *pm;
++
++	rcu_read_lock();
++	list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
++		if (try_module_get(pm->owner)) {
++			mpcb->pm_ops = pm;
++			break;
++		}
++	}
++	rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
++{
++	module_put(mpcb->pm_ops->owner);
++}
++
++/* Fallback to the default path-manager. */
++void mptcp_fallback_default(struct mptcp_cb *mpcb)
++{
++	struct mptcp_pm_ops *pm;
++
++	mptcp_cleanup_path_manager(mpcb);
++	pm = mptcp_pm_find("default");
++
++	/* Cannot fail - it's the default module */
++	try_module_get(pm->owner);
++	mpcb->pm_ops = pm;
++}
++EXPORT_SYMBOL_GPL(mptcp_fallback_default);
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_path_manager_default(void)
++{
++	return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
++}
++late_initcall(mptcp_path_manager_default);
+diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
+new file mode 100644
+index 000000000000..93278f684069
+--- /dev/null
++++ b/net/mptcp/mptcp_rr.c
+@@ -0,0 +1,301 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static unsigned char num_segments __read_mostly = 1;
++module_param(num_segments, byte, 0644);
++MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
++
++static bool cwnd_limited __read_mostly = 1;
++module_param(cwnd_limited, bool, 0644);
++MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
++
++struct rrsched_priv {
++	unsigned char quota;
++};
++
++static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
++{
++	return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
++				  bool zero_wnd_test, bool cwnd_test)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	unsigned int space, in_flight;
++
++	/* Set of states for which we are allowed to send data */
++	if (!mptcp_sk_can_send(sk))
++		return false;
++
++	/* We do not send data on this subflow unless it is
++	 * fully established, i.e. the 4th ack has been received.
++	 */
++	if (tp->mptcp->pre_established)
++		return false;
++
++	if (tp->pf)
++		return false;
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++		/* If SACK is disabled, and we got a loss, TCP does not exit
++		 * the loss-state until something above high_seq has been acked.
++		 * (see tcp_try_undo_recovery)
++		 *
++		 * high_seq is the snd_nxt at the moment of the RTO. As soon
++		 * as we have an RTO, we won't push data on the subflow.
++		 * Thus, snd_una can never go beyond high_seq.
++		 */
++		if (!tcp_is_reno(tp))
++			return false;
++		else if (tp->snd_una != tp->high_seq)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		/* Make sure that we send in-order data */
++		if (skb && tp->mptcp->second_packet &&
++		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++			return false;
++	}
++
++	if (!cwnd_test)
++		goto zero_wnd_test;
++
++	in_flight = tcp_packets_in_flight(tp);
++	/* Not even a single spot in the cwnd */
++	if (in_flight >= tp->snd_cwnd)
++		return false;
++
++	/* Now, check if what is queued in the subflow's send-queue
++	 * already fills the cwnd.
++	 */
++	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++	if (tp->write_seq - tp->snd_nxt > space)
++		return false;
++
++zero_wnd_test:
++	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++		return false;
++
++	return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++	/* If the skb has already been enqueued in this sk, try to find
++	 * another one.
++	 */
++	return skb &&
++		/* Has the skb already been enqueued into this subsocket? */
++		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* We just look for any subflow that is available */
++static struct sock *rr_get_available_subflow(struct sock *meta_sk,
++					     struct sk_buff *skb,
++					     bool zero_wnd_test)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk, *bestsk = NULL, *backupsk = NULL;
++
++	/* Answer data_fin on same subflow!!! */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++	    skb && mptcp_is_data_fin(skb)) {
++		mptcp_for_each_sk(mpcb, sk) {
++			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++			    mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++				return sk;
++		}
++	}
++
++	/* First, find the best subflow */
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++
++		if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++			continue;
++
++		if (mptcp_rr_dont_reinject_skb(tp, skb)) {
++			backupsk = sk;
++			continue;
++		}
++
++		bestsk = sk;
++	}
++
++	if (bestsk) {
++		sk = bestsk;
++	} else if (backupsk) {
++		/* It has been sent on all subflows once - let's give it a
++		 * chance again by restarting its pathmask.
++		 */
++		if (skb)
++			TCP_SKB_CB(skb)->path_mask = 0;
++		sk = backupsk;
++	}
++
++	return sk;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sk_buff *skb = NULL;
++
++	*reinject = 0;
++
++	/* If we are in fallback-mode, just take from the meta-send-queue */
++	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++		return tcp_send_head(meta_sk);
++
++	skb = skb_peek(&mpcb->reinject_queue);
++
++	if (skb)
++		*reinject = 1;
++	else
++		skb = tcp_send_head(meta_sk);
++	return skb;
++}
++
++static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
++					     int *reinject,
++					     struct sock **subsk,
++					     unsigned int *limit)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk_it, *choose_sk = NULL;
++	struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
++	unsigned char split = num_segments;
++	unsigned char iter = 0, full_subs = 0;
++
++	/* As we set it, we have to reset it as well. */
++	*limit = 0;
++
++	if (!skb)
++		return NULL;
++
++	if (*reinject) {
++		*subsk = rr_get_available_subflow(meta_sk, skb, false);
++		if (!*subsk)
++			return NULL;
++
++		return skb;
++	}
++
++retry:
++
++	/* First, we look for a subflow who is currently being used */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		struct tcp_sock *tp_it = tcp_sk(sk_it);
++		struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++		if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++			continue;
++
++		iter++;
++
++		/* Is this subflow currently being used? */
++		if (rsp->quota > 0 && rsp->quota < num_segments) {
++			split = num_segments - rsp->quota;
++			choose_sk = sk_it;
++			goto found;
++		}
++
++		/* Or, it's totally unused */
++		if (!rsp->quota) {
++			split = num_segments;
++			choose_sk = sk_it;
++		}
++
++		/* Or, it must then be fully used  */
++		if (rsp->quota == num_segments)
++			full_subs++;
++	}
++
++	/* All considered subflows have a full quota, and we considered at
++	 * least one.
++	 */
++	if (iter && iter == full_subs) {
++		/* So, we restart this round by setting quota to 0 and retry
++		 * to find a subflow.
++		 */
++		mptcp_for_each_sk(mpcb, sk_it) {
++			struct tcp_sock *tp_it = tcp_sk(sk_it);
++			struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++			if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++				continue;
++
++			rsp->quota = 0;
++		}
++
++		goto retry;
++	}
++
++found:
++	if (choose_sk) {
++		unsigned int mss_now;
++		struct tcp_sock *choose_tp = tcp_sk(choose_sk);
++		struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
++
++		if (!mptcp_rr_is_available(choose_sk, skb, false, true))
++			return NULL;
++
++		*subsk = choose_sk;
++		mss_now = tcp_current_mss(*subsk);
++		*limit = split * mss_now;
++
++		if (skb->len > mss_now)
++			rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
++		else
++			rsp->quota++;
++
++		return skb;
++	}
++
++	return NULL;
++}
++
++static struct mptcp_sched_ops mptcp_sched_rr = {
++	.get_subflow = rr_get_available_subflow,
++	.next_segment = mptcp_rr_next_segment,
++	.name = "roundrobin",
++	.owner = THIS_MODULE,
++};
++
++static int __init rr_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
++
++	if (mptcp_register_scheduler(&mptcp_sched_rr))
++		return -1;
++
++	return 0;
++}
++
++static void rr_unregister(void)
++{
++	mptcp_unregister_scheduler(&mptcp_sched_rr);
++}
++
++module_init(rr_register);
++module_exit(rr_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
++MODULE_VERSION("0.89");
+diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
+new file mode 100644
+index 000000000000..6c7ff4eceac1
+--- /dev/null
++++ b/net/mptcp/mptcp_sched.c
+@@ -0,0 +1,493 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_sched_list_lock);
++static LIST_HEAD(mptcp_sched_list);
++
++struct defsched_priv {
++	u32	last_rbuf_opti;
++};
++
++static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
++{
++	return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
++			       bool zero_wnd_test)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	unsigned int mss_now, space, in_flight;
++
++	/* Set of states for which we are allowed to send data */
++	if (!mptcp_sk_can_send(sk))
++		return false;
++
++	/* We do not send data on this subflow unless it is
++	 * fully established, i.e. the 4th ack has been received.
++	 */
++	if (tp->mptcp->pre_established)
++		return false;
++
++	if (tp->pf)
++		return false;
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++		/* If SACK is disabled, and we got a loss, TCP does not exit
++		 * the loss-state until something above high_seq has been acked.
++		 * (see tcp_try_undo_recovery)
++		 *
++		 * high_seq is the snd_nxt at the moment of the RTO. As soon
++		 * as we have an RTO, we won't push data on the subflow.
++		 * Thus, snd_una can never go beyond high_seq.
++		 */
++		if (!tcp_is_reno(tp))
++			return false;
++		else if (tp->snd_una != tp->high_seq)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		/* Make sure that we send in-order data */
++		if (skb && tp->mptcp->second_packet &&
++		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++			return false;
++	}
++
++	/* If TSQ is already throttling us, do not send on this subflow. When
++	 * TSQ gets cleared the subflow becomes eligible again.
++	 */
++	if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
++		return false;
++
++	in_flight = tcp_packets_in_flight(tp);
++	/* Not even a single spot in the cwnd */
++	if (in_flight >= tp->snd_cwnd)
++		return false;
++
++	/* Now, check if what is queued in the subflow's send-queue
++	 * already fills the cwnd.
++	 */
++	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++	if (tp->write_seq - tp->snd_nxt > space)
++		return false;
++
++	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++		return false;
++
++	mss_now = tcp_current_mss(sk);
++
++	/* Don't send on this subflow if we bypass the allowed send-window at
++	 * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
++	 * calculated end_seq (because here at this point end_seq is still at
++	 * the meta-level).
++	 */
++	if (skb && !zero_wnd_test &&
++	    after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
++		return false;
++
++	return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++	/* If the skb has already been enqueued in this sk, try to find
++	 * another one.
++	 */
++	return skb &&
++		/* Has the skb already been enqueued into this subsocket? */
++		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* This is the scheduler. This function decides on which flow to send
++ * a given MSS. If all subflows are found to be busy, NULL is returned
++ * The flow is selected based on the shortest RTT.
++ * If all paths have full cong windows, we simply return NULL.
++ *
++ * Additionally, this function is aware of the backup-subflows.
++ */
++static struct sock *get_available_subflow(struct sock *meta_sk,
++					  struct sk_buff *skb,
++					  bool zero_wnd_test)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
++	u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
++	int cnt_backups = 0;
++
++	/* if there is only one subflow, bypass the scheduling function */
++	if (mpcb->cnt_subflows == 1) {
++		bestsk = (struct sock *)mpcb->connection_list;
++		if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
++			bestsk = NULL;
++		return bestsk;
++	}
++
++	/* Answer data_fin on same subflow!!! */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++	    skb && mptcp_is_data_fin(skb)) {
++		mptcp_for_each_sk(mpcb, sk) {
++			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++			    mptcp_is_available(sk, skb, zero_wnd_test))
++				return sk;
++		}
++	}
++
++	/* First, find the best subflow */
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++
++		if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
++			cnt_backups++;
++
++		if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++		    tp->srtt_us < lowprio_min_time_to_peer) {
++			if (!mptcp_is_available(sk, skb, zero_wnd_test))
++				continue;
++
++			if (mptcp_dont_reinject_skb(tp, skb)) {
++				backupsk = sk;
++				continue;
++			}
++
++			lowprio_min_time_to_peer = tp->srtt_us;
++			lowpriosk = sk;
++		} else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++			   tp->srtt_us < min_time_to_peer) {
++			if (!mptcp_is_available(sk, skb, zero_wnd_test))
++				continue;
++
++			if (mptcp_dont_reinject_skb(tp, skb)) {
++				backupsk = sk;
++				continue;
++			}
++
++			min_time_to_peer = tp->srtt_us;
++			bestsk = sk;
++		}
++	}
++
++	if (mpcb->cnt_established == cnt_backups && lowpriosk) {
++		sk = lowpriosk;
++	} else if (bestsk) {
++		sk = bestsk;
++	} else if (backupsk) {
++		/* It has been sent on all subflows once - let's give it a
++		 * chance again by restarting its pathmask.
++		 */
++		if (skb)
++			TCP_SKB_CB(skb)->path_mask = 0;
++		sk = backupsk;
++	}
++
++	return sk;
++}
++
++static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
++{
++	struct sock *meta_sk;
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct tcp_sock *tp_it;
++	struct sk_buff *skb_head;
++	struct defsched_priv *dsp = defsched_get_priv(tp);
++
++	if (tp->mpcb->cnt_subflows == 1)
++		return NULL;
++
++	meta_sk = mptcp_meta_sk(sk);
++	skb_head = tcp_write_queue_head(meta_sk);
++
++	if (!skb_head || skb_head == tcp_send_head(meta_sk))
++		return NULL;
++
++	/* If penalization is optional (coming from mptcp_next_segment() and
++	 * We are not send-buffer-limited we do not penalize. The retransmission
++	 * is just an optimization to fix the idle-time due to the delay before
++	 * we wake up the application.
++	 */
++	if (!penal && sk_stream_memory_free(meta_sk))
++		goto retrans;
++
++	/* Only penalize again after an RTT has elapsed */
++	if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
++		goto retrans;
++
++	/* Half the cwnd of the slow flow */
++	mptcp_for_each_tp(tp->mpcb, tp_it) {
++		if (tp_it != tp &&
++		    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++			if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
++				tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
++				if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
++					tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
++
++				dsp->last_rbuf_opti = tcp_time_stamp;
++			}
++			break;
++		}
++	}
++
++retrans:
++
++	/* Segment not yet injected into this path? Take it!!! */
++	if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
++		bool do_retrans = false;
++		mptcp_for_each_tp(tp->mpcb, tp_it) {
++			if (tp_it != tp &&
++			    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++				if (tp_it->snd_cwnd <= 4) {
++					do_retrans = true;
++					break;
++				}
++
++				if (4 * tp->srtt_us >= tp_it->srtt_us) {
++					do_retrans = false;
++					break;
++				} else {
++					do_retrans = true;
++				}
++			}
++		}
++
++		if (do_retrans && mptcp_is_available(sk, skb_head, false))
++			return skb_head;
++	}
++	return NULL;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sk_buff *skb = NULL;
++
++	*reinject = 0;
++
++	/* If we are in fallback-mode, just take from the meta-send-queue */
++	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++		return tcp_send_head(meta_sk);
++
++	skb = skb_peek(&mpcb->reinject_queue);
++
++	if (skb) {
++		*reinject = 1;
++	} else {
++		skb = tcp_send_head(meta_sk);
++
++		if (!skb && meta_sk->sk_socket &&
++		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
++		    sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
++			struct sock *subsk = get_available_subflow(meta_sk, NULL,
++								   false);
++			if (!subsk)
++				return NULL;
++
++			skb = mptcp_rcv_buf_optimization(subsk, 0);
++			if (skb)
++				*reinject = -1;
++		}
++	}
++	return skb;
++}
++
++static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
++					  int *reinject,
++					  struct sock **subsk,
++					  unsigned int *limit)
++{
++	struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
++	unsigned int mss_now;
++	struct tcp_sock *subtp;
++	u16 gso_max_segs;
++	u32 max_len, max_segs, window, needed;
++
++	/* As we set it, we have to reset it as well. */
++	*limit = 0;
++
++	if (!skb)
++		return NULL;
++
++	*subsk = get_available_subflow(meta_sk, skb, false);
++	if (!*subsk)
++		return NULL;
++
++	subtp = tcp_sk(*subsk);
++	mss_now = tcp_current_mss(*subsk);
++
++	if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
++		skb = mptcp_rcv_buf_optimization(*subsk, 1);
++		if (skb)
++			*reinject = -1;
++		else
++			return NULL;
++	}
++
++	/* No splitting required, as we will only send one single segment */
++	if (skb->len <= mss_now)
++		return skb;
++
++	/* The following is similar to tcp_mss_split_point, but
++	 * we do not care about nagle, because we will anyways
++	 * use TCP_NAGLE_PUSH, which overrides this.
++	 *
++	 * So, we first limit according to the cwnd/gso-size and then according
++	 * to the subflow's window.
++	 */
++
++	gso_max_segs = (*subsk)->sk_gso_max_segs;
++	if (!gso_max_segs) /* No gso supported on the subflow's NIC */
++		gso_max_segs = 1;
++	max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
++	if (!max_segs)
++		return NULL;
++
++	max_len = mss_now * max_segs;
++	window = tcp_wnd_end(subtp) - subtp->write_seq;
++
++	needed = min(skb->len, window);
++	if (max_len <= skb->len)
++		/* Take max_win, which is actually the cwnd/gso-size */
++		*limit = max_len;
++	else
++		/* Or, take the window */
++		*limit = needed;
++
++	return skb;
++}
++
++static void defsched_init(struct sock *sk)
++{
++	struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
++
++	dsp->last_rbuf_opti = tcp_time_stamp;
++}
++
++struct mptcp_sched_ops mptcp_sched_default = {
++	.get_subflow = get_available_subflow,
++	.next_segment = mptcp_next_segment,
++	.init = defsched_init,
++	.name = "default",
++	.owner = THIS_MODULE,
++};
++
++static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
++{
++	struct mptcp_sched_ops *e;
++
++	list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
++		if (strcmp(e->name, name) == 0)
++			return e;
++	}
++
++	return NULL;
++}
++
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
++{
++	int ret = 0;
++
++	if (!sched->get_subflow || !sched->next_segment)
++		return -EINVAL;
++
++	spin_lock(&mptcp_sched_list_lock);
++	if (mptcp_sched_find(sched->name)) {
++		pr_notice("%s already registered\n", sched->name);
++		ret = -EEXIST;
++	} else {
++		list_add_tail_rcu(&sched->list, &mptcp_sched_list);
++		pr_info("%s registered\n", sched->name);
++	}
++	spin_unlock(&mptcp_sched_list_lock);
++
++	return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
++
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
++{
++	spin_lock(&mptcp_sched_list_lock);
++	list_del_rcu(&sched->list);
++	spin_unlock(&mptcp_sched_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
++
++void mptcp_get_default_scheduler(char *name)
++{
++	struct mptcp_sched_ops *sched;
++
++	BUG_ON(list_empty(&mptcp_sched_list));
++
++	rcu_read_lock();
++	sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
++	strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
++	rcu_read_unlock();
++}
++
++int mptcp_set_default_scheduler(const char *name)
++{
++	struct mptcp_sched_ops *sched;
++	int ret = -ENOENT;
++
++	spin_lock(&mptcp_sched_list_lock);
++	sched = mptcp_sched_find(name);
++#ifdef CONFIG_MODULES
++	if (!sched && capable(CAP_NET_ADMIN)) {
++		spin_unlock(&mptcp_sched_list_lock);
++
++		request_module("mptcp_%s", name);
++		spin_lock(&mptcp_sched_list_lock);
++		sched = mptcp_sched_find(name);
++	}
++#endif
++
++	if (sched) {
++		list_move(&sched->list, &mptcp_sched_list);
++		ret = 0;
++	} else {
++		pr_info("%s is not available\n", name);
++	}
++	spin_unlock(&mptcp_sched_list_lock);
++
++	return ret;
++}
++
++void mptcp_init_scheduler(struct mptcp_cb *mpcb)
++{
++	struct mptcp_sched_ops *sched;
++
++	rcu_read_lock();
++	list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
++		if (try_module_get(sched->owner)) {
++			mpcb->sched_ops = sched;
++			break;
++		}
++	}
++	rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
++{
++	module_put(mpcb->sched_ops->owner);
++}
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_scheduler_default(void)
++{
++	BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
++
++	return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
++}
++late_initcall(mptcp_scheduler_default);
+diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
+new file mode 100644
+index 000000000000..29ca1d868d17
+--- /dev/null
++++ b/net/mptcp/mptcp_wvegas.c
+@@ -0,0 +1,268 @@
++/*
++ *	MPTCP implementation - WEIGHTED VEGAS
++ *
++ *	Algorithm design:
++ *	Yu Cao <cyAnalyst@126.com>
++ *	Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
++ *	Xiaoming Fu <fu@cs.uni-goettinggen.de>
++ *
++ *	Implementation:
++ *	Yu Cao <cyAnalyst@126.com>
++ *	Enhuan Dong <deh13@mails.tsinghua.edu.cn>
++ *
++ *	Ported to the official MPTCP-kernel:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	This program is free software; you can redistribute it and/or
++ *	modify it under the terms of the GNU General Public License
++ *	as published by the Free Software Foundation; either version
++ *	2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++#include <linux/module.h>
++#include <linux/tcp.h>
++
++static int initial_alpha = 2;
++static int total_alpha = 10;
++static int gamma = 1;
++
++module_param(initial_alpha, int, 0644);
++MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
++module_param(total_alpha, int, 0644);
++MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
++module_param(gamma, int, 0644);
++MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
++
++#define MPTCP_WVEGAS_SCALE 16
++
++/* wVegas variables */
++struct wvegas {
++	u32	beg_snd_nxt;	/* right edge during last RTT */
++	u8	doing_wvegas_now;/* if true, do wvegas for this RTT */
++
++	u16	cnt_rtt;		/* # of RTTs measured within last RTT */
++	u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
++	u32	base_rtt;	/* the min of all wVegas RTT measurements seen (in usec) */
++
++	u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
++	u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
++	int alpha; /* alpha for each subflows */
++
++	u32 queue_delay; /* queue delay*/
++};
++
++
++static inline u64 mptcp_wvegas_scale(u32 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++static void wvegas_enable(const struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->doing_wvegas_now = 1;
++
++	wvegas->beg_snd_nxt = tp->snd_nxt;
++
++	wvegas->cnt_rtt = 0;
++	wvegas->sampled_rtt = 0;
++
++	wvegas->instant_rate = 0;
++	wvegas->alpha = initial_alpha;
++	wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
++
++	wvegas->queue_delay = 0;
++}
++
++static inline void wvegas_disable(const struct sock *sk)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->doing_wvegas_now = 0;
++}
++
++static void mptcp_wvegas_init(struct sock *sk)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->base_rtt = 0x7fffffff;
++	wvegas_enable(sk);
++}
++
++static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
++{
++	return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
++}
++
++static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++	u32 vrtt;
++
++	if (rtt_us < 0)
++		return;
++
++	vrtt = rtt_us + 1;
++
++	if (vrtt < wvegas->base_rtt)
++		wvegas->base_rtt = vrtt;
++
++	wvegas->sampled_rtt += vrtt;
++	wvegas->cnt_rtt++;
++}
++
++static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
++{
++	if (ca_state == TCP_CA_Open)
++		wvegas_enable(sk);
++	else
++		wvegas_disable(sk);
++}
++
++static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++	if (event == CA_EVENT_CWND_RESTART) {
++		mptcp_wvegas_init(sk);
++	} else if (event == CA_EVENT_LOSS) {
++		struct wvegas *wvegas = inet_csk_ca(sk);
++		wvegas->instant_rate = 0;
++	}
++}
++
++static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
++{
++	return  min(tp->snd_ssthresh, tp->snd_cwnd - 1);
++}
++
++static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
++{
++	u64 total_rate = 0;
++	struct sock *sub_sk;
++	const struct wvegas *wvegas = inet_csk_ca(sk);
++
++	if (!mpcb)
++		return wvegas->weight;
++
++
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
++
++		/* sampled_rtt is initialized by 0 */
++		if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
++			total_rate += sub_wvegas->instant_rate;
++	}
++
++	if (total_rate && wvegas->instant_rate)
++		return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
++	else
++		return wvegas->weight;
++}
++
++static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	if (!wvegas->doing_wvegas_now) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	if (after(ack, wvegas->beg_snd_nxt)) {
++		wvegas->beg_snd_nxt  = tp->snd_nxt;
++
++		if (wvegas->cnt_rtt <= 2) {
++			tcp_reno_cong_avoid(sk, ack, acked);
++		} else {
++			u32 rtt, diff, q_delay;
++			u64 target_cwnd;
++
++			rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
++			target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
++
++			diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
++
++			if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
++				tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
++				tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++
++			} else if (tp->snd_cwnd <= tp->snd_ssthresh) {
++				tcp_slow_start(tp, acked);
++			} else {
++				if (diff >= wvegas->alpha) {
++					wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
++					wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
++					wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
++				}
++				if (diff > wvegas->alpha) {
++					tp->snd_cwnd--;
++					tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++				} else if (diff < wvegas->alpha) {
++					tp->snd_cwnd++;
++				}
++
++				/* Try to drain link queue if needed*/
++				q_delay = rtt - wvegas->base_rtt;
++				if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
++					wvegas->queue_delay = q_delay;
++
++				if (q_delay >= 2 * wvegas->queue_delay) {
++					u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
++					tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
++					wvegas->queue_delay = 0;
++				}
++			}
++
++			if (tp->snd_cwnd < 2)
++				tp->snd_cwnd = 2;
++			else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
++				tp->snd_cwnd = tp->snd_cwnd_clamp;
++
++			tp->snd_ssthresh = tcp_current_ssthresh(sk);
++		}
++
++		wvegas->cnt_rtt = 0;
++		wvegas->sampled_rtt = 0;
++	}
++	/* Use normal slow start */
++	else if (tp->snd_cwnd <= tp->snd_ssthresh)
++		tcp_slow_start(tp, acked);
++}
++
++
++static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
++	.init		= mptcp_wvegas_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_wvegas_cong_avoid,
++	.pkts_acked	= mptcp_wvegas_pkts_acked,
++	.set_state	= mptcp_wvegas_state,
++	.cwnd_event	= mptcp_wvegas_cwnd_event,
++
++	.owner		= THIS_MODULE,
++	.name		= "wvegas",
++};
++
++static int __init mptcp_wvegas_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
++	tcp_register_congestion_control(&mptcp_wvegas);
++	return 0;
++}
++
++static void __exit mptcp_wvegas_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_wvegas);
++}
++
++module_init(mptcp_wvegas_register);
++module_exit(mptcp_wvegas_unregister);
++
++MODULE_AUTHOR("Yu Cao, Enhuan Dong");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP wVegas");
++MODULE_VERSION("0.1");


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-06 11:16 Anthony G. Basile
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-10-06 11:16 UTC (permalink / raw
  To: gentoo-commits

commit:     767ed99241e0cc05f2ef12e42c95efcc2898492d
Author:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Mon Oct  6 11:16:37 2014 +0000
Commit:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Mon Oct  6 11:16:37 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=767ed992

Linux patch 3.16.4

---
 1003_linux-3.16.4.patch | 14205 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 14205 insertions(+)

diff --git a/1003_linux-3.16.4.patch b/1003_linux-3.16.4.patch
new file mode 100644
index 0000000..c50eb2d
--- /dev/null
+++ b/1003_linux-3.16.4.patch
@@ -0,0 +1,14205 @@
+diff --git a/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt b/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
+index 1486497a24c1..ce6a1a072028 100644
+--- a/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
++++ b/Documentation/devicetree/bindings/interrupt-controller/interrupts.txt
+@@ -4,11 +4,13 @@ Specifying interrupt information for devices
+ 1) Interrupt client nodes
+ -------------------------
+ 
+-Nodes that describe devices which generate interrupts must contain an either an
+-"interrupts" property or an "interrupts-extended" property. These properties
+-contain a list of interrupt specifiers, one per output interrupt. The format of
+-the interrupt specifier is determined by the interrupt controller to which the
+-interrupts are routed; see section 2 below for details.
++Nodes that describe devices which generate interrupts must contain an
++"interrupts" property, an "interrupts-extended" property, or both. If both are
++present, the latter should take precedence; the former may be provided simply
++for compatibility with software that does not recognize the latter. These
++properties contain a list of interrupt specifiers, one per output interrupt. The
++format of the interrupt specifier is determined by the interrupt controller to
++which the interrupts are routed; see section 2 below for details.
+ 
+   Example:
+ 	interrupt-parent = <&intc1>;
+diff --git a/Documentation/devicetree/bindings/staging/imx-drm/ldb.txt b/Documentation/devicetree/bindings/staging/imx-drm/ldb.txt
+index 578a1fca366e..443bcb6134d5 100644
+--- a/Documentation/devicetree/bindings/staging/imx-drm/ldb.txt
++++ b/Documentation/devicetree/bindings/staging/imx-drm/ldb.txt
+@@ -56,6 +56,9 @@ Required properties:
+  - fsl,data-width : should be <18> or <24>
+  - port: A port node with endpoint definitions as defined in
+    Documentation/devicetree/bindings/media/video-interfaces.txt.
++   On i.MX5, the internal two-input-multiplexer is used.
++   Due to hardware limitations, only one port (port@[0,1])
++   can be used for each channel (lvds-channel@[0,1], respectively)
+    On i.MX6, there should be four ports (port@[0-3]) that correspond
+    to the four LVDS multiplexer inputs.
+ 
+@@ -78,6 +81,8 @@ ldb: ldb@53fa8008 {
+ 		      "di0", "di1";
+ 
+ 	lvds-channel@0 {
++		#address-cells = <1>;
++		#size-cells = <0>;
+ 		reg = <0>;
+ 		fsl,data-mapping = "spwg";
+ 		fsl,data-width = <24>;
+@@ -86,7 +91,9 @@ ldb: ldb@53fa8008 {
+ 			/* ... */
+ 		};
+ 
+-		port {
++		port@0 {
++			reg = <0>;
++
+ 			lvds0_in: endpoint {
+ 				remote-endpoint = <&ipu_di0_lvds0>;
+ 			};
+@@ -94,6 +101,8 @@ ldb: ldb@53fa8008 {
+ 	};
+ 
+ 	lvds-channel@1 {
++		#address-cells = <1>;
++		#size-cells = <0>;
+ 		reg = <1>;
+ 		fsl,data-mapping = "spwg";
+ 		fsl,data-width = <24>;
+@@ -102,7 +111,9 @@ ldb: ldb@53fa8008 {
+ 			/* ... */
+ 		};
+ 
+-		port {
++		port@1 {
++			reg = <1>;
++
+ 			lvds1_in: endpoint {
+ 				remote-endpoint = <&ipu_di1_lvds1>;
+ 			};
+diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
+index b7fa2f599459..f896f68a3ba3 100644
+--- a/Documentation/kernel-parameters.txt
++++ b/Documentation/kernel-parameters.txt
+@@ -3478,6 +3478,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
+ 					bogus residue values);
+ 				s = SINGLE_LUN (the device has only one
+ 					Logical Unit);
++				u = IGNORE_UAS (don't bind to the uas driver);
+ 				w = NO_WP_DETECT (don't test whether the
+ 					medium is write-protected).
+ 			Example: quirks=0419:aaf5:rl,0421:0433:rc
+diff --git a/Makefile b/Makefile
+index 9b25a830a9d7..e75c75f0ec35 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 3
++SUBLEVEL = 4
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+ 
+diff --git a/arch/arm/boot/dts/dra7-evm.dts b/arch/arm/boot/dts/dra7-evm.dts
+index 83089540e324..780d66119f3c 100644
+--- a/arch/arm/boot/dts/dra7-evm.dts
++++ b/arch/arm/boot/dts/dra7-evm.dts
+@@ -50,13 +50,13 @@
+ 
+ 	mcspi1_pins: pinmux_mcspi1_pins {
+ 		pinctrl-single,pins = <
+-			0x3a4 (PIN_INPUT | MUX_MODE0) /* spi2_clk */
+-			0x3a8 (PIN_INPUT | MUX_MODE0) /* spi2_d1 */
+-			0x3ac (PIN_INPUT | MUX_MODE0) /* spi2_d0 */
+-			0x3b0 (PIN_INPUT_SLEW | MUX_MODE0) /* spi2_cs0 */
+-			0x3b4 (PIN_INPUT_SLEW | MUX_MODE0) /* spi2_cs1 */
+-			0x3b8 (PIN_INPUT_SLEW | MUX_MODE6) /* spi2_cs2 */
+-			0x3bc (PIN_INPUT_SLEW | MUX_MODE6) /* spi2_cs3 */
++			0x3a4 (PIN_INPUT | MUX_MODE0) /* spi1_sclk */
++			0x3a8 (PIN_INPUT | MUX_MODE0) /* spi1_d1 */
++			0x3ac (PIN_INPUT | MUX_MODE0) /* spi1_d0 */
++			0x3b0 (PIN_INPUT_SLEW | MUX_MODE0) /* spi1_cs0 */
++			0x3b4 (PIN_INPUT_SLEW | MUX_MODE0) /* spi1_cs1 */
++			0x3b8 (PIN_INPUT_SLEW | MUX_MODE6) /* spi1_cs2.hdmi1_hpd */
++			0x3bc (PIN_INPUT_SLEW | MUX_MODE6) /* spi1_cs3.hdmi1_cec */
+ 		>;
+ 	};
+ 
+@@ -427,22 +427,19 @@
+ 		gpmc,device-width = <2>;
+ 		gpmc,sync-clk-ps = <0>;
+ 		gpmc,cs-on-ns = <0>;
+-		gpmc,cs-rd-off-ns = <40>;
+-		gpmc,cs-wr-off-ns = <40>;
++		gpmc,cs-rd-off-ns = <80>;
++		gpmc,cs-wr-off-ns = <80>;
+ 		gpmc,adv-on-ns = <0>;
+-		gpmc,adv-rd-off-ns = <30>;
+-		gpmc,adv-wr-off-ns = <30>;
+-		gpmc,we-on-ns = <5>;
+-		gpmc,we-off-ns = <25>;
+-		gpmc,oe-on-ns = <2>;
+-		gpmc,oe-off-ns = <20>;
+-		gpmc,access-ns = <20>;
+-		gpmc,wr-access-ns = <40>;
+-		gpmc,rd-cycle-ns = <40>;
+-		gpmc,wr-cycle-ns = <40>;
+-		gpmc,wait-pin = <0>;
+-		gpmc,wait-on-read;
+-		gpmc,wait-on-write;
++		gpmc,adv-rd-off-ns = <60>;
++		gpmc,adv-wr-off-ns = <60>;
++		gpmc,we-on-ns = <10>;
++		gpmc,we-off-ns = <50>;
++		gpmc,oe-on-ns = <4>;
++		gpmc,oe-off-ns = <40>;
++		gpmc,access-ns = <40>;
++		gpmc,wr-access-ns = <80>;
++		gpmc,rd-cycle-ns = <80>;
++		gpmc,wr-cycle-ns = <80>;
+ 		gpmc,bus-turnaround-ns = <0>;
+ 		gpmc,cycle2cycle-delay-ns = <0>;
+ 		gpmc,clk-activation-ns = <0>;
+diff --git a/arch/arm/boot/dts/dra7.dtsi b/arch/arm/boot/dts/dra7.dtsi
+index 80127638b379..f21ef396902f 100644
+--- a/arch/arm/boot/dts/dra7.dtsi
++++ b/arch/arm/boot/dts/dra7.dtsi
+@@ -172,7 +172,7 @@
+ 			gpio-controller;
+ 			#gpio-cells = <2>;
+ 			interrupt-controller;
+-			#interrupt-cells = <1>;
++			#interrupt-cells = <2>;
+ 		};
+ 
+ 		gpio2: gpio@48055000 {
+@@ -183,7 +183,7 @@
+ 			gpio-controller;
+ 			#gpio-cells = <2>;
+ 			interrupt-controller;
+-			#interrupt-cells = <1>;
++			#interrupt-cells = <2>;
+ 		};
+ 
+ 		gpio3: gpio@48057000 {
+@@ -194,7 +194,7 @@
+ 			gpio-controller;
+ 			#gpio-cells = <2>;
+ 			interrupt-controller;
+-			#interrupt-cells = <1>;
++			#interrupt-cells = <2>;
+ 		};
+ 
+ 		gpio4: gpio@48059000 {
+@@ -205,7 +205,7 @@
+ 			gpio-controller;
+ 			#gpio-cells = <2>;
+ 			interrupt-controller;
+-			#interrupt-cells = <1>;
++			#interrupt-cells = <2>;
+ 		};
+ 
+ 		gpio5: gpio@4805b000 {
+@@ -216,7 +216,7 @@
+ 			gpio-controller;
+ 			#gpio-cells = <2>;
+ 			interrupt-controller;
+-			#interrupt-cells = <1>;
++			#interrupt-cells = <2>;
+ 		};
+ 
+ 		gpio6: gpio@4805d000 {
+@@ -227,7 +227,7 @@
+ 			gpio-controller;
+ 			#gpio-cells = <2>;
+ 			interrupt-controller;
+-			#interrupt-cells = <1>;
++			#interrupt-cells = <2>;
+ 		};
+ 
+ 		gpio7: gpio@48051000 {
+@@ -238,7 +238,7 @@
+ 			gpio-controller;
+ 			#gpio-cells = <2>;
+ 			interrupt-controller;
+-			#interrupt-cells = <1>;
++			#interrupt-cells = <2>;
+ 		};
+ 
+ 		gpio8: gpio@48053000 {
+@@ -249,7 +249,7 @@
+ 			gpio-controller;
+ 			#gpio-cells = <2>;
+ 			interrupt-controller;
+-			#interrupt-cells = <1>;
++			#interrupt-cells = <2>;
+ 		};
+ 
+ 		uart1: serial@4806a000 {
+diff --git a/arch/arm/boot/dts/imx53-qsrb.dts b/arch/arm/boot/dts/imx53-qsrb.dts
+index f1bbf9a32991..82d623d05915 100644
+--- a/arch/arm/boot/dts/imx53-qsrb.dts
++++ b/arch/arm/boot/dts/imx53-qsrb.dts
+@@ -28,6 +28,12 @@
+ 				MX53_PAD_CSI0_DAT9__I2C1_SCL      0x400001ec
+ 			>;
+ 		};
++
++		pinctrl_pmic: pmicgrp {
++			fsl,pins = <
++				MX53_PAD_CSI0_DAT5__GPIO5_23	0x1e4 /* IRQ */
++			>;
++		};
+ 	};
+ };
+ 
+@@ -38,6 +44,8 @@
+ 
+ 	pmic: mc34708@8 {
+ 		compatible = "fsl,mc34708";
++		pinctrl-names = "default";
++		pinctrl-0 = <&pinctrl_pmic>;
+ 		reg = <0x08>;
+ 		interrupt-parent = <&gpio5>;
+ 		interrupts = <23 0x8>;
+diff --git a/arch/arm/boot/dts/imx53.dtsi b/arch/arm/boot/dts/imx53.dtsi
+index 6456a0084388..7d42db36d6bb 100644
+--- a/arch/arm/boot/dts/imx53.dtsi
++++ b/arch/arm/boot/dts/imx53.dtsi
+@@ -419,10 +419,14 @@
+ 				status = "disabled";
+ 
+ 				lvds-channel@0 {
++					#address-cells = <1>;
++					#size-cells = <0>;
+ 					reg = <0>;
+ 					status = "disabled";
+ 
+-					port {
++					port@0 {
++						reg = <0>;
++
+ 						lvds0_in: endpoint {
+ 							remote-endpoint = <&ipu_di0_lvds0>;
+ 						};
+@@ -430,10 +434,14 @@
+ 				};
+ 
+ 				lvds-channel@1 {
++					#address-cells = <1>;
++					#size-cells = <0>;
+ 					reg = <1>;
+ 					status = "disabled";
+ 
+-					port {
++					port@1 {
++						reg = <1>;
++
+ 						lvds1_in: endpoint {
+ 							remote-endpoint = <&ipu_di1_lvds1>;
+ 						};
+@@ -724,7 +732,7 @@
+ 				compatible = "fsl,imx53-vpu";
+ 				reg = <0x63ff4000 0x1000>;
+ 				interrupts = <9>;
+-				clocks = <&clks IMX5_CLK_VPU_GATE>,
++				clocks = <&clks IMX5_CLK_VPU_REFERENCE_GATE>,
+ 				         <&clks IMX5_CLK_VPU_GATE>;
+ 				clock-names = "per", "ahb";
+ 				resets = <&src 1>;
+diff --git a/arch/arm/boot/dts/vf610-twr.dts b/arch/arm/boot/dts/vf610-twr.dts
+index 11d733406c7e..b8a5e8c68f06 100644
+--- a/arch/arm/boot/dts/vf610-twr.dts
++++ b/arch/arm/boot/dts/vf610-twr.dts
+@@ -168,7 +168,7 @@
+ 		};
+ 
+ 		pinctrl_esdhc1: esdhc1grp {
+-			fsl,fsl,pins = <
++			fsl,pins = <
+ 				VF610_PAD_PTA24__ESDHC1_CLK	0x31ef
+ 				VF610_PAD_PTA25__ESDHC1_CMD	0x31ef
+ 				VF610_PAD_PTA26__ESDHC1_DAT0	0x31ef
+diff --git a/arch/arm/common/edma.c b/arch/arm/common/edma.c
+index 485be42519b9..ea97e14e1f0b 100644
+--- a/arch/arm/common/edma.c
++++ b/arch/arm/common/edma.c
+@@ -1415,14 +1415,14 @@ void edma_clear_event(unsigned channel)
+ EXPORT_SYMBOL(edma_clear_event);
+ 
+ static int edma_setup_from_hw(struct device *dev, struct edma_soc_info *pdata,
+-			      struct edma *edma_cc)
++			      struct edma *edma_cc, int cc_id)
+ {
+ 	int i;
+ 	u32 value, cccfg;
+ 	s8 (*queue_priority_map)[2];
+ 
+ 	/* Decode the eDMA3 configuration from CCCFG register */
+-	cccfg = edma_read(0, EDMA_CCCFG);
++	cccfg = edma_read(cc_id, EDMA_CCCFG);
+ 
+ 	value = GET_NUM_REGN(cccfg);
+ 	edma_cc->num_region = BIT(value);
+@@ -1436,7 +1436,8 @@ static int edma_setup_from_hw(struct device *dev, struct edma_soc_info *pdata,
+ 	value = GET_NUM_EVQUE(cccfg);
+ 	edma_cc->num_tc = value + 1;
+ 
+-	dev_dbg(dev, "eDMA3 HW configuration (cccfg: 0x%08x):\n", cccfg);
++	dev_dbg(dev, "eDMA3 CC%d HW configuration (cccfg: 0x%08x):\n", cc_id,
++		cccfg);
+ 	dev_dbg(dev, "num_region: %u\n", edma_cc->num_region);
+ 	dev_dbg(dev, "num_channel: %u\n", edma_cc->num_channels);
+ 	dev_dbg(dev, "num_slot: %u\n", edma_cc->num_slots);
+@@ -1655,7 +1656,7 @@ static int edma_probe(struct platform_device *pdev)
+ 			return -ENOMEM;
+ 
+ 		/* Get eDMA3 configuration from IP */
+-		ret = edma_setup_from_hw(dev, info[j], edma_cc[j]);
++		ret = edma_setup_from_hw(dev, info[j], edma_cc[j], j);
+ 		if (ret)
+ 			return ret;
+ 
+diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
+index fd43f7f55b70..79ecb4f34ffb 100644
+--- a/arch/arm/include/asm/cacheflush.h
++++ b/arch/arm/include/asm/cacheflush.h
+@@ -472,7 +472,6 @@ static inline void __sync_cache_range_r(volatile void *p, size_t size)
+ 	"mcr	p15, 0, r0, c1, c0, 0	@ set SCTLR \n\t" \
+ 	"isb	\n\t" \
+ 	"bl	v7_flush_dcache_"__stringify(level)" \n\t" \
+-	"clrex	\n\t" \
+ 	"mrc	p15, 0, r0, c1, c0, 1	@ get ACTLR \n\t" \
+ 	"bic	r0, r0, #(1 << 6)	@ disable local coherency \n\t" \
+ 	"mcr	p15, 0, r0, c1, c0, 1	@ set ACTLR \n\t" \
+diff --git a/arch/arm/include/asm/tls.h b/arch/arm/include/asm/tls.h
+index 83259b873333..5f833f7adba1 100644
+--- a/arch/arm/include/asm/tls.h
++++ b/arch/arm/include/asm/tls.h
+@@ -1,6 +1,9 @@
+ #ifndef __ASMARM_TLS_H
+ #define __ASMARM_TLS_H
+ 
++#include <linux/compiler.h>
++#include <asm/thread_info.h>
++
+ #ifdef __ASSEMBLY__
+ #include <asm/asm-offsets.h>
+ 	.macro switch_tls_none, base, tp, tpuser, tmp1, tmp2
+@@ -50,6 +53,49 @@
+ #endif
+ 
+ #ifndef __ASSEMBLY__
++
++static inline void set_tls(unsigned long val)
++{
++	struct thread_info *thread;
++
++	thread = current_thread_info();
++
++	thread->tp_value[0] = val;
++
++	/*
++	 * This code runs with preemption enabled and therefore must
++	 * be reentrant with respect to switch_tls.
++	 *
++	 * We need to ensure ordering between the shadow state and the
++	 * hardware state, so that we don't corrupt the hardware state
++	 * with a stale shadow state during context switch.
++	 *
++	 * If we're preempted here, switch_tls will load TPIDRURO from
++	 * thread_info upon resuming execution and the following mcr
++	 * is merely redundant.
++	 */
++	barrier();
++
++	if (!tls_emu) {
++		if (has_tls_reg) {
++			asm("mcr p15, 0, %0, c13, c0, 3"
++			    : : "r" (val));
++		} else {
++#ifdef CONFIG_KUSER_HELPERS
++			/*
++			 * User space must never try to access this
++			 * directly.  Expect your app to break
++			 * eventually if you do so.  The user helper
++			 * at 0xffff0fe0 must be used instead.  (see
++			 * entry-armv.S for details)
++			 */
++			*((unsigned int *)0xffff0ff0) = val;
++#endif
++		}
++
++	}
++}
++
+ static inline unsigned long get_tpuser(void)
+ {
+ 	unsigned long reg = 0;
+@@ -59,5 +105,23 @@ static inline unsigned long get_tpuser(void)
+ 
+ 	return reg;
+ }
++
++static inline void set_tpuser(unsigned long val)
++{
++	/* Since TPIDRURW is fully context-switched (unlike TPIDRURO),
++	 * we need not update thread_info.
++	 */
++	if (has_tls_reg && !tls_emu) {
++		asm("mcr p15, 0, %0, c13, c0, 2"
++		    : : "r" (val));
++	}
++}
++
++static inline void flush_tls(void)
++{
++	set_tls(0);
++	set_tpuser(0);
++}
++
+ #endif
+ #endif	/* __ASMARM_TLS_H */
+diff --git a/arch/arm/kernel/entry-header.S b/arch/arm/kernel/entry-header.S
+index 5d702f8900b1..0325dbf6e762 100644
+--- a/arch/arm/kernel/entry-header.S
++++ b/arch/arm/kernel/entry-header.S
+@@ -208,26 +208,21 @@
+ #endif
+ 	.endif
+ 	msr	spsr_cxsf, \rpsr
+-#if defined(CONFIG_CPU_V6)
+-	ldr	r0, [sp]
+-	strex	r1, r2, [sp]			@ clear the exclusive monitor
+-	ldmib	sp, {r1 - pc}^			@ load r1 - pc, cpsr
+-#elif defined(CONFIG_CPU_32v6K)
+-	clrex					@ clear the exclusive monitor
+-	ldmia	sp, {r0 - pc}^			@ load r0 - pc, cpsr
+-#else
+-	ldmia	sp, {r0 - pc}^			@ load r0 - pc, cpsr
++#if defined(CONFIG_CPU_V6) || defined(CONFIG_CPU_32v6K)
++	@ We must avoid clrex due to Cortex-A15 erratum #830321
++	sub	r0, sp, #4			@ uninhabited address
++	strex	r1, r2, [r0]			@ clear the exclusive monitor
+ #endif
++	ldmia	sp, {r0 - pc}^			@ load r0 - pc, cpsr
+ 	.endm
+ 
+ 	.macro	restore_user_regs, fast = 0, offset = 0
+ 	ldr	r1, [sp, #\offset + S_PSR]	@ get calling cpsr
+ 	ldr	lr, [sp, #\offset + S_PC]!	@ get pc
+ 	msr	spsr_cxsf, r1			@ save in spsr_svc
+-#if defined(CONFIG_CPU_V6)
++#if defined(CONFIG_CPU_V6) || defined(CONFIG_CPU_32v6K)
++	@ We must avoid clrex due to Cortex-A15 erratum #830321
+ 	strex	r1, r2, [sp]			@ clear the exclusive monitor
+-#elif defined(CONFIG_CPU_32v6K)
+-	clrex					@ clear the exclusive monitor
+ #endif
+ 	.if	\fast
+ 	ldmdb	sp, {r1 - lr}^			@ get calling r1 - lr
+@@ -267,7 +262,10 @@
+ 	.endif
+ 	ldr	lr, [sp, #S_SP]			@ top of the stack
+ 	ldrd	r0, r1, [sp, #S_LR]		@ calling lr and pc
+-	clrex					@ clear the exclusive monitor
++
++	@ We must avoid clrex due to Cortex-A15 erratum #830321
++	strex	r2, r1, [sp, #S_LR]		@ clear the exclusive monitor
++
+ 	stmdb	lr!, {r0, r1, \rpsr}		@ calling lr and rfe context
+ 	ldmia	sp, {r0 - r12}
+ 	mov	sp, lr
+@@ -288,13 +286,16 @@
+ 	.endm
+ #else	/* ifdef CONFIG_CPU_V7M */
+ 	.macro	restore_user_regs, fast = 0, offset = 0
+-	clrex					@ clear the exclusive monitor
+ 	mov	r2, sp
+ 	load_user_sp_lr r2, r3, \offset + S_SP	@ calling sp, lr
+ 	ldr	r1, [sp, #\offset + S_PSR]	@ get calling cpsr
+ 	ldr	lr, [sp, #\offset + S_PC]	@ get pc
+ 	add	sp, sp, #\offset + S_SP
+ 	msr	spsr_cxsf, r1			@ save in spsr_svc
++
++	@ We must avoid clrex due to Cortex-A15 erratum #830321
++	strex	r1, r2, [sp]			@ clear the exclusive monitor
++
+ 	.if	\fast
+ 	ldmdb	sp, {r1 - r12}			@ get calling r1 - r12
+ 	.else
+diff --git a/arch/arm/kernel/irq.c b/arch/arm/kernel/irq.c
+index 2c4257604513..5c4d38e32a51 100644
+--- a/arch/arm/kernel/irq.c
++++ b/arch/arm/kernel/irq.c
+@@ -175,7 +175,7 @@ static bool migrate_one_irq(struct irq_desc *desc)
+ 	c = irq_data_get_irq_chip(d);
+ 	if (!c->irq_set_affinity)
+ 		pr_debug("IRQ%u: unable to set affinity\n", d->irq);
+-	else if (c->irq_set_affinity(d, affinity, true) == IRQ_SET_MASK_OK && ret)
++	else if (c->irq_set_affinity(d, affinity, false) == IRQ_SET_MASK_OK && ret)
+ 		cpumask_copy(d->affinity, affinity);
+ 
+ 	return ret;
+diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
+index af9e35e8836f..290ad8170d7a 100644
+--- a/arch/arm/kernel/perf_event_cpu.c
++++ b/arch/arm/kernel/perf_event_cpu.c
+@@ -76,21 +76,15 @@ static struct pmu_hw_events *cpu_pmu_get_cpu_events(void)
+ 
+ static void cpu_pmu_enable_percpu_irq(void *data)
+ {
+-	struct arm_pmu *cpu_pmu = data;
+-	struct platform_device *pmu_device = cpu_pmu->plat_device;
+-	int irq = platform_get_irq(pmu_device, 0);
++	int irq = *(int *)data;
+ 
+ 	enable_percpu_irq(irq, IRQ_TYPE_NONE);
+-	cpumask_set_cpu(smp_processor_id(), &cpu_pmu->active_irqs);
+ }
+ 
+ static void cpu_pmu_disable_percpu_irq(void *data)
+ {
+-	struct arm_pmu *cpu_pmu = data;
+-	struct platform_device *pmu_device = cpu_pmu->plat_device;
+-	int irq = platform_get_irq(pmu_device, 0);
++	int irq = *(int *)data;
+ 
+-	cpumask_clear_cpu(smp_processor_id(), &cpu_pmu->active_irqs);
+ 	disable_percpu_irq(irq);
+ }
+ 
+@@ -103,7 +97,7 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu)
+ 
+ 	irq = platform_get_irq(pmu_device, 0);
+ 	if (irq >= 0 && irq_is_percpu(irq)) {
+-		on_each_cpu(cpu_pmu_disable_percpu_irq, cpu_pmu, 1);
++		on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
+ 		free_percpu_irq(irq, &percpu_pmu);
+ 	} else {
+ 		for (i = 0; i < irqs; ++i) {
+@@ -138,7 +132,7 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, irq_handler_t handler)
+ 				irq);
+ 			return err;
+ 		}
+-		on_each_cpu(cpu_pmu_enable_percpu_irq, cpu_pmu, 1);
++		on_each_cpu(cpu_pmu_enable_percpu_irq, &irq, 1);
+ 	} else {
+ 		for (i = 0; i < irqs; ++i) {
+ 			err = 0;
+diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c
+index 1d37568c547a..ac8dc747264c 100644
+--- a/arch/arm/kernel/perf_event_v7.c
++++ b/arch/arm/kernel/perf_event_v7.c
+@@ -157,6 +157,7 @@ static const unsigned armv7_a8_perf_map[PERF_COUNT_HW_MAX] = {
+ 	[PERF_COUNT_HW_BUS_CYCLES]		= HW_OP_UNSUPPORTED,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND]	= ARMV7_A8_PERFCTR_STALL_ISIDE,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_BACKEND]	= HW_OP_UNSUPPORTED,
++	[PERF_COUNT_HW_REF_CPU_CYCLES]	= HW_OP_UNSUPPORTED,
+ };
+ 
+ static const unsigned armv7_a8_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+@@ -281,6 +282,7 @@ static const unsigned armv7_a9_perf_map[PERF_COUNT_HW_MAX] = {
+ 	[PERF_COUNT_HW_BUS_CYCLES]		= HW_OP_UNSUPPORTED,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND]	= ARMV7_A9_PERFCTR_STALL_ICACHE,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_BACKEND]	= ARMV7_A9_PERFCTR_STALL_DISPATCH,
++	[PERF_COUNT_HW_REF_CPU_CYCLES]	= HW_OP_UNSUPPORTED,
+ };
+ 
+ static const unsigned armv7_a9_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+@@ -405,6 +407,7 @@ static const unsigned armv7_a5_perf_map[PERF_COUNT_HW_MAX] = {
+ 	[PERF_COUNT_HW_BUS_CYCLES]		= HW_OP_UNSUPPORTED,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND]	= HW_OP_UNSUPPORTED,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_BACKEND]	= HW_OP_UNSUPPORTED,
++	[PERF_COUNT_HW_REF_CPU_CYCLES] = HW_OP_UNSUPPORTED,
+ };
+ 
+ static const unsigned armv7_a5_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+@@ -527,6 +530,7 @@ static const unsigned armv7_a15_perf_map[PERF_COUNT_HW_MAX] = {
+ 	[PERF_COUNT_HW_BUS_CYCLES]		= ARMV7_PERFCTR_BUS_CYCLES,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND]	= HW_OP_UNSUPPORTED,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_BACKEND]	= HW_OP_UNSUPPORTED,
++	[PERF_COUNT_HW_REF_CPU_CYCLES] = HW_OP_UNSUPPORTED,
+ };
+ 
+ static const unsigned armv7_a15_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+@@ -651,6 +655,7 @@ static const unsigned armv7_a7_perf_map[PERF_COUNT_HW_MAX] = {
+ 	[PERF_COUNT_HW_BUS_CYCLES]		= ARMV7_PERFCTR_BUS_CYCLES,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND]	= HW_OP_UNSUPPORTED,
+ 	[PERF_COUNT_HW_STALLED_CYCLES_BACKEND]	= HW_OP_UNSUPPORTED,
++	[PERF_COUNT_HW_REF_CPU_CYCLES] = HW_OP_UNSUPPORTED,
+ };
+ 
+ static const unsigned armv7_a7_perf_cache_map[PERF_COUNT_HW_CACHE_MAX]
+diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
+index 81ef686a91ca..a35f6ebbd2c2 100644
+--- a/arch/arm/kernel/process.c
++++ b/arch/arm/kernel/process.c
+@@ -334,6 +334,8 @@ void flush_thread(void)
+ 	memset(&tsk->thread.debug, 0, sizeof(struct debug_info));
+ 	memset(&thread->fpstate, 0, sizeof(union fp_state));
+ 
++	flush_tls();
++
+ 	thread_notify(THREAD_NOTIFY_FLUSH, thread);
+ }
+ 
+diff --git a/arch/arm/kernel/thumbee.c b/arch/arm/kernel/thumbee.c
+index 7b8403b76666..80f0d69205e7 100644
+--- a/arch/arm/kernel/thumbee.c
++++ b/arch/arm/kernel/thumbee.c
+@@ -45,7 +45,7 @@ static int thumbee_notifier(struct notifier_block *self, unsigned long cmd, void
+ 
+ 	switch (cmd) {
+ 	case THREAD_NOTIFY_FLUSH:
+-		thread->thumbee_state = 0;
++		teehbr_write(0);
+ 		break;
+ 	case THREAD_NOTIFY_SWITCH:
+ 		current_thread_info()->thumbee_state = teehbr_read();
+diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
+index abd2fc067736..da11b28a72da 100644
+--- a/arch/arm/kernel/traps.c
++++ b/arch/arm/kernel/traps.c
+@@ -579,7 +579,6 @@ do_cache_op(unsigned long start, unsigned long end, int flags)
+ #define NR(x) ((__ARM_NR_##x) - __ARM_NR_BASE)
+ asmlinkage int arm_syscall(int no, struct pt_regs *regs)
+ {
+-	struct thread_info *thread = current_thread_info();
+ 	siginfo_t info;
+ 
+ 	if ((no >> 16) != (__ARM_NR_BASE>> 16))
+@@ -630,21 +629,7 @@ asmlinkage int arm_syscall(int no, struct pt_regs *regs)
+ 		return regs->ARM_r0;
+ 
+ 	case NR(set_tls):
+-		thread->tp_value[0] = regs->ARM_r0;
+-		if (tls_emu)
+-			return 0;
+-		if (has_tls_reg) {
+-			asm ("mcr p15, 0, %0, c13, c0, 3"
+-				: : "r" (regs->ARM_r0));
+-		} else {
+-			/*
+-			 * User space must never try to access this directly.
+-			 * Expect your app to break eventually if you do so.
+-			 * The user helper at 0xffff0fe0 must be used instead.
+-			 * (see entry-armv.S for details)
+-			 */
+-			*((unsigned int *)0xffff0ff0) = regs->ARM_r0;
+-		}
++		set_tls(regs->ARM_r0);
+ 		return 0;
+ 
+ #ifdef CONFIG_NEEDS_SYSCALL_FOR_CMPXCHG
+diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
+index 4c979d466cc1..a96a8043277c 100644
+--- a/arch/arm/kvm/handle_exit.c
++++ b/arch/arm/kvm/handle_exit.c
+@@ -93,6 +93,8 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
+ 	else
+ 		kvm_vcpu_block(vcpu);
+ 
++	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
++
+ 	return 1;
+ }
+ 
+diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
+index 1b9844d369cc..ee4f7447a1d3 100644
+--- a/arch/arm/kvm/init.S
++++ b/arch/arm/kvm/init.S
+@@ -98,6 +98,10 @@ __do_hyp_init:
+ 	mrc	p15, 0, r0, c10, c2, 1
+ 	mcr	p15, 4, r0, c10, c2, 1
+ 
++	@ Invalidate the stale TLBs from Bootloader
++	mcr	p15, 4, r0, c8, c7, 0	@ TLBIALLH
++	dsb	ish
++
+ 	@ Set the HSCTLR to:
+ 	@  - ARM/THUMB exceptions: Kernel config (Thumb-2 kernel)
+ 	@  - Endianness: Kernel config
+diff --git a/arch/arm/mach-exynos/mcpm-exynos.c b/arch/arm/mach-exynos/mcpm-exynos.c
+index ace0ed617476..25ef73278a26 100644
+--- a/arch/arm/mach-exynos/mcpm-exynos.c
++++ b/arch/arm/mach-exynos/mcpm-exynos.c
+@@ -39,7 +39,6 @@
+ 	"mcr	p15, 0, r0, c1, c0, 0	@ set SCTLR\n\t" \
+ 	"isb\n\t"\
+ 	"bl	v7_flush_dcache_"__stringify(level)"\n\t" \
+-	"clrex\n\t"\
+ 	"mrc	p15, 0, r0, c1, c0, 1	@ get ACTLR\n\t" \
+ 	"bic	r0, r0, #(1 << 6)	@ disable local coherency\n\t" \
+ 	/* Dummy Load of a device register to avoid Erratum 799270 */ \
+diff --git a/arch/arm/mach-imx/clk-gate2.c b/arch/arm/mach-imx/clk-gate2.c
+index 84acdfd1d715..5a75cdc81891 100644
+--- a/arch/arm/mach-imx/clk-gate2.c
++++ b/arch/arm/mach-imx/clk-gate2.c
+@@ -97,7 +97,7 @@ static int clk_gate2_is_enabled(struct clk_hw *hw)
+ 	struct clk_gate2 *gate = to_clk_gate2(hw);
+ 
+ 	if (gate->share_count)
+-		return !!(*gate->share_count);
++		return !!__clk_get_enable_count(hw->clk);
+ 	else
+ 		return clk_gate2_reg_is_enabled(gate->reg, gate->bit_idx);
+ }
+@@ -127,10 +127,6 @@ struct clk *clk_register_gate2(struct device *dev, const char *name,
+ 	gate->bit_idx = bit_idx;
+ 	gate->flags = clk_gate2_flags;
+ 	gate->lock = lock;
+-
+-	/* Initialize share_count per hardware state */
+-	if (share_count)
+-		*share_count = clk_gate2_reg_is_enabled(reg, bit_idx) ? 1 : 0;
+ 	gate->share_count = share_count;
+ 
+ 	init.name = name;
+diff --git a/arch/arm/mach-imx/suspend-imx6.S b/arch/arm/mach-imx/suspend-imx6.S
+index fe123b079c05..87bdf7a629a5 100644
+--- a/arch/arm/mach-imx/suspend-imx6.S
++++ b/arch/arm/mach-imx/suspend-imx6.S
+@@ -172,6 +172,8 @@ ENTRY(imx6_suspend)
+ 	ldr	r6, [r11, #0x0]
+ 	ldr	r11, [r0, #PM_INFO_MX6Q_GPC_V_OFFSET]
+ 	ldr	r6, [r11, #0x0]
++	ldr	r11, [r0, #PM_INFO_MX6Q_IOMUXC_V_OFFSET]
++	ldr	r6, [r11, #0x0]
+ 
+ 	/* use r11 to store the IO address */
+ 	ldr	r11, [r0, #PM_INFO_MX6Q_SRC_V_OFFSET]
+diff --git a/arch/arm/mach-omap2/omap_hwmod.c b/arch/arm/mach-omap2/omap_hwmod.c
+index da1b256caccc..8fd87a3055bf 100644
+--- a/arch/arm/mach-omap2/omap_hwmod.c
++++ b/arch/arm/mach-omap2/omap_hwmod.c
+@@ -3349,6 +3349,9 @@ int __init omap_hwmod_register_links(struct omap_hwmod_ocp_if **ois)
+ 	if (!ois)
+ 		return 0;
+ 
++	if (ois[0] == NULL) /* Empty list */
++		return 0;
++
+ 	if (!linkspace) {
+ 		if (_alloc_linkspace(ois)) {
+ 			pr_err("omap_hwmod: could not allocate link space\n");
+diff --git a/arch/arm/mach-omap2/omap_hwmod_7xx_data.c b/arch/arm/mach-omap2/omap_hwmod_7xx_data.c
+index 284324f2b98a..c95033c1029b 100644
+--- a/arch/arm/mach-omap2/omap_hwmod_7xx_data.c
++++ b/arch/arm/mach-omap2/omap_hwmod_7xx_data.c
+@@ -35,6 +35,7 @@
+ #include "i2c.h"
+ #include "mmc.h"
+ #include "wd_timer.h"
++#include "soc.h"
+ 
+ /* Base offset for all DRA7XX interrupts external to MPUSS */
+ #define DRA7XX_IRQ_GIC_START	32
+@@ -2705,7 +2706,6 @@ static struct omap_hwmod_ocp_if *dra7xx_hwmod_ocp_ifs[] __initdata = {
+ 	&dra7xx_l4_per3__usb_otg_ss1,
+ 	&dra7xx_l4_per3__usb_otg_ss2,
+ 	&dra7xx_l4_per3__usb_otg_ss3,
+-	&dra7xx_l4_per3__usb_otg_ss4,
+ 	&dra7xx_l3_main_1__vcp1,
+ 	&dra7xx_l4_per2__vcp1,
+ 	&dra7xx_l3_main_1__vcp2,
+@@ -2714,8 +2714,26 @@ static struct omap_hwmod_ocp_if *dra7xx_hwmod_ocp_ifs[] __initdata = {
+ 	NULL,
+ };
+ 
++static struct omap_hwmod_ocp_if *dra74x_hwmod_ocp_ifs[] __initdata = {
++	&dra7xx_l4_per3__usb_otg_ss4,
++	NULL,
++};
++
++static struct omap_hwmod_ocp_if *dra72x_hwmod_ocp_ifs[] __initdata = {
++	NULL,
++};
++
+ int __init dra7xx_hwmod_init(void)
+ {
++	int ret;
++
+ 	omap_hwmod_init();
+-	return omap_hwmod_register_links(dra7xx_hwmod_ocp_ifs);
++	ret = omap_hwmod_register_links(dra7xx_hwmod_ocp_ifs);
++
++	if (!ret && soc_is_dra74x())
++		return omap_hwmod_register_links(dra74x_hwmod_ocp_ifs);
++	else if (!ret && soc_is_dra72x())
++		return omap_hwmod_register_links(dra72x_hwmod_ocp_ifs);
++
++	return ret;
+ }
+diff --git a/arch/arm/mach-omap2/soc.h b/arch/arm/mach-omap2/soc.h
+index 01ca8086fb6c..4376f59626d1 100644
+--- a/arch/arm/mach-omap2/soc.h
++++ b/arch/arm/mach-omap2/soc.h
+@@ -245,6 +245,8 @@ IS_AM_SUBCLASS(437x, 0x437)
+ #define soc_is_omap54xx()		0
+ #define soc_is_omap543x()		0
+ #define soc_is_dra7xx()			0
++#define soc_is_dra74x()			0
++#define soc_is_dra72x()			0
+ 
+ #if defined(MULTI_OMAP2)
+ # if defined(CONFIG_ARCH_OMAP2)
+@@ -393,7 +395,11 @@ IS_OMAP_TYPE(3430, 0x3430)
+ 
+ #if defined(CONFIG_SOC_DRA7XX)
+ #undef soc_is_dra7xx
++#undef soc_is_dra74x
++#undef soc_is_dra72x
+ #define soc_is_dra7xx()	(of_machine_is_compatible("ti,dra7"))
++#define soc_is_dra74x()	(of_machine_is_compatible("ti,dra74"))
++#define soc_is_dra72x()	(of_machine_is_compatible("ti,dra72"))
+ #endif
+ 
+ /* Various silicon revisions for omap2 */
+diff --git a/arch/arm/mm/abort-ev6.S b/arch/arm/mm/abort-ev6.S
+index 3815a8262af0..8c48c5c22a33 100644
+--- a/arch/arm/mm/abort-ev6.S
++++ b/arch/arm/mm/abort-ev6.S
+@@ -17,12 +17,6 @@
+  */
+ 	.align	5
+ ENTRY(v6_early_abort)
+-#ifdef CONFIG_CPU_V6
+-	sub	r1, sp, #4			@ Get unused stack location
+-	strex	r0, r1, [r1]			@ Clear the exclusive monitor
+-#elif defined(CONFIG_CPU_32v6K)
+-	clrex
+-#endif
+ 	mrc	p15, 0, r1, c5, c0, 0		@ get FSR
+ 	mrc	p15, 0, r0, c6, c0, 0		@ get FAR
+ /*
+diff --git a/arch/arm/mm/abort-ev7.S b/arch/arm/mm/abort-ev7.S
+index 703375277ba6..4812ad054214 100644
+--- a/arch/arm/mm/abort-ev7.S
++++ b/arch/arm/mm/abort-ev7.S
+@@ -13,12 +13,6 @@
+  */
+ 	.align	5
+ ENTRY(v7_early_abort)
+-	/*
+-	 * The effect of data aborts on on the exclusive access monitor are
+-	 * UNPREDICTABLE. Do a CLREX to clear the state
+-	 */
+-	clrex
+-
+ 	mrc	p15, 0, r1, c5, c0, 0		@ get FSR
+ 	mrc	p15, 0, r0, c6, c0, 0		@ get FAR
+ 
+diff --git a/arch/arm/mm/alignment.c b/arch/arm/mm/alignment.c
+index b8cb1a2688a0..33ca98085cd5 100644
+--- a/arch/arm/mm/alignment.c
++++ b/arch/arm/mm/alignment.c
+@@ -41,6 +41,7 @@
+  * This code is not portable to processors with late data abort handling.
+  */
+ #define CODING_BITS(i)	(i & 0x0e000000)
++#define COND_BITS(i)	(i & 0xf0000000)
+ 
+ #define LDST_I_BIT(i)	(i & (1 << 26))		/* Immediate constant	*/
+ #define LDST_P_BIT(i)	(i & (1 << 24))		/* Preindex		*/
+@@ -819,6 +820,8 @@ do_alignment(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
+ 		break;
+ 
+ 	case 0x04000000:	/* ldr or str immediate */
++		if (COND_BITS(instr) == 0xf0000000) /* NEON VLDn, VSTn */
++			goto bad;
+ 		offset.un = OFFSET_BITS(instr);
+ 		handler = do_alignment_ldrstr;
+ 		break;
+diff --git a/arch/arm64/include/asm/hw_breakpoint.h b/arch/arm64/include/asm/hw_breakpoint.h
+index d064047612b1..52b484b6aa1a 100644
+--- a/arch/arm64/include/asm/hw_breakpoint.h
++++ b/arch/arm64/include/asm/hw_breakpoint.h
+@@ -79,7 +79,6 @@ static inline void decode_ctrl_reg(u32 reg,
+  */
+ #define ARM_MAX_BRP		16
+ #define ARM_MAX_WRP		16
+-#define ARM_MAX_HBP_SLOTS	(ARM_MAX_BRP + ARM_MAX_WRP)
+ 
+ /* Virtual debug register bases. */
+ #define AARCH64_DBG_REG_BVR	0
+diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
+index 501000fadb6f..41ed9e13795e 100644
+--- a/arch/arm64/include/asm/ptrace.h
++++ b/arch/arm64/include/asm/ptrace.h
+@@ -137,7 +137,7 @@ struct pt_regs {
+ 	(!((regs)->pstate & PSR_F_BIT))
+ 
+ #define user_stack_pointer(regs) \
+-	(!compat_user_mode(regs)) ? ((regs)->sp) : ((regs)->compat_sp)
++	(!compat_user_mode(regs) ? (regs)->sp : (regs)->compat_sp)
+ 
+ static inline unsigned long regs_return_value(struct pt_regs *regs)
+ {
+diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c
+index 0f08dfd69ebc..dfa6e3e74fdd 100644
+--- a/arch/arm64/kernel/irq.c
++++ b/arch/arm64/kernel/irq.c
+@@ -97,19 +97,15 @@ static bool migrate_one_irq(struct irq_desc *desc)
+ 	if (irqd_is_per_cpu(d) || !cpumask_test_cpu(smp_processor_id(), affinity))
+ 		return false;
+ 
+-	if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids)
++	if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) {
++		affinity = cpu_online_mask;
+ 		ret = true;
++	}
+ 
+-	/*
+-	 * when using forced irq_set_affinity we must ensure that the cpu
+-	 * being offlined is not present in the affinity mask, it may be
+-	 * selected as the target CPU otherwise
+-	 */
+-	affinity = cpu_online_mask;
+ 	c = irq_data_get_irq_chip(d);
+ 	if (!c->irq_set_affinity)
+ 		pr_debug("IRQ%u: unable to set affinity\n", d->irq);
+-	else if (c->irq_set_affinity(d, affinity, true) == IRQ_SET_MASK_OK && ret)
++	else if (c->irq_set_affinity(d, affinity, false) == IRQ_SET_MASK_OK && ret)
+ 		cpumask_copy(d->affinity, affinity);
+ 
+ 	return ret;
+diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
+index 43b7c34f92cb..7b0827ae402d 100644
+--- a/arch/arm64/kernel/process.c
++++ b/arch/arm64/kernel/process.c
+@@ -224,9 +224,27 @@ void exit_thread(void)
+ {
+ }
+ 
++static void tls_thread_flush(void)
++{
++	asm ("msr tpidr_el0, xzr");
++
++	if (is_compat_task()) {
++		current->thread.tp_value = 0;
++
++		/*
++		 * We need to ensure ordering between the shadow state and the
++		 * hardware state, so that we don't corrupt the hardware state
++		 * with a stale shadow state during context switch.
++		 */
++		barrier();
++		asm ("msr tpidrro_el0, xzr");
++	}
++}
++
+ void flush_thread(void)
+ {
+ 	fpsimd_flush_thread();
++	tls_thread_flush();
+ 	flush_ptrace_hw_breakpoint(current);
+ }
+ 
+diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
+index 9fde010c945f..167c5edecad4 100644
+--- a/arch/arm64/kernel/ptrace.c
++++ b/arch/arm64/kernel/ptrace.c
+@@ -85,7 +85,8 @@ static void ptrace_hbptriggered(struct perf_event *bp,
+ 			break;
+ 		}
+ 	}
+-	for (i = ARM_MAX_BRP; i < ARM_MAX_HBP_SLOTS && !bp; ++i) {
++
++	for (i = 0; i < ARM_MAX_WRP; ++i) {
+ 		if (current->thread.debug.hbp_watch[i] == bp) {
+ 			info.si_errno = -((i << 1) + 1);
+ 			break;
+diff --git a/arch/arm64/kernel/sys_compat.c b/arch/arm64/kernel/sys_compat.c
+index 26e9c4eeaba8..78039927c807 100644
+--- a/arch/arm64/kernel/sys_compat.c
++++ b/arch/arm64/kernel/sys_compat.c
+@@ -79,6 +79,12 @@ long compat_arm_syscall(struct pt_regs *regs)
+ 
+ 	case __ARM_NR_compat_set_tls:
+ 		current->thread.tp_value = regs->regs[0];
++
++		/*
++		 * Protect against register corruption from context switch.
++		 * See comment in tls_thread_flush.
++		 */
++		barrier();
+ 		asm ("msr tpidrro_el0, %0" : : "r" (regs->regs[0]));
+ 		return 0;
+ 
+diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
+index 182415e1a952..2ca885c3eb0f 100644
+--- a/arch/arm64/kvm/handle_exit.c
++++ b/arch/arm64/kvm/handle_exit.c
+@@ -66,6 +66,8 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run)
+ 	else
+ 		kvm_vcpu_block(vcpu);
+ 
++	kvm_skip_instr(vcpu, kvm_vcpu_trap_il_is32bit(vcpu));
++
+ 	return 1;
+ }
+ 
+diff --git a/arch/arm64/kvm/hyp-init.S b/arch/arm64/kvm/hyp-init.S
+index d968796f4b2d..c3191168a994 100644
+--- a/arch/arm64/kvm/hyp-init.S
++++ b/arch/arm64/kvm/hyp-init.S
+@@ -80,6 +80,10 @@ __do_hyp_init:
+ 	msr	mair_el2, x4
+ 	isb
+ 
++	/* Invalidate the stale TLBs from Bootloader */
++	tlbi	alle2
++	dsb	sy
++
+ 	mrs	x4, sctlr_el2
+ 	and	x4, x4, #SCTLR_EL2_EE	// preserve endianness of EL2
+ 	ldr	x5, =SCTLR_EL2_FLAGS
+diff --git a/arch/mips/boot/compressed/decompress.c b/arch/mips/boot/compressed/decompress.c
+index c00c4ddf4514..5244cecf1e45 100644
+--- a/arch/mips/boot/compressed/decompress.c
++++ b/arch/mips/boot/compressed/decompress.c
+@@ -13,6 +13,7 @@
+ 
+ #include <linux/types.h>
+ #include <linux/kernel.h>
++#include <linux/string.h>
+ 
+ #include <asm/addrspace.h>
+ 
+diff --git a/arch/mips/kernel/mcount.S b/arch/mips/kernel/mcount.S
+index 539b6294b613..8f89ff4ed524 100644
+--- a/arch/mips/kernel/mcount.S
++++ b/arch/mips/kernel/mcount.S
+@@ -123,7 +123,11 @@ NESTED(_mcount, PT_SIZE, ra)
+ 	 nop
+ #endif
+ 	b	ftrace_stub
++#ifdef CONFIG_32BIT
++	 addiu sp, sp, 8
++#else
+ 	 nop
++#endif
+ 
+ static_trace:
+ 	MCOUNT_SAVE_REGS
+@@ -133,6 +137,9 @@ static_trace:
+ 	 move	a1, AT		/* arg2: parent's return address */
+ 
+ 	MCOUNT_RESTORE_REGS
++#ifdef CONFIG_32BIT
++	addiu sp, sp, 8
++#endif
+ 	.globl ftrace_stub
+ ftrace_stub:
+ 	RETURN_BACK
+@@ -177,6 +184,11 @@ NESTED(ftrace_graph_caller, PT_SIZE, ra)
+ 	jal	prepare_ftrace_return
+ 	 nop
+ 	MCOUNT_RESTORE_REGS
++#ifndef CONFIG_DYNAMIC_FTRACE
++#ifdef CONFIG_32BIT
++	addiu sp, sp, 8
++#endif
++#endif
+ 	RETURN_BACK
+ 	END(ftrace_graph_caller)
+ 
+diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
+index bf0fc6b16ad9..7a4727795a70 100644
+--- a/arch/mips/math-emu/cp1emu.c
++++ b/arch/mips/math-emu/cp1emu.c
+@@ -650,9 +650,9 @@ static inline int cop1_64bit(struct pt_regs *xcp)
+ #define SIFROMREG(si, x)						\
+ do {									\
+ 	if (cop1_64bit(xcp))						\
+-		(si) = get_fpr32(&ctx->fpr[x], 0);			\
++		(si) = (int)get_fpr32(&ctx->fpr[x], 0);			\
+ 	else								\
+-		(si) = get_fpr32(&ctx->fpr[(x) & ~1], (x) & 1);		\
++		(si) = (int)get_fpr32(&ctx->fpr[(x) & ~1], (x) & 1);	\
+ } while (0)
+ 
+ #define SITOREG(si, x)							\
+@@ -667,7 +667,7 @@ do {									\
+ 	}								\
+ } while (0)
+ 
+-#define SIFROMHREG(si, x)	((si) = get_fpr32(&ctx->fpr[x], 1))
++#define SIFROMHREG(si, x)	((si) = (int)get_fpr32(&ctx->fpr[x], 1))
+ 
+ #define SITOHREG(si, x)							\
+ do {									\
+diff --git a/arch/parisc/Makefile b/arch/parisc/Makefile
+index 7187664034c3..5db8882f732c 100644
+--- a/arch/parisc/Makefile
++++ b/arch/parisc/Makefile
+@@ -48,7 +48,12 @@ cflags-y	:= -pipe
+ 
+ # These flags should be implied by an hppa-linux configuration, but they
+ # are not in gcc 3.2.
+-cflags-y	+= -mno-space-regs -mfast-indirect-calls
++cflags-y	+= -mno-space-regs
++
++# -mfast-indirect-calls is only relevant for 32-bit kernels.
++ifndef CONFIG_64BIT
++cflags-y	+= -mfast-indirect-calls
++endif
+ 
+ # Currently we save and restore fpregs on all kernel entry/interruption paths.
+ # If that gets optimized, we might need to disable the use of fpregs in the
+diff --git a/arch/parisc/kernel/syscall.S b/arch/parisc/kernel/syscall.S
+index 838786011037..7ef22e3387e0 100644
+--- a/arch/parisc/kernel/syscall.S
++++ b/arch/parisc/kernel/syscall.S
+@@ -74,7 +74,7 @@ ENTRY(linux_gateway_page)
+ 	/* ADDRESS 0xb0 to 0xb8, lws uses two insns for entry */
+ 	/* Light-weight-syscall entry must always be located at 0xb0 */
+ 	/* WARNING: Keep this number updated with table size changes */
+-#define __NR_lws_entries (2)
++#define __NR_lws_entries (3)
+ 
+ lws_entry:
+ 	gate	lws_start, %r0		/* increase privilege */
+@@ -502,7 +502,7 @@ lws_exit:
+ 
+ 	
+ 	/***************************************************
+-		Implementing CAS as an atomic operation:
++		Implementing 32bit CAS as an atomic operation:
+ 
+ 		%r26 - Address to examine
+ 		%r25 - Old value to check (old)
+@@ -659,6 +659,230 @@ cas_action:
+ 	ASM_EXCEPTIONTABLE_ENTRY(2b-linux_gateway_page, 3b-linux_gateway_page)
+ 
+ 
++	/***************************************************
++		New CAS implementation which uses pointers and variable size
++		information. The value pointed by old and new MUST NOT change
++		while performing CAS. The lock only protect the value at %r26.
++
++		%r26 - Address to examine
++		%r25 - Pointer to the value to check (old)
++		%r24 - Pointer to the value to set (new)
++		%r23 - Size of the variable (0/1/2/3 for 8/16/32/64 bit)
++		%r28 - Return non-zero on failure
++		%r21 - Kernel error code
++
++		%r21 has the following meanings:
++
++		EAGAIN - CAS is busy, ldcw failed, try again.
++		EFAULT - Read or write failed.
++
++		Scratch: r20, r22, r28, r29, r1, fr4 (32bit for 64bit CAS only)
++
++	****************************************************/
++
++	/* ELF32 Process entry path */
++lws_compare_and_swap_2:
++#ifdef CONFIG_64BIT
++	/* Clip the input registers */
++	depdi	0, 31, 32, %r26
++	depdi	0, 31, 32, %r25
++	depdi	0, 31, 32, %r24
++	depdi	0, 31, 32, %r23
++#endif
++
++	/* Check the validity of the size pointer */
++	subi,>>= 4, %r23, %r0
++	b,n	lws_exit_nosys
++
++	/* Jump to the functions which will load the old and new values into
++	   registers depending on the their size */
++	shlw	%r23, 2, %r29
++	blr	%r29, %r0
++	nop
++
++	/* 8bit load */
++4:	ldb	0(%sr3,%r25), %r25
++	b	cas2_lock_start
++5:	ldb	0(%sr3,%r24), %r24
++	nop
++	nop
++	nop
++	nop
++	nop
++
++	/* 16bit load */
++6:	ldh	0(%sr3,%r25), %r25
++	b	cas2_lock_start
++7:	ldh	0(%sr3,%r24), %r24
++	nop
++	nop
++	nop
++	nop
++	nop
++
++	/* 32bit load */
++8:	ldw	0(%sr3,%r25), %r25
++	b	cas2_lock_start
++9:	ldw	0(%sr3,%r24), %r24
++	nop
++	nop
++	nop
++	nop
++	nop
++
++	/* 64bit load */
++#ifdef CONFIG_64BIT
++10:	ldd	0(%sr3,%r25), %r25
++11:	ldd	0(%sr3,%r24), %r24
++#else
++	/* Load new value into r22/r23 - high/low */
++10:	ldw	0(%sr3,%r25), %r22
++11:	ldw	4(%sr3,%r25), %r23
++	/* Load new value into fr4 for atomic store later */
++12:	flddx	0(%sr3,%r24), %fr4
++#endif
++
++cas2_lock_start:
++	/* Load start of lock table */
++	ldil	L%lws_lock_start, %r20
++	ldo	R%lws_lock_start(%r20), %r28
++
++	/* Extract four bits from r26 and hash lock (Bits 4-7) */
++	extru  %r26, 27, 4, %r20
++
++	/* Find lock to use, the hash is either one of 0 to
++	   15, multiplied by 16 (keep it 16-byte aligned)
++	   and add to the lock table offset. */
++	shlw	%r20, 4, %r20
++	add	%r20, %r28, %r20
++
++	rsm	PSW_SM_I, %r0			/* Disable interrupts */
++	/* COW breaks can cause contention on UP systems */
++	LDCW	0(%sr2,%r20), %r28		/* Try to acquire the lock */
++	cmpb,<>,n	%r0, %r28, cas2_action	/* Did we get it? */
++cas2_wouldblock:
++	ldo	2(%r0), %r28			/* 2nd case */
++	ssm	PSW_SM_I, %r0
++	b	lws_exit			/* Contended... */
++	ldo	-EAGAIN(%r0), %r21		/* Spin in userspace */
++
++	/*
++		prev = *addr;
++		if ( prev == old )
++		  *addr = new;
++		return prev;
++	*/
++
++	/* NOTES:
++		This all works becuse intr_do_signal
++		and schedule both check the return iasq
++		and see that we are on the kernel page
++		so this process is never scheduled off
++		or is ever sent any signal of any sort,
++		thus it is wholly atomic from usrspaces
++		perspective
++	*/
++cas2_action:
++	/* Jump to the correct function */
++	blr	%r29, %r0
++	/* Set %r28 as non-zero for now */
++	ldo	1(%r0),%r28
++
++	/* 8bit CAS */
++13:	ldb,ma	0(%sr3,%r26), %r29
++	sub,=	%r29, %r25, %r0
++	b,n	cas2_end
++14:	stb,ma	%r24, 0(%sr3,%r26)
++	b	cas2_end
++	copy	%r0, %r28
++	nop
++	nop
++
++	/* 16bit CAS */
++15:	ldh,ma	0(%sr3,%r26), %r29
++	sub,=	%r29, %r25, %r0
++	b,n	cas2_end
++16:	sth,ma	%r24, 0(%sr3,%r26)
++	b	cas2_end
++	copy	%r0, %r28
++	nop
++	nop
++
++	/* 32bit CAS */
++17:	ldw,ma	0(%sr3,%r26), %r29
++	sub,=	%r29, %r25, %r0
++	b,n	cas2_end
++18:	stw,ma	%r24, 0(%sr3,%r26)
++	b	cas2_end
++	copy	%r0, %r28
++	nop
++	nop
++
++	/* 64bit CAS */
++#ifdef CONFIG_64BIT
++19:	ldd,ma	0(%sr3,%r26), %r29
++	sub,=	%r29, %r25, %r0
++	b,n	cas2_end
++20:	std,ma	%r24, 0(%sr3,%r26)
++	copy	%r0, %r28
++#else
++	/* Compare first word */
++19:	ldw,ma	0(%sr3,%r26), %r29
++	sub,=	%r29, %r22, %r0
++	b,n	cas2_end
++	/* Compare second word */
++20:	ldw,ma	4(%sr3,%r26), %r29
++	sub,=	%r29, %r23, %r0
++	b,n	cas2_end
++	/* Perform the store */
++21:	fstdx	%fr4, 0(%sr3,%r26)
++	copy	%r0, %r28
++#endif
++
++cas2_end:
++	/* Free lock */
++	stw,ma	%r20, 0(%sr2,%r20)
++	/* Enable interrupts */
++	ssm	PSW_SM_I, %r0
++	/* Return to userspace, set no error */
++	b	lws_exit
++	copy	%r0, %r21
++
++22:
++	/* Error occurred on load or store */
++	/* Free lock */
++	stw	%r20, 0(%sr2,%r20)
++	ssm	PSW_SM_I, %r0
++	ldo	1(%r0),%r28
++	b	lws_exit
++	ldo	-EFAULT(%r0),%r21	/* set errno */
++	nop
++	nop
++	nop
++
++	/* Exception table entries, for the load and store, return EFAULT.
++	   Each of the entries must be relocated. */
++	ASM_EXCEPTIONTABLE_ENTRY(4b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(5b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(6b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(7b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(8b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(9b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(10b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(11b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(13b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(14b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(15b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(16b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(17b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(18b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(19b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(20b-linux_gateway_page, 22b-linux_gateway_page)
++#ifndef CONFIG_64BIT
++	ASM_EXCEPTIONTABLE_ENTRY(12b-linux_gateway_page, 22b-linux_gateway_page)
++	ASM_EXCEPTIONTABLE_ENTRY(21b-linux_gateway_page, 22b-linux_gateway_page)
++#endif
++
+ 	/* Make sure nothing else is placed on this page */
+ 	.align PAGE_SIZE
+ END(linux_gateway_page)
+@@ -675,8 +899,9 @@ ENTRY(end_linux_gateway_page)
+ 	/* Light-weight-syscall table */
+ 	/* Start of lws table. */
+ ENTRY(lws_table)
+-	LWS_ENTRY(compare_and_swap32)	/* 0 - ELF32 Atomic compare and swap */
+-	LWS_ENTRY(compare_and_swap64)	/* 1 - ELF64 Atomic compare and swap */
++	LWS_ENTRY(compare_and_swap32)		/* 0 - ELF32 Atomic 32bit CAS */
++	LWS_ENTRY(compare_and_swap64)		/* 1 - ELF64 Atomic 32bit CAS */
++	LWS_ENTRY(compare_and_swap_2)		/* 2 - ELF32 Atomic 64bit CAS */
+ END(lws_table)
+ 	/* End of lws table */
+ 
+diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h
+index 279b80f3bb29..c0c61fa9cd9e 100644
+--- a/arch/powerpc/include/asm/ptrace.h
++++ b/arch/powerpc/include/asm/ptrace.h
+@@ -47,6 +47,12 @@
+ 				 STACK_FRAME_OVERHEAD + KERNEL_REDZONE_SIZE)
+ #define STACK_FRAME_MARKER	12
+ 
++#if defined(_CALL_ELF) && _CALL_ELF == 2
++#define STACK_FRAME_MIN_SIZE	32
++#else
++#define STACK_FRAME_MIN_SIZE	STACK_FRAME_OVERHEAD
++#endif
++
+ /* Size of dummy stack frame allocated when calling signal handler. */
+ #define __SIGNAL_FRAMESIZE	128
+ #define __SIGNAL_FRAMESIZE32	64
+@@ -60,6 +66,7 @@
+ #define STACK_FRAME_REGS_MARKER	ASM_CONST(0x72656773)
+ #define STACK_INT_FRAME_SIZE	(sizeof(struct pt_regs) + STACK_FRAME_OVERHEAD)
+ #define STACK_FRAME_MARKER	2
++#define STACK_FRAME_MIN_SIZE	STACK_FRAME_OVERHEAD
+ 
+ /* Size of stack frame allocated when calling signal handler. */
+ #define __SIGNAL_FRAMESIZE	64
+diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
+index 35aa339410bd..4dbe072eecbe 100644
+--- a/arch/powerpc/include/asm/spinlock.h
++++ b/arch/powerpc/include/asm/spinlock.h
+@@ -61,6 +61,7 @@ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
+ 
+ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
+ {
++	smp_mb();
+ 	return !arch_spin_value_unlocked(*lock);
+ }
+ 
+diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
+index 0c9c8d7d0734..170a0346f756 100644
+--- a/arch/powerpc/lib/locks.c
++++ b/arch/powerpc/lib/locks.c
+@@ -70,12 +70,16 @@ void __rw_yield(arch_rwlock_t *rw)
+ 
+ void arch_spin_unlock_wait(arch_spinlock_t *lock)
+ {
++	smp_mb();
++
+ 	while (lock->slock) {
+ 		HMT_low();
+ 		if (SHARED_PROCESSOR)
+ 			__spin_yield(lock);
+ 	}
+ 	HMT_medium();
++
++	smp_mb();
+ }
+ 
+ EXPORT_SYMBOL(arch_spin_unlock_wait);
+diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
+index 74d1e780748b..2396dda282cd 100644
+--- a/arch/powerpc/perf/callchain.c
++++ b/arch/powerpc/perf/callchain.c
+@@ -35,7 +35,7 @@ static int valid_next_sp(unsigned long sp, unsigned long prev_sp)
+ 		return 0;		/* must be 16-byte aligned */
+ 	if (!validate_sp(sp, current, STACK_FRAME_OVERHEAD))
+ 		return 0;
+-	if (sp >= prev_sp + STACK_FRAME_OVERHEAD)
++	if (sp >= prev_sp + STACK_FRAME_MIN_SIZE)
+ 		return 1;
+ 	/*
+ 	 * sp could decrease when we jump off an interrupt stack
+diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
+index fcba5e03839f..8904e1282562 100644
+--- a/arch/s390/include/asm/pgtable.h
++++ b/arch/s390/include/asm/pgtable.h
+@@ -1115,7 +1115,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+ 					    unsigned long addr, pte_t *ptep)
+ {
+ 	pgste_t pgste;
+-	pte_t pte;
++	pte_t pte, oldpte;
+ 	int young;
+ 
+ 	if (mm_has_pgste(vma->vm_mm)) {
+@@ -1123,12 +1123,13 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+ 		pgste = pgste_ipte_notify(vma->vm_mm, ptep, pgste);
+ 	}
+ 
+-	pte = *ptep;
++	oldpte = pte = *ptep;
+ 	ptep_flush_direct(vma->vm_mm, addr, ptep);
+ 	young = pte_young(pte);
+ 	pte = pte_mkold(pte);
+ 
+ 	if (mm_has_pgste(vma->vm_mm)) {
++		pgste = pgste_update_all(&oldpte, pgste, vma->vm_mm);
+ 		pgste = pgste_set_pte(ptep, pgste, pte);
+ 		pgste_set_unlock(ptep, pgste);
+ 	} else
+@@ -1318,6 +1319,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
+ 	ptep_flush_direct(vma->vm_mm, address, ptep);
+ 
+ 	if (mm_has_pgste(vma->vm_mm)) {
++		pgste_set_key(ptep, pgste, entry, vma->vm_mm);
+ 		pgste = pgste_set_pte(ptep, pgste, entry);
+ 		pgste_set_unlock(ptep, pgste);
+ 	} else
+diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
+index 2f3e14fe91a4..0eaf87281f45 100644
+--- a/arch/s390/kvm/kvm-s390.c
++++ b/arch/s390/kvm/kvm-s390.c
+@@ -1286,19 +1286,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
+ 
+ 	kvm_s390_vcpu_start(vcpu);
+ 
+-	switch (kvm_run->exit_reason) {
+-	case KVM_EXIT_S390_SIEIC:
+-	case KVM_EXIT_UNKNOWN:
+-	case KVM_EXIT_INTR:
+-	case KVM_EXIT_S390_RESET:
+-	case KVM_EXIT_S390_UCONTROL:
+-	case KVM_EXIT_S390_TSCH:
+-	case KVM_EXIT_DEBUG:
+-		break;
+-	default:
+-		BUG();
+-	}
+-
+ 	vcpu->arch.sie_block->gpsw.mask = kvm_run->psw_mask;
+ 	vcpu->arch.sie_block->gpsw.addr = kvm_run->psw_addr;
+ 	if (kvm_run->kvm_dirty_regs & KVM_SYNC_PREFIX) {
+diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
+index f90ad8592b36..98eeb823342c 100644
+--- a/arch/s390/mm/pgtable.c
++++ b/arch/s390/mm/pgtable.c
+@@ -986,11 +986,21 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
+ 	pte_t *ptep;
+ 
+ 	down_read(&mm->mmap_sem);
++retry:
+ 	ptep = get_locked_pte(current->mm, addr, &ptl);
+ 	if (unlikely(!ptep)) {
+ 		up_read(&mm->mmap_sem);
+ 		return -EFAULT;
+ 	}
++	if (!(pte_val(*ptep) & _PAGE_INVALID) &&
++	     (pte_val(*ptep) & _PAGE_PROTECT)) {
++			pte_unmap_unlock(*ptep, ptl);
++			if (fixup_user_fault(current, mm, addr, FAULT_FLAG_WRITE)) {
++				up_read(&mm->mmap_sem);
++				return -EFAULT;
++			}
++			goto retry;
++		}
+ 
+ 	new = old = pgste_get_lock(ptep);
+ 	pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
+diff --git a/arch/x86/boot/compressed/aslr.c b/arch/x86/boot/compressed/aslr.c
+index fc6091abedb7..d39189ba7f8e 100644
+--- a/arch/x86/boot/compressed/aslr.c
++++ b/arch/x86/boot/compressed/aslr.c
+@@ -183,12 +183,27 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size,
+ static bool mem_avoid_overlap(struct mem_vector *img)
+ {
+ 	int i;
++	struct setup_data *ptr;
+ 
+ 	for (i = 0; i < MEM_AVOID_MAX; i++) {
+ 		if (mem_overlaps(img, &mem_avoid[i]))
+ 			return true;
+ 	}
+ 
++	/* Avoid all entries in the setup_data linked list. */
++	ptr = (struct setup_data *)(unsigned long)real_mode->hdr.setup_data;
++	while (ptr) {
++		struct mem_vector avoid;
++
++		avoid.start = (u64)ptr;
++		avoid.size = sizeof(*ptr) + ptr->len;
++
++		if (mem_overlaps(img, &avoid))
++			return true;
++
++		ptr = (struct setup_data *)(unsigned long)ptr->next;
++	}
++
+ 	return false;
+ }
+ 
+diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
+index b0910f97a3ea..ffb1733ac91f 100644
+--- a/arch/x86/include/asm/fixmap.h
++++ b/arch/x86/include/asm/fixmap.h
+@@ -106,14 +106,14 @@ enum fixed_addresses {
+ 	__end_of_permanent_fixed_addresses,
+ 
+ 	/*
+-	 * 256 temporary boot-time mappings, used by early_ioremap(),
++	 * 512 temporary boot-time mappings, used by early_ioremap(),
+ 	 * before ioremap() is functional.
+ 	 *
+-	 * If necessary we round it up to the next 256 pages boundary so
++	 * If necessary we round it up to the next 512 pages boundary so
+ 	 * that we can have a single pgd entry and a single pte table:
+ 	 */
+ #define NR_FIX_BTMAPS		64
+-#define FIX_BTMAPS_SLOTS	4
++#define FIX_BTMAPS_SLOTS	8
+ #define TOTAL_FIX_BTMAPS	(NR_FIX_BTMAPS * FIX_BTMAPS_SLOTS)
+ 	FIX_BTMAP_END =
+ 	 (__end_of_permanent_fixed_addresses ^
+diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
+index 5be9063545d2..3874693c0e53 100644
+--- a/arch/x86/include/asm/pgtable_64.h
++++ b/arch/x86/include/asm/pgtable_64.h
+@@ -19,6 +19,7 @@ extern pud_t level3_ident_pgt[512];
+ extern pmd_t level2_kernel_pgt[512];
+ extern pmd_t level2_fixmap_pgt[512];
+ extern pmd_t level2_ident_pgt[512];
++extern pte_t level1_fixmap_pgt[512];
+ extern pgd_t init_level4_pgt[];
+ 
+ #define swapper_pg_dir init_level4_pgt
+diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
+index 5492798930ef..215815b6407c 100644
+--- a/arch/x86/kernel/smpboot.c
++++ b/arch/x86/kernel/smpboot.c
+@@ -1292,6 +1292,9 @@ static void remove_siblinginfo(int cpu)
+ 
+ 	for_each_cpu(sibling, cpu_sibling_mask(cpu))
+ 		cpumask_clear_cpu(cpu, cpu_sibling_mask(sibling));
++	for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
++		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
++	cpumask_clear(cpu_llc_shared_mask(cpu));
+ 	cpumask_clear(cpu_sibling_mask(cpu));
+ 	cpumask_clear(cpu_core_mask(cpu));
+ 	c->phys_proc_id = 0;
+diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
+index e8a1201c3293..16fb0099b7f2 100644
+--- a/arch/x86/xen/mmu.c
++++ b/arch/x86/xen/mmu.c
+@@ -1866,12 +1866,11 @@ static void __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end,
+  *
+  * We can construct this by grafting the Xen provided pagetable into
+  * head_64.S's preconstructed pagetables.  We copy the Xen L2's into
+- * level2_ident_pgt, level2_kernel_pgt and level2_fixmap_pgt.  This
+- * means that only the kernel has a physical mapping to start with -
+- * but that's enough to get __va working.  We need to fill in the rest
+- * of the physical mapping once some sort of allocator has been set
+- * up.
+- * NOTE: for PVH, the page tables are native.
++ * level2_ident_pgt, and level2_kernel_pgt.  This means that only the
++ * kernel has a physical mapping to start with - but that's enough to
++ * get __va working.  We need to fill in the rest of the physical
++ * mapping once some sort of allocator has been set up.  NOTE: for
++ * PVH, the page tables are native.
+  */
+ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+ {
+@@ -1902,8 +1901,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+ 		/* L3_i[0] -> level2_ident_pgt */
+ 		convert_pfn_mfn(level3_ident_pgt);
+ 		/* L3_k[510] -> level2_kernel_pgt
+-		 * L3_i[511] -> level2_fixmap_pgt */
++		 * L3_k[511] -> level2_fixmap_pgt */
+ 		convert_pfn_mfn(level3_kernel_pgt);
++
++		/* L3_k[511][506] -> level1_fixmap_pgt */
++		convert_pfn_mfn(level2_fixmap_pgt);
+ 	}
+ 	/* We get [511][511] and have Xen's version of level2_kernel_pgt */
+ 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
+@@ -1913,21 +1915,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+ 	addr[1] = (unsigned long)l3;
+ 	addr[2] = (unsigned long)l2;
+ 	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
+-	 * Both L4[272][0] and L4[511][511] have entries that point to the same
++	 * Both L4[272][0] and L4[511][510] have entries that point to the same
+ 	 * L2 (PMD) tables. Meaning that if you modify it in __va space
+ 	 * it will be also modified in the __ka space! (But if you just
+ 	 * modify the PMD table to point to other PTE's or none, then you
+ 	 * are OK - which is what cleanup_highmap does) */
+ 	copy_page(level2_ident_pgt, l2);
+-	/* Graft it onto L4[511][511] */
++	/* Graft it onto L4[511][510] */
+ 	copy_page(level2_kernel_pgt, l2);
+ 
+-	/* Get [511][510] and graft that in level2_fixmap_pgt */
+-	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
+-	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
+-	copy_page(level2_fixmap_pgt, l2);
+-	/* Note that we don't do anything with level1_fixmap_pgt which
+-	 * we don't need. */
+ 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+ 		/* Make pagetable pieces RO */
+ 		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
+@@ -1937,6 +1933,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+ 		set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
+ 		set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
+ 		set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
++		set_page_prot(level1_fixmap_pgt, PAGE_KERNEL_RO);
+ 
+ 		/* Pin down new L4 */
+ 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
+diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
+index 4b0ca35a93b1..b2173e5da601 100644
+--- a/arch/xtensa/include/asm/pgtable.h
++++ b/arch/xtensa/include/asm/pgtable.h
+@@ -67,7 +67,12 @@
+ #define VMALLOC_START		0xC0000000
+ #define VMALLOC_END		0xC7FEFFFF
+ #define TLBTEMP_BASE_1		0xC7FF0000
+-#define TLBTEMP_BASE_2		0xC7FF8000
++#define TLBTEMP_BASE_2		(TLBTEMP_BASE_1 + DCACHE_WAY_SIZE)
++#if 2 * DCACHE_WAY_SIZE > ICACHE_WAY_SIZE
++#define TLBTEMP_SIZE		(2 * DCACHE_WAY_SIZE)
++#else
++#define TLBTEMP_SIZE		ICACHE_WAY_SIZE
++#endif
+ 
+ /*
+  * For the Xtensa architecture, the PTE layout is as follows:
+diff --git a/arch/xtensa/include/asm/uaccess.h b/arch/xtensa/include/asm/uaccess.h
+index fd686dc45d1a..c7211e7e182d 100644
+--- a/arch/xtensa/include/asm/uaccess.h
++++ b/arch/xtensa/include/asm/uaccess.h
+@@ -52,7 +52,12 @@
+  */
+ 	.macro	get_fs	ad, sp
+ 	GET_CURRENT(\ad,\sp)
++#if THREAD_CURRENT_DS > 1020
++	addi	\ad, \ad, TASK_THREAD
++	l32i	\ad, \ad, THREAD_CURRENT_DS - TASK_THREAD
++#else
+ 	l32i	\ad, \ad, THREAD_CURRENT_DS
++#endif
+ 	.endm
+ 
+ /*
+diff --git a/arch/xtensa/include/uapi/asm/ioctls.h b/arch/xtensa/include/uapi/asm/ioctls.h
+index b4cb1100c0fb..a47909f0c34b 100644
+--- a/arch/xtensa/include/uapi/asm/ioctls.h
++++ b/arch/xtensa/include/uapi/asm/ioctls.h
+@@ -28,17 +28,17 @@
+ #define TCSETSW		0x5403
+ #define TCSETSF		0x5404
+ 
+-#define TCGETA		_IOR('t', 23, struct termio)
+-#define TCSETA		_IOW('t', 24, struct termio)
+-#define TCSETAW		_IOW('t', 25, struct termio)
+-#define TCSETAF		_IOW('t', 28, struct termio)
++#define TCGETA		0x80127417	/* _IOR('t', 23, struct termio) */
++#define TCSETA		0x40127418	/* _IOW('t', 24, struct termio) */
++#define TCSETAW		0x40127419	/* _IOW('t', 25, struct termio) */
++#define TCSETAF		0x4012741C	/* _IOW('t', 28, struct termio) */
+ 
+ #define TCSBRK		_IO('t', 29)
+ #define TCXONC		_IO('t', 30)
+ #define TCFLSH		_IO('t', 31)
+ 
+-#define TIOCSWINSZ	_IOW('t', 103, struct winsize)
+-#define TIOCGWINSZ	_IOR('t', 104, struct winsize)
++#define TIOCSWINSZ	0x40087467	/* _IOW('t', 103, struct winsize) */
++#define TIOCGWINSZ	0x80087468	/* _IOR('t', 104, struct winsize) */
+ #define	TIOCSTART	_IO('t', 110)		/* start output, like ^Q */
+ #define	TIOCSTOP	_IO('t', 111)		/* stop output, like ^S */
+ #define TIOCOUTQ        _IOR('t', 115, int)     /* output queue size */
+@@ -88,7 +88,6 @@
+ #define TIOCSETD	_IOW('T', 35, int)
+ #define TIOCGETD	_IOR('T', 36, int)
+ #define TCSBRKP		_IOW('T', 37, int)   /* Needed for POSIX tcsendbreak()*/
+-#define TIOCTTYGSTRUCT	_IOR('T', 38, struct tty_struct) /* For debugging only*/
+ #define TIOCSBRK	_IO('T', 39) 	     /* BSD compatibility */
+ #define TIOCCBRK	_IO('T', 40)	     /* BSD compatibility */
+ #define TIOCGSID	_IOR('T', 41, pid_t) /* Return the session ID of FD*/
+@@ -114,8 +113,10 @@
+ #define TIOCSERGETLSR   _IOR('T', 89, unsigned int) /* Get line status reg. */
+   /* ioctl (fd, TIOCSERGETLSR, &result) where result may be as below */
+ # define TIOCSER_TEMT    0x01		     /* Transmitter physically empty */
+-#define TIOCSERGETMULTI _IOR('T', 90, struct serial_multiport_struct) /* Get multiport config  */
+-#define TIOCSERSETMULTI _IOW('T', 91, struct serial_multiport_struct) /* Set multiport config */
++#define TIOCSERGETMULTI 0x80a8545a /* Get multiport config  */
++			/* _IOR('T', 90, struct serial_multiport_struct) */
++#define TIOCSERSETMULTI 0x40a8545b /* Set multiport config */
++			/* _IOW('T', 91, struct serial_multiport_struct) */
+ 
+ #define TIOCMIWAIT	_IO('T', 92) /* wait for a change on serial input line(s) */
+ #define TIOCGICOUNT	0x545D	/* read serial port inline interrupt counts */
+diff --git a/arch/xtensa/kernel/entry.S b/arch/xtensa/kernel/entry.S
+index ef7f4990722b..a06b7efaae82 100644
+--- a/arch/xtensa/kernel/entry.S
++++ b/arch/xtensa/kernel/entry.S
+@@ -1001,9 +1001,8 @@ ENTRY(fast_syscall_xtensa)
+ 	movi	a7, 4			# sizeof(unsigned int)
+ 	access_ok a3, a7, a0, a2, .Leac	# a0: scratch reg, a2: sp
+ 
+-	addi	a6, a6, -1		# assuming SYS_XTENSA_ATOMIC_SET = 1
+-	_bgeui	a6, SYS_XTENSA_COUNT - 1, .Lill
+-	_bnei	a6, SYS_XTENSA_ATOMIC_CMP_SWP - 1, .Lnswp
++	_bgeui	a6, SYS_XTENSA_COUNT, .Lill
++	_bnei	a6, SYS_XTENSA_ATOMIC_CMP_SWP, .Lnswp
+ 
+ 	/* Fall through for ATOMIC_CMP_SWP. */
+ 
+@@ -1015,27 +1014,26 @@ TRY	s32i	a5, a3, 0		# different, modify value
+ 	l32i	a7, a2, PT_AREG7	# restore a7
+ 	l32i	a0, a2, PT_AREG0	# restore a0
+ 	movi	a2, 1			# and return 1
+-	addi	a6, a6, 1		# restore a6 (really necessary?)
+ 	rfe
+ 
+ 1:	l32i	a7, a2, PT_AREG7	# restore a7
+ 	l32i	a0, a2, PT_AREG0	# restore a0
+ 	movi	a2, 0			# return 0 (note that we cannot set
+-	addi	a6, a6, 1		# restore a6 (really necessary?)
+ 	rfe
+ 
+ .Lnswp:	/* Atomic set, add, and exg_add. */
+ 
+ TRY	l32i	a7, a3, 0		# orig
++	addi	a6, a6, -SYS_XTENSA_ATOMIC_SET
+ 	add	a0, a4, a7		# + arg
+ 	moveqz	a0, a4, a6		# set
++	addi	a6, a6, SYS_XTENSA_ATOMIC_SET
+ TRY	s32i	a0, a3, 0		# write new value
+ 
+ 	mov	a0, a2
+ 	mov	a2, a7
+ 	l32i	a7, a0, PT_AREG7	# restore a7
+ 	l32i	a0, a0, PT_AREG0	# restore a0
+-	addi	a6, a6, 1		# restore a6 (really necessary?)
+ 	rfe
+ 
+ CATCH
+@@ -1044,7 +1042,7 @@ CATCH
+ 	movi	a2, -EFAULT
+ 	rfe
+ 
+-.Lill:	l32i	a7, a2, PT_AREG0	# restore a7
++.Lill:	l32i	a7, a2, PT_AREG7	# restore a7
+ 	l32i	a0, a2, PT_AREG0	# restore a0
+ 	movi	a2, -EINVAL
+ 	rfe
+@@ -1565,7 +1563,7 @@ ENTRY(fast_second_level_miss)
+ 	rsr	a0, excvaddr
+ 	bltu	a0, a3, 2f
+ 
+-	addi	a1, a0, -(2 << (DCACHE_ALIAS_ORDER + PAGE_SHIFT))
++	addi	a1, a0, -TLBTEMP_SIZE
+ 	bgeu	a1, a3, 2f
+ 
+ 	/* Check if we have to restore an ITLB mapping. */
+@@ -1820,7 +1818,6 @@ ENTRY(_switch_to)
+ 
+ 	entry	a1, 16
+ 
+-	mov	a10, a2			# preserve 'prev' (a2)
+ 	mov	a11, a3			# and 'next' (a3)
+ 
+ 	l32i	a4, a2, TASK_THREAD_INFO
+@@ -1828,8 +1825,14 @@ ENTRY(_switch_to)
+ 
+ 	save_xtregs_user a4 a6 a8 a9 a12 a13 THREAD_XTREGS_USER
+ 
+-	s32i	a0, a10, THREAD_RA	# save return address
+-	s32i	a1, a10, THREAD_SP	# save stack pointer
++#if THREAD_RA > 1020 || THREAD_SP > 1020
++	addi	a10, a2, TASK_THREAD
++	s32i	a0, a10, THREAD_RA - TASK_THREAD	# save return address
++	s32i	a1, a10, THREAD_SP - TASK_THREAD	# save stack pointer
++#else
++	s32i	a0, a2, THREAD_RA	# save return address
++	s32i	a1, a2, THREAD_SP	# save stack pointer
++#endif
+ 
+ 	/* Disable ints while we manipulate the stack pointer. */
+ 
+@@ -1870,7 +1873,6 @@ ENTRY(_switch_to)
+ 	load_xtregs_user a5 a6 a8 a9 a12 a13 THREAD_XTREGS_USER
+ 
+ 	wsr	a14, ps
+-	mov	a2, a10			# return 'prev'
+ 	rsync
+ 
+ 	retw
+diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
+index 2d9cc6dbfd78..e8b76b8e4b29 100644
+--- a/arch/xtensa/kernel/pci-dma.c
++++ b/arch/xtensa/kernel/pci-dma.c
+@@ -49,9 +49,8 @@ dma_alloc_coherent(struct device *dev,size_t size,dma_addr_t *handle,gfp_t flag)
+ 
+ 	/* We currently don't support coherent memory outside KSEG */
+ 
+-	if (ret < XCHAL_KSEG_CACHED_VADDR
+-	    || ret >= XCHAL_KSEG_CACHED_VADDR + XCHAL_KSEG_SIZE)
+-		BUG();
++	BUG_ON(ret < XCHAL_KSEG_CACHED_VADDR ||
++	       ret > XCHAL_KSEG_CACHED_VADDR + XCHAL_KSEG_SIZE - 1);
+ 
+ 
+ 	if (ret != 0) {
+@@ -68,10 +67,11 @@ EXPORT_SYMBOL(dma_alloc_coherent);
+ void dma_free_coherent(struct device *hwdev, size_t size,
+ 			 void *vaddr, dma_addr_t dma_handle)
+ {
+-	long addr=(long)vaddr+XCHAL_KSEG_CACHED_VADDR-XCHAL_KSEG_BYPASS_VADDR;
++	unsigned long addr = (unsigned long)vaddr +
++		XCHAL_KSEG_CACHED_VADDR - XCHAL_KSEG_BYPASS_VADDR;
+ 
+-	if (addr < 0 || addr >= XCHAL_KSEG_SIZE)
+-		BUG();
++	BUG_ON(addr < XCHAL_KSEG_CACHED_VADDR ||
++	       addr > XCHAL_KSEG_CACHED_VADDR + XCHAL_KSEG_SIZE - 1);
+ 
+ 	free_pages(addr, get_order(size));
+ }
+diff --git a/block/blk-mq.c b/block/blk-mq.c
+index ad69ef657e85..06ac59f5bb5a 100644
+--- a/block/blk-mq.c
++++ b/block/blk-mq.c
+@@ -219,7 +219,6 @@ __blk_mq_alloc_request(struct blk_mq_alloc_data *data, int rw)
+ 	if (tag != BLK_MQ_TAG_FAIL) {
+ 		rq = data->hctx->tags->rqs[tag];
+ 
+-		rq->cmd_flags = 0;
+ 		if (blk_mq_tag_busy(data->hctx)) {
+ 			rq->cmd_flags = REQ_MQ_INFLIGHT;
+ 			atomic_inc(&data->hctx->nr_active);
+@@ -274,6 +273,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
+ 
+ 	if (rq->cmd_flags & REQ_MQ_INFLIGHT)
+ 		atomic_dec(&hctx->nr_active);
++	rq->cmd_flags = 0;
+ 
+ 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+ 	blk_mq_put_tag(hctx, tag, &ctx->last_tag);
+@@ -1411,6 +1411,8 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
+ 		left -= to_do * rq_size;
+ 		for (j = 0; j < to_do; j++) {
+ 			tags->rqs[i] = p;
++			tags->rqs[i]->atomic_flags = 0;
++			tags->rqs[i]->cmd_flags = 0;
+ 			if (set->ops->init_request) {
+ 				if (set->ops->init_request(set->driver_data,
+ 						tags->rqs[i], hctx_idx, i,
+diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
+index cadc37841744..d7494637c5db 100644
+--- a/block/cfq-iosched.c
++++ b/block/cfq-iosched.c
+@@ -1275,12 +1275,16 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+ static void
+ cfq_update_group_weight(struct cfq_group *cfqg)
+ {
+-	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+-
+ 	if (cfqg->new_weight) {
+ 		cfqg->weight = cfqg->new_weight;
+ 		cfqg->new_weight = 0;
+ 	}
++}
++
++static void
++cfq_update_group_leaf_weight(struct cfq_group *cfqg)
++{
++	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+ 
+ 	if (cfqg->new_leaf_weight) {
+ 		cfqg->leaf_weight = cfqg->new_leaf_weight;
+@@ -1299,7 +1303,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+ 	/* add to the service tree */
+ 	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+ 
+-	cfq_update_group_weight(cfqg);
++	cfq_update_group_leaf_weight(cfqg);
+ 	__cfq_group_service_tree_add(st, cfqg);
+ 
+ 	/*
+@@ -1323,6 +1327,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+ 	 */
+ 	while ((parent = cfqg_parent(pos))) {
+ 		if (propagate) {
++			cfq_update_group_weight(pos);
+ 			propagate = !parent->nr_active++;
+ 			parent->children_weight += pos->weight;
+ 		}
+diff --git a/block/genhd.c b/block/genhd.c
+index 791f41943132..e6723bd4d7a1 100644
+--- a/block/genhd.c
++++ b/block/genhd.c
+@@ -28,10 +28,10 @@ struct kobject *block_depr;
+ /* for extended dynamic devt allocation, currently only one major is used */
+ #define NR_EXT_DEVT		(1 << MINORBITS)
+ 
+-/* For extended devt allocation.  ext_devt_mutex prevents look up
++/* For extended devt allocation.  ext_devt_lock prevents look up
+  * results from going away underneath its user.
+  */
+-static DEFINE_MUTEX(ext_devt_mutex);
++static DEFINE_SPINLOCK(ext_devt_lock);
+ static DEFINE_IDR(ext_devt_idr);
+ 
+ static struct device_type disk_type;
+@@ -420,9 +420,13 @@ int blk_alloc_devt(struct hd_struct *part, dev_t *devt)
+ 	}
+ 
+ 	/* allocate ext devt */
+-	mutex_lock(&ext_devt_mutex);
+-	idx = idr_alloc(&ext_devt_idr, part, 0, NR_EXT_DEVT, GFP_KERNEL);
+-	mutex_unlock(&ext_devt_mutex);
++	idr_preload(GFP_KERNEL);
++
++	spin_lock(&ext_devt_lock);
++	idx = idr_alloc(&ext_devt_idr, part, 0, NR_EXT_DEVT, GFP_NOWAIT);
++	spin_unlock(&ext_devt_lock);
++
++	idr_preload_end();
+ 	if (idx < 0)
+ 		return idx == -ENOSPC ? -EBUSY : idx;
+ 
+@@ -441,15 +445,13 @@ int blk_alloc_devt(struct hd_struct *part, dev_t *devt)
+  */
+ void blk_free_devt(dev_t devt)
+ {
+-	might_sleep();
+-
+ 	if (devt == MKDEV(0, 0))
+ 		return;
+ 
+ 	if (MAJOR(devt) == BLOCK_EXT_MAJOR) {
+-		mutex_lock(&ext_devt_mutex);
++		spin_lock(&ext_devt_lock);
+ 		idr_remove(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
+-		mutex_unlock(&ext_devt_mutex);
++		spin_unlock(&ext_devt_lock);
+ 	}
+ }
+ 
+@@ -665,7 +667,6 @@ void del_gendisk(struct gendisk *disk)
+ 		sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
+ 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
+ 	device_del(disk_to_dev(disk));
+-	blk_free_devt(disk_to_dev(disk)->devt);
+ }
+ EXPORT_SYMBOL(del_gendisk);
+ 
+@@ -690,13 +691,13 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
+ 	} else {
+ 		struct hd_struct *part;
+ 
+-		mutex_lock(&ext_devt_mutex);
++		spin_lock(&ext_devt_lock);
+ 		part = idr_find(&ext_devt_idr, blk_mangle_minor(MINOR(devt)));
+ 		if (part && get_disk(part_to_disk(part))) {
+ 			*partno = part->partno;
+ 			disk = part_to_disk(part);
+ 		}
+-		mutex_unlock(&ext_devt_mutex);
++		spin_unlock(&ext_devt_lock);
+ 	}
+ 
+ 	return disk;
+@@ -1098,6 +1099,7 @@ static void disk_release(struct device *dev)
+ {
+ 	struct gendisk *disk = dev_to_disk(dev);
+ 
++	blk_free_devt(dev->devt);
+ 	disk_release_events(disk);
+ 	kfree(disk->random);
+ 	disk_replace_part_tbl(disk, NULL);
+diff --git a/block/partition-generic.c b/block/partition-generic.c
+index 789cdea05893..0d9e5f97f0a8 100644
+--- a/block/partition-generic.c
++++ b/block/partition-generic.c
+@@ -211,6 +211,7 @@ static const struct attribute_group *part_attr_groups[] = {
+ static void part_release(struct device *dev)
+ {
+ 	struct hd_struct *p = dev_to_part(dev);
++	blk_free_devt(dev->devt);
+ 	free_part_stats(p);
+ 	free_part_info(p);
+ 	kfree(p);
+@@ -253,7 +254,6 @@ void delete_partition(struct gendisk *disk, int partno)
+ 	rcu_assign_pointer(ptbl->last_lookup, NULL);
+ 	kobject_put(part->holder_dir);
+ 	device_del(part_to_dev(part));
+-	blk_free_devt(part_devt(part));
+ 
+ 	hd_struct_put(part);
+ }
+diff --git a/block/partitions/aix.c b/block/partitions/aix.c
+index 43be471d9b1d..0931f5136ab2 100644
+--- a/block/partitions/aix.c
++++ b/block/partitions/aix.c
+@@ -253,7 +253,7 @@ int aix_partition(struct parsed_partitions *state)
+ 				continue;
+ 			}
+ 			lv_ix = be16_to_cpu(p->lv_ix) - 1;
+-			if (lv_ix > state->limit) {
++			if (lv_ix >= state->limit) {
+ 				cur_lv_ix = -1;
+ 				continue;
+ 			}
+diff --git a/drivers/acpi/acpi_cmos_rtc.c b/drivers/acpi/acpi_cmos_rtc.c
+index 2da8660262e5..81dc75033f15 100644
+--- a/drivers/acpi/acpi_cmos_rtc.c
++++ b/drivers/acpi/acpi_cmos_rtc.c
+@@ -33,7 +33,7 @@ acpi_cmos_rtc_space_handler(u32 function, acpi_physical_address address,
+ 		      void *handler_context, void *region_context)
+ {
+ 	int i;
+-	u8 *value = (u8 *)&value64;
++	u8 *value = (u8 *)value64;
+ 
+ 	if (address > 0xff || !value64)
+ 		return AE_BAD_PARAMETER;
+diff --git a/drivers/acpi/acpi_lpss.c b/drivers/acpi/acpi_lpss.c
+index 9cb65b0e7597..2f65b0969edb 100644
+--- a/drivers/acpi/acpi_lpss.c
++++ b/drivers/acpi/acpi_lpss.c
+@@ -392,7 +392,6 @@ static int acpi_lpss_create_device(struct acpi_device *adev,
+ 	adev->driver_data = pdata;
+ 	pdev = acpi_create_platform_device(adev);
+ 	if (!IS_ERR_OR_NULL(pdev)) {
+-		device_enable_async_suspend(&pdev->dev);
+ 		return 1;
+ 	}
+ 
+@@ -583,7 +582,7 @@ static int acpi_lpss_suspend_late(struct device *dev)
+ 	return acpi_dev_suspend_late(dev);
+ }
+ 
+-static int acpi_lpss_restore_early(struct device *dev)
++static int acpi_lpss_resume_early(struct device *dev)
+ {
+ 	int ret = acpi_dev_resume_early(dev);
+ 
+@@ -623,15 +622,15 @@ static int acpi_lpss_runtime_resume(struct device *dev)
+ static struct dev_pm_domain acpi_lpss_pm_domain = {
+ 	.ops = {
+ #ifdef CONFIG_PM_SLEEP
+-		.suspend_late = acpi_lpss_suspend_late,
+-		.restore_early = acpi_lpss_restore_early,
+ 		.prepare = acpi_subsys_prepare,
+ 		.complete = acpi_subsys_complete,
+ 		.suspend = acpi_subsys_suspend,
+-		.resume_early = acpi_subsys_resume_early,
++		.suspend_late = acpi_lpss_suspend_late,
++		.resume_early = acpi_lpss_resume_early,
+ 		.freeze = acpi_subsys_freeze,
+ 		.poweroff = acpi_subsys_suspend,
+-		.poweroff_late = acpi_subsys_suspend_late,
++		.poweroff_late = acpi_lpss_suspend_late,
++		.restore_early = acpi_lpss_resume_early,
+ #endif
+ #ifdef CONFIG_PM_RUNTIME
+ 		.runtime_suspend = acpi_lpss_runtime_suspend,
+diff --git a/drivers/acpi/acpica/aclocal.h b/drivers/acpi/acpica/aclocal.h
+index 91f801a2e689..494775a67ffa 100644
+--- a/drivers/acpi/acpica/aclocal.h
++++ b/drivers/acpi/acpica/aclocal.h
+@@ -254,6 +254,7 @@ struct acpi_create_field_info {
+ 	u32 field_bit_position;
+ 	u32 field_bit_length;
+ 	u16 resource_length;
++	u16 pin_number_index;
+ 	u8 field_flags;
+ 	u8 attribute;
+ 	u8 field_type;
+diff --git a/drivers/acpi/acpica/acobject.h b/drivers/acpi/acpica/acobject.h
+index 22fb6449d3d6..8abb393dafab 100644
+--- a/drivers/acpi/acpica/acobject.h
++++ b/drivers/acpi/acpica/acobject.h
+@@ -264,6 +264,7 @@ struct acpi_object_region_field {
+ 	ACPI_OBJECT_COMMON_HEADER ACPI_COMMON_FIELD_INFO u16 resource_length;
+ 	union acpi_operand_object *region_obj;	/* Containing op_region object */
+ 	u8 *resource_buffer;	/* resource_template for serial regions/fields */
++	u16 pin_number_index;	/* Index relative to previous Connection/Template */
+ };
+ 
+ struct acpi_object_bank_field {
+diff --git a/drivers/acpi/acpica/dsfield.c b/drivers/acpi/acpica/dsfield.c
+index 3661c8e90540..c57666196672 100644
+--- a/drivers/acpi/acpica/dsfield.c
++++ b/drivers/acpi/acpica/dsfield.c
+@@ -360,6 +360,7 @@ acpi_ds_get_field_names(struct acpi_create_field_info *info,
+ 			 */
+ 			info->resource_buffer = NULL;
+ 			info->connection_node = NULL;
++			info->pin_number_index = 0;
+ 
+ 			/*
+ 			 * A Connection() is either an actual resource descriptor (buffer)
+@@ -437,6 +438,7 @@ acpi_ds_get_field_names(struct acpi_create_field_info *info,
+ 			}
+ 
+ 			info->field_bit_position += info->field_bit_length;
++			info->pin_number_index++;	/* Index relative to previous Connection() */
+ 			break;
+ 
+ 		default:
+diff --git a/drivers/acpi/acpica/evregion.c b/drivers/acpi/acpica/evregion.c
+index 9957297d1580..8eb8575e8c16 100644
+--- a/drivers/acpi/acpica/evregion.c
++++ b/drivers/acpi/acpica/evregion.c
+@@ -142,6 +142,7 @@ acpi_ev_address_space_dispatch(union acpi_operand_object *region_obj,
+ 	union acpi_operand_object *region_obj2;
+ 	void *region_context = NULL;
+ 	struct acpi_connection_info *context;
++	acpi_physical_address address;
+ 
+ 	ACPI_FUNCTION_TRACE(ev_address_space_dispatch);
+ 
+@@ -231,25 +232,23 @@ acpi_ev_address_space_dispatch(union acpi_operand_object *region_obj,
+ 	/* We have everything we need, we can invoke the address space handler */
+ 
+ 	handler = handler_desc->address_space.handler;
+-
+-	ACPI_DEBUG_PRINT((ACPI_DB_OPREGION,
+-			  "Handler %p (@%p) Address %8.8X%8.8X [%s]\n",
+-			  &region_obj->region.handler->address_space, handler,
+-			  ACPI_FORMAT_NATIVE_UINT(region_obj->region.address +
+-						  region_offset),
+-			  acpi_ut_get_region_name(region_obj->region.
+-						  space_id)));
++	address = (region_obj->region.address + region_offset);
+ 
+ 	/*
+ 	 * Special handling for generic_serial_bus and general_purpose_io:
+ 	 * There are three extra parameters that must be passed to the
+ 	 * handler via the context:
+-	 *   1) Connection buffer, a resource template from Connection() op.
+-	 *   2) Length of the above buffer.
+-	 *   3) Actual access length from the access_as() op.
++	 *   1) Connection buffer, a resource template from Connection() op
++	 *   2) Length of the above buffer
++	 *   3) Actual access length from the access_as() op
++	 *
++	 * In addition, for general_purpose_io, the Address and bit_width fields
++	 * are defined as follows:
++	 *   1) Address is the pin number index of the field (bit offset from
++	 *      the previous Connection)
++	 *   2) bit_width is the actual bit length of the field (number of pins)
+ 	 */
+-	if (((region_obj->region.space_id == ACPI_ADR_SPACE_GSBUS) ||
+-	     (region_obj->region.space_id == ACPI_ADR_SPACE_GPIO)) &&
++	if ((region_obj->region.space_id == ACPI_ADR_SPACE_GSBUS) &&
+ 	    context && field_obj) {
+ 
+ 		/* Get the Connection (resource_template) buffer */
+@@ -258,6 +257,24 @@ acpi_ev_address_space_dispatch(union acpi_operand_object *region_obj,
+ 		context->length = field_obj->field.resource_length;
+ 		context->access_length = field_obj->field.access_length;
+ 	}
++	if ((region_obj->region.space_id == ACPI_ADR_SPACE_GPIO) &&
++	    context && field_obj) {
++
++		/* Get the Connection (resource_template) buffer */
++
++		context->connection = field_obj->field.resource_buffer;
++		context->length = field_obj->field.resource_length;
++		context->access_length = field_obj->field.access_length;
++		address = field_obj->field.pin_number_index;
++		bit_width = field_obj->field.bit_length;
++	}
++
++	ACPI_DEBUG_PRINT((ACPI_DB_OPREGION,
++			  "Handler %p (@%p) Address %8.8X%8.8X [%s]\n",
++			  &region_obj->region.handler->address_space, handler,
++			  ACPI_FORMAT_NATIVE_UINT(address),
++			  acpi_ut_get_region_name(region_obj->region.
++						  space_id)));
+ 
+ 	if (!(handler_desc->address_space.handler_flags &
+ 	      ACPI_ADDR_HANDLER_DEFAULT_INSTALLED)) {
+@@ -271,9 +288,7 @@ acpi_ev_address_space_dispatch(union acpi_operand_object *region_obj,
+ 
+ 	/* Call the handler */
+ 
+-	status = handler(function,
+-			 (region_obj->region.address + region_offset),
+-			 bit_width, value, context,
++	status = handler(function, address, bit_width, value, context,
+ 			 region_obj2->extra.region_context);
+ 
+ 	if (ACPI_FAILURE(status)) {
+diff --git a/drivers/acpi/acpica/exfield.c b/drivers/acpi/acpica/exfield.c
+index 12878e1982f7..9dabfd2acd4d 100644
+--- a/drivers/acpi/acpica/exfield.c
++++ b/drivers/acpi/acpica/exfield.c
+@@ -254,6 +254,37 @@ acpi_ex_read_data_from_field(struct acpi_walk_state * walk_state,
+ 		buffer = &buffer_desc->integer.value;
+ 	}
+ 
++	if ((obj_desc->common.type == ACPI_TYPE_LOCAL_REGION_FIELD) &&
++	    (obj_desc->field.region_obj->region.space_id ==
++	     ACPI_ADR_SPACE_GPIO)) {
++		/*
++		 * For GPIO (general_purpose_io), the Address will be the bit offset
++		 * from the previous Connection() operator, making it effectively a
++		 * pin number index. The bit_length is the length of the field, which
++		 * is thus the number of pins.
++		 */
++		ACPI_DEBUG_PRINT((ACPI_DB_BFIELD,
++				  "GPIO FieldRead [FROM]:  Pin %u Bits %u\n",
++				  obj_desc->field.pin_number_index,
++				  obj_desc->field.bit_length));
++
++		/* Lock entire transaction if requested */
++
++		acpi_ex_acquire_global_lock(obj_desc->common_field.field_flags);
++
++		/* Perform the write */
++
++		status = acpi_ex_access_region(obj_desc, 0,
++					       (u64 *)buffer, ACPI_READ);
++		acpi_ex_release_global_lock(obj_desc->common_field.field_flags);
++		if (ACPI_FAILURE(status)) {
++			acpi_ut_remove_reference(buffer_desc);
++		} else {
++			*ret_buffer_desc = buffer_desc;
++		}
++		return_ACPI_STATUS(status);
++	}
++
+ 	ACPI_DEBUG_PRINT((ACPI_DB_BFIELD,
+ 			  "FieldRead [TO]:   Obj %p, Type %X, Buf %p, ByteLen %X\n",
+ 			  obj_desc, obj_desc->common.type, buffer,
+@@ -415,6 +446,42 @@ acpi_ex_write_data_to_field(union acpi_operand_object *source_desc,
+ 
+ 		*result_desc = buffer_desc;
+ 		return_ACPI_STATUS(status);
++	} else if ((obj_desc->common.type == ACPI_TYPE_LOCAL_REGION_FIELD) &&
++		   (obj_desc->field.region_obj->region.space_id ==
++		    ACPI_ADR_SPACE_GPIO)) {
++		/*
++		 * For GPIO (general_purpose_io), we will bypass the entire field
++		 * mechanism and handoff the bit address and bit width directly to
++		 * the handler. The Address will be the bit offset
++		 * from the previous Connection() operator, making it effectively a
++		 * pin number index. The bit_length is the length of the field, which
++		 * is thus the number of pins.
++		 */
++		if (source_desc->common.type != ACPI_TYPE_INTEGER) {
++			return_ACPI_STATUS(AE_AML_OPERAND_TYPE);
++		}
++
++		ACPI_DEBUG_PRINT((ACPI_DB_BFIELD,
++				  "GPIO FieldWrite [FROM]: (%s:%X), Val %.8X  [TO]:  Pin %u Bits %u\n",
++				  acpi_ut_get_type_name(source_desc->common.
++							type),
++				  source_desc->common.type,
++				  (u32)source_desc->integer.value,
++				  obj_desc->field.pin_number_index,
++				  obj_desc->field.bit_length));
++
++		buffer = &source_desc->integer.value;
++
++		/* Lock entire transaction if requested */
++
++		acpi_ex_acquire_global_lock(obj_desc->common_field.field_flags);
++
++		/* Perform the write */
++
++		status = acpi_ex_access_region(obj_desc, 0,
++					       (u64 *)buffer, ACPI_WRITE);
++		acpi_ex_release_global_lock(obj_desc->common_field.field_flags);
++		return_ACPI_STATUS(status);
+ 	}
+ 
+ 	/* Get a pointer to the data to be written */
+diff --git a/drivers/acpi/acpica/exprep.c b/drivers/acpi/acpica/exprep.c
+index ee3f872870bc..118e942005e5 100644
+--- a/drivers/acpi/acpica/exprep.c
++++ b/drivers/acpi/acpica/exprep.c
+@@ -484,6 +484,8 @@ acpi_status acpi_ex_prep_field_value(struct acpi_create_field_info *info)
+ 			obj_desc->field.resource_length = info->resource_length;
+ 		}
+ 
++		obj_desc->field.pin_number_index = info->pin_number_index;
++
+ 		/* Allow full data read from EC address space */
+ 
+ 		if ((obj_desc->field.region_obj->region.space_id ==
+diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
+index 130f513e08c9..bc0b286ff2ba 100644
+--- a/drivers/acpi/battery.c
++++ b/drivers/acpi/battery.c
+@@ -535,20 +535,6 @@ static int acpi_battery_get_state(struct acpi_battery *battery)
+ 			" invalid.\n");
+ 	}
+ 
+-	/*
+-	 * When fully charged, some batteries wrongly report
+-	 * capacity_now = design_capacity instead of = full_charge_capacity
+-	 */
+-	if (battery->capacity_now > battery->full_charge_capacity
+-	    && battery->full_charge_capacity != ACPI_BATTERY_VALUE_UNKNOWN) {
+-		battery->capacity_now = battery->full_charge_capacity;
+-		if (battery->capacity_now != battery->design_capacity)
+-			printk_once(KERN_WARNING FW_BUG
+-				"battery: reported current charge level (%d) "
+-				"is higher than reported maximum charge level (%d).\n",
+-				battery->capacity_now, battery->full_charge_capacity);
+-	}
+-
+ 	if (test_bit(ACPI_BATTERY_QUIRK_PERCENTAGE_CAPACITY, &battery->flags)
+ 	    && battery->capacity_now >= 0 && battery->capacity_now <= 100)
+ 		battery->capacity_now = (battery->capacity_now *
+diff --git a/drivers/acpi/container.c b/drivers/acpi/container.c
+index 76f7cff64594..c8ead9f97375 100644
+--- a/drivers/acpi/container.c
++++ b/drivers/acpi/container.c
+@@ -99,6 +99,13 @@ static void container_device_detach(struct acpi_device *adev)
+ 		device_unregister(dev);
+ }
+ 
++static void container_device_online(struct acpi_device *adev)
++{
++	struct device *dev = acpi_driver_data(adev);
++
++	kobject_uevent(&dev->kobj, KOBJ_ONLINE);
++}
++
+ static struct acpi_scan_handler container_handler = {
+ 	.ids = container_device_ids,
+ 	.attach = container_device_attach,
+@@ -106,6 +113,7 @@ static struct acpi_scan_handler container_handler = {
+ 	.hotplug = {
+ 		.enabled = true,
+ 		.demand_offline = true,
++		.notify_online = container_device_online,
+ 	},
+ };
+ 
+diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
+index 551f29127369..2e9ed9a4f13f 100644
+--- a/drivers/acpi/scan.c
++++ b/drivers/acpi/scan.c
+@@ -128,7 +128,7 @@ static int create_modalias(struct acpi_device *acpi_dev, char *modalias,
+ 	list_for_each_entry(id, &acpi_dev->pnp.ids, list) {
+ 		count = snprintf(&modalias[len], size, "%s:", id->id);
+ 		if (count < 0)
+-			return EINVAL;
++			return -EINVAL;
+ 		if (count >= size)
+ 			return -ENOMEM;
+ 		len += count;
+@@ -2184,6 +2184,9 @@ static void acpi_bus_attach(struct acpi_device *device)
+  ok:
+ 	list_for_each_entry(child, &device->children, node)
+ 		acpi_bus_attach(child);
++
++	if (device->handler && device->handler->hotplug.notify_online)
++		device->handler->hotplug.notify_online(device);
+ }
+ 
+ /**
+diff --git a/drivers/acpi/video.c b/drivers/acpi/video.c
+index 4834b4cae540..f1e3496c00c7 100644
+--- a/drivers/acpi/video.c
++++ b/drivers/acpi/video.c
+@@ -675,6 +675,14 @@ static struct dmi_system_id video_dmi_table[] __initdata = {
+ 		DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T520"),
+ 		},
+ 	},
++	{
++	 .callback = video_disable_native_backlight,
++	 .ident = "ThinkPad X201s",
++	 .matches = {
++		DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
++		DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad X201s"),
++		},
++	},
+ 
+ 	/* The native backlight controls do not work on some older machines */
+ 	{
+diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
+index 4cd52a4541a9..f0f8ae1197e2 100644
+--- a/drivers/ata/ahci.c
++++ b/drivers/ata/ahci.c
+@@ -305,6 +305,14 @@ static const struct pci_device_id ahci_pci_tbl[] = {
+ 	{ PCI_VDEVICE(INTEL, 0x9c85), board_ahci }, /* Wildcat Point-LP RAID */
+ 	{ PCI_VDEVICE(INTEL, 0x9c87), board_ahci }, /* Wildcat Point-LP RAID */
+ 	{ PCI_VDEVICE(INTEL, 0x9c8f), board_ahci }, /* Wildcat Point-LP RAID */
++	{ PCI_VDEVICE(INTEL, 0x8c82), board_ahci }, /* 9 Series AHCI */
++	{ PCI_VDEVICE(INTEL, 0x8c83), board_ahci }, /* 9 Series AHCI */
++	{ PCI_VDEVICE(INTEL, 0x8c84), board_ahci }, /* 9 Series RAID */
++	{ PCI_VDEVICE(INTEL, 0x8c85), board_ahci }, /* 9 Series RAID */
++	{ PCI_VDEVICE(INTEL, 0x8c86), board_ahci }, /* 9 Series RAID */
++	{ PCI_VDEVICE(INTEL, 0x8c87), board_ahci }, /* 9 Series RAID */
++	{ PCI_VDEVICE(INTEL, 0x8c8e), board_ahci }, /* 9 Series RAID */
++	{ PCI_VDEVICE(INTEL, 0x8c8f), board_ahci }, /* 9 Series RAID */
+ 
+ 	/* JMicron 360/1/3/5/6, match class to avoid IDE function */
+ 	{ PCI_VENDOR_ID_JMICRON, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
+@@ -442,6 +450,8 @@ static const struct pci_device_id ahci_pci_tbl[] = {
+ 	{ PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x917a),
+ 	  .driver_data = board_ahci_yes_fbs },			/* 88se9172 */
+ 	{ PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9172),
++	  .driver_data = board_ahci_yes_fbs },			/* 88se9182 */
++	{ PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9182),
+ 	  .driver_data = board_ahci_yes_fbs },			/* 88se9172 */
+ 	{ PCI_DEVICE(PCI_VENDOR_ID_MARVELL_EXT, 0x9192),
+ 	  .driver_data = board_ahci_yes_fbs },			/* 88se9172 on some Gigabyte */
+diff --git a/drivers/ata/ahci_xgene.c b/drivers/ata/ahci_xgene.c
+index ee3a3659bd9e..10d524699676 100644
+--- a/drivers/ata/ahci_xgene.c
++++ b/drivers/ata/ahci_xgene.c
+@@ -337,7 +337,7 @@ static struct ata_port_operations xgene_ahci_ops = {
+ };
+ 
+ static const struct ata_port_info xgene_ahci_port_info = {
+-	.flags = AHCI_FLAG_COMMON | ATA_FLAG_NCQ,
++	.flags = AHCI_FLAG_COMMON,
+ 	.pio_mask = ATA_PIO4,
+ 	.udma_mask = ATA_UDMA6,
+ 	.port_ops = &xgene_ahci_ops,
+@@ -484,7 +484,7 @@ static int xgene_ahci_probe(struct platform_device *pdev)
+ 		goto disable_resources;
+ 	}
+ 
+-	hflags = AHCI_HFLAG_NO_PMP | AHCI_HFLAG_YES_NCQ;
++	hflags = AHCI_HFLAG_NO_PMP | AHCI_HFLAG_NO_NCQ;
+ 
+ 	rc = ahci_platform_init_host(pdev, hpriv, &xgene_ahci_port_info,
+ 				     hflags, 0, 0);
+diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
+index 893e30e9a9ef..ffbe625e6fd2 100644
+--- a/drivers/ata/ata_piix.c
++++ b/drivers/ata/ata_piix.c
+@@ -340,6 +340,14 @@ static const struct pci_device_id piix_pci_tbl[] = {
+ 	{ 0x8086, 0x0F21, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_2port_sata_byt },
+ 	/* SATA Controller IDE (Coleto Creek) */
+ 	{ 0x8086, 0x23a6, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_2port_sata },
++	/* SATA Controller IDE (9 Series) */
++	{ 0x8086, 0x8c88, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_2port_sata_snb },
++	/* SATA Controller IDE (9 Series) */
++	{ 0x8086, 0x8c89, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_2port_sata_snb },
++	/* SATA Controller IDE (9 Series) */
++	{ 0x8086, 0x8c80, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_sata_snb },
++	/* SATA Controller IDE (9 Series) */
++	{ 0x8086, 0x8c81, PCI_ANY_ID, PCI_ANY_ID, 0, 0, ich8_sata_snb },
+ 
+ 	{ }	/* terminate list */
+ };
+diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
+index 677c0c1b03bd..e7f30b59bc8b 100644
+--- a/drivers/ata/libata-core.c
++++ b/drivers/ata/libata-core.c
+@@ -4227,7 +4227,7 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = {
+ 	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+ 	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+ 	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+-	{ "Crucial_CT???M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
++	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+ 
+ 	/*
+ 	 * Some WD SATA-I drives spin up and down erratically when the link
+diff --git a/drivers/ata/pata_scc.c b/drivers/ata/pata_scc.c
+index 4e006d74bef8..7f4cb76ed9fa 100644
+--- a/drivers/ata/pata_scc.c
++++ b/drivers/ata/pata_scc.c
+@@ -585,7 +585,7 @@ static int scc_wait_after_reset(struct ata_link *link, unsigned int devmask,
+  *	Note: Original code is ata_bus_softreset().
+  */
+ 
+-static unsigned int scc_bus_softreset(struct ata_port *ap, unsigned int devmask,
++static int scc_bus_softreset(struct ata_port *ap, unsigned int devmask,
+                                       unsigned long deadline)
+ {
+ 	struct ata_ioports *ioaddr = &ap->ioaddr;
+@@ -599,9 +599,7 @@ static unsigned int scc_bus_softreset(struct ata_port *ap, unsigned int devmask,
+ 	udelay(20);
+ 	out_be32(ioaddr->ctl_addr, ap->ctl);
+ 
+-	scc_wait_after_reset(&ap->link, devmask, deadline);
+-
+-	return 0;
++	return scc_wait_after_reset(&ap->link, devmask, deadline);
+ }
+ 
+ /**
+@@ -618,7 +616,8 @@ static int scc_softreset(struct ata_link *link, unsigned int *classes,
+ {
+ 	struct ata_port *ap = link->ap;
+ 	unsigned int slave_possible = ap->flags & ATA_FLAG_SLAVE_POSS;
+-	unsigned int devmask = 0, err_mask;
++	unsigned int devmask = 0;
++	int rc;
+ 	u8 err;
+ 
+ 	DPRINTK("ENTER\n");
+@@ -634,9 +633,9 @@ static int scc_softreset(struct ata_link *link, unsigned int *classes,
+ 
+ 	/* issue bus reset */
+ 	DPRINTK("about to softreset, devmask=%x\n", devmask);
+-	err_mask = scc_bus_softreset(ap, devmask, deadline);
+-	if (err_mask) {
+-		ata_port_err(ap, "SRST failed (err_mask=0x%x)\n", err_mask);
++	rc = scc_bus_softreset(ap, devmask, deadline);
++	if (rc) {
++		ata_port_err(ap, "SRST failed (err_mask=0x%x)\n", rc);
+ 		return -EIO;
+ 	}
+ 
+diff --git a/drivers/base/regmap/internal.h b/drivers/base/regmap/internal.h
+index 7d1326985bee..bfc90b8547f2 100644
+--- a/drivers/base/regmap/internal.h
++++ b/drivers/base/regmap/internal.h
+@@ -146,6 +146,9 @@ struct regcache_ops {
+ 	enum regcache_type type;
+ 	int (*init)(struct regmap *map);
+ 	int (*exit)(struct regmap *map);
++#ifdef CONFIG_DEBUG_FS
++	void (*debugfs_init)(struct regmap *map);
++#endif
+ 	int (*read)(struct regmap *map, unsigned int reg, unsigned int *value);
+ 	int (*write)(struct regmap *map, unsigned int reg, unsigned int value);
+ 	int (*sync)(struct regmap *map, unsigned int min, unsigned int max);
+diff --git a/drivers/base/regmap/regcache-rbtree.c b/drivers/base/regmap/regcache-rbtree.c
+index 6a7e4fa12854..f3e8fe0cc650 100644
+--- a/drivers/base/regmap/regcache-rbtree.c
++++ b/drivers/base/regmap/regcache-rbtree.c
+@@ -194,10 +194,6 @@ static void rbtree_debugfs_init(struct regmap *map)
+ {
+ 	debugfs_create_file("rbtree", 0400, map->debugfs, map, &rbtree_fops);
+ }
+-#else
+-static void rbtree_debugfs_init(struct regmap *map)
+-{
+-}
+ #endif
+ 
+ static int regcache_rbtree_init(struct regmap *map)
+@@ -222,8 +218,6 @@ static int regcache_rbtree_init(struct regmap *map)
+ 			goto err;
+ 	}
+ 
+-	rbtree_debugfs_init(map);
+-
+ 	return 0;
+ 
+ err:
+@@ -532,6 +526,9 @@ struct regcache_ops regcache_rbtree_ops = {
+ 	.name = "rbtree",
+ 	.init = regcache_rbtree_init,
+ 	.exit = regcache_rbtree_exit,
++#ifdef CONFIG_DEBUG_FS
++	.debugfs_init = rbtree_debugfs_init,
++#endif
+ 	.read = regcache_rbtree_read,
+ 	.write = regcache_rbtree_write,
+ 	.sync = regcache_rbtree_sync,
+diff --git a/drivers/base/regmap/regcache.c b/drivers/base/regmap/regcache.c
+index 29b4128da0b0..5617da6dc898 100644
+--- a/drivers/base/regmap/regcache.c
++++ b/drivers/base/regmap/regcache.c
+@@ -698,7 +698,7 @@ int regcache_sync_block(struct regmap *map, void *block,
+ 			unsigned int block_base, unsigned int start,
+ 			unsigned int end)
+ {
+-	if (regmap_can_raw_write(map))
++	if (regmap_can_raw_write(map) && !map->use_single_rw)
+ 		return regcache_sync_block_raw(map, block, cache_present,
+ 					       block_base, start, end);
+ 	else
+diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
+index 45d812c0ea77..65ea7b256b3e 100644
+--- a/drivers/base/regmap/regmap-debugfs.c
++++ b/drivers/base/regmap/regmap-debugfs.c
+@@ -538,6 +538,9 @@ void regmap_debugfs_init(struct regmap *map, const char *name)
+ 
+ 		next = rb_next(&range_node->node);
+ 	}
++
++	if (map->cache_ops && map->cache_ops->debugfs_init)
++		map->cache_ops->debugfs_init(map);
+ }
+ 
+ void regmap_debugfs_exit(struct regmap *map)
+diff --git a/drivers/base/regmap/regmap.c b/drivers/base/regmap/regmap.c
+index 74d8c0672cf6..283644e5d31f 100644
+--- a/drivers/base/regmap/regmap.c
++++ b/drivers/base/regmap/regmap.c
+@@ -109,7 +109,7 @@ bool regmap_readable(struct regmap *map, unsigned int reg)
+ 
+ bool regmap_volatile(struct regmap *map, unsigned int reg)
+ {
+-	if (!regmap_readable(map, reg))
++	if (!map->format.format_write && !regmap_readable(map, reg))
+ 		return false;
+ 
+ 	if (map->volatile_reg)
+diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
+index c4419ea1ab07..2a451b14b3cc 100644
+--- a/drivers/char/hw_random/core.c
++++ b/drivers/char/hw_random/core.c
+@@ -68,12 +68,6 @@ static void add_early_randomness(struct hwrng *rng)
+ 	unsigned char bytes[16];
+ 	int bytes_read;
+ 
+-	/*
+-	 * Currently only virtio-rng cannot return data during device
+-	 * probe, and that's handled in virtio-rng.c itself.  If there
+-	 * are more such devices, this call to rng_get_data can be
+-	 * made conditional here instead of doing it per-device.
+-	 */
+ 	bytes_read = rng_get_data(rng, bytes, sizeof(bytes), 1);
+ 	if (bytes_read > 0)
+ 		add_device_randomness(bytes, bytes_read);
+diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
+index e9b15bc18b4d..f1aa13b21f74 100644
+--- a/drivers/char/hw_random/virtio-rng.c
++++ b/drivers/char/hw_random/virtio-rng.c
+@@ -36,9 +36,9 @@ struct virtrng_info {
+ 	bool busy;
+ 	char name[25];
+ 	int index;
++	bool hwrng_register_done;
+ };
+ 
+-static bool probe_done;
+ 
+ static void random_recv_done(struct virtqueue *vq)
+ {
+@@ -69,13 +69,6 @@ static int virtio_read(struct hwrng *rng, void *buf, size_t size, bool wait)
+ 	int ret;
+ 	struct virtrng_info *vi = (struct virtrng_info *)rng->priv;
+ 
+-	/*
+-	 * Don't ask host for data till we're setup.  This call can
+-	 * happen during hwrng_register(), after commit d9e7972619.
+-	 */
+-	if (unlikely(!probe_done))
+-		return 0;
+-
+ 	if (!vi->busy) {
+ 		vi->busy = true;
+ 		init_completion(&vi->have_data);
+@@ -137,25 +130,17 @@ static int probe_common(struct virtio_device *vdev)
+ 		return err;
+ 	}
+ 
+-	err = hwrng_register(&vi->hwrng);
+-	if (err) {
+-		vdev->config->del_vqs(vdev);
+-		vi->vq = NULL;
+-		kfree(vi);
+-		ida_simple_remove(&rng_index_ida, index);
+-		return err;
+-	}
+-
+-	probe_done = true;
+ 	return 0;
+ }
+ 
+ static void remove_common(struct virtio_device *vdev)
+ {
+ 	struct virtrng_info *vi = vdev->priv;
++
+ 	vdev->config->reset(vdev);
+ 	vi->busy = false;
+-	hwrng_unregister(&vi->hwrng);
++	if (vi->hwrng_register_done)
++		hwrng_unregister(&vi->hwrng);
+ 	vdev->config->del_vqs(vdev);
+ 	ida_simple_remove(&rng_index_ida, vi->index);
+ 	kfree(vi);
+@@ -171,6 +156,16 @@ static void virtrng_remove(struct virtio_device *vdev)
+ 	remove_common(vdev);
+ }
+ 
++static void virtrng_scan(struct virtio_device *vdev)
++{
++	struct virtrng_info *vi = vdev->priv;
++	int err;
++
++	err = hwrng_register(&vi->hwrng);
++	if (!err)
++		vi->hwrng_register_done = true;
++}
++
+ #ifdef CONFIG_PM_SLEEP
+ static int virtrng_freeze(struct virtio_device *vdev)
+ {
+@@ -195,6 +190,7 @@ static struct virtio_driver virtio_rng_driver = {
+ 	.id_table =	id_table,
+ 	.probe =	virtrng_probe,
+ 	.remove =	virtrng_remove,
++	.scan =		virtrng_scan,
+ #ifdef CONFIG_PM_SLEEP
+ 	.freeze =	virtrng_freeze,
+ 	.restore =	virtrng_restore,
+diff --git a/drivers/clk/clk.c b/drivers/clk/clk.c
+index 8b73edef151d..4cc83ef7ef61 100644
+--- a/drivers/clk/clk.c
++++ b/drivers/clk/clk.c
+@@ -1495,6 +1495,7 @@ static struct clk *clk_propagate_rate_change(struct clk *clk, unsigned long even
+ static void clk_change_rate(struct clk *clk)
+ {
+ 	struct clk *child;
++	struct hlist_node *tmp;
+ 	unsigned long old_rate;
+ 	unsigned long best_parent_rate = 0;
+ 	bool skip_set_rate = false;
+@@ -1530,7 +1531,11 @@ static void clk_change_rate(struct clk *clk)
+ 	if (clk->notifier_count && old_rate != clk->rate)
+ 		__clk_notify(clk, POST_RATE_CHANGE, old_rate, clk->rate);
+ 
+-	hlist_for_each_entry(child, &clk->children, child_node) {
++	/*
++	 * Use safe iteration, as change_rate can actually swap parents
++	 * for certain clock types.
++	 */
++	hlist_for_each_entry_safe(child, tmp, &clk->children, child_node) {
+ 		/* Skip children who will be reparented to another clock */
+ 		if (child->new_parent && child->new_parent != clk)
+ 			continue;
+diff --git a/drivers/clk/qcom/common.c b/drivers/clk/qcom/common.c
+index 9b5a1cfc6b91..eeb3eea01f4c 100644
+--- a/drivers/clk/qcom/common.c
++++ b/drivers/clk/qcom/common.c
+@@ -27,30 +27,35 @@ struct qcom_cc {
+ 	struct clk *clks[];
+ };
+ 
+-int qcom_cc_probe(struct platform_device *pdev, const struct qcom_cc_desc *desc)
++struct regmap *
++qcom_cc_map(struct platform_device *pdev, const struct qcom_cc_desc *desc)
+ {
+ 	void __iomem *base;
+ 	struct resource *res;
++	struct device *dev = &pdev->dev;
++
++	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
++	base = devm_ioremap_resource(dev, res);
++	if (IS_ERR(base))
++		return ERR_CAST(base);
++
++	return devm_regmap_init_mmio(dev, base, desc->config);
++}
++EXPORT_SYMBOL_GPL(qcom_cc_map);
++
++int qcom_cc_really_probe(struct platform_device *pdev,
++			 const struct qcom_cc_desc *desc, struct regmap *regmap)
++{
+ 	int i, ret;
+ 	struct device *dev = &pdev->dev;
+ 	struct clk *clk;
+ 	struct clk_onecell_data *data;
+ 	struct clk **clks;
+-	struct regmap *regmap;
+ 	struct qcom_reset_controller *reset;
+ 	struct qcom_cc *cc;
+ 	size_t num_clks = desc->num_clks;
+ 	struct clk_regmap **rclks = desc->clks;
+ 
+-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+-	base = devm_ioremap_resource(dev, res);
+-	if (IS_ERR(base))
+-		return PTR_ERR(base);
+-
+-	regmap = devm_regmap_init_mmio(dev, base, desc->config);
+-	if (IS_ERR(regmap))
+-		return PTR_ERR(regmap);
+-
+ 	cc = devm_kzalloc(dev, sizeof(*cc) + sizeof(*clks) * num_clks,
+ 			  GFP_KERNEL);
+ 	if (!cc)
+@@ -91,6 +96,18 @@ int qcom_cc_probe(struct platform_device *pdev, const struct qcom_cc_desc *desc)
+ 
+ 	return ret;
+ }
++EXPORT_SYMBOL_GPL(qcom_cc_really_probe);
++
++int qcom_cc_probe(struct platform_device *pdev, const struct qcom_cc_desc *desc)
++{
++	struct regmap *regmap;
++
++	regmap = qcom_cc_map(pdev, desc);
++	if (IS_ERR(regmap))
++		return PTR_ERR(regmap);
++
++	return qcom_cc_really_probe(pdev, desc, regmap);
++}
+ EXPORT_SYMBOL_GPL(qcom_cc_probe);
+ 
+ void qcom_cc_remove(struct platform_device *pdev)
+diff --git a/drivers/clk/qcom/common.h b/drivers/clk/qcom/common.h
+index 2c3cfc860348..2765e9d3da97 100644
+--- a/drivers/clk/qcom/common.h
++++ b/drivers/clk/qcom/common.h
+@@ -17,6 +17,7 @@ struct platform_device;
+ struct regmap_config;
+ struct clk_regmap;
+ struct qcom_reset_map;
++struct regmap;
+ 
+ struct qcom_cc_desc {
+ 	const struct regmap_config *config;
+@@ -26,6 +27,11 @@ struct qcom_cc_desc {
+ 	size_t num_resets;
+ };
+ 
++extern struct regmap *qcom_cc_map(struct platform_device *pdev,
++				  const struct qcom_cc_desc *desc);
++extern int qcom_cc_really_probe(struct platform_device *pdev,
++				const struct qcom_cc_desc *desc,
++				struct regmap *regmap);
+ extern int qcom_cc_probe(struct platform_device *pdev,
+ 			 const struct qcom_cc_desc *desc);
+ 
+diff --git a/drivers/clk/qcom/mmcc-msm8960.c b/drivers/clk/qcom/mmcc-msm8960.c
+index 4c449b3170f6..9bf6d925dd1a 100644
+--- a/drivers/clk/qcom/mmcc-msm8960.c
++++ b/drivers/clk/qcom/mmcc-msm8960.c
+@@ -38,6 +38,8 @@
+ #define P_PLL2	2
+ #define P_PLL3	3
+ 
++#define F_MN(f, s, _m, _n) { .freq = f, .src = s, .m = _m, .n = _n }
++
+ static u8 mmcc_pxo_pll8_pll2_map[] = {
+ 	[P_PXO]		= 0,
+ 	[P_PLL8]	= 2,
+@@ -59,8 +61,8 @@ static u8 mmcc_pxo_pll8_pll2_pll3_map[] = {
+ 
+ static const char *mmcc_pxo_pll8_pll2_pll3[] = {
+ 	"pxo",
+-	"pll2",
+ 	"pll8_vote",
++	"pll2",
+ 	"pll3",
+ };
+ 
+@@ -710,18 +712,18 @@ static struct clk_branch csiphy2_timer_clk = {
+ };
+ 
+ static struct freq_tbl clk_tbl_gfx2d[] = {
+-	{  27000000, P_PXO,  1,  0 },
+-	{  48000000, P_PLL8, 1,  8 },
+-	{  54857000, P_PLL8, 1,  7 },
+-	{  64000000, P_PLL8, 1,  6 },
+-	{  76800000, P_PLL8, 1,  5 },
+-	{  96000000, P_PLL8, 1,  4 },
+-	{ 128000000, P_PLL8, 1,  3 },
+-	{ 145455000, P_PLL2, 2, 11 },
+-	{ 160000000, P_PLL2, 1,  5 },
+-	{ 177778000, P_PLL2, 2,  9 },
+-	{ 200000000, P_PLL2, 1,  4 },
+-	{ 228571000, P_PLL2, 2,  7 },
++	F_MN( 27000000, P_PXO,  1,  0),
++	F_MN( 48000000, P_PLL8, 1,  8),
++	F_MN( 54857000, P_PLL8, 1,  7),
++	F_MN( 64000000, P_PLL8, 1,  6),
++	F_MN( 76800000, P_PLL8, 1,  5),
++	F_MN( 96000000, P_PLL8, 1,  4),
++	F_MN(128000000, P_PLL8, 1,  3),
++	F_MN(145455000, P_PLL2, 2, 11),
++	F_MN(160000000, P_PLL2, 1,  5),
++	F_MN(177778000, P_PLL2, 2,  9),
++	F_MN(200000000, P_PLL2, 1,  4),
++	F_MN(228571000, P_PLL2, 2,  7),
+ 	{ }
+ };
+ 
+@@ -842,22 +844,22 @@ static struct clk_branch gfx2d1_clk = {
+ };
+ 
+ static struct freq_tbl clk_tbl_gfx3d[] = {
+-	{  27000000, P_PXO,  1,  0 },
+-	{  48000000, P_PLL8, 1,  8 },
+-	{  54857000, P_PLL8, 1,  7 },
+-	{  64000000, P_PLL8, 1,  6 },
+-	{  76800000, P_PLL8, 1,  5 },
+-	{  96000000, P_PLL8, 1,  4 },
+-	{ 128000000, P_PLL8, 1,  3 },
+-	{ 145455000, P_PLL2, 2, 11 },
+-	{ 160000000, P_PLL2, 1,  5 },
+-	{ 177778000, P_PLL2, 2,  9 },
+-	{ 200000000, P_PLL2, 1,  4 },
+-	{ 228571000, P_PLL2, 2,  7 },
+-	{ 266667000, P_PLL2, 1,  3 },
+-	{ 300000000, P_PLL3, 1,  4 },
+-	{ 320000000, P_PLL2, 2,  5 },
+-	{ 400000000, P_PLL2, 1,  2 },
++	F_MN( 27000000, P_PXO,  1,  0),
++	F_MN( 48000000, P_PLL8, 1,  8),
++	F_MN( 54857000, P_PLL8, 1,  7),
++	F_MN( 64000000, P_PLL8, 1,  6),
++	F_MN( 76800000, P_PLL8, 1,  5),
++	F_MN( 96000000, P_PLL8, 1,  4),
++	F_MN(128000000, P_PLL8, 1,  3),
++	F_MN(145455000, P_PLL2, 2, 11),
++	F_MN(160000000, P_PLL2, 1,  5),
++	F_MN(177778000, P_PLL2, 2,  9),
++	F_MN(200000000, P_PLL2, 1,  4),
++	F_MN(228571000, P_PLL2, 2,  7),
++	F_MN(266667000, P_PLL2, 1,  3),
++	F_MN(300000000, P_PLL3, 1,  4),
++	F_MN(320000000, P_PLL2, 2,  5),
++	F_MN(400000000, P_PLL2, 1,  2),
+ 	{ }
+ };
+ 
+@@ -897,7 +899,7 @@ static struct clk_dyn_rcg gfx3d_src = {
+ 		.hw.init = &(struct clk_init_data){
+ 			.name = "gfx3d_src",
+ 			.parent_names = mmcc_pxo_pll8_pll2_pll3,
+-			.num_parents = 3,
++			.num_parents = 4,
+ 			.ops = &clk_dyn_rcg_ops,
+ 		},
+ 	},
+@@ -995,7 +997,7 @@ static struct clk_rcg jpegd_src = {
+ 	.ns_reg = 0x00ac,
+ 	.p = {
+ 		.pre_div_shift = 12,
+-		.pre_div_width = 2,
++		.pre_div_width = 4,
+ 	},
+ 	.s = {
+ 		.src_sel_shift = 0,
+@@ -1115,7 +1117,7 @@ static struct clk_branch mdp_lut_clk = {
+ 		.enable_reg = 0x016c,
+ 		.enable_mask = BIT(0),
+ 		.hw.init = &(struct clk_init_data){
+-			.parent_names = (const char *[]){ "mdp_clk" },
++			.parent_names = (const char *[]){ "mdp_src" },
+ 			.num_parents = 1,
+ 			.name = "mdp_lut_clk",
+ 			.ops = &clk_branch_ops,
+@@ -1342,15 +1344,15 @@ static struct clk_branch hdmi_app_clk = {
+ };
+ 
+ static struct freq_tbl clk_tbl_vcodec[] = {
+-	{  27000000, P_PXO,  1,  0 },
+-	{  32000000, P_PLL8, 1, 12 },
+-	{  48000000, P_PLL8, 1,  8 },
+-	{  54860000, P_PLL8, 1,  7 },
+-	{  96000000, P_PLL8, 1,  4 },
+-	{ 133330000, P_PLL2, 1,  6 },
+-	{ 200000000, P_PLL2, 1,  4 },
+-	{ 228570000, P_PLL2, 2,  7 },
+-	{ 266670000, P_PLL2, 1,  3 },
++	F_MN( 27000000, P_PXO,  1,  0),
++	F_MN( 32000000, P_PLL8, 1, 12),
++	F_MN( 48000000, P_PLL8, 1,  8),
++	F_MN( 54860000, P_PLL8, 1,  7),
++	F_MN( 96000000, P_PLL8, 1,  4),
++	F_MN(133330000, P_PLL2, 1,  6),
++	F_MN(200000000, P_PLL2, 1,  4),
++	F_MN(228570000, P_PLL2, 2,  7),
++	F_MN(266670000, P_PLL2, 1,  3),
+ 	{ }
+ };
+ 
+diff --git a/drivers/clk/qcom/mmcc-msm8974.c b/drivers/clk/qcom/mmcc-msm8974.c
+index c65b90515872..bc8f519c47aa 100644
+--- a/drivers/clk/qcom/mmcc-msm8974.c
++++ b/drivers/clk/qcom/mmcc-msm8974.c
+@@ -2547,18 +2547,16 @@ MODULE_DEVICE_TABLE(of, mmcc_msm8974_match_table);
+ 
+ static int mmcc_msm8974_probe(struct platform_device *pdev)
+ {
+-	int ret;
+ 	struct regmap *regmap;
+ 
+-	ret = qcom_cc_probe(pdev, &mmcc_msm8974_desc);
+-	if (ret)
+-		return ret;
++	regmap = qcom_cc_map(pdev, &mmcc_msm8974_desc);
++	if (IS_ERR(regmap))
++		return PTR_ERR(regmap);
+ 
+-	regmap = dev_get_regmap(&pdev->dev, NULL);
+ 	clk_pll_configure_sr_hpm_lp(&mmpll1, regmap, &mmpll1_config, true);
+ 	clk_pll_configure_sr_hpm_lp(&mmpll3, regmap, &mmpll3_config, false);
+ 
+-	return 0;
++	return qcom_cc_really_probe(pdev, &mmcc_msm8974_desc, regmap);
+ }
+ 
+ static int mmcc_msm8974_remove(struct platform_device *pdev)
+diff --git a/drivers/clk/ti/clk-dra7-atl.c b/drivers/clk/ti/clk-dra7-atl.c
+index 4a65b410e4d5..af29359677da 100644
+--- a/drivers/clk/ti/clk-dra7-atl.c
++++ b/drivers/clk/ti/clk-dra7-atl.c
+@@ -139,9 +139,13 @@ static long atl_clk_round_rate(struct clk_hw *hw, unsigned long rate,
+ static int atl_clk_set_rate(struct clk_hw *hw, unsigned long rate,
+ 			    unsigned long parent_rate)
+ {
+-	struct dra7_atl_desc *cdesc = to_atl_desc(hw);
++	struct dra7_atl_desc *cdesc;
+ 	u32 divider;
+ 
++	if (!hw || !rate)
++		return -EINVAL;
++
++	cdesc = to_atl_desc(hw);
+ 	divider = ((parent_rate + rate / 2) / rate) - 1;
+ 	if (divider > DRA7_ATL_DIVIDER_MASK)
+ 		divider = DRA7_ATL_DIVIDER_MASK;
+diff --git a/drivers/clk/ti/divider.c b/drivers/clk/ti/divider.c
+index e6aa10db7bba..a837f703be65 100644
+--- a/drivers/clk/ti/divider.c
++++ b/drivers/clk/ti/divider.c
+@@ -211,11 +211,16 @@ static long ti_clk_divider_round_rate(struct clk_hw *hw, unsigned long rate,
+ static int ti_clk_divider_set_rate(struct clk_hw *hw, unsigned long rate,
+ 				   unsigned long parent_rate)
+ {
+-	struct clk_divider *divider = to_clk_divider(hw);
++	struct clk_divider *divider;
+ 	unsigned int div, value;
+ 	unsigned long flags = 0;
+ 	u32 val;
+ 
++	if (!hw || !rate)
++		return -EINVAL;
++
++	divider = to_clk_divider(hw);
++
+ 	div = DIV_ROUND_UP(parent_rate, rate);
+ 	value = _get_val(divider, div);
+ 
+diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
+index 6f024852c6fb..21ab8bcd4d20 100644
+--- a/drivers/cpufreq/cpufreq.c
++++ b/drivers/cpufreq/cpufreq.c
+@@ -1279,6 +1279,8 @@ err_get_freq:
+ 		per_cpu(cpufreq_cpu_data, j) = NULL;
+ 	write_unlock_irqrestore(&cpufreq_driver_lock, flags);
+ 
++	up_write(&policy->rwsem);
++
+ 	if (cpufreq_driver->exit)
+ 		cpufreq_driver->exit(policy);
+ err_set_policy_cpu:
+@@ -1665,7 +1667,7 @@ void cpufreq_suspend(void)
+ 		return;
+ 
+ 	if (!has_target())
+-		return;
++		goto suspend;
+ 
+ 	pr_debug("%s: Suspending Governors\n", __func__);
+ 
+@@ -1679,6 +1681,7 @@ void cpufreq_suspend(void)
+ 				policy);
+ 	}
+ 
++suspend:
+ 	cpufreq_suspended = true;
+ }
+ 
+@@ -1695,13 +1698,13 @@ void cpufreq_resume(void)
+ 	if (!cpufreq_driver)
+ 		return;
+ 
++	cpufreq_suspended = false;
++
+ 	if (!has_target())
+ 		return;
+ 
+ 	pr_debug("%s: Resuming Governors\n", __func__);
+ 
+-	cpufreq_suspended = false;
+-
+ 	list_for_each_entry(policy, &cpufreq_policy_list, policy_list) {
+ 		if (cpufreq_driver->resume && cpufreq_driver->resume(policy))
+ 			pr_err("%s: Failed to resume driver: %p\n", __func__,
+diff --git a/drivers/cpufreq/cpufreq_opp.c b/drivers/cpufreq/cpufreq_opp.c
+index c0c6f4a4eccf..f7a32d2326c6 100644
+--- a/drivers/cpufreq/cpufreq_opp.c
++++ b/drivers/cpufreq/cpufreq_opp.c
+@@ -60,7 +60,7 @@ int dev_pm_opp_init_cpufreq_table(struct device *dev,
+ 		goto out;
+ 	}
+ 
+-	freq_table = kzalloc(sizeof(*freq_table) * (max_opps + 1), GFP_KERNEL);
++	freq_table = kcalloc(sizeof(*freq_table), (max_opps + 1), GFP_ATOMIC);
+ 	if (!freq_table) {
+ 		ret = -ENOMEM;
+ 		goto out;
+diff --git a/drivers/crypto/ccp/ccp-crypto-main.c b/drivers/crypto/ccp/ccp-crypto-main.c
+index 20dc848481e7..4d4e016d755b 100644
+--- a/drivers/crypto/ccp/ccp-crypto-main.c
++++ b/drivers/crypto/ccp/ccp-crypto-main.c
+@@ -367,6 +367,10 @@ static int ccp_crypto_init(void)
+ {
+ 	int ret;
+ 
++	ret = ccp_present();
++	if (ret)
++		return ret;
++
+ 	spin_lock_init(&req_queue_lock);
+ 	INIT_LIST_HEAD(&req_queue.cmds);
+ 	req_queue.backlog = &req_queue.cmds;
+diff --git a/drivers/crypto/ccp/ccp-dev.c b/drivers/crypto/ccp/ccp-dev.c
+index 2c7816149b01..c08151eb54c1 100644
+--- a/drivers/crypto/ccp/ccp-dev.c
++++ b/drivers/crypto/ccp/ccp-dev.c
+@@ -53,6 +53,20 @@ static inline void ccp_del_device(struct ccp_device *ccp)
+ }
+ 
+ /**
++ * ccp_present - check if a CCP device is present
++ *
++ * Returns zero if a CCP device is present, -ENODEV otherwise.
++ */
++int ccp_present(void)
++{
++	if (ccp_get_device())
++		return 0;
++
++	return -ENODEV;
++}
++EXPORT_SYMBOL_GPL(ccp_present);
++
++/**
+  * ccp_enqueue_cmd - queue an operation for processing by the CCP
+  *
+  * @cmd: ccp_cmd struct to be processed
+diff --git a/drivers/dma/TODO b/drivers/dma/TODO
+index 734ed0206cd5..b8045cd42ee1 100644
+--- a/drivers/dma/TODO
++++ b/drivers/dma/TODO
+@@ -7,7 +7,6 @@ TODO for slave dma
+ 	- imx-dma
+ 	- imx-sdma
+ 	- mxs-dma.c
+-	- dw_dmac
+ 	- intel_mid_dma
+ 4. Check other subsystems for dma drivers and merge/move to dmaengine
+ 5. Remove dma_slave_config's dma direction.
+diff --git a/drivers/dma/dw/core.c b/drivers/dma/dw/core.c
+index a27ded53ab4f..525b4654bd90 100644
+--- a/drivers/dma/dw/core.c
++++ b/drivers/dma/dw/core.c
+@@ -279,6 +279,15 @@ static void dwc_dostart(struct dw_dma_chan *dwc, struct dw_desc *first)
+ 	channel_set_bit(dw, CH_EN, dwc->mask);
+ }
+ 
++static void dwc_dostart_first_queued(struct dw_dma_chan *dwc)
++{
++	if (list_empty(&dwc->queue))
++		return;
++
++	list_move(dwc->queue.next, &dwc->active_list);
++	dwc_dostart(dwc, dwc_first_active(dwc));
++}
++
+ /*----------------------------------------------------------------------*/
+ 
+ static void
+@@ -335,10 +344,7 @@ static void dwc_complete_all(struct dw_dma *dw, struct dw_dma_chan *dwc)
+ 	 * the completed ones.
+ 	 */
+ 	list_splice_init(&dwc->active_list, &list);
+-	if (!list_empty(&dwc->queue)) {
+-		list_move(dwc->queue.next, &dwc->active_list);
+-		dwc_dostart(dwc, dwc_first_active(dwc));
+-	}
++	dwc_dostart_first_queued(dwc);
+ 
+ 	spin_unlock_irqrestore(&dwc->lock, flags);
+ 
+@@ -467,10 +473,7 @@ static void dwc_scan_descriptors(struct dw_dma *dw, struct dw_dma_chan *dwc)
+ 	/* Try to continue after resetting the channel... */
+ 	dwc_chan_disable(dw, dwc);
+ 
+-	if (!list_empty(&dwc->queue)) {
+-		list_move(dwc->queue.next, &dwc->active_list);
+-		dwc_dostart(dwc, dwc_first_active(dwc));
+-	}
++	dwc_dostart_first_queued(dwc);
+ 	spin_unlock_irqrestore(&dwc->lock, flags);
+ }
+ 
+@@ -677,17 +680,9 @@ static dma_cookie_t dwc_tx_submit(struct dma_async_tx_descriptor *tx)
+ 	 * possible, perhaps even appending to those already submitted
+ 	 * for DMA. But this is hard to do in a race-free manner.
+ 	 */
+-	if (list_empty(&dwc->active_list)) {
+-		dev_vdbg(chan2dev(tx->chan), "%s: started %u\n", __func__,
+-				desc->txd.cookie);
+-		list_add_tail(&desc->desc_node, &dwc->active_list);
+-		dwc_dostart(dwc, dwc_first_active(dwc));
+-	} else {
+-		dev_vdbg(chan2dev(tx->chan), "%s: queued %u\n", __func__,
+-				desc->txd.cookie);
+ 
+-		list_add_tail(&desc->desc_node, &dwc->queue);
+-	}
++	dev_vdbg(chan2dev(tx->chan), "%s: queued %u\n", __func__, desc->txd.cookie);
++	list_add_tail(&desc->desc_node, &dwc->queue);
+ 
+ 	spin_unlock_irqrestore(&dwc->lock, flags);
+ 
+@@ -1092,9 +1087,12 @@ dwc_tx_status(struct dma_chan *chan,
+ static void dwc_issue_pending(struct dma_chan *chan)
+ {
+ 	struct dw_dma_chan	*dwc = to_dw_dma_chan(chan);
++	unsigned long		flags;
+ 
+-	if (!list_empty(&dwc->queue))
+-		dwc_scan_descriptors(to_dw_dma(chan->device), dwc);
++	spin_lock_irqsave(&dwc->lock, flags);
++	if (list_empty(&dwc->active_list))
++		dwc_dostart_first_queued(dwc);
++	spin_unlock_irqrestore(&dwc->lock, flags);
+ }
+ 
+ static int dwc_alloc_chan_resources(struct dma_chan *chan)
+diff --git a/drivers/gpio/gpiolib-acpi.c b/drivers/gpio/gpiolib-acpi.c
+index 4a987917c186..86608585ec00 100644
+--- a/drivers/gpio/gpiolib-acpi.c
++++ b/drivers/gpio/gpiolib-acpi.c
+@@ -357,8 +357,10 @@ acpi_gpio_adr_space_handler(u32 function, acpi_physical_address address,
+ 	struct gpio_chip *chip = achip->chip;
+ 	struct acpi_resource_gpio *agpio;
+ 	struct acpi_resource *ares;
++	int pin_index = (int)address;
+ 	acpi_status status;
+ 	bool pull_up;
++	int length;
+ 	int i;
+ 
+ 	status = acpi_buffer_to_resource(achip->conn_info.connection,
+@@ -380,7 +382,8 @@ acpi_gpio_adr_space_handler(u32 function, acpi_physical_address address,
+ 		return AE_BAD_PARAMETER;
+ 	}
+ 
+-	for (i = 0; i < agpio->pin_table_length; i++) {
++	length = min(agpio->pin_table_length, (u16)(pin_index + bits));
++	for (i = pin_index; i < length; ++i) {
+ 		unsigned pin = agpio->pin_table[i];
+ 		struct acpi_gpio_connection *conn;
+ 		struct gpio_desc *desc;
+diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
+index 2ebc9071e354..810c84fd00c4 100644
+--- a/drivers/gpio/gpiolib.c
++++ b/drivers/gpio/gpiolib.c
+@@ -1368,12 +1368,12 @@ void gpiochip_set_chained_irqchip(struct gpio_chip *gpiochip,
+ 		return;
+ 	}
+ 
+-	irq_set_chained_handler(parent_irq, parent_handler);
+ 	/*
+ 	 * The parent irqchip is already using the chip_data for this
+ 	 * irqchip, so our callbacks simply use the handler_data.
+ 	 */
+ 	irq_set_handler_data(parent_irq, gpiochip);
++	irq_set_chained_handler(parent_irq, parent_handler);
+ }
+ EXPORT_SYMBOL_GPL(gpiochip_set_chained_irqchip);
+ 
+diff --git a/drivers/gpu/drm/ast/ast_main.c b/drivers/gpu/drm/ast/ast_main.c
+index a2cc6be97983..b792194e0d9c 100644
+--- a/drivers/gpu/drm/ast/ast_main.c
++++ b/drivers/gpu/drm/ast/ast_main.c
+@@ -67,6 +67,7 @@ static int ast_detect_chip(struct drm_device *dev)
+ {
+ 	struct ast_private *ast = dev->dev_private;
+ 	uint32_t data, jreg;
++	ast_open_key(ast);
+ 
+ 	if (dev->pdev->device == PCI_CHIP_AST1180) {
+ 		ast->chip = AST1100;
+@@ -104,7 +105,7 @@ static int ast_detect_chip(struct drm_device *dev)
+ 			}
+ 			ast->vga2_clone = false;
+ 		} else {
+-			ast->chip = 2000;
++			ast->chip = AST2000;
+ 			DRM_INFO("AST 2000 detected\n");
+ 		}
+ 	}
+diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
+index 9d7954366bd2..fa9764a2e080 100644
+--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
++++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
+@@ -706,11 +706,13 @@ int i915_cmd_parser_init_ring(struct intel_engine_cs *ring)
+ 	BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
+ 	BUG_ON(!validate_regs_sorted(ring));
+ 
+-	ret = init_hash_table(ring, cmd_tables, cmd_table_count);
+-	if (ret) {
+-		DRM_ERROR("CMD: cmd_parser_init failed!\n");
+-		fini_hash_table(ring);
+-		return ret;
++	if (hash_empty(ring->cmd_hash)) {
++		ret = init_hash_table(ring, cmd_tables, cmd_table_count);
++		if (ret) {
++			DRM_ERROR("CMD: cmd_parser_init failed!\n");
++			fini_hash_table(ring);
++			return ret;
++		}
+ 	}
+ 
+ 	ring->needs_cmd_parser = true;
+diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
+index d893e4da5dce..ef3b4798da02 100644
+--- a/drivers/gpu/drm/i915/i915_gem.c
++++ b/drivers/gpu/drm/i915/i915_gem.c
+@@ -1576,10 +1576,13 @@ unlock:
+ out:
+ 	switch (ret) {
+ 	case -EIO:
+-		/* If this -EIO is due to a gpu hang, give the reset code a
+-		 * chance to clean up the mess. Otherwise return the proper
+-		 * SIGBUS. */
+-		if (i915_terminally_wedged(&dev_priv->gpu_error)) {
++		/*
++		 * We eat errors when the gpu is terminally wedged to avoid
++		 * userspace unduly crashing (gl has no provisions for mmaps to
++		 * fail). But any other -EIO isn't ours (e.g. swap in failure)
++		 * and so needs to be reported.
++		 */
++		if (!i915_terminally_wedged(&dev_priv->gpu_error)) {
+ 			ret = VM_FAULT_SIGBUS;
+ 			break;
+ 		}
+diff --git a/drivers/gpu/drm/i915/intel_bios.c b/drivers/gpu/drm/i915/intel_bios.c
+index 827498e081df..2e0a2feb4cda 100644
+--- a/drivers/gpu/drm/i915/intel_bios.c
++++ b/drivers/gpu/drm/i915/intel_bios.c
+@@ -877,7 +877,7 @@ err:
+ 
+ 	/* error during parsing so set all pointers to null
+ 	 * because of partial parsing */
+-	memset(dev_priv->vbt.dsi.sequence, 0, MIPI_SEQ_MAX);
++	memset(dev_priv->vbt.dsi.sequence, 0, sizeof(dev_priv->vbt.dsi.sequence));
+ }
+ 
+ static void parse_ddi_port(struct drm_i915_private *dev_priv, enum port port,
+@@ -1122,7 +1122,7 @@ init_vbt_defaults(struct drm_i915_private *dev_priv)
+ 	}
+ }
+ 
+-static int __init intel_no_opregion_vbt_callback(const struct dmi_system_id *id)
++static int intel_no_opregion_vbt_callback(const struct dmi_system_id *id)
+ {
+ 	DRM_DEBUG_KMS("Falling back to manually reading VBT from "
+ 		      "VBIOS ROM for %s\n",
+diff --git a/drivers/gpu/drm/i915/intel_crt.c b/drivers/gpu/drm/i915/intel_crt.c
+index 5a045d3bd77e..3e1edbfa8e07 100644
+--- a/drivers/gpu/drm/i915/intel_crt.c
++++ b/drivers/gpu/drm/i915/intel_crt.c
+@@ -673,16 +673,21 @@ intel_crt_detect(struct drm_connector *connector, bool force)
+ 		goto out;
+ 	}
+ 
++	drm_modeset_acquire_init(&ctx, 0);
++
+ 	/* for pre-945g platforms use load detect */
+ 	if (intel_get_load_detect_pipe(connector, NULL, &tmp, &ctx)) {
+ 		if (intel_crt_detect_ddc(connector))
+ 			status = connector_status_connected;
+ 		else
+ 			status = intel_crt_load_detect(crt);
+-		intel_release_load_detect_pipe(connector, &tmp, &ctx);
++		intel_release_load_detect_pipe(connector, &tmp);
+ 	} else
+ 		status = connector_status_unknown;
+ 
++	drm_modeset_drop_locks(&ctx);
++	drm_modeset_acquire_fini(&ctx);
++
+ out:
+ 	intel_display_power_put(dev_priv, power_domain);
+ 	intel_runtime_pm_put(dev_priv);
+@@ -775,7 +780,7 @@ static const struct drm_encoder_funcs intel_crt_enc_funcs = {
+ 	.destroy = intel_encoder_destroy,
+ };
+ 
+-static int __init intel_no_crt_dmi_callback(const struct dmi_system_id *id)
++static int intel_no_crt_dmi_callback(const struct dmi_system_id *id)
+ {
+ 	DRM_INFO("Skipping CRT initialization for %s\n", id->ident);
+ 	return 1;
+diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
+index f0be855ddf45..ffaf8be939f1 100644
+--- a/drivers/gpu/drm/i915/intel_display.c
++++ b/drivers/gpu/drm/i915/intel_display.c
+@@ -2200,6 +2200,15 @@ intel_pin_and_fence_fb_obj(struct drm_device *dev,
+ 	if (need_vtd_wa(dev) && alignment < 256 * 1024)
+ 		alignment = 256 * 1024;
+ 
++	/*
++	 * Global gtt pte registers are special registers which actually forward
++	 * writes to a chunk of system memory. Which means that there is no risk
++	 * that the register values disappear as soon as we call
++	 * intel_runtime_pm_put(), so it is correct to wrap only the
++	 * pin/unpin/fence and not more.
++	 */
++	intel_runtime_pm_get(dev_priv);
++
+ 	dev_priv->mm.interruptible = false;
+ 	ret = i915_gem_object_pin_to_display_plane(obj, alignment, pipelined);
+ 	if (ret)
+@@ -2217,12 +2226,14 @@ intel_pin_and_fence_fb_obj(struct drm_device *dev,
+ 	i915_gem_object_pin_fence(obj);
+ 
+ 	dev_priv->mm.interruptible = true;
++	intel_runtime_pm_put(dev_priv);
+ 	return 0;
+ 
+ err_unpin:
+ 	i915_gem_object_unpin_from_display_plane(obj);
+ err_interruptible:
+ 	dev_priv->mm.interruptible = true;
++	intel_runtime_pm_put(dev_priv);
+ 	return ret;
+ }
+ 
+@@ -8087,6 +8098,15 @@ static int intel_crtc_cursor_set(struct drm_crtc *crtc,
+ 			goto fail_locked;
+ 		}
+ 
++		/*
++		 * Global gtt pte registers are special registers which actually
++		 * forward writes to a chunk of system memory. Which means that
++		 * there is no risk that the register values disappear as soon
++		 * as we call intel_runtime_pm_put(), so it is correct to wrap
++		 * only the pin/unpin/fence and not more.
++		 */
++		intel_runtime_pm_get(dev_priv);
++
+ 		/* Note that the w/a also requires 2 PTE of padding following
+ 		 * the bo. We currently fill all unused PTE with the shadow
+ 		 * page and so we should always have valid PTE following the
+@@ -8099,16 +8119,20 @@ static int intel_crtc_cursor_set(struct drm_crtc *crtc,
+ 		ret = i915_gem_object_pin_to_display_plane(obj, alignment, NULL);
+ 		if (ret) {
+ 			DRM_DEBUG_KMS("failed to move cursor bo into the GTT\n");
++			intel_runtime_pm_put(dev_priv);
+ 			goto fail_locked;
+ 		}
+ 
+ 		ret = i915_gem_object_put_fence(obj);
+ 		if (ret) {
+ 			DRM_DEBUG_KMS("failed to release fence for cursor");
++			intel_runtime_pm_put(dev_priv);
+ 			goto fail_unpin;
+ 		}
+ 
+ 		addr = i915_gem_obj_ggtt_offset(obj);
++
++		intel_runtime_pm_put(dev_priv);
+ 	} else {
+ 		int align = IS_I830(dev) ? 16 * 1024 : 256;
+ 		ret = i915_gem_object_attach_phys(obj, align);
+@@ -8319,8 +8343,6 @@ bool intel_get_load_detect_pipe(struct drm_connector *connector,
+ 		      connector->base.id, connector->name,
+ 		      encoder->base.id, encoder->name);
+ 
+-	drm_modeset_acquire_init(ctx, 0);
+-
+ retry:
+ 	ret = drm_modeset_lock(&config->connection_mutex, ctx);
+ 	if (ret)
+@@ -8359,10 +8381,14 @@ retry:
+ 		i++;
+ 		if (!(encoder->possible_crtcs & (1 << i)))
+ 			continue;
+-		if (!possible_crtc->enabled) {
+-			crtc = possible_crtc;
+-			break;
+-		}
++		if (possible_crtc->enabled)
++			continue;
++		/* This can occur when applying the pipe A quirk on resume. */
++		if (to_intel_crtc(possible_crtc)->new_enabled)
++			continue;
++
++		crtc = possible_crtc;
++		break;
+ 	}
+ 
+ 	/*
+@@ -8431,15 +8457,11 @@ fail_unlock:
+ 		goto retry;
+ 	}
+ 
+-	drm_modeset_drop_locks(ctx);
+-	drm_modeset_acquire_fini(ctx);
+-
+ 	return false;
+ }
+ 
+ void intel_release_load_detect_pipe(struct drm_connector *connector,
+-				    struct intel_load_detect_pipe *old,
+-				    struct drm_modeset_acquire_ctx *ctx)
++				    struct intel_load_detect_pipe *old)
+ {
+ 	struct intel_encoder *intel_encoder =
+ 		intel_attached_encoder(connector);
+@@ -8463,17 +8485,12 @@ void intel_release_load_detect_pipe(struct drm_connector *connector,
+ 			drm_framebuffer_unreference(old->release_fb);
+ 		}
+ 
+-		goto unlock;
+ 		return;
+ 	}
+ 
+ 	/* Switch crtc and encoder back off if necessary */
+ 	if (old->dpms_mode != DRM_MODE_DPMS_ON)
+ 		connector->funcs->dpms(connector, old->dpms_mode);
+-
+-unlock:
+-	drm_modeset_drop_locks(ctx);
+-	drm_modeset_acquire_fini(ctx);
+ }
+ 
+ static int i9xx_pll_refclk(struct drm_device *dev,
+@@ -9294,6 +9311,8 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,
+ 
+ 	if (IS_VALLEYVIEW(dev)) {
+ 		ring = &dev_priv->ring[BCS];
++	} else if (IS_IVYBRIDGE(dev)) {
++		ring = &dev_priv->ring[BCS];
+ 	} else if (INTEL_INFO(dev)->gen >= 7) {
+ 		ring = obj->ring;
+ 		if (ring == NULL || ring->id != RCS)
+@@ -11671,6 +11690,9 @@ static struct intel_quirk intel_quirks[] = {
+ 	/* Acer C720 and C720P Chromebooks (Celeron 2955U) have backlights */
+ 	{ 0x0a06, 0x1025, 0x0a11, quirk_backlight_present },
+ 
++	/* Acer C720 Chromebook (Core i3 4005U) */
++	{ 0x0a16, 0x1025, 0x0a11, quirk_backlight_present },
++
+ 	/* Toshiba CB35 Chromebook (Celeron 2955U) */
+ 	{ 0x0a06, 0x1179, 0x0a88, quirk_backlight_present },
+ 
+@@ -11840,7 +11862,7 @@ static void intel_enable_pipe_a(struct drm_device *dev)
+ 	struct intel_connector *connector;
+ 	struct drm_connector *crt = NULL;
+ 	struct intel_load_detect_pipe load_detect_temp;
+-	struct drm_modeset_acquire_ctx ctx;
++	struct drm_modeset_acquire_ctx *ctx = dev->mode_config.acquire_ctx;
+ 
+ 	/* We can't just switch on the pipe A, we need to set things up with a
+ 	 * proper mode and output configuration. As a gross hack, enable pipe A
+@@ -11857,10 +11879,8 @@ static void intel_enable_pipe_a(struct drm_device *dev)
+ 	if (!crt)
+ 		return;
+ 
+-	if (intel_get_load_detect_pipe(crt, NULL, &load_detect_temp, &ctx))
+-		intel_release_load_detect_pipe(crt, &load_detect_temp, &ctx);
+-
+-
++	if (intel_get_load_detect_pipe(crt, NULL, &load_detect_temp, ctx))
++		intel_release_load_detect_pipe(crt, &load_detect_temp);
+ }
+ 
+ static bool
+diff --git a/drivers/gpu/drm/i915/intel_dp.c b/drivers/gpu/drm/i915/intel_dp.c
+index 8a1a4fbc06ac..fbffcbb9a0f8 100644
+--- a/drivers/gpu/drm/i915/intel_dp.c
++++ b/drivers/gpu/drm/i915/intel_dp.c
+@@ -3313,6 +3313,9 @@ intel_dp_check_link_status(struct intel_dp *intel_dp)
+ 	if (WARN_ON(!intel_encoder->base.crtc))
+ 		return;
+ 
++	if (!to_intel_crtc(intel_encoder->base.crtc)->active)
++		return;
++
+ 	/* Try to read receiver status if the link appears to be up */
+ 	if (!intel_dp_get_link_status(intel_dp, link_status)) {
+ 		return;
+diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h
+index f67340ed2c12..e0f88a0669c1 100644
+--- a/drivers/gpu/drm/i915/intel_drv.h
++++ b/drivers/gpu/drm/i915/intel_drv.h
+@@ -754,8 +754,7 @@ bool intel_get_load_detect_pipe(struct drm_connector *connector,
+ 				struct intel_load_detect_pipe *old,
+ 				struct drm_modeset_acquire_ctx *ctx);
+ void intel_release_load_detect_pipe(struct drm_connector *connector,
+-				    struct intel_load_detect_pipe *old,
+-				    struct drm_modeset_acquire_ctx *ctx);
++				    struct intel_load_detect_pipe *old);
+ int intel_pin_and_fence_fb_obj(struct drm_device *dev,
+ 			       struct drm_i915_gem_object *obj,
+ 			       struct intel_engine_cs *pipelined);
+diff --git a/drivers/gpu/drm/i915/intel_hdmi.c b/drivers/gpu/drm/i915/intel_hdmi.c
+index eee2bbec2958..057366453d27 100644
+--- a/drivers/gpu/drm/i915/intel_hdmi.c
++++ b/drivers/gpu/drm/i915/intel_hdmi.c
+@@ -728,7 +728,7 @@ static void intel_hdmi_get_config(struct intel_encoder *encoder,
+ 	if (tmp & HDMI_MODE_SELECT_HDMI)
+ 		pipe_config->has_hdmi_sink = true;
+ 
+-	if (tmp & HDMI_MODE_SELECT_HDMI)
++	if (tmp & SDVO_AUDIO_ENABLE)
+ 		pipe_config->has_audio = true;
+ 
+ 	pipe_config->adjusted_mode.flags |= flags;
+diff --git a/drivers/gpu/drm/i915/intel_lvds.c b/drivers/gpu/drm/i915/intel_lvds.c
+index 5e5a72fca5fb..0fb230949f81 100644
+--- a/drivers/gpu/drm/i915/intel_lvds.c
++++ b/drivers/gpu/drm/i915/intel_lvds.c
+@@ -531,7 +531,7 @@ static const struct drm_encoder_funcs intel_lvds_enc_funcs = {
+ 	.destroy = intel_encoder_destroy,
+ };
+ 
+-static int __init intel_no_lvds_dmi_callback(const struct dmi_system_id *id)
++static int intel_no_lvds_dmi_callback(const struct dmi_system_id *id)
+ {
+ 	DRM_INFO("Skipping LVDS initialization for %s\n", id->ident);
+ 	return 1;
+diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
+index 279488addf3f..7add7eead21d 100644
+--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
++++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
+@@ -517,6 +517,9 @@ static int init_ring_common(struct intel_engine_cs *ring)
+ 	else
+ 		ring_setup_phys_status_page(ring);
+ 
++	/* Enforce ordering by reading HEAD register back */
++	I915_READ_HEAD(ring);
++
+ 	/* Initialize the ring. This must happen _after_ we've cleared the ring
+ 	 * registers with the above sequence (the readback of the HEAD registers
+ 	 * also enforces ordering), otherwise the hw might lose the new ring
+diff --git a/drivers/gpu/drm/i915/intel_tv.c b/drivers/gpu/drm/i915/intel_tv.c
+index 67c6c9a2eb1c..5c6f7e2417e4 100644
+--- a/drivers/gpu/drm/i915/intel_tv.c
++++ b/drivers/gpu/drm/i915/intel_tv.c
+@@ -854,6 +854,10 @@ intel_enable_tv(struct intel_encoder *encoder)
+ 	struct drm_device *dev = encoder->base.dev;
+ 	struct drm_i915_private *dev_priv = dev->dev_private;
+ 
++	/* Prevents vblank waits from timing out in intel_tv_detect_type() */
++	intel_wait_for_vblank(encoder->base.dev,
++			      to_intel_crtc(encoder->base.crtc)->pipe);
++
+ 	I915_WRITE(TV_CTL, I915_READ(TV_CTL) | TV_ENC_ENABLE);
+ }
+ 
+@@ -1311,6 +1315,7 @@ intel_tv_detect(struct drm_connector *connector, bool force)
+ {
+ 	struct drm_display_mode mode;
+ 	struct intel_tv *intel_tv = intel_attached_tv(connector);
++	enum drm_connector_status status;
+ 	int type;
+ 
+ 	DRM_DEBUG_KMS("[CONNECTOR:%d:%s] force=%d\n",
+@@ -1323,16 +1328,24 @@ intel_tv_detect(struct drm_connector *connector, bool force)
+ 		struct intel_load_detect_pipe tmp;
+ 		struct drm_modeset_acquire_ctx ctx;
+ 
++		drm_modeset_acquire_init(&ctx, 0);
++
+ 		if (intel_get_load_detect_pipe(connector, &mode, &tmp, &ctx)) {
+ 			type = intel_tv_detect_type(intel_tv, connector);
+-			intel_release_load_detect_pipe(connector, &tmp, &ctx);
++			intel_release_load_detect_pipe(connector, &tmp);
++			status = type < 0 ?
++				connector_status_disconnected :
++				connector_status_connected;
+ 		} else
+-			return connector_status_unknown;
++			status = connector_status_unknown;
++
++		drm_modeset_drop_locks(&ctx);
++		drm_modeset_acquire_fini(&ctx);
+ 	} else
+ 		return connector->status;
+ 
+-	if (type < 0)
+-		return connector_status_disconnected;
++	if (status != connector_status_connected)
++		return status;
+ 
+ 	intel_tv->type = type;
+ 	intel_tv_find_better_format(connector);
+diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
+index 5425ffe3931d..594c3f54102e 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
++++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
+@@ -596,6 +596,7 @@ int nouveau_pmops_suspend(struct device *dev)
+ 
+ 	pci_save_state(pdev);
+ 	pci_disable_device(pdev);
++	pci_ignore_hotplug(pdev);
+ 	pci_set_power_state(pdev, PCI_D3hot);
+ 	return 0;
+ }
+diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c
+index ab0228f640a5..7e185c122750 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
++++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
+@@ -76,6 +76,7 @@ static int
+ nouveau_vram_manager_new(struct ttm_mem_type_manager *man,
+ 			 struct ttm_buffer_object *bo,
+ 			 struct ttm_placement *placement,
++			 uint32_t flags,
+ 			 struct ttm_mem_reg *mem)
+ {
+ 	struct nouveau_drm *drm = nouveau_bdev(man->bdev);
+@@ -162,6 +163,7 @@ static int
+ nouveau_gart_manager_new(struct ttm_mem_type_manager *man,
+ 			 struct ttm_buffer_object *bo,
+ 			 struct ttm_placement *placement,
++			 uint32_t flags,
+ 			 struct ttm_mem_reg *mem)
+ {
+ 	struct nouveau_drm *drm = nouveau_bdev(bo->bdev);
+@@ -242,6 +244,7 @@ static int
+ nv04_gart_manager_new(struct ttm_mem_type_manager *man,
+ 		      struct ttm_buffer_object *bo,
+ 		      struct ttm_placement *placement,
++		      uint32_t flags,
+ 		      struct ttm_mem_reg *mem)
+ {
+ 	struct nouveau_mem *node;
+diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c b/drivers/gpu/drm/nouveau/nouveau_vga.c
+index 4f4c3fec6916..c110b2cfc3eb 100644
+--- a/drivers/gpu/drm/nouveau/nouveau_vga.c
++++ b/drivers/gpu/drm/nouveau/nouveau_vga.c
+@@ -106,7 +106,16 @@ void
+ nouveau_vga_fini(struct nouveau_drm *drm)
+ {
+ 	struct drm_device *dev = drm->dev;
++	bool runtime = false;
++
++	if (nouveau_runtime_pm == 1)
++		runtime = true;
++	if ((nouveau_runtime_pm == -1) && (nouveau_is_optimus() || nouveau_is_v1_dsm()))
++		runtime = true;
++
+ 	vga_switcheroo_unregister_client(dev->pdev);
++	if (runtime && nouveau_is_v1_dsm() && !nouveau_is_optimus())
++		vga_switcheroo_fini_domain_pm_ops(drm->dev->dev);
+ 	vga_client_register(dev->pdev, NULL, NULL, NULL);
+ }
+ 
+diff --git a/drivers/gpu/drm/radeon/ci_dpm.c b/drivers/gpu/drm/radeon/ci_dpm.c
+index 584090ac3eb9..d416bb2ff48d 100644
+--- a/drivers/gpu/drm/radeon/ci_dpm.c
++++ b/drivers/gpu/drm/radeon/ci_dpm.c
+@@ -869,6 +869,9 @@ static int ci_set_thermal_temperature_range(struct radeon_device *rdev,
+ 	WREG32_SMC(CG_THERMAL_CTRL, tmp);
+ #endif
+ 
++	rdev->pm.dpm.thermal.min_temp = low_temp;
++	rdev->pm.dpm.thermal.max_temp = high_temp;
++
+ 	return 0;
+ }
+ 
+@@ -940,7 +943,18 @@ static void ci_get_leakage_voltages(struct radeon_device *rdev)
+ 	pi->vddc_leakage.count = 0;
+ 	pi->vddci_leakage.count = 0;
+ 
+-	if (radeon_atom_get_leakage_id_from_vbios(rdev, &leakage_id) == 0) {
++	if (rdev->pm.dpm.platform_caps & ATOM_PP_PLATFORM_CAP_EVV) {
++		for (i = 0; i < CISLANDS_MAX_LEAKAGE_COUNT; i++) {
++			virtual_voltage_id = ATOM_VIRTUAL_VOLTAGE_ID0 + i;
++			if (radeon_atom_get_voltage_evv(rdev, virtual_voltage_id, &vddc) != 0)
++				continue;
++			if (vddc != 0 && vddc != virtual_voltage_id) {
++				pi->vddc_leakage.actual_voltage[pi->vddc_leakage.count] = vddc;
++				pi->vddc_leakage.leakage_id[pi->vddc_leakage.count] = virtual_voltage_id;
++				pi->vddc_leakage.count++;
++			}
++		}
++	} else if (radeon_atom_get_leakage_id_from_vbios(rdev, &leakage_id) == 0) {
+ 		for (i = 0; i < CISLANDS_MAX_LEAKAGE_COUNT; i++) {
+ 			virtual_voltage_id = ATOM_VIRTUAL_VOLTAGE_ID0 + i;
+ 			if (radeon_atom_get_leakage_vddc_based_on_leakage_params(rdev, &vddc, &vddci,
+diff --git a/drivers/gpu/drm/radeon/cik.c b/drivers/gpu/drm/radeon/cik.c
+index 65a8cca603a4..5ea01de617ab 100644
+--- a/drivers/gpu/drm/radeon/cik.c
++++ b/drivers/gpu/drm/radeon/cik.c
+@@ -3259,7 +3259,7 @@ static void cik_gpu_init(struct radeon_device *rdev)
+ 	u32 mc_shared_chmap, mc_arb_ramcfg;
+ 	u32 hdp_host_path_cntl;
+ 	u32 tmp;
+-	int i, j, k;
++	int i, j;
+ 
+ 	switch (rdev->family) {
+ 	case CHIP_BONAIRE:
+@@ -3449,12 +3449,11 @@ static void cik_gpu_init(struct radeon_device *rdev)
+ 		     rdev->config.cik.max_sh_per_se,
+ 		     rdev->config.cik.max_backends_per_se);
+ 
++	rdev->config.cik.active_cus = 0;
+ 	for (i = 0; i < rdev->config.cik.max_shader_engines; i++) {
+ 		for (j = 0; j < rdev->config.cik.max_sh_per_se; j++) {
+-			for (k = 0; k < rdev->config.cik.max_cu_per_sh; k++) {
+-				rdev->config.cik.active_cus +=
+-					hweight32(cik_get_cu_active_bitmap(rdev, i, j));
+-			}
++			rdev->config.cik.active_cus +=
++				hweight32(cik_get_cu_active_bitmap(rdev, i, j));
+ 		}
+ 	}
+ 
+@@ -4490,7 +4489,7 @@ struct bonaire_mqd
+  */
+ static int cik_cp_compute_resume(struct radeon_device *rdev)
+ {
+-	int r, i, idx;
++	int r, i, j, idx;
+ 	u32 tmp;
+ 	bool use_doorbell = true;
+ 	u64 hqd_gpu_addr;
+@@ -4609,7 +4608,7 @@ static int cik_cp_compute_resume(struct radeon_device *rdev)
+ 		mqd->queue_state.cp_hqd_pq_wptr= 0;
+ 		if (RREG32(CP_HQD_ACTIVE) & 1) {
+ 			WREG32(CP_HQD_DEQUEUE_REQUEST, 1);
+-			for (i = 0; i < rdev->usec_timeout; i++) {
++			for (j = 0; j < rdev->usec_timeout; j++) {
+ 				if (!(RREG32(CP_HQD_ACTIVE) & 1))
+ 					break;
+ 				udelay(1);
+@@ -5643,12 +5642,13 @@ static void cik_vm_decode_fault(struct radeon_device *rdev,
+ void cik_vm_flush(struct radeon_device *rdev, int ridx, struct radeon_vm *vm)
+ {
+ 	struct radeon_ring *ring = &rdev->ring[ridx];
++	int usepfp = (ridx == RADEON_RING_TYPE_GFX_INDEX);
+ 
+ 	if (vm == NULL)
+ 		return;
+ 
+ 	radeon_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
+-	radeon_ring_write(ring, (WRITE_DATA_ENGINE_SEL(0) |
++	radeon_ring_write(ring, (WRITE_DATA_ENGINE_SEL(usepfp) |
+ 				 WRITE_DATA_DST_SEL(0)));
+ 	if (vm->id < 8) {
+ 		radeon_ring_write(ring,
+@@ -5698,7 +5698,7 @@ void cik_vm_flush(struct radeon_device *rdev, int ridx, struct radeon_vm *vm)
+ 	radeon_ring_write(ring, 1 << vm->id);
+ 
+ 	/* compute doesn't have PFP */
+-	if (ridx == RADEON_RING_TYPE_GFX_INDEX) {
++	if (usepfp) {
+ 		/* sync PFP to ME, otherwise we might get invalid PFP reads */
+ 		radeon_ring_write(ring, PACKET3(PACKET3_PFP_SYNC_ME, 0));
+ 		radeon_ring_write(ring, 0x0);
+diff --git a/drivers/gpu/drm/radeon/cik_sdma.c b/drivers/gpu/drm/radeon/cik_sdma.c
+index 8e9d0f1d858e..72bff72c036d 100644
+--- a/drivers/gpu/drm/radeon/cik_sdma.c
++++ b/drivers/gpu/drm/radeon/cik_sdma.c
+@@ -459,13 +459,6 @@ int cik_sdma_resume(struct radeon_device *rdev)
+ {
+ 	int r;
+ 
+-	/* Reset dma */
+-	WREG32(SRBM_SOFT_RESET, SOFT_RESET_SDMA | SOFT_RESET_SDMA1);
+-	RREG32(SRBM_SOFT_RESET);
+-	udelay(50);
+-	WREG32(SRBM_SOFT_RESET, 0);
+-	RREG32(SRBM_SOFT_RESET);
+-
+ 	r = cik_sdma_load_microcode(rdev);
+ 	if (r)
+ 		return r;
+diff --git a/drivers/gpu/drm/radeon/kv_dpm.c b/drivers/gpu/drm/radeon/kv_dpm.c
+index 9ef8c38f2d66..f00e6a6c254a 100644
+--- a/drivers/gpu/drm/radeon/kv_dpm.c
++++ b/drivers/gpu/drm/radeon/kv_dpm.c
+@@ -33,6 +33,8 @@
+ #define KV_MINIMUM_ENGINE_CLOCK         800
+ #define SMC_RAM_END                     0x40000
+ 
++static int kv_enable_nb_dpm(struct radeon_device *rdev,
++			    bool enable);
+ static void kv_init_graphics_levels(struct radeon_device *rdev);
+ static int kv_calculate_ds_divider(struct radeon_device *rdev);
+ static int kv_calculate_nbps_level_settings(struct radeon_device *rdev);
+@@ -1295,6 +1297,9 @@ void kv_dpm_disable(struct radeon_device *rdev)
+ {
+ 	kv_smc_bapm_enable(rdev, false);
+ 
++	if (rdev->family == CHIP_MULLINS)
++		kv_enable_nb_dpm(rdev, false);
++
+ 	/* powerup blocks */
+ 	kv_dpm_powergate_acp(rdev, false);
+ 	kv_dpm_powergate_samu(rdev, false);
+@@ -1438,14 +1443,14 @@ static int kv_update_uvd_dpm(struct radeon_device *rdev, bool gate)
+ 	return kv_enable_uvd_dpm(rdev, !gate);
+ }
+ 
+-static u8 kv_get_vce_boot_level(struct radeon_device *rdev)
++static u8 kv_get_vce_boot_level(struct radeon_device *rdev, u32 evclk)
+ {
+ 	u8 i;
+ 	struct radeon_vce_clock_voltage_dependency_table *table =
+ 		&rdev->pm.dpm.dyn_state.vce_clock_voltage_dependency_table;
+ 
+ 	for (i = 0; i < table->count; i++) {
+-		if (table->entries[i].evclk >= 0) /* XXX */
++		if (table->entries[i].evclk >= evclk)
+ 			break;
+ 	}
+ 
+@@ -1468,7 +1473,7 @@ static int kv_update_vce_dpm(struct radeon_device *rdev,
+ 		if (pi->caps_stable_p_state)
+ 			pi->vce_boot_level = table->count - 1;
+ 		else
+-			pi->vce_boot_level = kv_get_vce_boot_level(rdev);
++			pi->vce_boot_level = kv_get_vce_boot_level(rdev, radeon_new_state->evclk);
+ 
+ 		ret = kv_copy_bytes_to_smc(rdev,
+ 					   pi->dpm_table_start +
+@@ -1769,15 +1774,24 @@ static int kv_update_dfs_bypass_settings(struct radeon_device *rdev,
+ 	return ret;
+ }
+ 
+-static int kv_enable_nb_dpm(struct radeon_device *rdev)
++static int kv_enable_nb_dpm(struct radeon_device *rdev,
++			    bool enable)
+ {
+ 	struct kv_power_info *pi = kv_get_pi(rdev);
+ 	int ret = 0;
+ 
+-	if (pi->enable_nb_dpm && !pi->nb_dpm_enabled) {
+-		ret = kv_notify_message_to_smu(rdev, PPSMC_MSG_NBDPM_Enable);
+-		if (ret == 0)
+-			pi->nb_dpm_enabled = true;
++	if (enable) {
++		if (pi->enable_nb_dpm && !pi->nb_dpm_enabled) {
++			ret = kv_notify_message_to_smu(rdev, PPSMC_MSG_NBDPM_Enable);
++			if (ret == 0)
++				pi->nb_dpm_enabled = true;
++		}
++	} else {
++		if (pi->enable_nb_dpm && pi->nb_dpm_enabled) {
++			ret = kv_notify_message_to_smu(rdev, PPSMC_MSG_NBDPM_Disable);
++			if (ret == 0)
++				pi->nb_dpm_enabled = false;
++		}
+ 	}
+ 
+ 	return ret;
+@@ -1864,7 +1878,7 @@ int kv_dpm_set_power_state(struct radeon_device *rdev)
+ 			}
+ 			kv_update_sclk_t(rdev);
+ 			if (rdev->family == CHIP_MULLINS)
+-				kv_enable_nb_dpm(rdev);
++				kv_enable_nb_dpm(rdev, true);
+ 		}
+ 	} else {
+ 		if (pi->enable_dpm) {
+@@ -1889,7 +1903,7 @@ int kv_dpm_set_power_state(struct radeon_device *rdev)
+ 			}
+ 			kv_update_acp_boot_level(rdev);
+ 			kv_update_sclk_t(rdev);
+-			kv_enable_nb_dpm(rdev);
++			kv_enable_nb_dpm(rdev, true);
+ 		}
+ 	}
+ 
+diff --git a/drivers/gpu/drm/radeon/ni_dma.c b/drivers/gpu/drm/radeon/ni_dma.c
+index 6378e0276691..6927db4d8db7 100644
+--- a/drivers/gpu/drm/radeon/ni_dma.c
++++ b/drivers/gpu/drm/radeon/ni_dma.c
+@@ -191,12 +191,6 @@ int cayman_dma_resume(struct radeon_device *rdev)
+ 	u32 reg_offset, wb_offset;
+ 	int i, r;
+ 
+-	/* Reset dma */
+-	WREG32(SRBM_SOFT_RESET, SOFT_RESET_DMA | SOFT_RESET_DMA1);
+-	RREG32(SRBM_SOFT_RESET);
+-	udelay(50);
+-	WREG32(SRBM_SOFT_RESET, 0);
+-
+ 	for (i = 0; i < 2; i++) {
+ 		if (i == 0) {
+ 			ring = &rdev->ring[R600_RING_TYPE_DMA_INDEX];
+diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c
+index 3c69f58e46ef..44b046b4056f 100644
+--- a/drivers/gpu/drm/radeon/r600.c
++++ b/drivers/gpu/drm/radeon/r600.c
+@@ -1813,7 +1813,6 @@ static void r600_gpu_init(struct radeon_device *rdev)
+ {
+ 	u32 tiling_config;
+ 	u32 ramcfg;
+-	u32 cc_rb_backend_disable;
+ 	u32 cc_gc_shader_pipe_config;
+ 	u32 tmp;
+ 	int i, j;
+@@ -1940,29 +1939,20 @@ static void r600_gpu_init(struct radeon_device *rdev)
+ 	}
+ 	tiling_config |= BANK_SWAPS(1);
+ 
+-	cc_rb_backend_disable = RREG32(CC_RB_BACKEND_DISABLE) & 0x00ff0000;
+-	tmp = R6XX_MAX_BACKENDS -
+-		r600_count_pipe_bits((cc_rb_backend_disable >> 16) & R6XX_MAX_BACKENDS_MASK);
+-	if (tmp < rdev->config.r600.max_backends) {
+-		rdev->config.r600.max_backends = tmp;
+-	}
+-
+ 	cc_gc_shader_pipe_config = RREG32(CC_GC_SHADER_PIPE_CONFIG) & 0x00ffff00;
+-	tmp = R6XX_MAX_PIPES -
+-		r600_count_pipe_bits((cc_gc_shader_pipe_config >> 8) & R6XX_MAX_PIPES_MASK);
+-	if (tmp < rdev->config.r600.max_pipes) {
+-		rdev->config.r600.max_pipes = tmp;
+-	}
+-	tmp = R6XX_MAX_SIMDS -
+-		r600_count_pipe_bits((cc_gc_shader_pipe_config >> 16) & R6XX_MAX_SIMDS_MASK);
+-	if (tmp < rdev->config.r600.max_simds) {
+-		rdev->config.r600.max_simds = tmp;
+-	}
+ 	tmp = rdev->config.r600.max_simds -
+ 		r600_count_pipe_bits((cc_gc_shader_pipe_config >> 16) & R6XX_MAX_SIMDS_MASK);
+ 	rdev->config.r600.active_simds = tmp;
+ 
+ 	disabled_rb_mask = (RREG32(CC_RB_BACKEND_DISABLE) >> 16) & R6XX_MAX_BACKENDS_MASK;
++	tmp = 0;
++	for (i = 0; i < rdev->config.r600.max_backends; i++)
++		tmp |= (1 << i);
++	/* if all the backends are disabled, fix it up here */
++	if ((disabled_rb_mask & tmp) == tmp) {
++		for (i = 0; i < rdev->config.r600.max_backends; i++)
++			disabled_rb_mask &= ~(1 << i);
++	}
+ 	tmp = (tiling_config & PIPE_TILING__MASK) >> PIPE_TILING__SHIFT;
+ 	tmp = r6xx_remap_render_backend(rdev, tmp, rdev->config.r600.max_backends,
+ 					R6XX_MAX_BACKENDS, disabled_rb_mask);
+diff --git a/drivers/gpu/drm/radeon/r600_dma.c b/drivers/gpu/drm/radeon/r600_dma.c
+index 4969cef44a19..b766e052d91f 100644
+--- a/drivers/gpu/drm/radeon/r600_dma.c
++++ b/drivers/gpu/drm/radeon/r600_dma.c
+@@ -124,15 +124,6 @@ int r600_dma_resume(struct radeon_device *rdev)
+ 	u32 rb_bufsz;
+ 	int r;
+ 
+-	/* Reset dma */
+-	if (rdev->family >= CHIP_RV770)
+-		WREG32(SRBM_SOFT_RESET, RV770_SOFT_RESET_DMA);
+-	else
+-		WREG32(SRBM_SOFT_RESET, SOFT_RESET_DMA);
+-	RREG32(SRBM_SOFT_RESET);
+-	udelay(50);
+-	WREG32(SRBM_SOFT_RESET, 0);
+-
+ 	WREG32(DMA_SEM_INCOMPLETE_TIMER_CNTL, 0);
+ 	WREG32(DMA_SEM_WAIT_FAIL_TIMER_CNTL, 0);
+ 
+diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
+index 60c47f829122..2d6b55d8461e 100644
+--- a/drivers/gpu/drm/radeon/radeon.h
++++ b/drivers/gpu/drm/radeon/radeon.h
+@@ -304,6 +304,9 @@ int radeon_atom_get_leakage_vddc_based_on_leakage_params(struct radeon_device *r
+ 							 u16 *vddc, u16 *vddci,
+ 							 u16 virtual_voltage_id,
+ 							 u16 vbios_voltage_id);
++int radeon_atom_get_voltage_evv(struct radeon_device *rdev,
++				u16 virtual_voltage_id,
++				u16 *voltage);
+ int radeon_atom_round_to_true_voltage(struct radeon_device *rdev,
+ 				      u8 voltage_type,
+ 				      u16 nominal_voltage,
+diff --git a/drivers/gpu/drm/radeon/radeon_atombios.c b/drivers/gpu/drm/radeon/radeon_atombios.c
+index 173f378428a9..be6705eeb649 100644
+--- a/drivers/gpu/drm/radeon/radeon_atombios.c
++++ b/drivers/gpu/drm/radeon/radeon_atombios.c
+@@ -447,6 +447,13 @@ static bool radeon_atom_apply_quirks(struct drm_device *dev,
+ 		}
+ 	}
+ 
++	/* Fujitsu D3003-S2 board lists DVI-I as DVI-I and VGA */
++	if ((dev->pdev->device == 0x9805) &&
++	    (dev->pdev->subsystem_vendor == 0x1734) &&
++	    (dev->pdev->subsystem_device == 0x11bd)) {
++		if (*connector_type == DRM_MODE_CONNECTOR_VGA)
++			return false;
++	}
+ 
+ 	return true;
+ }
+@@ -1963,7 +1970,7 @@ static const char *thermal_controller_names[] = {
+ 	"adm1032",
+ 	"adm1030",
+ 	"max6649",
+-	"lm64",
++	"lm63", /* lm64 */
+ 	"f75375",
+ 	"asc7xxx",
+ };
+@@ -1974,7 +1981,7 @@ static const char *pp_lib_thermal_controller_names[] = {
+ 	"adm1032",
+ 	"adm1030",
+ 	"max6649",
+-	"lm64",
++	"lm63", /* lm64 */
+ 	"f75375",
+ 	"RV6xx",
+ 	"RV770",
+@@ -2281,19 +2288,31 @@ static void radeon_atombios_add_pplib_thermal_controller(struct radeon_device *r
+ 				 (controller->ucFanParameters &
+ 				  ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
+ 			rdev->pm.int_thermal_type = THERMAL_TYPE_KV;
+-		} else if ((controller->ucType ==
+-			    ATOM_PP_THERMALCONTROLLER_EXTERNAL_GPIO) ||
+-			   (controller->ucType ==
+-			    ATOM_PP_THERMALCONTROLLER_ADT7473_WITH_INTERNAL) ||
+-			   (controller->ucType ==
+-			    ATOM_PP_THERMALCONTROLLER_EMC2103_WITH_INTERNAL)) {
+-			DRM_INFO("Special thermal controller config\n");
++		} else if (controller->ucType ==
++			   ATOM_PP_THERMALCONTROLLER_EXTERNAL_GPIO) {
++			DRM_INFO("External GPIO thermal controller %s fan control\n",
++				 (controller->ucFanParameters &
++				  ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
++			rdev->pm.int_thermal_type = THERMAL_TYPE_EXTERNAL_GPIO;
++		} else if (controller->ucType ==
++			   ATOM_PP_THERMALCONTROLLER_ADT7473_WITH_INTERNAL) {
++			DRM_INFO("ADT7473 with internal thermal controller %s fan control\n",
++				 (controller->ucFanParameters &
++				  ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
++			rdev->pm.int_thermal_type = THERMAL_TYPE_ADT7473_WITH_INTERNAL;
++		} else if (controller->ucType ==
++			   ATOM_PP_THERMALCONTROLLER_EMC2103_WITH_INTERNAL) {
++			DRM_INFO("EMC2103 with internal thermal controller %s fan control\n",
++				 (controller->ucFanParameters &
++				  ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
++			rdev->pm.int_thermal_type = THERMAL_TYPE_EMC2103_WITH_INTERNAL;
+ 		} else if (controller->ucType < ARRAY_SIZE(pp_lib_thermal_controller_names)) {
+ 			DRM_INFO("Possible %s thermal controller at 0x%02x %s fan control\n",
+ 				 pp_lib_thermal_controller_names[controller->ucType],
+ 				 controller->ucI2cAddress >> 1,
+ 				 (controller->ucFanParameters &
+ 				  ATOM_PP_FANPARAMETERS_NOFAN) ? "without" : "with");
++			rdev->pm.int_thermal_type = THERMAL_TYPE_EXTERNAL;
+ 			i2c_bus = radeon_lookup_i2c_gpio(rdev, controller->ucI2cLine);
+ 			rdev->pm.i2c_bus = radeon_i2c_lookup(rdev, &i2c_bus);
+ 			if (rdev->pm.i2c_bus) {
+@@ -3236,6 +3255,41 @@ int radeon_atom_get_leakage_vddc_based_on_leakage_params(struct radeon_device *r
+ 	return 0;
+ }
+ 
++union get_voltage_info {
++	struct  _GET_VOLTAGE_INFO_INPUT_PARAMETER_V1_2 in;
++	struct  _GET_EVV_VOLTAGE_INFO_OUTPUT_PARAMETER_V1_2 evv_out;
++};
++
++int radeon_atom_get_voltage_evv(struct radeon_device *rdev,
++				u16 virtual_voltage_id,
++				u16 *voltage)
++{
++	int index = GetIndexIntoMasterTable(COMMAND, GetVoltageInfo);
++	u32 entry_id;
++	u32 count = rdev->pm.dpm.dyn_state.vddc_dependency_on_sclk.count;
++	union get_voltage_info args;
++
++	for (entry_id = 0; entry_id < count; entry_id++) {
++		if (rdev->pm.dpm.dyn_state.vddc_dependency_on_sclk.entries[entry_id].v ==
++		    virtual_voltage_id)
++			break;
++	}
++
++	if (entry_id >= count)
++		return -EINVAL;
++
++	args.in.ucVoltageType = VOLTAGE_TYPE_VDDC;
++	args.in.ucVoltageMode = ATOM_GET_VOLTAGE_EVV_VOLTAGE;
++	args.in.ulSCLKFreq =
++		cpu_to_le32(rdev->pm.dpm.dyn_state.vddc_dependency_on_sclk.entries[entry_id].clk);
++
++	atom_execute_table(rdev->mode_info.atom_context, index, (uint32_t *)&args);
++
++	*voltage = le16_to_cpu(args.evv_out.usVoltageLevel);
++
++	return 0;
++}
++
+ int radeon_atom_get_voltage_gpio_settings(struct radeon_device *rdev,
+ 					  u16 voltage_level, u8 voltage_type,
+ 					  u32 *gpio_value, u32 *gpio_mask)
+diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c
+index ae763f60c8a0..8f7d56f342f1 100644
+--- a/drivers/gpu/drm/radeon/radeon_cs.c
++++ b/drivers/gpu/drm/radeon/radeon_cs.c
+@@ -132,7 +132,8 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
+ 		 * the buffers used for read only, which doubles the range
+ 		 * to 0 to 31. 32 is reserved for the kernel driver.
+ 		 */
+-		priority = (r->flags & 0xf) * 2 + !!r->write_domain;
++		priority = (r->flags & RADEON_RELOC_PRIO_MASK) * 2
++			   + !!r->write_domain;
+ 
+ 		/* the first reloc of an UVD job is the msg and that must be in
+ 		   VRAM, also but everything into VRAM on AGP cards to avoid
+diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
+index 697add2cd4e3..52a0cfd0276a 100644
+--- a/drivers/gpu/drm/radeon/radeon_device.c
++++ b/drivers/gpu/drm/radeon/radeon_device.c
+@@ -1350,7 +1350,7 @@ int radeon_device_init(struct radeon_device *rdev,
+ 
+ 	r = radeon_init(rdev);
+ 	if (r)
+-		return r;
++		goto failed;
+ 
+ 	r = radeon_ib_ring_tests(rdev);
+ 	if (r)
+@@ -1370,7 +1370,7 @@ int radeon_device_init(struct radeon_device *rdev,
+ 		radeon_agp_disable(rdev);
+ 		r = radeon_init(rdev);
+ 		if (r)
+-			return r;
++			goto failed;
+ 	}
+ 
+ 	if ((radeon_testing & 1)) {
+@@ -1392,6 +1392,11 @@ int radeon_device_init(struct radeon_device *rdev,
+ 			DRM_INFO("radeon: acceleration disabled, skipping benchmarks\n");
+ 	}
+ 	return 0;
++
++failed:
++	if (runtime)
++		vga_switcheroo_fini_domain_pm_ops(rdev->dev);
++	return r;
+ }
+ 
+ static void radeon_debugfs_remove_files(struct radeon_device *rdev);
+@@ -1412,6 +1417,8 @@ void radeon_device_fini(struct radeon_device *rdev)
+ 	radeon_bo_evict_vram(rdev);
+ 	radeon_fini(rdev);
+ 	vga_switcheroo_unregister_client(rdev->pdev);
++	if (rdev->flags & RADEON_IS_PX)
++		vga_switcheroo_fini_domain_pm_ops(rdev->dev);
+ 	vga_client_register(rdev->pdev, NULL, NULL, NULL);
+ 	if (rdev->rio_mem)
+ 		pci_iounmap(rdev->pdev, rdev->rio_mem);
+@@ -1637,7 +1644,6 @@ int radeon_gpu_reset(struct radeon_device *rdev)
+ 	radeon_save_bios_scratch_regs(rdev);
+ 	/* block TTM */
+ 	resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
+-	radeon_pm_suspend(rdev);
+ 	radeon_suspend(rdev);
+ 
+ 	for (i = 0; i < RADEON_NUM_RINGS; ++i) {
+@@ -1683,9 +1689,24 @@ retry:
+ 		}
+ 	}
+ 
+-	radeon_pm_resume(rdev);
++	if ((rdev->pm.pm_method == PM_METHOD_DPM) && rdev->pm.dpm_enabled) {
++		/* do dpm late init */
++		r = radeon_pm_late_init(rdev);
++		if (r) {
++			rdev->pm.dpm_enabled = false;
++			DRM_ERROR("radeon_pm_late_init failed, disabling dpm\n");
++		}
++	} else {
++		/* resume old pm late */
++		radeon_pm_resume(rdev);
++	}
++
+ 	drm_helper_resume_force_mode(rdev->ddev);
+ 
++	/* set the power state here in case we are a PX system or headless */
++	if ((rdev->pm.pm_method == PM_METHOD_DPM) && rdev->pm.dpm_enabled)
++		radeon_pm_compute_clocks(rdev);
++
+ 	ttm_bo_unlock_delayed_workqueue(&rdev->mman.bdev, resched);
+ 	if (r) {
+ 		/* bad news, how to tell it to userspace ? */
+diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
+index e9e361084249..a089abb76363 100644
+--- a/drivers/gpu/drm/radeon/radeon_drv.c
++++ b/drivers/gpu/drm/radeon/radeon_drv.c
+@@ -429,6 +429,7 @@ static int radeon_pmops_runtime_suspend(struct device *dev)
+ 	ret = radeon_suspend_kms(drm_dev, false, false);
+ 	pci_save_state(pdev);
+ 	pci_disable_device(pdev);
++	pci_ignore_hotplug(pdev);
+ 	pci_set_power_state(pdev, PCI_D3cold);
+ 	drm_dev->switch_power_state = DRM_SWITCH_POWER_DYNAMIC_OFF;
+ 
+diff --git a/drivers/gpu/drm/radeon/radeon_kms.c b/drivers/gpu/drm/radeon/radeon_kms.c
+index d25ae6acfd5a..c1a206dd859d 100644
+--- a/drivers/gpu/drm/radeon/radeon_kms.c
++++ b/drivers/gpu/drm/radeon/radeon_kms.c
+@@ -254,7 +254,14 @@ static int radeon_info_ioctl(struct drm_device *dev, void *data, struct drm_file
+ 		}
+ 		break;
+ 	case RADEON_INFO_ACCEL_WORKING2:
+-		*value = rdev->accel_working;
++		if (rdev->family == CHIP_HAWAII) {
++			if (rdev->accel_working)
++				*value = 2;
++			else
++				*value = 0;
++		} else {
++			*value = rdev->accel_working;
++		}
+ 		break;
+ 	case RADEON_INFO_TILING_CONFIG:
+ 		if (rdev->family >= CHIP_BONAIRE)
+diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c
+index e447e390d09a..50d6ff9d7656 100644
+--- a/drivers/gpu/drm/radeon/radeon_pm.c
++++ b/drivers/gpu/drm/radeon/radeon_pm.c
+@@ -460,10 +460,6 @@ static ssize_t radeon_get_dpm_state(struct device *dev,
+ 	struct radeon_device *rdev = ddev->dev_private;
+ 	enum radeon_pm_state_type pm = rdev->pm.dpm.user_state;
+ 
+-	if  ((rdev->flags & RADEON_IS_PX) &&
+-	     (ddev->switch_power_state != DRM_SWITCH_POWER_ON))
+-		return snprintf(buf, PAGE_SIZE, "off\n");
+-
+ 	return snprintf(buf, PAGE_SIZE, "%s\n",
+ 			(pm == POWER_STATE_TYPE_BATTERY) ? "battery" :
+ 			(pm == POWER_STATE_TYPE_BALANCED) ? "balanced" : "performance");
+@@ -477,11 +473,6 @@ static ssize_t radeon_set_dpm_state(struct device *dev,
+ 	struct drm_device *ddev = dev_get_drvdata(dev);
+ 	struct radeon_device *rdev = ddev->dev_private;
+ 
+-	/* Can't set dpm state when the card is off */
+-	if  ((rdev->flags & RADEON_IS_PX) &&
+-	     (ddev->switch_power_state != DRM_SWITCH_POWER_ON))
+-		return -EINVAL;
+-
+ 	mutex_lock(&rdev->pm.mutex);
+ 	if (strncmp("battery", buf, strlen("battery")) == 0)
+ 		rdev->pm.dpm.user_state = POWER_STATE_TYPE_BATTERY;
+@@ -495,7 +486,12 @@ static ssize_t radeon_set_dpm_state(struct device *dev,
+ 		goto fail;
+ 	}
+ 	mutex_unlock(&rdev->pm.mutex);
+-	radeon_pm_compute_clocks(rdev);
++
++	/* Can't set dpm state when the card is off */
++	if (!(rdev->flags & RADEON_IS_PX) ||
++	    (ddev->switch_power_state == DRM_SWITCH_POWER_ON))
++		radeon_pm_compute_clocks(rdev);
++
+ fail:
+ 	return count;
+ }
+@@ -1303,10 +1299,6 @@ int radeon_pm_init(struct radeon_device *rdev)
+ 	case CHIP_RS780:
+ 	case CHIP_RS880:
+ 	case CHIP_RV770:
+-	case CHIP_BARTS:
+-	case CHIP_TURKS:
+-	case CHIP_CAICOS:
+-	case CHIP_CAYMAN:
+ 		/* DPM requires the RLC, RV770+ dGPU requires SMC */
+ 		if (!rdev->rlc_fw)
+ 			rdev->pm.pm_method = PM_METHOD_PROFILE;
+@@ -1330,6 +1322,10 @@ int radeon_pm_init(struct radeon_device *rdev)
+ 	case CHIP_PALM:
+ 	case CHIP_SUMO:
+ 	case CHIP_SUMO2:
++	case CHIP_BARTS:
++	case CHIP_TURKS:
++	case CHIP_CAICOS:
++	case CHIP_CAYMAN:
+ 	case CHIP_ARUBA:
+ 	case CHIP_TAHITI:
+ 	case CHIP_PITCAIRN:
+diff --git a/drivers/gpu/drm/radeon/radeon_semaphore.c b/drivers/gpu/drm/radeon/radeon_semaphore.c
+index dbd6bcde92de..e6101c18c457 100644
+--- a/drivers/gpu/drm/radeon/radeon_semaphore.c
++++ b/drivers/gpu/drm/radeon/radeon_semaphore.c
+@@ -34,7 +34,7 @@
+ int radeon_semaphore_create(struct radeon_device *rdev,
+ 			    struct radeon_semaphore **semaphore)
+ {
+-	uint32_t *cpu_addr;
++	uint64_t *cpu_addr;
+ 	int i, r;
+ 
+ 	*semaphore = kmalloc(sizeof(struct radeon_semaphore), GFP_KERNEL);
+diff --git a/drivers/gpu/drm/radeon/rv770.c b/drivers/gpu/drm/radeon/rv770.c
+index da8703d8d455..11cd3d887428 100644
+--- a/drivers/gpu/drm/radeon/rv770.c
++++ b/drivers/gpu/drm/radeon/rv770.c
+@@ -1178,7 +1178,6 @@ static void rv770_gpu_init(struct radeon_device *rdev)
+ 	u32 hdp_host_path_cntl;
+ 	u32 sq_dyn_gpr_size_simd_ab_0;
+ 	u32 gb_tiling_config = 0;
+-	u32 cc_rb_backend_disable = 0;
+ 	u32 cc_gc_shader_pipe_config = 0;
+ 	u32 mc_arb_ramcfg;
+ 	u32 db_debug4, tmp;
+@@ -1312,21 +1311,7 @@ static void rv770_gpu_init(struct radeon_device *rdev)
+ 		WREG32(SPI_CONFIG_CNTL, 0);
+ 	}
+ 
+-	cc_rb_backend_disable = RREG32(CC_RB_BACKEND_DISABLE) & 0x00ff0000;
+-	tmp = R7XX_MAX_BACKENDS - r600_count_pipe_bits(cc_rb_backend_disable >> 16);
+-	if (tmp < rdev->config.rv770.max_backends) {
+-		rdev->config.rv770.max_backends = tmp;
+-	}
+-
+ 	cc_gc_shader_pipe_config = RREG32(CC_GC_SHADER_PIPE_CONFIG) & 0xffffff00;
+-	tmp = R7XX_MAX_PIPES - r600_count_pipe_bits((cc_gc_shader_pipe_config >> 8) & R7XX_MAX_PIPES_MASK);
+-	if (tmp < rdev->config.rv770.max_pipes) {
+-		rdev->config.rv770.max_pipes = tmp;
+-	}
+-	tmp = R7XX_MAX_SIMDS - r600_count_pipe_bits((cc_gc_shader_pipe_config >> 16) & R7XX_MAX_SIMDS_MASK);
+-	if (tmp < rdev->config.rv770.max_simds) {
+-		rdev->config.rv770.max_simds = tmp;
+-	}
+ 	tmp = rdev->config.rv770.max_simds -
+ 		r600_count_pipe_bits((cc_gc_shader_pipe_config >> 16) & R7XX_MAX_SIMDS_MASK);
+ 	rdev->config.rv770.active_simds = tmp;
+@@ -1349,6 +1334,14 @@ static void rv770_gpu_init(struct radeon_device *rdev)
+ 	rdev->config.rv770.tiling_npipes = rdev->config.rv770.max_tile_pipes;
+ 
+ 	disabled_rb_mask = (RREG32(CC_RB_BACKEND_DISABLE) >> 16) & R7XX_MAX_BACKENDS_MASK;
++	tmp = 0;
++	for (i = 0; i < rdev->config.rv770.max_backends; i++)
++		tmp |= (1 << i);
++	/* if all the backends are disabled, fix it up here */
++	if ((disabled_rb_mask & tmp) == tmp) {
++		for (i = 0; i < rdev->config.rv770.max_backends; i++)
++			disabled_rb_mask &= ~(1 << i);
++	}
+ 	tmp = (gb_tiling_config & PIPE_TILING__MASK) >> PIPE_TILING__SHIFT;
+ 	tmp = r6xx_remap_render_backend(rdev, tmp, rdev->config.rv770.max_backends,
+ 					R7XX_MAX_BACKENDS, disabled_rb_mask);
+diff --git a/drivers/gpu/drm/radeon/si.c b/drivers/gpu/drm/radeon/si.c
+index 9e854fd016da..6c17d3b0be8b 100644
+--- a/drivers/gpu/drm/radeon/si.c
++++ b/drivers/gpu/drm/radeon/si.c
+@@ -2901,7 +2901,7 @@ static void si_gpu_init(struct radeon_device *rdev)
+ 	u32 sx_debug_1;
+ 	u32 hdp_host_path_cntl;
+ 	u32 tmp;
+-	int i, j, k;
++	int i, j;
+ 
+ 	switch (rdev->family) {
+ 	case CHIP_TAHITI:
+@@ -3099,12 +3099,11 @@ static void si_gpu_init(struct radeon_device *rdev)
+ 		     rdev->config.si.max_sh_per_se,
+ 		     rdev->config.si.max_cu_per_sh);
+ 
++	rdev->config.si.active_cus = 0;
+ 	for (i = 0; i < rdev->config.si.max_shader_engines; i++) {
+ 		for (j = 0; j < rdev->config.si.max_sh_per_se; j++) {
+-			for (k = 0; k < rdev->config.si.max_cu_per_sh; k++) {
+-				rdev->config.si.active_cus +=
+-					hweight32(si_get_cu_active_bitmap(rdev, i, j));
+-			}
++			rdev->config.si.active_cus +=
++				hweight32(si_get_cu_active_bitmap(rdev, i, j));
+ 		}
+ 	}
+ 
+@@ -4815,7 +4814,7 @@ void si_vm_flush(struct radeon_device *rdev, int ridx, struct radeon_vm *vm)
+ 
+ 	/* write new base address */
+ 	radeon_ring_write(ring, PACKET3(PACKET3_WRITE_DATA, 3));
+-	radeon_ring_write(ring, (WRITE_DATA_ENGINE_SEL(0) |
++	radeon_ring_write(ring, (WRITE_DATA_ENGINE_SEL(1) |
+ 				 WRITE_DATA_DST_SEL(0)));
+ 
+ 	if (vm->id < 8) {
+diff --git a/drivers/gpu/drm/tegra/dc.c b/drivers/gpu/drm/tegra/dc.c
+index ef40381f3909..48c3bc460eef 100644
+--- a/drivers/gpu/drm/tegra/dc.c
++++ b/drivers/gpu/drm/tegra/dc.c
+@@ -1303,6 +1303,7 @@ static const struct of_device_id tegra_dc_of_match[] = {
+ 		/* sentinel */
+ 	}
+ };
++MODULE_DEVICE_TABLE(of, tegra_dc_of_match);
+ 
+ static int tegra_dc_parse_dt(struct tegra_dc *dc)
+ {
+diff --git a/drivers/gpu/drm/tegra/dpaux.c b/drivers/gpu/drm/tegra/dpaux.c
+index 3f132e356e9c..708f783ead47 100644
+--- a/drivers/gpu/drm/tegra/dpaux.c
++++ b/drivers/gpu/drm/tegra/dpaux.c
+@@ -382,6 +382,7 @@ static const struct of_device_id tegra_dpaux_of_match[] = {
+ 	{ .compatible = "nvidia,tegra124-dpaux", },
+ 	{ },
+ };
++MODULE_DEVICE_TABLE(of, tegra_dpaux_of_match);
+ 
+ struct platform_driver tegra_dpaux_driver = {
+ 	.driver = {
+diff --git a/drivers/gpu/drm/tegra/dsi.c b/drivers/gpu/drm/tegra/dsi.c
+index bd56f2affa78..97c409f10456 100644
+--- a/drivers/gpu/drm/tegra/dsi.c
++++ b/drivers/gpu/drm/tegra/dsi.c
+@@ -982,6 +982,7 @@ static const struct of_device_id tegra_dsi_of_match[] = {
+ 	{ .compatible = "nvidia,tegra114-dsi", },
+ 	{ },
+ };
++MODULE_DEVICE_TABLE(of, tegra_dsi_of_match);
+ 
+ struct platform_driver tegra_dsi_driver = {
+ 	.driver = {
+diff --git a/drivers/gpu/drm/tegra/gr2d.c b/drivers/gpu/drm/tegra/gr2d.c
+index 7c53941f2a9e..02cd3e37a6ec 100644
+--- a/drivers/gpu/drm/tegra/gr2d.c
++++ b/drivers/gpu/drm/tegra/gr2d.c
+@@ -121,6 +121,7 @@ static const struct of_device_id gr2d_match[] = {
+ 	{ .compatible = "nvidia,tegra20-gr2d" },
+ 	{ },
+ };
++MODULE_DEVICE_TABLE(of, gr2d_match);
+ 
+ static const u32 gr2d_addr_regs[] = {
+ 	GR2D_UA_BASE_ADDR,
+diff --git a/drivers/gpu/drm/tegra/gr3d.c b/drivers/gpu/drm/tegra/gr3d.c
+index 30f5ba9bd6d0..2bea2b2d204e 100644
+--- a/drivers/gpu/drm/tegra/gr3d.c
++++ b/drivers/gpu/drm/tegra/gr3d.c
+@@ -130,6 +130,7 @@ static const struct of_device_id tegra_gr3d_match[] = {
+ 	{ .compatible = "nvidia,tegra20-gr3d" },
+ 	{ }
+ };
++MODULE_DEVICE_TABLE(of, tegra_gr3d_match);
+ 
+ static const u32 gr3d_addr_regs[] = {
+ 	GR3D_IDX_ATTRIBUTE( 0),
+diff --git a/drivers/gpu/drm/tegra/hdmi.c b/drivers/gpu/drm/tegra/hdmi.c
+index ba067bb767e3..ffe26547328d 100644
+--- a/drivers/gpu/drm/tegra/hdmi.c
++++ b/drivers/gpu/drm/tegra/hdmi.c
+@@ -1450,6 +1450,7 @@ static const struct of_device_id tegra_hdmi_of_match[] = {
+ 	{ .compatible = "nvidia,tegra20-hdmi", .data = &tegra20_hdmi_config },
+ 	{ },
+ };
++MODULE_DEVICE_TABLE(of, tegra_hdmi_of_match);
+ 
+ static int tegra_hdmi_probe(struct platform_device *pdev)
+ {
+diff --git a/drivers/gpu/drm/tegra/sor.c b/drivers/gpu/drm/tegra/sor.c
+index 27c979b50111..061a5c501124 100644
+--- a/drivers/gpu/drm/tegra/sor.c
++++ b/drivers/gpu/drm/tegra/sor.c
+@@ -1455,6 +1455,7 @@ static const struct of_device_id tegra_sor_of_match[] = {
+ 	{ .compatible = "nvidia,tegra124-sor", },
+ 	{ },
+ };
++MODULE_DEVICE_TABLE(of, tegra_sor_of_match);
+ 
+ struct platform_driver tegra_sor_driver = {
+ 	.driver = {
+diff --git a/drivers/gpu/drm/tilcdc/tilcdc_drv.c b/drivers/gpu/drm/tilcdc/tilcdc_drv.c
+index b20b69488dc9..006a30e90390 100644
+--- a/drivers/gpu/drm/tilcdc/tilcdc_drv.c
++++ b/drivers/gpu/drm/tilcdc/tilcdc_drv.c
+@@ -122,6 +122,7 @@ static int tilcdc_unload(struct drm_device *dev)
+ 	struct tilcdc_drm_private *priv = dev->dev_private;
+ 	struct tilcdc_module *mod, *cur;
+ 
++	drm_fbdev_cma_fini(priv->fbdev);
+ 	drm_kms_helper_poll_fini(dev);
+ 	drm_mode_config_cleanup(dev);
+ 	drm_vblank_cleanup(dev);
+@@ -628,10 +629,10 @@ static int __init tilcdc_drm_init(void)
+ static void __exit tilcdc_drm_fini(void)
+ {
+ 	DBG("fini");
+-	tilcdc_tfp410_fini();
+-	tilcdc_slave_fini();
+-	tilcdc_panel_fini();
+ 	platform_driver_unregister(&tilcdc_platform_driver);
++	tilcdc_panel_fini();
++	tilcdc_slave_fini();
++	tilcdc_tfp410_fini();
+ }
+ 
+ late_initcall(tilcdc_drm_init);
+diff --git a/drivers/gpu/drm/tilcdc/tilcdc_panel.c b/drivers/gpu/drm/tilcdc/tilcdc_panel.c
+index 86c67329b605..b085dcc54fb5 100644
+--- a/drivers/gpu/drm/tilcdc/tilcdc_panel.c
++++ b/drivers/gpu/drm/tilcdc/tilcdc_panel.c
+@@ -151,6 +151,7 @@ struct panel_connector {
+ static void panel_connector_destroy(struct drm_connector *connector)
+ {
+ 	struct panel_connector *panel_connector = to_panel_connector(connector);
++	drm_sysfs_connector_remove(connector);
+ 	drm_connector_cleanup(connector);
+ 	kfree(panel_connector);
+ }
+@@ -285,10 +286,8 @@ static void panel_destroy(struct tilcdc_module *mod)
+ {
+ 	struct panel_module *panel_mod = to_panel_module(mod);
+ 
+-	if (panel_mod->timings) {
++	if (panel_mod->timings)
+ 		display_timings_release(panel_mod->timings);
+-		kfree(panel_mod->timings);
+-	}
+ 
+ 	tilcdc_module_cleanup(mod);
+ 	kfree(panel_mod->info);
+diff --git a/drivers/gpu/drm/tilcdc/tilcdc_slave.c b/drivers/gpu/drm/tilcdc/tilcdc_slave.c
+index 595068ba2d5e..2f83ffb7f37e 100644
+--- a/drivers/gpu/drm/tilcdc/tilcdc_slave.c
++++ b/drivers/gpu/drm/tilcdc/tilcdc_slave.c
+@@ -166,6 +166,7 @@ struct slave_connector {
+ static void slave_connector_destroy(struct drm_connector *connector)
+ {
+ 	struct slave_connector *slave_connector = to_slave_connector(connector);
++	drm_sysfs_connector_remove(connector);
+ 	drm_connector_cleanup(connector);
+ 	kfree(slave_connector);
+ }
+diff --git a/drivers/gpu/drm/tilcdc/tilcdc_tfp410.c b/drivers/gpu/drm/tilcdc/tilcdc_tfp410.c
+index c38b56b268ac..ce75ac8de4f8 100644
+--- a/drivers/gpu/drm/tilcdc/tilcdc_tfp410.c
++++ b/drivers/gpu/drm/tilcdc/tilcdc_tfp410.c
+@@ -167,6 +167,7 @@ struct tfp410_connector {
+ static void tfp410_connector_destroy(struct drm_connector *connector)
+ {
+ 	struct tfp410_connector *tfp410_connector = to_tfp410_connector(connector);
++	drm_sysfs_connector_remove(connector);
+ 	drm_connector_cleanup(connector);
+ 	kfree(tfp410_connector);
+ }
+diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
+index 4ab9f7171c4f..a13a10025ec7 100644
+--- a/drivers/gpu/drm/ttm/ttm_bo.c
++++ b/drivers/gpu/drm/ttm/ttm_bo.c
+@@ -784,7 +784,7 @@ static int ttm_bo_mem_force_space(struct ttm_buffer_object *bo,
+ 	int ret;
+ 
+ 	do {
+-		ret = (*man->func->get_node)(man, bo, placement, mem);
++		ret = (*man->func->get_node)(man, bo, placement, 0, mem);
+ 		if (unlikely(ret != 0))
+ 			return ret;
+ 		if (mem->mm_node)
+@@ -897,7 +897,8 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
+ 
+ 		if (man->has_type && man->use_type) {
+ 			type_found = true;
+-			ret = (*man->func->get_node)(man, bo, placement, mem);
++			ret = (*man->func->get_node)(man, bo, placement,
++						     cur_flags, mem);
+ 			if (unlikely(ret))
+ 				return ret;
+ 		}
+@@ -937,7 +938,6 @@ int ttm_bo_mem_space(struct ttm_buffer_object *bo,
+ 		ttm_flag_masked(&cur_flags, placement->busy_placement[i],
+ 				~TTM_PL_MASK_MEMTYPE);
+ 
+-
+ 		if (mem_type == TTM_PL_SYSTEM) {
+ 			mem->mem_type = mem_type;
+ 			mem->placement = cur_flags;
+diff --git a/drivers/gpu/drm/ttm/ttm_bo_manager.c b/drivers/gpu/drm/ttm/ttm_bo_manager.c
+index bd850c9f4bca..9e103a4875c8 100644
+--- a/drivers/gpu/drm/ttm/ttm_bo_manager.c
++++ b/drivers/gpu/drm/ttm/ttm_bo_manager.c
+@@ -50,6 +50,7 @@ struct ttm_range_manager {
+ static int ttm_bo_man_get_node(struct ttm_mem_type_manager *man,
+ 			       struct ttm_buffer_object *bo,
+ 			       struct ttm_placement *placement,
++			       uint32_t flags,
+ 			       struct ttm_mem_reg *mem)
+ {
+ 	struct ttm_range_manager *rman = (struct ttm_range_manager *) man->priv;
+@@ -67,7 +68,7 @@ static int ttm_bo_man_get_node(struct ttm_mem_type_manager *man,
+ 	if (!node)
+ 		return -ENOMEM;
+ 
+-	if (bo->mem.placement & TTM_PL_FLAG_TOPDOWN)
++	if (flags & TTM_PL_FLAG_TOPDOWN)
+ 		aflags = DRM_MM_CREATE_TOP;
+ 
+ 	spin_lock(&rman->lock);
+diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
+index 863bef9f9234..cf4bad2c1d59 100644
+--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
++++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
+@@ -297,8 +297,10 @@ static void ttm_pool_update_free_locked(struct ttm_page_pool *pool,
+  *
+  * @pool: to free the pages from
+  * @free_all: If set to true will free all pages in pool
++ * @gfp: GFP flags.
+  **/
+-static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free)
++static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free,
++			      gfp_t gfp)
+ {
+ 	unsigned long irq_flags;
+ 	struct page *p;
+@@ -309,8 +311,7 @@ static int ttm_page_pool_free(struct ttm_page_pool *pool, unsigned nr_free)
+ 	if (NUM_PAGES_TO_ALLOC < nr_free)
+ 		npages_to_free = NUM_PAGES_TO_ALLOC;
+ 
+-	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
+-			GFP_KERNEL);
++	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
+ 	if (!pages_to_free) {
+ 		pr_err("Failed to allocate memory for pool free operation\n");
+ 		return 0;
+@@ -382,32 +383,35 @@ out:
+  *
+  * XXX: (dchinner) Deadlock warning!
+  *
+- * ttm_page_pool_free() does memory allocation using GFP_KERNEL.  that means
+- * this can deadlock when called a sc->gfp_mask that is not equal to
+- * GFP_KERNEL.
++ * We need to pass sc->gfp_mask to ttm_page_pool_free().
+  *
+  * This code is crying out for a shrinker per pool....
+  */
+ static unsigned long
+ ttm_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+ {
+-	static atomic_t start_pool = ATOMIC_INIT(0);
++	static DEFINE_MUTEX(lock);
++	static unsigned start_pool;
+ 	unsigned i;
+-	unsigned pool_offset = atomic_add_return(1, &start_pool);
++	unsigned pool_offset;
+ 	struct ttm_page_pool *pool;
+ 	int shrink_pages = sc->nr_to_scan;
+ 	unsigned long freed = 0;
+ 
+-	pool_offset = pool_offset % NUM_POOLS;
++	if (!mutex_trylock(&lock))
++		return SHRINK_STOP;
++	pool_offset = ++start_pool % NUM_POOLS;
+ 	/* select start pool in round robin fashion */
+ 	for (i = 0; i < NUM_POOLS; ++i) {
+ 		unsigned nr_free = shrink_pages;
+ 		if (shrink_pages == 0)
+ 			break;
+ 		pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
+-		shrink_pages = ttm_page_pool_free(pool, nr_free);
++		shrink_pages = ttm_page_pool_free(pool, nr_free,
++						  sc->gfp_mask);
+ 		freed += nr_free - shrink_pages;
+ 	}
++	mutex_unlock(&lock);
+ 	return freed;
+ }
+ 
+@@ -706,7 +710,7 @@ static void ttm_put_pages(struct page **pages, unsigned npages, int flags,
+ 	}
+ 	spin_unlock_irqrestore(&pool->lock, irq_flags);
+ 	if (npages)
+-		ttm_page_pool_free(pool, npages);
++		ttm_page_pool_free(pool, npages, GFP_KERNEL);
+ }
+ 
+ /*
+@@ -846,7 +850,8 @@ void ttm_page_alloc_fini(void)
+ 	ttm_pool_mm_shrink_fini(_manager);
+ 
+ 	for (i = 0; i < NUM_POOLS; ++i)
+-		ttm_page_pool_free(&_manager->pools[i], FREE_ALL_PAGES);
++		ttm_page_pool_free(&_manager->pools[i], FREE_ALL_PAGES,
++				   GFP_KERNEL);
+ 
+ 	kobject_put(&_manager->kobj);
+ 	_manager = NULL;
+diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+index fb8259f69839..ca65df144765 100644
+--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
++++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+@@ -411,8 +411,10 @@ static void ttm_dma_page_put(struct dma_pool *pool, struct dma_page *d_page)
+  *
+  * @pool: to free the pages from
+  * @nr_free: If set to true will free all pages in pool
++ * @gfp: GFP flags.
+  **/
+-static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free)
++static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free,
++				       gfp_t gfp)
+ {
+ 	unsigned long irq_flags;
+ 	struct dma_page *dma_p, *tmp;
+@@ -430,8 +432,7 @@ static unsigned ttm_dma_page_pool_free(struct dma_pool *pool, unsigned nr_free)
+ 			 npages_to_free, nr_free);
+ 	}
+ #endif
+-	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *),
+-			GFP_KERNEL);
++	pages_to_free = kmalloc(npages_to_free * sizeof(struct page *), gfp);
+ 
+ 	if (!pages_to_free) {
+ 		pr_err("%s: Failed to allocate memory for pool free operation\n",
+@@ -530,7 +531,7 @@ static void ttm_dma_free_pool(struct device *dev, enum pool_type type)
+ 		if (pool->type != type)
+ 			continue;
+ 		/* Takes a spinlock.. */
+-		ttm_dma_page_pool_free(pool, FREE_ALL_PAGES);
++		ttm_dma_page_pool_free(pool, FREE_ALL_PAGES, GFP_KERNEL);
+ 		WARN_ON(((pool->npages_in_use + pool->npages_free) != 0));
+ 		/* This code path is called after _all_ references to the
+ 		 * struct device has been dropped - so nobody should be
+@@ -983,7 +984,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
+ 
+ 	/* shrink pool if necessary (only on !is_cached pools)*/
+ 	if (npages)
+-		ttm_dma_page_pool_free(pool, npages);
++		ttm_dma_page_pool_free(pool, npages, GFP_KERNEL);
+ 	ttm->state = tt_unpopulated;
+ }
+ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
+@@ -993,10 +994,7 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
+  *
+  * XXX: (dchinner) Deadlock warning!
+  *
+- * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
+- * needs to be paid to sc->gfp_mask to determine if this can be done or not.
+- * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really
+- * bad.
++ * We need to pass sc->gfp_mask to ttm_dma_page_pool_free().
+  *
+  * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
+  * shrinkers
+@@ -1004,9 +1002,9 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
+ static unsigned long
+ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+ {
+-	static atomic_t start_pool = ATOMIC_INIT(0);
++	static unsigned start_pool;
+ 	unsigned idx = 0;
+-	unsigned pool_offset = atomic_add_return(1, &start_pool);
++	unsigned pool_offset;
+ 	unsigned shrink_pages = sc->nr_to_scan;
+ 	struct device_pools *p;
+ 	unsigned long freed = 0;
+@@ -1014,8 +1012,11 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+ 	if (list_empty(&_manager->pools))
+ 		return SHRINK_STOP;
+ 
+-	mutex_lock(&_manager->lock);
+-	pool_offset = pool_offset % _manager->npools;
++	if (!mutex_trylock(&_manager->lock))
++		return SHRINK_STOP;
++	if (!_manager->npools)
++		goto out;
++	pool_offset = ++start_pool % _manager->npools;
+ 	list_for_each_entry(p, &_manager->pools, pools) {
+ 		unsigned nr_free;
+ 
+@@ -1027,13 +1028,15 @@ ttm_dma_pool_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+ 		if (++idx < pool_offset)
+ 			continue;
+ 		nr_free = shrink_pages;
+-		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
++		shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free,
++						      sc->gfp_mask);
+ 		freed += nr_free - shrink_pages;
+ 
+ 		pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
+ 			 p->pool->dev_name, p->pool->name, current->pid,
+ 			 nr_free, shrink_pages);
+ 	}
++out:
+ 	mutex_unlock(&_manager->lock);
+ 	return freed;
+ }
+@@ -1044,7 +1047,8 @@ ttm_dma_pool_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+ 	struct device_pools *p;
+ 	unsigned long count = 0;
+ 
+-	mutex_lock(&_manager->lock);
++	if (!mutex_trylock(&_manager->lock))
++		return 0;
+ 	list_for_each_entry(p, &_manager->pools, pools)
+ 		count += p->pool->npages_free;
+ 	mutex_unlock(&_manager->lock);
+diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c b/drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c
+index 6ccd993e26bf..6eae14d2a3f7 100644
+--- a/drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c
++++ b/drivers/gpu/drm/vmwgfx/vmwgfx_fifo.c
+@@ -180,8 +180,9 @@ void vmw_fifo_release(struct vmw_private *dev_priv, struct vmw_fifo_state *fifo)
+ 
+ 	mutex_lock(&dev_priv->hw_mutex);
+ 
++	vmw_write(dev_priv, SVGA_REG_SYNC, SVGA_SYNC_GENERIC);
+ 	while (vmw_read(dev_priv, SVGA_REG_BUSY) != 0)
+-		vmw_write(dev_priv, SVGA_REG_SYNC, SVGA_SYNC_GENERIC);
++		;
+ 
+ 	dev_priv->last_read_seqno = ioread32(fifo_mem + SVGA_FIFO_FENCE);
+ 
+diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c b/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
+index b1273e8e9a69..26f8bdde3529 100644
+--- a/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
++++ b/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
+@@ -47,6 +47,7 @@ struct vmwgfx_gmrid_man {
+ static int vmw_gmrid_man_get_node(struct ttm_mem_type_manager *man,
+ 				  struct ttm_buffer_object *bo,
+ 				  struct ttm_placement *placement,
++				  uint32_t flags,
+ 				  struct ttm_mem_reg *mem)
+ {
+ 	struct vmwgfx_gmrid_man *gman =
+diff --git a/drivers/gpu/vga/vga_switcheroo.c b/drivers/gpu/vga/vga_switcheroo.c
+index 6866448083b2..37ac7b5dbd06 100644
+--- a/drivers/gpu/vga/vga_switcheroo.c
++++ b/drivers/gpu/vga/vga_switcheroo.c
+@@ -660,6 +660,12 @@ int vga_switcheroo_init_domain_pm_ops(struct device *dev, struct dev_pm_domain *
+ }
+ EXPORT_SYMBOL(vga_switcheroo_init_domain_pm_ops);
+ 
++void vga_switcheroo_fini_domain_pm_ops(struct device *dev)
++{
++	dev->pm_domain = NULL;
++}
++EXPORT_SYMBOL(vga_switcheroo_fini_domain_pm_ops);
++
+ static int vga_switcheroo_runtime_resume_hdmi_audio(struct device *dev)
+ {
+ 	struct pci_dev *pdev = to_pci_dev(dev);
+diff --git a/drivers/hid/hid-logitech-dj.c b/drivers/hid/hid-logitech-dj.c
+index b7ba82960c79..9bf8637747a5 100644
+--- a/drivers/hid/hid-logitech-dj.c
++++ b/drivers/hid/hid-logitech-dj.c
+@@ -656,7 +656,6 @@ static int logi_dj_raw_event(struct hid_device *hdev,
+ 	struct dj_receiver_dev *djrcv_dev = hid_get_drvdata(hdev);
+ 	struct dj_report *dj_report = (struct dj_report *) data;
+ 	unsigned long flags;
+-	bool report_processed = false;
+ 
+ 	dbg_hid("%s, size:%d\n", __func__, size);
+ 
+@@ -683,34 +682,42 @@ static int logi_dj_raw_event(struct hid_device *hdev,
+ 	 * device (via hid_input_report() ) and return 1 so hid-core does not do
+ 	 * anything else with it.
+ 	 */
++
++	/* case 1) */
++	if (data[0] != REPORT_ID_DJ_SHORT)
++		return false;
++
+ 	if ((dj_report->device_index < DJ_DEVICE_INDEX_MIN) ||
+ 	    (dj_report->device_index > DJ_DEVICE_INDEX_MAX)) {
+-		dev_err(&hdev->dev, "%s: invalid device index:%d\n",
++		/*
++		 * Device index is wrong, bail out.
++		 * This driver can ignore safely the receiver notifications,
++		 * so ignore those reports too.
++		 */
++		if (dj_report->device_index != DJ_RECEIVER_INDEX)
++			dev_err(&hdev->dev, "%s: invalid device index:%d\n",
+ 				__func__, dj_report->device_index);
+ 		return false;
+ 	}
+ 
+ 	spin_lock_irqsave(&djrcv_dev->lock, flags);
+-	if (dj_report->report_id == REPORT_ID_DJ_SHORT) {
+-		switch (dj_report->report_type) {
+-		case REPORT_TYPE_NOTIF_DEVICE_PAIRED:
+-		case REPORT_TYPE_NOTIF_DEVICE_UNPAIRED:
+-			logi_dj_recv_queue_notification(djrcv_dev, dj_report);
+-			break;
+-		case REPORT_TYPE_NOTIF_CONNECTION_STATUS:
+-			if (dj_report->report_params[CONNECTION_STATUS_PARAM_STATUS] ==
+-			    STATUS_LINKLOSS) {
+-				logi_dj_recv_forward_null_report(djrcv_dev, dj_report);
+-			}
+-			break;
+-		default:
+-			logi_dj_recv_forward_report(djrcv_dev, dj_report);
++	switch (dj_report->report_type) {
++	case REPORT_TYPE_NOTIF_DEVICE_PAIRED:
++	case REPORT_TYPE_NOTIF_DEVICE_UNPAIRED:
++		logi_dj_recv_queue_notification(djrcv_dev, dj_report);
++		break;
++	case REPORT_TYPE_NOTIF_CONNECTION_STATUS:
++		if (dj_report->report_params[CONNECTION_STATUS_PARAM_STATUS] ==
++		    STATUS_LINKLOSS) {
++			logi_dj_recv_forward_null_report(djrcv_dev, dj_report);
+ 		}
+-		report_processed = true;
++		break;
++	default:
++		logi_dj_recv_forward_report(djrcv_dev, dj_report);
+ 	}
+ 	spin_unlock_irqrestore(&djrcv_dev->lock, flags);
+ 
+-	return report_processed;
++	return true;
+ }
+ 
+ static int logi_dj_probe(struct hid_device *hdev,
+diff --git a/drivers/hid/hid-logitech-dj.h b/drivers/hid/hid-logitech-dj.h
+index 4a4000340ce1..daeb0aa4bee9 100644
+--- a/drivers/hid/hid-logitech-dj.h
++++ b/drivers/hid/hid-logitech-dj.h
+@@ -27,6 +27,7 @@
+ 
+ #define DJ_MAX_PAIRED_DEVICES			6
+ #define DJ_MAX_NUMBER_NOTIFICATIONS		8
++#define DJ_RECEIVER_INDEX			0
+ #define DJ_DEVICE_INDEX_MIN 			1
+ #define DJ_DEVICE_INDEX_MAX 			6
+ 
+diff --git a/drivers/hid/hid-magicmouse.c b/drivers/hid/hid-magicmouse.c
+index ecc2cbf300cc..29a74c1efcb8 100644
+--- a/drivers/hid/hid-magicmouse.c
++++ b/drivers/hid/hid-magicmouse.c
+@@ -290,6 +290,11 @@ static int magicmouse_raw_event(struct hid_device *hdev,
+ 		if (size < 4 || ((size - 4) % 9) != 0)
+ 			return 0;
+ 		npoints = (size - 4) / 9;
++		if (npoints > 15) {
++			hid_warn(hdev, "invalid size value (%d) for TRACKPAD_REPORT_ID\n",
++					size);
++			return 0;
++		}
+ 		msc->ntouches = 0;
+ 		for (ii = 0; ii < npoints; ii++)
+ 			magicmouse_emit_touch(msc, ii, data + ii * 9 + 4);
+@@ -307,6 +312,11 @@ static int magicmouse_raw_event(struct hid_device *hdev,
+ 		if (size < 6 || ((size - 6) % 8) != 0)
+ 			return 0;
+ 		npoints = (size - 6) / 8;
++		if (npoints > 15) {
++			hid_warn(hdev, "invalid size value (%d) for MOUSE_REPORT_ID\n",
++					size);
++			return 0;
++		}
+ 		msc->ntouches = 0;
+ 		for (ii = 0; ii < npoints; ii++)
+ 			magicmouse_emit_touch(msc, ii, data + ii * 8 + 6);
+diff --git a/drivers/hid/hid-picolcd_core.c b/drivers/hid/hid-picolcd_core.c
+index acbb021065ec..020df3c2e8b4 100644
+--- a/drivers/hid/hid-picolcd_core.c
++++ b/drivers/hid/hid-picolcd_core.c
+@@ -350,6 +350,12 @@ static int picolcd_raw_event(struct hid_device *hdev,
+ 	if (!data)
+ 		return 1;
+ 
++	if (size > 64) {
++		hid_warn(hdev, "invalid size value (%d) for picolcd raw event\n",
++				size);
++		return 0;
++	}
++
+ 	if (report->id == REPORT_KEY_STATE) {
+ 		if (data->input_keys)
+ 			ret = picolcd_raw_keypad(data, report, raw_data+1, size-1);
+diff --git a/drivers/hwmon/ds1621.c b/drivers/hwmon/ds1621.c
+index fc6f5d54e7f7..8890870309e4 100644
+--- a/drivers/hwmon/ds1621.c
++++ b/drivers/hwmon/ds1621.c
+@@ -309,6 +309,7 @@ static ssize_t set_convrate(struct device *dev, struct device_attribute *da,
+ 	data->conf |= (resol << DS1621_REG_CONFIG_RESOL_SHIFT);
+ 	i2c_smbus_write_byte_data(client, DS1621_REG_CONF, data->conf);
+ 	data->update_interval = ds1721_convrates[resol];
++	data->zbits = 7 - resol;
+ 	mutex_unlock(&data->update_lock);
+ 
+ 	return count;
+diff --git a/drivers/i2c/busses/i2c-at91.c b/drivers/i2c/busses/i2c-at91.c
+index 83c989382be9..e96edab2e30b 100644
+--- a/drivers/i2c/busses/i2c-at91.c
++++ b/drivers/i2c/busses/i2c-at91.c
+@@ -101,6 +101,7 @@ struct at91_twi_dev {
+ 	unsigned twi_cwgr_reg;
+ 	struct at91_twi_pdata *pdata;
+ 	bool use_dma;
++	bool recv_len_abort;
+ 	struct at91_twi_dma dma;
+ };
+ 
+@@ -267,12 +268,24 @@ static void at91_twi_read_next_byte(struct at91_twi_dev *dev)
+ 	*dev->buf = at91_twi_read(dev, AT91_TWI_RHR) & 0xff;
+ 	--dev->buf_len;
+ 
++	/* return if aborting, we only needed to read RHR to clear RXRDY*/
++	if (dev->recv_len_abort)
++		return;
++
+ 	/* handle I2C_SMBUS_BLOCK_DATA */
+ 	if (unlikely(dev->msg->flags & I2C_M_RECV_LEN)) {
+-		dev->msg->flags &= ~I2C_M_RECV_LEN;
+-		dev->buf_len += *dev->buf;
+-		dev->msg->len = dev->buf_len + 1;
+-		dev_dbg(dev->dev, "received block length %d\n", dev->buf_len);
++		/* ensure length byte is a valid value */
++		if (*dev->buf <= I2C_SMBUS_BLOCK_MAX && *dev->buf > 0) {
++			dev->msg->flags &= ~I2C_M_RECV_LEN;
++			dev->buf_len += *dev->buf;
++			dev->msg->len = dev->buf_len + 1;
++			dev_dbg(dev->dev, "received block length %d\n",
++					 dev->buf_len);
++		} else {
++			/* abort and send the stop by reading one more byte */
++			dev->recv_len_abort = true;
++			dev->buf_len = 1;
++		}
+ 	}
+ 
+ 	/* send stop if second but last byte has been read */
+@@ -421,8 +434,8 @@ static int at91_do_twi_transfer(struct at91_twi_dev *dev)
+ 		}
+ 	}
+ 
+-	ret = wait_for_completion_interruptible_timeout(&dev->cmd_complete,
+-							dev->adapter.timeout);
++	ret = wait_for_completion_io_timeout(&dev->cmd_complete,
++					     dev->adapter.timeout);
+ 	if (ret == 0) {
+ 		dev_err(dev->dev, "controller timed out\n");
+ 		at91_init_twi_bus(dev);
+@@ -444,6 +457,12 @@ static int at91_do_twi_transfer(struct at91_twi_dev *dev)
+ 		ret = -EIO;
+ 		goto error;
+ 	}
++	if (dev->recv_len_abort) {
++		dev_err(dev->dev, "invalid smbus block length recvd\n");
++		ret = -EPROTO;
++		goto error;
++	}
++
+ 	dev_dbg(dev->dev, "transfer complete\n");
+ 
+ 	return 0;
+@@ -500,6 +519,7 @@ static int at91_twi_xfer(struct i2c_adapter *adap, struct i2c_msg *msg, int num)
+ 	dev->buf_len = m_start->len;
+ 	dev->buf = m_start->buf;
+ 	dev->msg = m_start;
++	dev->recv_len_abort = false;
+ 
+ 	ret = at91_do_twi_transfer(dev);
+ 
+diff --git a/drivers/i2c/busses/i2c-ismt.c b/drivers/i2c/busses/i2c-ismt.c
+index 984492553e95..d9ee43c80cde 100644
+--- a/drivers/i2c/busses/i2c-ismt.c
++++ b/drivers/i2c/busses/i2c-ismt.c
+@@ -497,7 +497,7 @@ static int ismt_access(struct i2c_adapter *adap, u16 addr,
+ 			desc->wr_len_cmd = dma_size;
+ 			desc->control |= ISMT_DESC_BLK;
+ 			priv->dma_buffer[0] = command;
+-			memcpy(&priv->dma_buffer[1], &data->block[1], dma_size);
++			memcpy(&priv->dma_buffer[1], &data->block[1], dma_size - 1);
+ 		} else {
+ 			/* Block Read */
+ 			dev_dbg(dev, "I2C_SMBUS_BLOCK_DATA:  READ\n");
+@@ -525,7 +525,7 @@ static int ismt_access(struct i2c_adapter *adap, u16 addr,
+ 			desc->wr_len_cmd = dma_size;
+ 			desc->control |= ISMT_DESC_I2C;
+ 			priv->dma_buffer[0] = command;
+-			memcpy(&priv->dma_buffer[1], &data->block[1], dma_size);
++			memcpy(&priv->dma_buffer[1], &data->block[1], dma_size - 1);
+ 		} else {
+ 			/* i2c Block Read */
+ 			dev_dbg(dev, "I2C_SMBUS_I2C_BLOCK_DATA:  READ\n");
+diff --git a/drivers/i2c/busses/i2c-mv64xxx.c b/drivers/i2c/busses/i2c-mv64xxx.c
+index 9f4b775e2e39..e21e206d94e7 100644
+--- a/drivers/i2c/busses/i2c-mv64xxx.c
++++ b/drivers/i2c/busses/i2c-mv64xxx.c
+@@ -746,8 +746,7 @@ mv64xxx_of_config(struct mv64xxx_i2c_data *drv_data,
+ 	}
+ 	tclk = clk_get_rate(drv_data->clk);
+ 
+-	rc = of_property_read_u32(np, "clock-frequency", &bus_freq);
+-	if (rc)
++	if (of_property_read_u32(np, "clock-frequency", &bus_freq))
+ 		bus_freq = 100000; /* 100kHz by default */
+ 
+ 	if (!mv64xxx_find_baud_factors(bus_freq, tclk,
+diff --git a/drivers/i2c/busses/i2c-rcar.c b/drivers/i2c/busses/i2c-rcar.c
+index 899405923678..772d76ad036f 100644
+--- a/drivers/i2c/busses/i2c-rcar.c
++++ b/drivers/i2c/busses/i2c-rcar.c
+@@ -34,6 +34,7 @@
+ #include <linux/platform_device.h>
+ #include <linux/pm_runtime.h>
+ #include <linux/slab.h>
++#include <linux/spinlock.h>
+ 
+ /* register offsets */
+ #define ICSCR	0x00	/* slave ctrl */
+@@ -75,8 +76,8 @@
+ #define RCAR_IRQ_RECV	(MNR | MAL | MST | MAT | MDR)
+ #define RCAR_IRQ_STOP	(MST)
+ 
+-#define RCAR_IRQ_ACK_SEND	(~(MAT | MDE))
+-#define RCAR_IRQ_ACK_RECV	(~(MAT | MDR))
++#define RCAR_IRQ_ACK_SEND	(~(MAT | MDE) & 0xFF)
++#define RCAR_IRQ_ACK_RECV	(~(MAT | MDR) & 0xFF)
+ 
+ #define ID_LAST_MSG	(1 << 0)
+ #define ID_IOERROR	(1 << 1)
+@@ -95,6 +96,7 @@ struct rcar_i2c_priv {
+ 	struct i2c_msg	*msg;
+ 	struct clk *clk;
+ 
++	spinlock_t lock;
+ 	wait_queue_head_t wait;
+ 
+ 	int pos;
+@@ -365,20 +367,20 @@ static irqreturn_t rcar_i2c_irq(int irq, void *ptr)
+ 	struct rcar_i2c_priv *priv = ptr;
+ 	u32 msr;
+ 
++	/*-------------- spin lock -----------------*/
++	spin_lock(&priv->lock);
++
+ 	msr = rcar_i2c_read(priv, ICMSR);
+ 
++	/* Only handle interrupts that are currently enabled */
++	msr &= rcar_i2c_read(priv, ICMIER);
++
+ 	/* Arbitration lost */
+ 	if (msr & MAL) {
+ 		rcar_i2c_flags_set(priv, (ID_DONE | ID_ARBLOST));
+ 		goto out;
+ 	}
+ 
+-	/* Stop */
+-	if (msr & MST) {
+-		rcar_i2c_flags_set(priv, ID_DONE);
+-		goto out;
+-	}
+-
+ 	/* Nack */
+ 	if (msr & MNR) {
+ 		/* go to stop phase */
+@@ -388,6 +390,12 @@ static irqreturn_t rcar_i2c_irq(int irq, void *ptr)
+ 		goto out;
+ 	}
+ 
++	/* Stop */
++	if (msr & MST) {
++		rcar_i2c_flags_set(priv, ID_DONE);
++		goto out;
++	}
++
+ 	if (rcar_i2c_is_recv(priv))
+ 		rcar_i2c_flags_set(priv, rcar_i2c_irq_recv(priv, msr));
+ 	else
+@@ -400,6 +408,9 @@ out:
+ 		wake_up(&priv->wait);
+ 	}
+ 
++	spin_unlock(&priv->lock);
++	/*-------------- spin unlock -----------------*/
++
+ 	return IRQ_HANDLED;
+ }
+ 
+@@ -409,14 +420,21 @@ static int rcar_i2c_master_xfer(struct i2c_adapter *adap,
+ {
+ 	struct rcar_i2c_priv *priv = i2c_get_adapdata(adap);
+ 	struct device *dev = rcar_i2c_priv_to_dev(priv);
++	unsigned long flags;
+ 	int i, ret, timeout;
+ 
+ 	pm_runtime_get_sync(dev);
+ 
++	/*-------------- spin lock -----------------*/
++	spin_lock_irqsave(&priv->lock, flags);
++
+ 	rcar_i2c_init(priv);
+ 	/* start clock */
+ 	rcar_i2c_write(priv, ICCCR, priv->icccr);
+ 
++	spin_unlock_irqrestore(&priv->lock, flags);
++	/*-------------- spin unlock -----------------*/
++
+ 	ret = rcar_i2c_bus_barrier(priv);
+ 	if (ret < 0)
+ 		goto out;
+@@ -428,6 +446,9 @@ static int rcar_i2c_master_xfer(struct i2c_adapter *adap,
+ 			break;
+ 		}
+ 
++		/*-------------- spin lock -----------------*/
++		spin_lock_irqsave(&priv->lock, flags);
++
+ 		/* init each data */
+ 		priv->msg	= &msgs[i];
+ 		priv->pos	= 0;
+@@ -437,6 +458,9 @@ static int rcar_i2c_master_xfer(struct i2c_adapter *adap,
+ 
+ 		ret = rcar_i2c_prepare_msg(priv);
+ 
++		spin_unlock_irqrestore(&priv->lock, flags);
++		/*-------------- spin unlock -----------------*/
++
+ 		if (ret < 0)
+ 			break;
+ 
+@@ -540,6 +564,7 @@ static int rcar_i2c_probe(struct platform_device *pdev)
+ 
+ 	irq = platform_get_irq(pdev, 0);
+ 	init_waitqueue_head(&priv->wait);
++	spin_lock_init(&priv->lock);
+ 
+ 	adap			= &priv->adap;
+ 	adap->nr		= pdev->id;
+diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
+index 69e11853e8bf..93cfc837200b 100644
+--- a/drivers/i2c/busses/i2c-rk3x.c
++++ b/drivers/i2c/busses/i2c-rk3x.c
+@@ -323,6 +323,10 @@ static void rk3x_i2c_handle_read(struct rk3x_i2c *i2c, unsigned int ipd)
+ 	/* ack interrupt */
+ 	i2c_writel(i2c, REG_INT_MBRF, REG_IPD);
+ 
++	/* Can only handle a maximum of 32 bytes at a time */
++	if (len > 32)
++		len = 32;
++
+ 	/* read the data from receive buffer */
+ 	for (i = 0; i < len; ++i) {
+ 		if (i % 4 == 0)
+@@ -429,12 +433,11 @@ static void rk3x_i2c_set_scl_rate(struct rk3x_i2c *i2c, unsigned long scl_rate)
+ 	unsigned long i2c_rate = clk_get_rate(i2c->clk);
+ 	unsigned int div;
+ 
+-	/* SCL rate = (clk rate) / (8 * DIV) */
+-	div = DIV_ROUND_UP(i2c_rate, scl_rate * 8);
+-
+-	/* The lower and upper half of the CLKDIV reg describe the length of
+-	 * SCL low & high periods. */
+-	div = DIV_ROUND_UP(div, 2);
++	/* set DIV = DIVH = DIVL
++	 * SCL rate = (clk rate) / (8 * (DIVH + 1 + DIVL + 1))
++	 *          = (clk rate) / (16 * (DIV + 1))
++	 */
++	div = DIV_ROUND_UP(i2c_rate, scl_rate * 16) - 1;
+ 
+ 	i2c_writel(i2c, (div << 16) | (div & 0xffff), REG_CLKDIV);
+ }
+diff --git a/drivers/iio/accel/bma180.c b/drivers/iio/accel/bma180.c
+index a077cc86421b..19100fddd2ed 100644
+--- a/drivers/iio/accel/bma180.c
++++ b/drivers/iio/accel/bma180.c
+@@ -571,7 +571,7 @@ static int bma180_probe(struct i2c_client *client,
+ 	trig->ops = &bma180_trigger_ops;
+ 	iio_trigger_set_drvdata(trig, indio_dev);
+ 	data->trig = trig;
+-	indio_dev->trig = trig;
++	indio_dev->trig = iio_trigger_get(trig);
+ 
+ 	ret = iio_trigger_register(trig);
+ 	if (ret)
+diff --git a/drivers/iio/adc/ad_sigma_delta.c b/drivers/iio/adc/ad_sigma_delta.c
+index 9a4e0e32a771..eb799a43aef0 100644
+--- a/drivers/iio/adc/ad_sigma_delta.c
++++ b/drivers/iio/adc/ad_sigma_delta.c
+@@ -472,7 +472,7 @@ static int ad_sd_probe_trigger(struct iio_dev *indio_dev)
+ 		goto error_free_irq;
+ 
+ 	/* select default trigger */
+-	indio_dev->trig = sigma_delta->trig;
++	indio_dev->trig = iio_trigger_get(sigma_delta->trig);
+ 
+ 	return 0;
+ 
+diff --git a/drivers/iio/adc/at91_adc.c b/drivers/iio/adc/at91_adc.c
+index 2b6a9ce9927c..f508bd6b46e3 100644
+--- a/drivers/iio/adc/at91_adc.c
++++ b/drivers/iio/adc/at91_adc.c
+@@ -196,6 +196,7 @@ struct at91_adc_state {
+ 	bool			done;
+ 	int			irq;
+ 	u16			last_value;
++	int			chnb;
+ 	struct mutex		lock;
+ 	u8			num_channels;
+ 	void __iomem		*reg_base;
+@@ -274,7 +275,7 @@ void handle_adc_eoc_trigger(int irq, struct iio_dev *idev)
+ 		disable_irq_nosync(irq);
+ 		iio_trigger_poll(idev->trig, iio_get_time_ns());
+ 	} else {
+-		st->last_value = at91_adc_readl(st, AT91_ADC_LCDR);
++		st->last_value = at91_adc_readl(st, AT91_ADC_CHAN(st, st->chnb));
+ 		st->done = true;
+ 		wake_up_interruptible(&st->wq_data_avail);
+ 	}
+@@ -351,7 +352,7 @@ static irqreturn_t at91_adc_rl_interrupt(int irq, void *private)
+ 	unsigned int reg;
+ 
+ 	status &= at91_adc_readl(st, AT91_ADC_IMR);
+-	if (status & st->registers->drdy_mask)
++	if (status & GENMASK(st->num_channels - 1, 0))
+ 		handle_adc_eoc_trigger(irq, idev);
+ 
+ 	if (status & AT91RL_ADC_IER_PEN) {
+@@ -418,7 +419,7 @@ static irqreturn_t at91_adc_9x5_interrupt(int irq, void *private)
+ 		AT91_ADC_IER_YRDY |
+ 		AT91_ADC_IER_PRDY;
+ 
+-	if (status & st->registers->drdy_mask)
++	if (status & GENMASK(st->num_channels - 1, 0))
+ 		handle_adc_eoc_trigger(irq, idev);
+ 
+ 	if (status & AT91_ADC_IER_PEN) {
+@@ -689,9 +690,10 @@ static int at91_adc_read_raw(struct iio_dev *idev,
+ 	case IIO_CHAN_INFO_RAW:
+ 		mutex_lock(&st->lock);
+ 
++		st->chnb = chan->channel;
+ 		at91_adc_writel(st, AT91_ADC_CHER,
+ 				AT91_ADC_CH(chan->channel));
+-		at91_adc_writel(st, AT91_ADC_IER, st->registers->drdy_mask);
++		at91_adc_writel(st, AT91_ADC_IER, BIT(chan->channel));
+ 		at91_adc_writel(st, AT91_ADC_CR, AT91_ADC_START);
+ 
+ 		ret = wait_event_interruptible_timeout(st->wq_data_avail,
+@@ -708,7 +710,7 @@ static int at91_adc_read_raw(struct iio_dev *idev,
+ 
+ 		at91_adc_writel(st, AT91_ADC_CHDR,
+ 				AT91_ADC_CH(chan->channel));
+-		at91_adc_writel(st, AT91_ADC_IDR, st->registers->drdy_mask);
++		at91_adc_writel(st, AT91_ADC_IDR, BIT(chan->channel));
+ 
+ 		st->last_value = 0;
+ 		st->done = false;
+diff --git a/drivers/iio/adc/xilinx-xadc-core.c b/drivers/iio/adc/xilinx-xadc-core.c
+index ab52be29141b..41d3a5efd62c 100644
+--- a/drivers/iio/adc/xilinx-xadc-core.c
++++ b/drivers/iio/adc/xilinx-xadc-core.c
+@@ -1126,7 +1126,7 @@ static int xadc_parse_dt(struct iio_dev *indio_dev, struct device_node *np,
+ 				chan->address = XADC_REG_VPVN;
+ 			} else {
+ 				chan->scan_index = 15 + reg;
+-				chan->scan_index = XADC_REG_VAUX(reg - 1);
++				chan->address = XADC_REG_VAUX(reg - 1);
+ 			}
+ 			num_channels++;
+ 			chan++;
+diff --git a/drivers/iio/common/hid-sensors/hid-sensor-trigger.c b/drivers/iio/common/hid-sensors/hid-sensor-trigger.c
+index a3109a6f4d86..92068cdbf8c7 100644
+--- a/drivers/iio/common/hid-sensors/hid-sensor-trigger.c
++++ b/drivers/iio/common/hid-sensors/hid-sensor-trigger.c
+@@ -122,7 +122,8 @@ int hid_sensor_setup_trigger(struct iio_dev *indio_dev, const char *name,
+ 		dev_err(&indio_dev->dev, "Trigger Register Failed\n");
+ 		goto error_free_trig;
+ 	}
+-	indio_dev->trig = attrb->trigger = trig;
++	attrb->trigger = trig;
++	indio_dev->trig = iio_trigger_get(trig);
+ 
+ 	return ret;
+ 
+diff --git a/drivers/iio/common/st_sensors/st_sensors_trigger.c b/drivers/iio/common/st_sensors/st_sensors_trigger.c
+index 8fc3a97eb266..8d8ca6f1e16a 100644
+--- a/drivers/iio/common/st_sensors/st_sensors_trigger.c
++++ b/drivers/iio/common/st_sensors/st_sensors_trigger.c
+@@ -49,7 +49,7 @@ int st_sensors_allocate_trigger(struct iio_dev *indio_dev,
+ 		dev_err(&indio_dev->dev, "failed to register iio trigger.\n");
+ 		goto iio_trigger_register_error;
+ 	}
+-	indio_dev->trig = sdata->trig;
++	indio_dev->trig = iio_trigger_get(sdata->trig);
+ 
+ 	return 0;
+ 
+diff --git a/drivers/iio/gyro/itg3200_buffer.c b/drivers/iio/gyro/itg3200_buffer.c
+index e3b3c5084070..eef50e91f17c 100644
+--- a/drivers/iio/gyro/itg3200_buffer.c
++++ b/drivers/iio/gyro/itg3200_buffer.c
+@@ -132,7 +132,7 @@ int itg3200_probe_trigger(struct iio_dev *indio_dev)
+ 		goto error_free_irq;
+ 
+ 	/* select default trigger */
+-	indio_dev->trig = st->trig;
++	indio_dev->trig = iio_trigger_get(st->trig);
+ 
+ 	return 0;
+ 
+diff --git a/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c b/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
+index 03b9372c1212..926fccea8de0 100644
+--- a/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
++++ b/drivers/iio/imu/inv_mpu6050/inv_mpu_trigger.c
+@@ -135,7 +135,7 @@ int inv_mpu6050_probe_trigger(struct iio_dev *indio_dev)
+ 	ret = iio_trigger_register(st->trig);
+ 	if (ret)
+ 		goto error_free_irq;
+-	indio_dev->trig = st->trig;
++	indio_dev->trig = iio_trigger_get(st->trig);
+ 
+ 	return 0;
+ 
+diff --git a/drivers/iio/inkern.c b/drivers/iio/inkern.c
+index c7497009d60a..f0846108d006 100644
+--- a/drivers/iio/inkern.c
++++ b/drivers/iio/inkern.c
+@@ -178,7 +178,7 @@ static struct iio_channel *of_iio_channel_get_by_name(struct device_node *np,
+ 			index = of_property_match_string(np, "io-channel-names",
+ 							 name);
+ 		chan = of_iio_channel_get(np, index);
+-		if (!IS_ERR(chan))
++		if (!IS_ERR(chan) || PTR_ERR(chan) == -EPROBE_DEFER)
+ 			break;
+ 		else if (name && index >= 0) {
+ 			pr_err("ERROR: could not get IIO channel %s:%s(%i)\n",
+diff --git a/drivers/iio/magnetometer/st_magn_core.c b/drivers/iio/magnetometer/st_magn_core.c
+index 240a21dd0c61..4d55151893af 100644
+--- a/drivers/iio/magnetometer/st_magn_core.c
++++ b/drivers/iio/magnetometer/st_magn_core.c
+@@ -42,7 +42,8 @@
+ #define ST_MAGN_FS_AVL_5600MG			5600
+ #define ST_MAGN_FS_AVL_8000MG			8000
+ #define ST_MAGN_FS_AVL_8100MG			8100
+-#define ST_MAGN_FS_AVL_10000MG			10000
++#define ST_MAGN_FS_AVL_12000MG			12000
++#define ST_MAGN_FS_AVL_16000MG			16000
+ 
+ /* CUSTOM VALUES FOR SENSOR 1 */
+ #define ST_MAGN_1_WAI_EXP			0x3c
+@@ -69,20 +70,20 @@
+ #define ST_MAGN_1_FS_AVL_4700_VAL		0x05
+ #define ST_MAGN_1_FS_AVL_5600_VAL		0x06
+ #define ST_MAGN_1_FS_AVL_8100_VAL		0x07
+-#define ST_MAGN_1_FS_AVL_1300_GAIN_XY		1100
+-#define ST_MAGN_1_FS_AVL_1900_GAIN_XY		855
+-#define ST_MAGN_1_FS_AVL_2500_GAIN_XY		670
+-#define ST_MAGN_1_FS_AVL_4000_GAIN_XY		450
+-#define ST_MAGN_1_FS_AVL_4700_GAIN_XY		400
+-#define ST_MAGN_1_FS_AVL_5600_GAIN_XY		330
+-#define ST_MAGN_1_FS_AVL_8100_GAIN_XY		230
+-#define ST_MAGN_1_FS_AVL_1300_GAIN_Z		980
+-#define ST_MAGN_1_FS_AVL_1900_GAIN_Z		760
+-#define ST_MAGN_1_FS_AVL_2500_GAIN_Z		600
+-#define ST_MAGN_1_FS_AVL_4000_GAIN_Z		400
+-#define ST_MAGN_1_FS_AVL_4700_GAIN_Z		355
+-#define ST_MAGN_1_FS_AVL_5600_GAIN_Z		295
+-#define ST_MAGN_1_FS_AVL_8100_GAIN_Z		205
++#define ST_MAGN_1_FS_AVL_1300_GAIN_XY		909
++#define ST_MAGN_1_FS_AVL_1900_GAIN_XY		1169
++#define ST_MAGN_1_FS_AVL_2500_GAIN_XY		1492
++#define ST_MAGN_1_FS_AVL_4000_GAIN_XY		2222
++#define ST_MAGN_1_FS_AVL_4700_GAIN_XY		2500
++#define ST_MAGN_1_FS_AVL_5600_GAIN_XY		3030
++#define ST_MAGN_1_FS_AVL_8100_GAIN_XY		4347
++#define ST_MAGN_1_FS_AVL_1300_GAIN_Z		1020
++#define ST_MAGN_1_FS_AVL_1900_GAIN_Z		1315
++#define ST_MAGN_1_FS_AVL_2500_GAIN_Z		1666
++#define ST_MAGN_1_FS_AVL_4000_GAIN_Z		2500
++#define ST_MAGN_1_FS_AVL_4700_GAIN_Z		2816
++#define ST_MAGN_1_FS_AVL_5600_GAIN_Z		3389
++#define ST_MAGN_1_FS_AVL_8100_GAIN_Z		4878
+ #define ST_MAGN_1_MULTIREAD_BIT			false
+ 
+ /* CUSTOM VALUES FOR SENSOR 2 */
+@@ -105,10 +106,12 @@
+ #define ST_MAGN_2_FS_MASK			0x60
+ #define ST_MAGN_2_FS_AVL_4000_VAL		0x00
+ #define ST_MAGN_2_FS_AVL_8000_VAL		0x01
+-#define ST_MAGN_2_FS_AVL_10000_VAL		0x02
+-#define ST_MAGN_2_FS_AVL_4000_GAIN		430
+-#define ST_MAGN_2_FS_AVL_8000_GAIN		230
+-#define ST_MAGN_2_FS_AVL_10000_GAIN		230
++#define ST_MAGN_2_FS_AVL_12000_VAL		0x02
++#define ST_MAGN_2_FS_AVL_16000_VAL		0x03
++#define ST_MAGN_2_FS_AVL_4000_GAIN		146
++#define ST_MAGN_2_FS_AVL_8000_GAIN		292
++#define ST_MAGN_2_FS_AVL_12000_GAIN		438
++#define ST_MAGN_2_FS_AVL_16000_GAIN		584
+ #define ST_MAGN_2_MULTIREAD_BIT			false
+ #define ST_MAGN_2_OUT_X_L_ADDR			0x28
+ #define ST_MAGN_2_OUT_Y_L_ADDR			0x2a
+@@ -266,9 +269,14 @@ static const struct st_sensors st_magn_sensors[] = {
+ 					.gain = ST_MAGN_2_FS_AVL_8000_GAIN,
+ 				},
+ 				[2] = {
+-					.num = ST_MAGN_FS_AVL_10000MG,
+-					.value = ST_MAGN_2_FS_AVL_10000_VAL,
+-					.gain = ST_MAGN_2_FS_AVL_10000_GAIN,
++					.num = ST_MAGN_FS_AVL_12000MG,
++					.value = ST_MAGN_2_FS_AVL_12000_VAL,
++					.gain = ST_MAGN_2_FS_AVL_12000_GAIN,
++				},
++				[3] = {
++					.num = ST_MAGN_FS_AVL_16000MG,
++					.value = ST_MAGN_2_FS_AVL_16000_VAL,
++					.gain = ST_MAGN_2_FS_AVL_16000_GAIN,
+ 				},
+ 			},
+ 		},
+diff --git a/drivers/infiniband/core/uverbs_marshall.c b/drivers/infiniband/core/uverbs_marshall.c
+index e7bee46868d1..abd97247443e 100644
+--- a/drivers/infiniband/core/uverbs_marshall.c
++++ b/drivers/infiniband/core/uverbs_marshall.c
+@@ -140,5 +140,9 @@ void ib_copy_path_rec_from_user(struct ib_sa_path_rec *dst,
+ 	dst->packet_life_time	= src->packet_life_time;
+ 	dst->preference		= src->preference;
+ 	dst->packet_life_time_selector = src->packet_life_time_selector;
++
++	memset(dst->smac, 0, sizeof(dst->smac));
++	memset(dst->dmac, 0, sizeof(dst->dmac));
++	dst->vlan_id = 0xffff;
+ }
+ EXPORT_SYMBOL(ib_copy_path_rec_from_user);
+diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
+index 0f7027e7db13..91eeb5edff80 100644
+--- a/drivers/infiniband/hw/mlx4/main.c
++++ b/drivers/infiniband/hw/mlx4/main.c
+@@ -1678,6 +1678,7 @@ static void mlx4_ib_get_dev_addr(struct net_device *dev,
+ 	struct inet6_dev *in6_dev;
+ 	union ib_gid  *pgid;
+ 	struct inet6_ifaddr *ifp;
++	union ib_gid default_gid;
+ #endif
+ 	union ib_gid gid;
+ 
+@@ -1698,12 +1699,15 @@ static void mlx4_ib_get_dev_addr(struct net_device *dev,
+ 		in_dev_put(in_dev);
+ 	}
+ #if IS_ENABLED(CONFIG_IPV6)
++	mlx4_make_default_gid(dev, &default_gid);
+ 	/* IPv6 gids */
+ 	in6_dev = in6_dev_get(dev);
+ 	if (in6_dev) {
+ 		read_lock_bh(&in6_dev->lock);
+ 		list_for_each_entry(ifp, &in6_dev->addr_list, if_list) {
+ 			pgid = (union ib_gid *)&ifp->addr;
++			if (!memcmp(pgid, &default_gid, sizeof(*pgid)))
++				continue;
+ 			update_gid_table(ibdev, port, pgid, 0, 0);
+ 		}
+ 		read_unlock_bh(&in6_dev->lock);
+@@ -1788,31 +1792,34 @@ static void mlx4_ib_scan_netdevs(struct mlx4_ib_dev *ibdev,
+ 			port_state = (netif_running(curr_netdev) && netif_carrier_ok(curr_netdev)) ?
+ 						IB_PORT_ACTIVE : IB_PORT_DOWN;
+ 			mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
+-		} else {
+-			reset_gid_table(ibdev, port);
+-		}
+-		/* if using bonding/team and a slave port is down, we don't the bond IP
+-		 * based gids in the table since flows that select port by gid may get
+-		 * the down port.
+-		 */
+-		if (curr_master && (port_state == IB_PORT_DOWN)) {
+-			reset_gid_table(ibdev, port);
+-			mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
+-		}
+-		/* if bonding is used it is possible that we add it to masters
+-		 * only after IP address is assigned to the net bonding
+-		 * interface.
+-		*/
+-		if (curr_master && (old_master != curr_master)) {
+-			reset_gid_table(ibdev, port);
+-			mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
+-			mlx4_ib_get_dev_addr(curr_master, ibdev, port);
+-		}
++			/* if using bonding/team and a slave port is down, we
++			 * don't the bond IP based gids in the table since
++			 * flows that select port by gid may get the down port.
++			 */
++			if (curr_master && (port_state == IB_PORT_DOWN)) {
++				reset_gid_table(ibdev, port);
++				mlx4_ib_set_default_gid(ibdev,
++							curr_netdev, port);
++			}
++			/* if bonding is used it is possible that we add it to
++			 * masters only after IP address is assigned to the
++			 * net bonding interface.
++			*/
++			if (curr_master && (old_master != curr_master)) {
++				reset_gid_table(ibdev, port);
++				mlx4_ib_set_default_gid(ibdev,
++							curr_netdev, port);
++				mlx4_ib_get_dev_addr(curr_master, ibdev, port);
++			}
+ 
+-		if (!curr_master && (old_master != curr_master)) {
++			if (!curr_master && (old_master != curr_master)) {
++				reset_gid_table(ibdev, port);
++				mlx4_ib_set_default_gid(ibdev,
++							curr_netdev, port);
++				mlx4_ib_get_dev_addr(curr_netdev, ibdev, port);
++			}
++		} else {
+ 			reset_gid_table(ibdev, port);
+-			mlx4_ib_set_default_gid(ibdev, curr_netdev, port);
+-			mlx4_ib_get_dev_addr(curr_netdev, ibdev, port);
+ 		}
+ 	}
+ 
+diff --git a/drivers/infiniband/hw/qib/qib_debugfs.c b/drivers/infiniband/hw/qib/qib_debugfs.c
+index 799a0c3bffc4..6abd3ed3cd51 100644
+--- a/drivers/infiniband/hw/qib/qib_debugfs.c
++++ b/drivers/infiniband/hw/qib/qib_debugfs.c
+@@ -193,6 +193,7 @@ static void *_qp_stats_seq_start(struct seq_file *s, loff_t *pos)
+ 	struct qib_qp_iter *iter;
+ 	loff_t n = *pos;
+ 
++	rcu_read_lock();
+ 	iter = qib_qp_iter_init(s->private);
+ 	if (!iter)
+ 		return NULL;
+@@ -224,7 +225,7 @@ static void *_qp_stats_seq_next(struct seq_file *s, void *iter_ptr,
+ 
+ static void _qp_stats_seq_stop(struct seq_file *s, void *iter_ptr)
+ {
+-	/* nothing for now */
++	rcu_read_unlock();
+ }
+ 
+ static int _qp_stats_seq_show(struct seq_file *s, void *iter_ptr)
+diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c
+index 7fcc150d603c..6ddc0264aad2 100644
+--- a/drivers/infiniband/hw/qib/qib_qp.c
++++ b/drivers/infiniband/hw/qib/qib_qp.c
+@@ -1325,7 +1325,6 @@ int qib_qp_iter_next(struct qib_qp_iter *iter)
+ 	struct qib_qp *pqp = iter->qp;
+ 	struct qib_qp *qp;
+ 
+-	rcu_read_lock();
+ 	for (; n < dev->qp_table_size; n++) {
+ 		if (pqp)
+ 			qp = rcu_dereference(pqp->next);
+@@ -1333,18 +1332,11 @@ int qib_qp_iter_next(struct qib_qp_iter *iter)
+ 			qp = rcu_dereference(dev->qp_table[n]);
+ 		pqp = qp;
+ 		if (qp) {
+-			if (iter->qp)
+-				atomic_dec(&iter->qp->refcount);
+-			atomic_inc(&qp->refcount);
+-			rcu_read_unlock();
+ 			iter->qp = qp;
+ 			iter->n = n;
+ 			return 0;
+ 		}
+ 	}
+-	rcu_read_unlock();
+-	if (iter->qp)
+-		atomic_dec(&iter->qp->refcount);
+ 	return ret;
+ }
+ 
+diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
+index d4c7928a0f36..9959cd1faad9 100644
+--- a/drivers/infiniband/ulp/isert/ib_isert.c
++++ b/drivers/infiniband/ulp/isert/ib_isert.c
+@@ -586,7 +586,6 @@ isert_connect_request(struct rdma_cm_id *cma_id, struct rdma_cm_event *event)
+ 	init_completion(&isert_conn->conn_wait);
+ 	init_completion(&isert_conn->conn_wait_comp_err);
+ 	kref_init(&isert_conn->conn_kref);
+-	kref_get(&isert_conn->conn_kref);
+ 	mutex_init(&isert_conn->conn_mutex);
+ 	spin_lock_init(&isert_conn->conn_lock);
+ 	INIT_LIST_HEAD(&isert_conn->conn_fr_pool);
+@@ -746,7 +745,9 @@ isert_connect_release(struct isert_conn *isert_conn)
+ static void
+ isert_connected_handler(struct rdma_cm_id *cma_id)
+ {
+-	return;
++	struct isert_conn *isert_conn = cma_id->context;
++
++	kref_get(&isert_conn->conn_kref);
+ }
+ 
+ static void
+@@ -798,7 +799,6 @@ isert_disconnect_work(struct work_struct *work)
+ 
+ wake_up:
+ 	complete(&isert_conn->conn_wait);
+-	isert_put_conn(isert_conn);
+ }
+ 
+ static void
+@@ -3234,6 +3234,7 @@ static void isert_wait_conn(struct iscsi_conn *conn)
+ 	wait_for_completion(&isert_conn->conn_wait_comp_err);
+ 
+ 	wait_for_completion(&isert_conn->conn_wait);
++	isert_put_conn(isert_conn);
+ }
+ 
+ static void isert_free_conn(struct iscsi_conn *conn)
+diff --git a/drivers/input/keyboard/atkbd.c b/drivers/input/keyboard/atkbd.c
+index 2dd1d0dd4f7d..6f5d79569136 100644
+--- a/drivers/input/keyboard/atkbd.c
++++ b/drivers/input/keyboard/atkbd.c
+@@ -1791,14 +1791,6 @@ static const struct dmi_system_id atkbd_dmi_quirk_table[] __initconst = {
+ 	{
+ 		.matches = {
+ 			DMI_MATCH(DMI_SYS_VENDOR, "LG Electronics"),
+-			DMI_MATCH(DMI_PRODUCT_NAME, "LW25-B7HV"),
+-		},
+-		.callback = atkbd_deactivate_fixup,
+-	},
+-	{
+-		.matches = {
+-			DMI_MATCH(DMI_SYS_VENDOR, "LG Electronics"),
+-			DMI_MATCH(DMI_PRODUCT_NAME, "P1-J273B"),
+ 		},
+ 		.callback = atkbd_deactivate_fixup,
+ 	},
+diff --git a/drivers/input/mouse/elantech.c b/drivers/input/mouse/elantech.c
+index ee2a04d90d20..0ec186d256fb 100644
+--- a/drivers/input/mouse/elantech.c
++++ b/drivers/input/mouse/elantech.c
+@@ -1253,6 +1253,13 @@ static bool elantech_is_signature_valid(const unsigned char *param)
+ 	if (param[1] == 0)
+ 		return true;
+ 
++	/*
++	 * Some models have a revision higher then 20. Meaning param[2] may
++	 * be 10 or 20, skip the rates check for these.
++	 */
++	if (param[0] == 0x46 && (param[1] & 0xef) == 0x0f && param[2] < 40)
++		return true;
++
+ 	for (i = 0; i < ARRAY_SIZE(rates); i++)
+ 		if (param[2] == rates[i])
+ 			return false;
+diff --git a/drivers/input/mouse/synaptics.c b/drivers/input/mouse/synaptics.c
+index ef9e0b8a9aa7..a50a2a7a43f7 100644
+--- a/drivers/input/mouse/synaptics.c
++++ b/drivers/input/mouse/synaptics.c
+@@ -626,10 +626,61 @@ static int synaptics_parse_hw_state(const unsigned char buf[],
+ 			 ((buf[0] & 0x04) >> 1) |
+ 			 ((buf[3] & 0x04) >> 2));
+ 
++		if ((SYN_CAP_ADV_GESTURE(priv->ext_cap_0c) ||
++			SYN_CAP_IMAGE_SENSOR(priv->ext_cap_0c)) &&
++		    hw->w == 2) {
++			synaptics_parse_agm(buf, priv, hw);
++			return 1;
++		}
++
++		hw->x = (((buf[3] & 0x10) << 8) |
++			 ((buf[1] & 0x0f) << 8) |
++			 buf[4]);
++		hw->y = (((buf[3] & 0x20) << 7) |
++			 ((buf[1] & 0xf0) << 4) |
++			 buf[5]);
++		hw->z = buf[2];
++
+ 		hw->left  = (buf[0] & 0x01) ? 1 : 0;
+ 		hw->right = (buf[0] & 0x02) ? 1 : 0;
+ 
+-		if (SYN_CAP_CLICKPAD(priv->ext_cap_0c)) {
++		if (SYN_CAP_FORCEPAD(priv->ext_cap_0c)) {
++			/*
++			 * ForcePads, like Clickpads, use middle button
++			 * bits to report primary button clicks.
++			 * Unfortunately they report primary button not
++			 * only when user presses on the pad above certain
++			 * threshold, but also when there are more than one
++			 * finger on the touchpad, which interferes with
++			 * out multi-finger gestures.
++			 */
++			if (hw->z == 0) {
++				/* No contacts */
++				priv->press = priv->report_press = false;
++			} else if (hw->w >= 4 && ((buf[0] ^ buf[3]) & 0x01)) {
++				/*
++				 * Single-finger touch with pressure above
++				 * the threshold. If pressure stays long
++				 * enough, we'll start reporting primary
++				 * button. We rely on the device continuing
++				 * sending data even if finger does not
++				 * move.
++				 */
++				if  (!priv->press) {
++					priv->press_start = jiffies;
++					priv->press = true;
++				} else if (time_after(jiffies,
++						priv->press_start +
++							msecs_to_jiffies(50))) {
++					priv->report_press = true;
++				}
++			} else {
++				priv->press = false;
++			}
++
++			hw->left = priv->report_press;
++
++		} else if (SYN_CAP_CLICKPAD(priv->ext_cap_0c)) {
+ 			/*
+ 			 * Clickpad's button is transmitted as middle button,
+ 			 * however, since it is primary button, we will report
+@@ -648,21 +699,6 @@ static int synaptics_parse_hw_state(const unsigned char buf[],
+ 			hw->down = ((buf[0] ^ buf[3]) & 0x02) ? 1 : 0;
+ 		}
+ 
+-		if ((SYN_CAP_ADV_GESTURE(priv->ext_cap_0c) ||
+-			SYN_CAP_IMAGE_SENSOR(priv->ext_cap_0c)) &&
+-		    hw->w == 2) {
+-			synaptics_parse_agm(buf, priv, hw);
+-			return 1;
+-		}
+-
+-		hw->x = (((buf[3] & 0x10) << 8) |
+-			 ((buf[1] & 0x0f) << 8) |
+-			 buf[4]);
+-		hw->y = (((buf[3] & 0x20) << 7) |
+-			 ((buf[1] & 0xf0) << 4) |
+-			 buf[5]);
+-		hw->z = buf[2];
+-
+ 		if (SYN_CAP_MULTI_BUTTON_NO(priv->ext_cap) &&
+ 		    ((buf[0] ^ buf[3]) & 0x02)) {
+ 			switch (SYN_CAP_MULTI_BUTTON_NO(priv->ext_cap) & ~0x01) {
+diff --git a/drivers/input/mouse/synaptics.h b/drivers/input/mouse/synaptics.h
+index e594af0b264b..fb2e076738ae 100644
+--- a/drivers/input/mouse/synaptics.h
++++ b/drivers/input/mouse/synaptics.h
+@@ -78,6 +78,11 @@
+  * 2	0x08	image sensor		image sensor tracks 5 fingers, but only
+  *					reports 2.
+  * 2	0x20	report min		query 0x0f gives min coord reported
++ * 2	0x80	forcepad		forcepad is a variant of clickpad that
++ *					does not have physical buttons but rather
++ *					uses pressure above certain threshold to
++ *					report primary clicks. Forcepads also have
++ *					clickpad bit set.
+  */
+ #define SYN_CAP_CLICKPAD(ex0c)		((ex0c) & 0x100000) /* 1-button ClickPad */
+ #define SYN_CAP_CLICKPAD2BTN(ex0c)	((ex0c) & 0x000100) /* 2-button ClickPad */
+@@ -86,6 +91,7 @@
+ #define SYN_CAP_ADV_GESTURE(ex0c)	((ex0c) & 0x080000)
+ #define SYN_CAP_REDUCED_FILTERING(ex0c)	((ex0c) & 0x000400)
+ #define SYN_CAP_IMAGE_SENSOR(ex0c)	((ex0c) & 0x000800)
++#define SYN_CAP_FORCEPAD(ex0c)		((ex0c) & 0x008000)
+ 
+ /* synaptics modes query bits */
+ #define SYN_MODE_ABSOLUTE(m)		((m) & (1 << 7))
+@@ -177,6 +183,11 @@ struct synaptics_data {
+ 	 */
+ 	struct synaptics_hw_state agm;
+ 	bool agm_pending;			/* new AGM packet received */
++
++	/* ForcePad handling */
++	unsigned long				press_start;
++	bool					press;
++	bool					report_press;
+ };
+ 
+ void synaptics_module_init(void);
+diff --git a/drivers/input/serio/i8042-x86ia64io.h b/drivers/input/serio/i8042-x86ia64io.h
+index 136b7b204f56..713e3ddb43bd 100644
+--- a/drivers/input/serio/i8042-x86ia64io.h
++++ b/drivers/input/serio/i8042-x86ia64io.h
+@@ -465,6 +465,13 @@ static const struct dmi_system_id __initconst i8042_dmi_nomux_table[] = {
+ 			DMI_MATCH(DMI_PRODUCT_NAME, "HP Pavilion dv4 Notebook PC"),
+ 		},
+ 	},
++	{
++		/* Avatar AVIU-145A6 */
++		.matches = {
++			DMI_MATCH(DMI_SYS_VENDOR, "Intel"),
++			DMI_MATCH(DMI_PRODUCT_NAME, "IC4I"),
++		},
++	},
+ 	{ }
+ };
+ 
+@@ -608,6 +615,14 @@ static const struct dmi_system_id __initconst i8042_dmi_notimeout_table[] = {
+ 			DMI_MATCH(DMI_PRODUCT_NAME, "HP Pavilion dv4 Notebook PC"),
+ 		},
+ 	},
++	{
++		/* Fujitsu U574 laptop */
++		/* https://bugzilla.kernel.org/show_bug.cgi?id=69731 */
++		.matches = {
++			DMI_MATCH(DMI_SYS_VENDOR, "FUJITSU"),
++			DMI_MATCH(DMI_PRODUCT_NAME, "LIFEBOOK U574"),
++		},
++	},
+ 	{ }
+ };
+ 
+diff --git a/drivers/input/serio/serport.c b/drivers/input/serio/serport.c
+index 0cb7ef59071b..69175b825346 100644
+--- a/drivers/input/serio/serport.c
++++ b/drivers/input/serio/serport.c
+@@ -21,6 +21,7 @@
+ #include <linux/init.h>
+ #include <linux/serio.h>
+ #include <linux/tty.h>
++#include <linux/compat.h>
+ 
+ MODULE_AUTHOR("Vojtech Pavlik <vojtech@ucw.cz>");
+ MODULE_DESCRIPTION("Input device TTY line discipline");
+@@ -198,28 +199,55 @@ static ssize_t serport_ldisc_read(struct tty_struct * tty, struct file * file, u
+ 	return 0;
+ }
+ 
++static void serport_set_type(struct tty_struct *tty, unsigned long type)
++{
++	struct serport *serport = tty->disc_data;
++
++	serport->id.proto = type & 0x000000ff;
++	serport->id.id    = (type & 0x0000ff00) >> 8;
++	serport->id.extra = (type & 0x00ff0000) >> 16;
++}
++
+ /*
+  * serport_ldisc_ioctl() allows to set the port protocol, and device ID
+  */
+ 
+-static int serport_ldisc_ioctl(struct tty_struct * tty, struct file * file, unsigned int cmd, unsigned long arg)
++static int serport_ldisc_ioctl(struct tty_struct *tty, struct file *file,
++			       unsigned int cmd, unsigned long arg)
+ {
+-	struct serport *serport = (struct serport*) tty->disc_data;
+-	unsigned long type;
+-
+ 	if (cmd == SPIOCSTYPE) {
++		unsigned long type;
++
+ 		if (get_user(type, (unsigned long __user *) arg))
+ 			return -EFAULT;
+ 
+-		serport->id.proto = type & 0x000000ff;
+-		serport->id.id	  = (type & 0x0000ff00) >> 8;
+-		serport->id.extra = (type & 0x00ff0000) >> 16;
++		serport_set_type(tty, type);
++		return 0;
++	}
++
++	return -EINVAL;
++}
++
++#ifdef CONFIG_COMPAT
++#define COMPAT_SPIOCSTYPE	_IOW('q', 0x01, compat_ulong_t)
++static long serport_ldisc_compat_ioctl(struct tty_struct *tty,
++				       struct file *file,
++				       unsigned int cmd, unsigned long arg)
++{
++	if (cmd == COMPAT_SPIOCSTYPE) {
++		void __user *uarg = compat_ptr(arg);
++		compat_ulong_t compat_type;
++
++		if (get_user(compat_type, (compat_ulong_t __user *)uarg))
++			return -EFAULT;
+ 
++		serport_set_type(tty, compat_type);
+ 		return 0;
+ 	}
+ 
+ 	return -EINVAL;
+ }
++#endif
+ 
+ static void serport_ldisc_write_wakeup(struct tty_struct * tty)
+ {
+@@ -243,6 +271,9 @@ static struct tty_ldisc_ops serport_ldisc = {
+ 	.close =	serport_ldisc_close,
+ 	.read =		serport_ldisc_read,
+ 	.ioctl =	serport_ldisc_ioctl,
++#ifdef CONFIG_COMPAT
++	.compat_ioctl =	serport_ldisc_compat_ioctl,
++#endif
+ 	.receive_buf =	serport_ldisc_receive,
+ 	.write_wakeup =	serport_ldisc_write_wakeup
+ };
+diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
+index 1599354e974d..9a35baf1caed 100644
+--- a/drivers/iommu/arm-smmu.c
++++ b/drivers/iommu/arm-smmu.c
+@@ -830,8 +830,11 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain)
+ 	reg |= TTBCR_EAE |
+ 	      (TTBCR_SH_IS << TTBCR_SH0_SHIFT) |
+ 	      (TTBCR_RGN_WBWA << TTBCR_ORGN0_SHIFT) |
+-	      (TTBCR_RGN_WBWA << TTBCR_IRGN0_SHIFT) |
+-	      (TTBCR_SL0_LVL_1 << TTBCR_SL0_SHIFT);
++	      (TTBCR_RGN_WBWA << TTBCR_IRGN0_SHIFT);
++
++	if (!stage1)
++		reg |= (TTBCR_SL0_LVL_1 << TTBCR_SL0_SHIFT);
++
+ 	writel_relaxed(reg, cb_base + ARM_SMMU_CB_TTBCR);
+ 
+ 	/* MAIR0 (stage-1 only) */
+diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
+index 9a4f05e5b23f..55f1515d54c9 100644
+--- a/drivers/iommu/dmar.c
++++ b/drivers/iommu/dmar.c
+@@ -677,8 +677,7 @@ static int __init dmar_acpi_dev_scope_init(void)
+ 				       andd->object_name);
+ 				continue;
+ 			}
+-			acpi_bus_get_device(h, &adev);
+-			if (!adev) {
++			if (acpi_bus_get_device(h, &adev)) {
+ 				pr_err("Failed to get device for ACPI object %s\n",
+ 				       andd->object_name);
+ 				continue;
+diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
+index af47648301a9..87f94d597e6e 100644
+--- a/drivers/iommu/fsl_pamu_domain.c
++++ b/drivers/iommu/fsl_pamu_domain.c
+@@ -1048,7 +1048,7 @@ static int fsl_pamu_add_device(struct device *dev)
+ 	struct iommu_group *group = ERR_PTR(-ENODEV);
+ 	struct pci_dev *pdev;
+ 	const u32 *prop;
+-	int ret, len;
++	int ret = 0, len;
+ 
+ 	/*
+ 	 * For platform devices we allocate a separate group for
+@@ -1071,7 +1071,13 @@ static int fsl_pamu_add_device(struct device *dev)
+ 	if (IS_ERR(group))
+ 		return PTR_ERR(group);
+ 
+-	ret = iommu_group_add_device(group, dev);
++	/*
++	 * Check if device has already been added to an iommu group.
++	 * Group could have already been created for a PCI device in
++	 * the iommu_group_get_for_dev path.
++	 */
++	if (!dev->iommu_group)
++		ret = iommu_group_add_device(group, dev);
+ 
+ 	iommu_group_put(group);
+ 	return ret;
+diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
+index 2c63326638b6..c892e48655c2 100644
+--- a/drivers/md/dm-cache-target.c
++++ b/drivers/md/dm-cache-target.c
+@@ -873,8 +873,8 @@ static void migration_success_pre_commit(struct dm_cache_migration *mg)
+ 	struct cache *cache = mg->cache;
+ 
+ 	if (mg->writeback) {
+-		cell_defer(cache, mg->old_ocell, false);
+ 		clear_dirty(cache, mg->old_oblock, mg->cblock);
++		cell_defer(cache, mg->old_ocell, false);
+ 		cleanup_migration(mg);
+ 		return;
+ 
+@@ -929,13 +929,13 @@ static void migration_success_post_commit(struct dm_cache_migration *mg)
+ 		}
+ 
+ 	} else {
++		clear_dirty(cache, mg->new_oblock, mg->cblock);
+ 		if (mg->requeue_holder)
+ 			cell_defer(cache, mg->new_ocell, true);
+ 		else {
+ 			bio_endio(mg->new_ocell->holder, 0);
+ 			cell_defer(cache, mg->new_ocell, false);
+ 		}
+-		clear_dirty(cache, mg->new_oblock, mg->cblock);
+ 		cleanup_migration(mg);
+ 	}
+ }
+diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
+index 4cba2d808afb..3e6ef4b1fb46 100644
+--- a/drivers/md/dm-crypt.c
++++ b/drivers/md/dm-crypt.c
+@@ -1681,6 +1681,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+ 	unsigned int key_size, opt_params;
+ 	unsigned long long tmpll;
+ 	int ret;
++	size_t iv_size_padding;
+ 	struct dm_arg_set as;
+ 	const char *opt_string;
+ 	char dummy;
+@@ -1717,12 +1718,23 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+ 
+ 	cc->dmreq_start = sizeof(struct ablkcipher_request);
+ 	cc->dmreq_start += crypto_ablkcipher_reqsize(any_tfm(cc));
+-	cc->dmreq_start = ALIGN(cc->dmreq_start, crypto_tfm_ctx_alignment());
+-	cc->dmreq_start += crypto_ablkcipher_alignmask(any_tfm(cc)) &
+-			   ~(crypto_tfm_ctx_alignment() - 1);
++	cc->dmreq_start = ALIGN(cc->dmreq_start, __alignof__(struct dm_crypt_request));
++
++	if (crypto_ablkcipher_alignmask(any_tfm(cc)) < CRYPTO_MINALIGN) {
++		/* Allocate the padding exactly */
++		iv_size_padding = -(cc->dmreq_start + sizeof(struct dm_crypt_request))
++				& crypto_ablkcipher_alignmask(any_tfm(cc));
++	} else {
++		/*
++		 * If the cipher requires greater alignment than kmalloc
++		 * alignment, we don't know the exact position of the
++		 * initialization vector. We must assume worst case.
++		 */
++		iv_size_padding = crypto_ablkcipher_alignmask(any_tfm(cc));
++	}
+ 
+ 	cc->req_pool = mempool_create_kmalloc_pool(MIN_IOS, cc->dmreq_start +
+-			sizeof(struct dm_crypt_request) + cc->iv_size);
++			sizeof(struct dm_crypt_request) + iv_size_padding + cc->iv_size);
+ 	if (!cc->req_pool) {
+ 		ti->error = "Cannot allocate crypt request mempool";
+ 		goto bad;
+diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
+index d7690f86fdb9..55de4f6f7eaf 100644
+--- a/drivers/md/raid1.c
++++ b/drivers/md/raid1.c
+@@ -540,11 +540,7 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
+ 	has_nonrot_disk = 0;
+ 	choose_next_idle = 0;
+ 
+-	if (conf->mddev->recovery_cp < MaxSector &&
+-	    (this_sector + sectors >= conf->next_resync))
+-		choose_first = 1;
+-	else
+-		choose_first = 0;
++	choose_first = (conf->mddev->recovery_cp < this_sector + sectors);
+ 
+ 	for (disk = 0 ; disk < conf->raid_disks * 2 ; disk++) {
+ 		sector_t dist;
+@@ -831,7 +827,7 @@ static void flush_pending_writes(struct r1conf *conf)
+  *    there is no normal IO happeing.  It must arrange to call
+  *    lower_barrier when the particular background IO completes.
+  */
+-static void raise_barrier(struct r1conf *conf)
++static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
+ {
+ 	spin_lock_irq(&conf->resync_lock);
+ 
+@@ -841,6 +837,7 @@ static void raise_barrier(struct r1conf *conf)
+ 
+ 	/* block any new IO from starting */
+ 	conf->barrier++;
++	conf->next_resync = sector_nr;
+ 
+ 	/* For these conditions we must wait:
+ 	 * A: while the array is in frozen state
+@@ -849,14 +846,17 @@ static void raise_barrier(struct r1conf *conf)
+ 	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
+ 	 *    next resync will reach to the window which normal bios are
+ 	 *    handling.
++	 * D: while there are any active requests in the current window.
+ 	 */
+ 	wait_event_lock_irq(conf->wait_barrier,
+ 			    !conf->array_frozen &&
+ 			    conf->barrier < RESYNC_DEPTH &&
++			    conf->current_window_requests == 0 &&
+ 			    (conf->start_next_window >=
+ 			     conf->next_resync + RESYNC_SECTORS),
+ 			    conf->resync_lock);
+ 
++	conf->nr_pending++;
+ 	spin_unlock_irq(&conf->resync_lock);
+ }
+ 
+@@ -866,6 +866,7 @@ static void lower_barrier(struct r1conf *conf)
+ 	BUG_ON(conf->barrier <= 0);
+ 	spin_lock_irqsave(&conf->resync_lock, flags);
+ 	conf->barrier--;
++	conf->nr_pending--;
+ 	spin_unlock_irqrestore(&conf->resync_lock, flags);
+ 	wake_up(&conf->wait_barrier);
+ }
+@@ -877,12 +878,10 @@ static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+ 	if (conf->array_frozen || !bio)
+ 		wait = true;
+ 	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
+-		if (conf->next_resync < RESYNC_WINDOW_SECTORS)
+-			wait = true;
+-		else if ((conf->next_resync - RESYNC_WINDOW_SECTORS
+-				>= bio_end_sector(bio)) ||
+-			 (conf->next_resync + NEXT_NORMALIO_DISTANCE
+-				<= bio->bi_iter.bi_sector))
++		if ((conf->mddev->curr_resync_completed
++		     >= bio_end_sector(bio)) ||
++		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
++		     <= bio->bi_iter.bi_sector))
+ 			wait = false;
+ 		else
+ 			wait = true;
+@@ -919,8 +918,8 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+ 	}
+ 
+ 	if (bio && bio_data_dir(bio) == WRITE) {
+-		if (conf->next_resync + NEXT_NORMALIO_DISTANCE
+-		    <= bio->bi_iter.bi_sector) {
++		if (bio->bi_iter.bi_sector >=
++		    conf->mddev->curr_resync_completed) {
+ 			if (conf->start_next_window == MaxSector)
+ 				conf->start_next_window =
+ 					conf->next_resync +
+@@ -1186,6 +1185,7 @@ read_again:
+ 				   atomic_read(&bitmap->behind_writes) == 0);
+ 		}
+ 		r1_bio->read_disk = rdisk;
++		r1_bio->start_next_window = 0;
+ 
+ 		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+ 		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
+@@ -1548,8 +1548,13 @@ static void close_sync(struct r1conf *conf)
+ 	mempool_destroy(conf->r1buf_pool);
+ 	conf->r1buf_pool = NULL;
+ 
++	spin_lock_irq(&conf->resync_lock);
+ 	conf->next_resync = 0;
+ 	conf->start_next_window = MaxSector;
++	conf->current_window_requests +=
++		conf->next_window_requests;
++	conf->next_window_requests = 0;
++	spin_unlock_irq(&conf->resync_lock);
+ }
+ 
+ static int raid1_spare_active(struct mddev *mddev)
+@@ -2150,7 +2155,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
+ 			d--;
+ 			rdev = conf->mirrors[d].rdev;
+ 			if (rdev &&
+-			    test_bit(In_sync, &rdev->flags))
++			    !test_bit(Faulty, &rdev->flags))
+ 				r1_sync_page_io(rdev, sect, s,
+ 						conf->tmppage, WRITE);
+ 		}
+@@ -2162,7 +2167,7 @@ static void fix_read_error(struct r1conf *conf, int read_disk,
+ 			d--;
+ 			rdev = conf->mirrors[d].rdev;
+ 			if (rdev &&
+-			    test_bit(In_sync, &rdev->flags)) {
++			    !test_bit(Faulty, &rdev->flags)) {
+ 				if (r1_sync_page_io(rdev, sect, s,
+ 						    conf->tmppage, READ)) {
+ 					atomic_add(s, &rdev->corrected_errors);
+@@ -2541,9 +2546,8 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, int *skipp
+ 
+ 	bitmap_cond_end_sync(mddev->bitmap, sector_nr);
+ 	r1_bio = mempool_alloc(conf->r1buf_pool, GFP_NOIO);
+-	raise_barrier(conf);
+ 
+-	conf->next_resync = sector_nr;
++	raise_barrier(conf, sector_nr);
+ 
+ 	rcu_read_lock();
+ 	/*
+diff --git a/drivers/media/dvb-core/dvb-usb-ids.h b/drivers/media/dvb-core/dvb-usb-ids.h
+index 11d2bea23b02..26674e12133b 100644
+--- a/drivers/media/dvb-core/dvb-usb-ids.h
++++ b/drivers/media/dvb-core/dvb-usb-ids.h
+@@ -279,6 +279,8 @@
+ #define USB_PID_PCTV_400E				0x020f
+ #define USB_PID_PCTV_450E				0x0222
+ #define USB_PID_PCTV_452E				0x021f
++#define USB_PID_PCTV_78E				0x025a
++#define USB_PID_PCTV_79E				0x0262
+ #define USB_PID_REALTEK_RTL2831U			0x2831
+ #define USB_PID_REALTEK_RTL2832U			0x2832
+ #define USB_PID_TECHNOTREND_CONNECT_S2_3600		0x3007
+diff --git a/drivers/media/dvb-frontends/af9033.c b/drivers/media/dvb-frontends/af9033.c
+index be4bec2a9640..5c90ea683a7e 100644
+--- a/drivers/media/dvb-frontends/af9033.c
++++ b/drivers/media/dvb-frontends/af9033.c
+@@ -314,6 +314,19 @@ static int af9033_init(struct dvb_frontend *fe)
+ 			goto err;
+ 	}
+ 
++	/* feed clock to RF tuner */
++	switch (state->cfg.tuner) {
++	case AF9033_TUNER_IT9135_38:
++	case AF9033_TUNER_IT9135_51:
++	case AF9033_TUNER_IT9135_52:
++	case AF9033_TUNER_IT9135_60:
++	case AF9033_TUNER_IT9135_61:
++	case AF9033_TUNER_IT9135_62:
++		ret = af9033_wr_reg(state, 0x80fba8, 0x00);
++		if (ret < 0)
++			goto err;
++	}
++
+ 	/* settings for TS interface */
+ 	if (state->cfg.ts_mode == AF9033_TS_MODE_USB) {
+ 		ret = af9033_wr_reg_mask(state, 0x80f9a5, 0x00, 0x01);
+diff --git a/drivers/media/dvb-frontends/af9033_priv.h b/drivers/media/dvb-frontends/af9033_priv.h
+index fc2ad581e302..ded7b67d7526 100644
+--- a/drivers/media/dvb-frontends/af9033_priv.h
++++ b/drivers/media/dvb-frontends/af9033_priv.h
+@@ -1418,7 +1418,7 @@ static const struct reg_val tuner_init_it9135_60[] = {
+ 	{ 0x800068, 0x0a },
+ 	{ 0x80006a, 0x03 },
+ 	{ 0x800070, 0x0a },
+-	{ 0x800071, 0x05 },
++	{ 0x800071, 0x0a },
+ 	{ 0x800072, 0x02 },
+ 	{ 0x800075, 0x8c },
+ 	{ 0x800076, 0x8c },
+@@ -1484,7 +1484,6 @@ static const struct reg_val tuner_init_it9135_60[] = {
+ 	{ 0x800104, 0x02 },
+ 	{ 0x800105, 0xbe },
+ 	{ 0x800106, 0x00 },
+-	{ 0x800109, 0x02 },
+ 	{ 0x800115, 0x0a },
+ 	{ 0x800116, 0x03 },
+ 	{ 0x80011a, 0xbe },
+@@ -1510,7 +1509,6 @@ static const struct reg_val tuner_init_it9135_60[] = {
+ 	{ 0x80014b, 0x8c },
+ 	{ 0x80014d, 0xac },
+ 	{ 0x80014e, 0xc6 },
+-	{ 0x80014f, 0x03 },
+ 	{ 0x800151, 0x1e },
+ 	{ 0x800153, 0xbc },
+ 	{ 0x800178, 0x09 },
+@@ -1522,9 +1520,10 @@ static const struct reg_val tuner_init_it9135_60[] = {
+ 	{ 0x80018d, 0x5f },
+ 	{ 0x80018f, 0xa0 },
+ 	{ 0x800190, 0x5a },
+-	{ 0x80ed02, 0xff },
+-	{ 0x80ee42, 0xff },
+-	{ 0x80ee82, 0xff },
++	{ 0x800191, 0x00 },
++	{ 0x80ed02, 0x40 },
++	{ 0x80ee42, 0x40 },
++	{ 0x80ee82, 0x40 },
+ 	{ 0x80f000, 0x0f },
+ 	{ 0x80f01f, 0x8c },
+ 	{ 0x80f020, 0x00 },
+@@ -1699,7 +1698,6 @@ static const struct reg_val tuner_init_it9135_61[] = {
+ 	{ 0x800104, 0x02 },
+ 	{ 0x800105, 0xc8 },
+ 	{ 0x800106, 0x00 },
+-	{ 0x800109, 0x02 },
+ 	{ 0x800115, 0x0a },
+ 	{ 0x800116, 0x03 },
+ 	{ 0x80011a, 0xc6 },
+@@ -1725,7 +1723,6 @@ static const struct reg_val tuner_init_it9135_61[] = {
+ 	{ 0x80014b, 0x8c },
+ 	{ 0x80014d, 0xa8 },
+ 	{ 0x80014e, 0xc6 },
+-	{ 0x80014f, 0x03 },
+ 	{ 0x800151, 0x28 },
+ 	{ 0x800153, 0xcc },
+ 	{ 0x800178, 0x09 },
+@@ -1737,9 +1734,10 @@ static const struct reg_val tuner_init_it9135_61[] = {
+ 	{ 0x80018d, 0x5f },
+ 	{ 0x80018f, 0xfb },
+ 	{ 0x800190, 0x5c },
+-	{ 0x80ed02, 0xff },
+-	{ 0x80ee42, 0xff },
+-	{ 0x80ee82, 0xff },
++	{ 0x800191, 0x00 },
++	{ 0x80ed02, 0x40 },
++	{ 0x80ee42, 0x40 },
++	{ 0x80ee82, 0x40 },
+ 	{ 0x80f000, 0x0f },
+ 	{ 0x80f01f, 0x8c },
+ 	{ 0x80f020, 0x00 },
+diff --git a/drivers/media/i2c/adv7604.c b/drivers/media/i2c/adv7604.c
+index 1778d320272e..67403b94f0a2 100644
+--- a/drivers/media/i2c/adv7604.c
++++ b/drivers/media/i2c/adv7604.c
+@@ -2325,7 +2325,7 @@ static int adv7604_log_status(struct v4l2_subdev *sd)
+ 	v4l2_info(sd, "HDCP keys read: %s%s\n",
+ 			(hdmi_read(sd, 0x04) & 0x20) ? "yes" : "no",
+ 			(hdmi_read(sd, 0x04) & 0x10) ? "ERROR" : "");
+-	if (!is_hdmi(sd)) {
++	if (is_hdmi(sd)) {
+ 		bool audio_pll_locked = hdmi_read(sd, 0x04) & 0x01;
+ 		bool audio_sample_packet_detect = hdmi_read(sd, 0x18) & 0x01;
+ 		bool audio_mute = io_read(sd, 0x65) & 0x40;
+diff --git a/drivers/media/pci/cx18/cx18-driver.c b/drivers/media/pci/cx18/cx18-driver.c
+index 716bdc57fac6..83f5074706f9 100644
+--- a/drivers/media/pci/cx18/cx18-driver.c
++++ b/drivers/media/pci/cx18/cx18-driver.c
+@@ -1091,6 +1091,7 @@ static int cx18_probe(struct pci_dev *pci_dev,
+ 		setup.addr = ADDR_UNSET;
+ 		setup.type = cx->options.tuner;
+ 		setup.mode_mask = T_ANALOG_TV;  /* matches TV tuners */
++		setup.config = NULL;
+ 		if (cx->options.radio > 0)
+ 			setup.mode_mask |= T_RADIO;
+ 		setup.tuner_callback = (setup.type == TUNER_XC2028) ?
+diff --git a/drivers/media/tuners/tuner_it913x.c b/drivers/media/tuners/tuner_it913x.c
+index 6f30d7e535b8..3d83c425bccf 100644
+--- a/drivers/media/tuners/tuner_it913x.c
++++ b/drivers/media/tuners/tuner_it913x.c
+@@ -396,6 +396,7 @@ struct dvb_frontend *it913x_attach(struct dvb_frontend *fe,
+ 		struct i2c_adapter *i2c_adap, u8 i2c_addr, u8 config)
+ {
+ 	struct it913x_state *state = NULL;
++	int ret;
+ 
+ 	/* allocate memory for the internal state */
+ 	state = kzalloc(sizeof(struct it913x_state), GFP_KERNEL);
+@@ -425,6 +426,11 @@ struct dvb_frontend *it913x_attach(struct dvb_frontend *fe,
+ 	state->tuner_type = config;
+ 	state->firmware_ver = 1;
+ 
++	/* tuner RF initial */
++	ret = it913x_wr_reg(state, PRO_DMOD, 0xec4c, 0x68);
++	if (ret < 0)
++		goto error;
++
+ 	fe->tuner_priv = state;
+ 	memcpy(&fe->ops.tuner_ops, &it913x_tuner_ops,
+ 			sizeof(struct dvb_tuner_ops));
+diff --git a/drivers/media/usb/dvb-usb-v2/af9035.c b/drivers/media/usb/dvb-usb-v2/af9035.c
+index 7b9b75f60774..04d8e951de0d 100644
+--- a/drivers/media/usb/dvb-usb-v2/af9035.c
++++ b/drivers/media/usb/dvb-usb-v2/af9035.c
+@@ -1555,6 +1555,10 @@ static const struct usb_device_id af9035_id_table[] = {
+ 		&af9035_props, "Leadtek WinFast DTV Dongle Dual", NULL) },
+ 	{ DVB_USB_DEVICE(USB_VID_HAUPPAUGE, 0xf900,
+ 		&af9035_props, "Hauppauge WinTV-MiniStick 2", NULL) },
++	{ DVB_USB_DEVICE(USB_VID_PCTV, USB_PID_PCTV_78E,
++		&af9035_props, "PCTV 78e", RC_MAP_IT913X_V1) },
++	{ DVB_USB_DEVICE(USB_VID_PCTV, USB_PID_PCTV_79E,
++		&af9035_props, "PCTV 79e", RC_MAP_IT913X_V2) },
+ 	{ }
+ };
+ MODULE_DEVICE_TABLE(usb, af9035_id_table);
+diff --git a/drivers/media/usb/em28xx/em28xx-video.c b/drivers/media/usb/em28xx/em28xx-video.c
+index f6b49c98e2c9..408c072ce228 100644
+--- a/drivers/media/usb/em28xx/em28xx-video.c
++++ b/drivers/media/usb/em28xx/em28xx-video.c
+@@ -1344,7 +1344,7 @@ static int vidioc_s_fmt_vid_cap(struct file *file, void *priv,
+ 	struct em28xx *dev = video_drvdata(file);
+ 	struct em28xx_v4l2 *v4l2 = dev->v4l2;
+ 
+-	if (v4l2->streaming_users > 0)
++	if (vb2_is_busy(&v4l2->vb_vidq))
+ 		return -EBUSY;
+ 
+ 	vidioc_try_fmt_vid_cap(file, priv, f);
+diff --git a/drivers/media/v4l2-core/videobuf2-core.c b/drivers/media/v4l2-core/videobuf2-core.c
+index 1d67e95311d6..dcdceae30ab0 100644
+--- a/drivers/media/v4l2-core/videobuf2-core.c
++++ b/drivers/media/v4l2-core/videobuf2-core.c
+@@ -1126,7 +1126,7 @@ EXPORT_SYMBOL_GPL(vb2_plane_vaddr);
+  */
+ void *vb2_plane_cookie(struct vb2_buffer *vb, unsigned int plane_no)
+ {
+-	if (plane_no > vb->num_planes || !vb->planes[plane_no].mem_priv)
++	if (plane_no >= vb->num_planes || !vb->planes[plane_no].mem_priv)
+ 		return NULL;
+ 
+ 	return call_ptr_memop(vb, cookie, vb->planes[plane_no].mem_priv);
+@@ -1161,13 +1161,10 @@ void vb2_buffer_done(struct vb2_buffer *vb, enum vb2_buffer_state state)
+ 	if (WARN_ON(vb->state != VB2_BUF_STATE_ACTIVE))
+ 		return;
+ 
+-	if (!q->start_streaming_called) {
+-		if (WARN_ON(state != VB2_BUF_STATE_QUEUED))
+-			state = VB2_BUF_STATE_QUEUED;
+-	} else if (WARN_ON(state != VB2_BUF_STATE_DONE &&
+-			   state != VB2_BUF_STATE_ERROR)) {
+-			state = VB2_BUF_STATE_ERROR;
+-	}
++	if (WARN_ON(state != VB2_BUF_STATE_DONE &&
++		    state != VB2_BUF_STATE_ERROR &&
++		    state != VB2_BUF_STATE_QUEUED))
++		state = VB2_BUF_STATE_ERROR;
+ 
+ #ifdef CONFIG_VIDEO_ADV_DEBUG
+ 	/*
+@@ -1774,6 +1771,12 @@ static int vb2_start_streaming(struct vb2_queue *q)
+ 		/* Must be zero now */
+ 		WARN_ON(atomic_read(&q->owned_by_drv_count));
+ 	}
++	/*
++	 * If done_list is not empty, then start_streaming() didn't call
++	 * vb2_buffer_done(vb, VB2_BUF_STATE_QUEUED) but STATE_ERROR or
++	 * STATE_DONE.
++	 */
++	WARN_ON(!list_empty(&q->done_list));
+ 	return ret;
+ }
+ 
+diff --git a/drivers/media/v4l2-core/videobuf2-dma-sg.c b/drivers/media/v4l2-core/videobuf2-dma-sg.c
+index adefc31bb853..9b163a440f89 100644
+--- a/drivers/media/v4l2-core/videobuf2-dma-sg.c
++++ b/drivers/media/v4l2-core/videobuf2-dma-sg.c
+@@ -113,7 +113,7 @@ static void *vb2_dma_sg_alloc(void *alloc_ctx, unsigned long size, gfp_t gfp_fla
+ 		goto fail_pages_alloc;
+ 
+ 	ret = sg_alloc_table_from_pages(&buf->sg_table, buf->pages,
+-			buf->num_pages, 0, size, gfp_flags);
++			buf->num_pages, 0, size, GFP_KERNEL);
+ 	if (ret)
+ 		goto fail_table_alloc;
+ 
+diff --git a/drivers/mmc/host/mmci.c b/drivers/mmc/host/mmci.c
+index 249ab80cbb45..d3f05ad33f09 100644
+--- a/drivers/mmc/host/mmci.c
++++ b/drivers/mmc/host/mmci.c
+@@ -65,6 +65,7 @@ static unsigned int fmax = 515633;
+  * @pwrreg_clkgate: MMCIPOWER register must be used to gate the clock
+  * @busy_detect: true if busy detection on dat0 is supported
+  * @pwrreg_nopower: bits in MMCIPOWER don't controls ext. power supply
++ * @reversed_irq_handling: handle data irq before cmd irq.
+  */
+ struct variant_data {
+ 	unsigned int		clkreg;
+@@ -80,6 +81,7 @@ struct variant_data {
+ 	bool			pwrreg_clkgate;
+ 	bool			busy_detect;
+ 	bool			pwrreg_nopower;
++	bool			reversed_irq_handling;
+ };
+ 
+ static struct variant_data variant_arm = {
+@@ -87,6 +89,7 @@ static struct variant_data variant_arm = {
+ 	.fifohalfsize		= 8 * 4,
+ 	.datalength_bits	= 16,
+ 	.pwrreg_powerup		= MCI_PWR_UP,
++	.reversed_irq_handling	= true,
+ };
+ 
+ static struct variant_data variant_arm_extended_fifo = {
+@@ -1163,8 +1166,13 @@ static irqreturn_t mmci_irq(int irq, void *dev_id)
+ 
+ 		dev_dbg(mmc_dev(host->mmc), "irq0 (data+cmd) %08x\n", status);
+ 
+-		mmci_cmd_irq(host, host->cmd, status);
+-		mmci_data_irq(host, host->data, status);
++		if (host->variant->reversed_irq_handling) {
++			mmci_data_irq(host, host->data, status);
++			mmci_cmd_irq(host, host->cmd, status);
++		} else {
++			mmci_cmd_irq(host, host->cmd, status);
++			mmci_data_irq(host, host->data, status);
++		}
+ 
+ 		/* Don't poll for busy completion in irq context. */
+ 		if (host->busy_status)
+diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
+index c9127562bd22..21978cc019e7 100644
+--- a/drivers/net/ethernet/ibm/ibmveth.c
++++ b/drivers/net/ethernet/ibm/ibmveth.c
+@@ -292,6 +292,18 @@ failure:
+ 	atomic_add(buffers_added, &(pool->available));
+ }
+ 
++/*
++ * The final 8 bytes of the buffer list is a counter of frames dropped
++ * because there was not a buffer in the buffer list capable of holding
++ * the frame.
++ */
++static void ibmveth_update_rx_no_buffer(struct ibmveth_adapter *adapter)
++{
++	__be64 *p = adapter->buffer_list_addr + 4096 - 8;
++
++	adapter->rx_no_buffer = be64_to_cpup(p);
++}
++
+ /* replenish routine */
+ static void ibmveth_replenish_task(struct ibmveth_adapter *adapter)
+ {
+@@ -307,8 +319,7 @@ static void ibmveth_replenish_task(struct ibmveth_adapter *adapter)
+ 			ibmveth_replenish_buffer_pool(adapter, pool);
+ 	}
+ 
+-	adapter->rx_no_buffer = *(u64 *)(((char*)adapter->buffer_list_addr) +
+-						4096 - 8);
++	ibmveth_update_rx_no_buffer(adapter);
+ }
+ 
+ /* empty and free ana buffer pool - also used to do cleanup in error paths */
+@@ -698,8 +709,7 @@ static int ibmveth_close(struct net_device *netdev)
+ 
+ 	free_irq(netdev->irq, netdev);
+ 
+-	adapter->rx_no_buffer = *(u64 *)(((char *)adapter->buffer_list_addr) +
+-						4096 - 8);
++	ibmveth_update_rx_no_buffer(adapter);
+ 
+ 	ibmveth_cleanup(adapter);
+ 
+diff --git a/drivers/net/wireless/ath/ath9k/htc_drv_txrx.c b/drivers/net/wireless/ath/ath9k/htc_drv_txrx.c
+index bb86eb2ffc95..f0484b1b617e 100644
+--- a/drivers/net/wireless/ath/ath9k/htc_drv_txrx.c
++++ b/drivers/net/wireless/ath/ath9k/htc_drv_txrx.c
+@@ -978,7 +978,7 @@ static bool ath9k_rx_prepare(struct ath9k_htc_priv *priv,
+ 	struct ath_hw *ah = common->ah;
+ 	struct ath_htc_rx_status *rxstatus;
+ 	struct ath_rx_status rx_stats;
+-	bool decrypt_error;
++	bool decrypt_error = false;
+ 
+ 	if (skb->len < HTC_RX_FRAME_HEADER_SIZE) {
+ 		ath_err(common, "Corrupted RX frame, dropping (len: %d)\n",
+diff --git a/drivers/net/wireless/ath/carl9170/carl9170.h b/drivers/net/wireless/ath/carl9170/carl9170.h
+index 8596aba34f96..237d0cda1bcb 100644
+--- a/drivers/net/wireless/ath/carl9170/carl9170.h
++++ b/drivers/net/wireless/ath/carl9170/carl9170.h
+@@ -256,6 +256,7 @@ struct ar9170 {
+ 	atomic_t rx_work_urbs;
+ 	atomic_t rx_pool_urbs;
+ 	kernel_ulong_t features;
++	bool usb_ep_cmd_is_bulk;
+ 
+ 	/* firmware settings */
+ 	struct completion fw_load_wait;
+diff --git a/drivers/net/wireless/ath/carl9170/usb.c b/drivers/net/wireless/ath/carl9170/usb.c
+index f35c7f30f9a6..c9f93310c0d6 100644
+--- a/drivers/net/wireless/ath/carl9170/usb.c
++++ b/drivers/net/wireless/ath/carl9170/usb.c
+@@ -621,9 +621,16 @@ int __carl9170_exec_cmd(struct ar9170 *ar, struct carl9170_cmd *cmd,
+ 		goto err_free;
+ 	}
+ 
+-	usb_fill_int_urb(urb, ar->udev, usb_sndintpipe(ar->udev,
+-		AR9170_USB_EP_CMD), cmd, cmd->hdr.len + 4,
+-		carl9170_usb_cmd_complete, ar, 1);
++	if (ar->usb_ep_cmd_is_bulk)
++		usb_fill_bulk_urb(urb, ar->udev,
++				  usb_sndbulkpipe(ar->udev, AR9170_USB_EP_CMD),
++				  cmd, cmd->hdr.len + 4,
++				  carl9170_usb_cmd_complete, ar);
++	else
++		usb_fill_int_urb(urb, ar->udev,
++				 usb_sndintpipe(ar->udev, AR9170_USB_EP_CMD),
++				 cmd, cmd->hdr.len + 4,
++				 carl9170_usb_cmd_complete, ar, 1);
+ 
+ 	if (free_buf)
+ 		urb->transfer_flags |= URB_FREE_BUFFER;
+@@ -1032,9 +1039,10 @@ static void carl9170_usb_firmware_step2(const struct firmware *fw,
+ static int carl9170_usb_probe(struct usb_interface *intf,
+ 			      const struct usb_device_id *id)
+ {
++	struct usb_endpoint_descriptor *ep;
+ 	struct ar9170 *ar;
+ 	struct usb_device *udev;
+-	int err;
++	int i, err;
+ 
+ 	err = usb_reset_device(interface_to_usbdev(intf));
+ 	if (err)
+@@ -1050,6 +1058,21 @@ static int carl9170_usb_probe(struct usb_interface *intf,
+ 	ar->intf = intf;
+ 	ar->features = id->driver_info;
+ 
++	/* We need to remember the type of endpoint 4 because it differs
++	 * between high- and full-speed configuration. The high-speed
++	 * configuration specifies it as interrupt and the full-speed
++	 * configuration as bulk endpoint. This information is required
++	 * later when sending urbs to that endpoint.
++	 */
++	for (i = 0; i < intf->cur_altsetting->desc.bNumEndpoints; ++i) {
++		ep = &intf->cur_altsetting->endpoint[i].desc;
++
++		if (usb_endpoint_num(ep) == AR9170_USB_EP_CMD &&
++		    usb_endpoint_dir_out(ep) &&
++		    usb_endpoint_type(ep) == USB_ENDPOINT_XFER_BULK)
++			ar->usb_ep_cmd_is_bulk = true;
++	}
++
+ 	usb_set_intfdata(intf, ar);
+ 	SET_IEEE80211_DEV(ar->hw, &intf->dev);
+ 
+diff --git a/drivers/net/wireless/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/brcm80211/brcmfmac/fweh.c
+index fad77dd2a3a5..3f9cb894d001 100644
+--- a/drivers/net/wireless/brcm80211/brcmfmac/fweh.c
++++ b/drivers/net/wireless/brcm80211/brcmfmac/fweh.c
+@@ -185,7 +185,13 @@ static void brcmf_fweh_handle_if_event(struct brcmf_pub *drvr,
+ 		  ifevent->action, ifevent->ifidx, ifevent->bssidx,
+ 		  ifevent->flags, ifevent->role);
+ 
+-	if (ifevent->flags & BRCMF_E_IF_FLAG_NOIF) {
++	/* The P2P Device interface event must not be ignored
++	 * contrary to what firmware tells us. The only way to
++	 * distinguish the P2P Device is by looking at the ifidx
++	 * and bssidx received.
++	 */
++	if (!(ifevent->ifidx == 0 && ifevent->bssidx == 1) &&
++	    (ifevent->flags & BRCMF_E_IF_FLAG_NOIF)) {
+ 		brcmf_dbg(EVENT, "event can be ignored\n");
+ 		return;
+ 	}
+@@ -210,12 +216,12 @@ static void brcmf_fweh_handle_if_event(struct brcmf_pub *drvr,
+ 				return;
+ 	}
+ 
+-	if (ifevent->action == BRCMF_E_IF_CHANGE)
++	if (ifp && ifevent->action == BRCMF_E_IF_CHANGE)
+ 		brcmf_fws_reset_interface(ifp);
+ 
+ 	err = brcmf_fweh_call_event_handler(ifp, emsg->event_code, emsg, data);
+ 
+-	if (ifevent->action == BRCMF_E_IF_DEL) {
++	if (ifp && ifevent->action == BRCMF_E_IF_DEL) {
+ 		brcmf_fws_del_interface(ifp);
+ 		brcmf_del_if(drvr, ifevent->bssidx);
+ 	}
+diff --git a/drivers/net/wireless/brcm80211/brcmfmac/fweh.h b/drivers/net/wireless/brcm80211/brcmfmac/fweh.h
+index 51b53a73d074..d26b47698f68 100644
+--- a/drivers/net/wireless/brcm80211/brcmfmac/fweh.h
++++ b/drivers/net/wireless/brcm80211/brcmfmac/fweh.h
+@@ -167,6 +167,8 @@ enum brcmf_fweh_event_code {
+ #define BRCMF_E_IF_ROLE_STA			0
+ #define BRCMF_E_IF_ROLE_AP			1
+ #define BRCMF_E_IF_ROLE_WDS			2
++#define BRCMF_E_IF_ROLE_P2P_GO			3
++#define BRCMF_E_IF_ROLE_P2P_CLIENT		4
+ 
+ /**
+  * definitions for event packet validation.
+diff --git a/drivers/net/wireless/iwlwifi/dvm/rxon.c b/drivers/net/wireless/iwlwifi/dvm/rxon.c
+index 6dc5dd3ced44..ed50de6362ed 100644
+--- a/drivers/net/wireless/iwlwifi/dvm/rxon.c
++++ b/drivers/net/wireless/iwlwifi/dvm/rxon.c
+@@ -1068,6 +1068,13 @@ int iwlagn_commit_rxon(struct iwl_priv *priv, struct iwl_rxon_context *ctx)
+ 	/* recalculate basic rates */
+ 	iwl_calc_basic_rates(priv, ctx);
+ 
++	/*
++	 * force CTS-to-self frames protection if RTS-CTS is not preferred
++	 * one aggregation protection method
++	 */
++	if (!priv->hw_params.use_rts_for_aggregation)
++		ctx->staging.flags |= RXON_FLG_SELF_CTS_EN;
++
+ 	if ((ctx->vif && ctx->vif->bss_conf.use_short_slot) ||
+ 	    !(ctx->staging.flags & RXON_FLG_BAND_24G_MSK))
+ 		ctx->staging.flags |= RXON_FLG_SHORT_SLOT_MSK;
+@@ -1473,6 +1480,11 @@ void iwlagn_bss_info_changed(struct ieee80211_hw *hw,
+ 	else
+ 		ctx->staging.flags &= ~RXON_FLG_TGG_PROTECT_MSK;
+ 
++	if (bss_conf->use_cts_prot)
++		ctx->staging.flags |= RXON_FLG_SELF_CTS_EN;
++	else
++		ctx->staging.flags &= ~RXON_FLG_SELF_CTS_EN;
++
+ 	memcpy(ctx->staging.bssid_addr, bss_conf->bssid, ETH_ALEN);
+ 
+ 	if (vif->type == NL80211_IFTYPE_AP ||
+diff --git a/drivers/net/wireless/iwlwifi/iwl-config.h b/drivers/net/wireless/iwlwifi/iwl-config.h
+index b7047905f41a..6ac1bedd2876 100644
+--- a/drivers/net/wireless/iwlwifi/iwl-config.h
++++ b/drivers/net/wireless/iwlwifi/iwl-config.h
+@@ -120,6 +120,8 @@ enum iwl_led_mode {
+ #define IWL_LONG_WD_TIMEOUT	10000
+ #define IWL_MAX_WD_TIMEOUT	120000
+ 
++#define IWL_DEFAULT_MAX_TX_POWER 22
++
+ /* Antenna presence definitions */
+ #define	ANT_NONE	0x0
+ #define	ANT_A		BIT(0)
+diff --git a/drivers/net/wireless/iwlwifi/iwl-nvm-parse.c b/drivers/net/wireless/iwlwifi/iwl-nvm-parse.c
+index 85eee79c495c..0c75fc140bf6 100644
+--- a/drivers/net/wireless/iwlwifi/iwl-nvm-parse.c
++++ b/drivers/net/wireless/iwlwifi/iwl-nvm-parse.c
+@@ -143,8 +143,6 @@ static const u8 iwl_nvm_channels_family_8000[] = {
+ #define LAST_2GHZ_HT_PLUS		9
+ #define LAST_5GHZ_HT			161
+ 
+-#define DEFAULT_MAX_TX_POWER 16
+-
+ /* rate data (static) */
+ static struct ieee80211_rate iwl_cfg80211_rates[] = {
+ 	{ .bitrate = 1 * 10, .hw_value = 0, .hw_value_short = 0, },
+@@ -279,7 +277,7 @@ static int iwl_init_channel_map(struct device *dev, const struct iwl_cfg *cfg,
+ 		 * Default value - highest tx power value.  max_power
+ 		 * is not used in mvm, and is used for backwards compatibility
+ 		 */
+-		channel->max_power = DEFAULT_MAX_TX_POWER;
++		channel->max_power = IWL_DEFAULT_MAX_TX_POWER;
+ 		is_5ghz = channel->band == IEEE80211_BAND_5GHZ;
+ 		IWL_DEBUG_EEPROM(dev,
+ 				 "Ch. %d [%sGHz] %s%s%s%s%s%s(0x%02x %ddBm): Ad-Hoc %ssupported\n",
+diff --git a/drivers/net/wireless/iwlwifi/mvm/fw-api.h b/drivers/net/wireless/iwlwifi/mvm/fw-api.h
+index 309a9b9a94fe..67363080f83d 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/fw-api.h
++++ b/drivers/net/wireless/iwlwifi/mvm/fw-api.h
+@@ -1487,14 +1487,14 @@ enum iwl_sf_scenario {
+ 
+ /**
+  * Smart Fifo configuration command.
+- * @state: smart fifo state, types listed in iwl_sf_sate.
++ * @state: smart fifo state, types listed in enum %iwl_sf_sate.
+  * @watermark: Minimum allowed availabe free space in RXF for transient state.
+  * @long_delay_timeouts: aging and idle timer values for each scenario
+  * in long delay state.
+  * @full_on_timeouts: timer values for each scenario in full on state.
+  */
+ struct iwl_sf_cfg_cmd {
+-	enum iwl_sf_state state;
++	__le32 state;
+ 	__le32 watermark[SF_TRANSIENT_STATES_NUMBER];
+ 	__le32 long_delay_timeouts[SF_NUM_SCENARIO][SF_NUM_TIMEOUT_TYPES];
+ 	__le32 full_on_timeouts[SF_NUM_SCENARIO][SF_NUM_TIMEOUT_TYPES];
+diff --git a/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c b/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
+index 8b79081d4885..db84533eff5d 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
++++ b/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
+@@ -720,11 +720,6 @@ static int iwl_mvm_mac_ctxt_cmd_sta(struct iwl_mvm *mvm,
+ 	    !force_assoc_off) {
+ 		u32 dtim_offs;
+ 
+-		/* Allow beacons to pass through as long as we are not
+-		 * associated, or we do not have dtim period information.
+-		 */
+-		cmd.filter_flags |= cpu_to_le32(MAC_FILTER_IN_BEACON);
+-
+ 		/*
+ 		 * The DTIM count counts down, so when it is N that means N
+ 		 * more beacon intervals happen until the DTIM TBTT. Therefore
+@@ -758,6 +753,11 @@ static int iwl_mvm_mac_ctxt_cmd_sta(struct iwl_mvm *mvm,
+ 		ctxt_sta->is_assoc = cpu_to_le32(1);
+ 	} else {
+ 		ctxt_sta->is_assoc = cpu_to_le32(0);
++
++		/* Allow beacons to pass through as long as we are not
++		 * associated, or we do not have dtim period information.
++		 */
++		cmd.filter_flags |= cpu_to_le32(MAC_FILTER_IN_BEACON);
+ 	}
+ 
+ 	ctxt_sta->bi = cpu_to_le32(vif->bss_conf.beacon_int);
+diff --git a/drivers/net/wireless/iwlwifi/mvm/sf.c b/drivers/net/wireless/iwlwifi/mvm/sf.c
+index 7edfd15efc9d..e843b67f2201 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/sf.c
++++ b/drivers/net/wireless/iwlwifi/mvm/sf.c
+@@ -172,7 +172,7 @@ static int iwl_mvm_sf_config(struct iwl_mvm *mvm, u8 sta_id,
+ 			     enum iwl_sf_state new_state)
+ {
+ 	struct iwl_sf_cfg_cmd sf_cmd = {
+-		.state = new_state,
++		.state = cpu_to_le32(new_state),
+ 	};
+ 	struct ieee80211_sta *sta;
+ 	int ret = 0;
+diff --git a/drivers/net/wireless/iwlwifi/mvm/tx.c b/drivers/net/wireless/iwlwifi/mvm/tx.c
+index 3846a6c41eb1..f2465f60122e 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/tx.c
++++ b/drivers/net/wireless/iwlwifi/mvm/tx.c
+@@ -169,10 +169,14 @@ static void iwl_mvm_set_tx_cmd_rate(struct iwl_mvm *mvm,
+ 
+ 	/*
+ 	 * for data packets, rate info comes from the table inside the fw. This
+-	 * table is controlled by LINK_QUALITY commands
++	 * table is controlled by LINK_QUALITY commands. Exclude ctrl port
++	 * frames like EAPOLs which should be treated as mgmt frames. This
++	 * avoids them being sent initially in high rates which increases the
++	 * chances for completion of the 4-Way handshake.
+ 	 */
+ 
+-	if (ieee80211_is_data(fc) && sta) {
++	if (ieee80211_is_data(fc) && sta &&
++	    !(info->control.flags & IEEE80211_TX_CTRL_PORT_CTRL_PROTO)) {
+ 		tx_cmd->initial_rate_index = 0;
+ 		tx_cmd->tx_flags |= cpu_to_le32(TX_CMD_FLG_STA_RATE);
+ 		return;
+diff --git a/drivers/net/wireless/rtlwifi/rtl8192cu/sw.c b/drivers/net/wireless/rtlwifi/rtl8192cu/sw.c
+index 361435f8608a..1ac6383e7947 100644
+--- a/drivers/net/wireless/rtlwifi/rtl8192cu/sw.c
++++ b/drivers/net/wireless/rtlwifi/rtl8192cu/sw.c
+@@ -317,6 +317,7 @@ static struct usb_device_id rtl8192c_usb_ids[] = {
+ 	{RTL_USB_DEVICE(0x0bda, 0x5088, rtl92cu_hal_cfg)}, /*Thinkware-CC&C*/
+ 	{RTL_USB_DEVICE(0x0df6, 0x0052, rtl92cu_hal_cfg)}, /*Sitecom - Edimax*/
+ 	{RTL_USB_DEVICE(0x0df6, 0x005c, rtl92cu_hal_cfg)}, /*Sitecom - Edimax*/
++	{RTL_USB_DEVICE(0x0df6, 0x0070, rtl92cu_hal_cfg)}, /*Sitecom - 150N */
+ 	{RTL_USB_DEVICE(0x0df6, 0x0077, rtl92cu_hal_cfg)}, /*Sitecom-WLA2100V2*/
+ 	{RTL_USB_DEVICE(0x0eb0, 0x9071, rtl92cu_hal_cfg)}, /*NO Brand - Etop*/
+ 	{RTL_USB_DEVICE(0x4856, 0x0091, rtl92cu_hal_cfg)}, /*NetweeN - Feixun*/
+diff --git a/drivers/nfc/microread/microread.c b/drivers/nfc/microread/microread.c
+index f868333271aa..963a4a5dc88e 100644
+--- a/drivers/nfc/microread/microread.c
++++ b/drivers/nfc/microread/microread.c
+@@ -501,9 +501,13 @@ static void microread_target_discovered(struct nfc_hci_dev *hdev, u8 gate,
+ 		targets->sens_res =
+ 			 be16_to_cpu(*(u16 *)&skb->data[MICROREAD_EMCF_A_ATQA]);
+ 		targets->sel_res = skb->data[MICROREAD_EMCF_A_SAK];
+-		memcpy(targets->nfcid1, &skb->data[MICROREAD_EMCF_A_UID],
+-		       skb->data[MICROREAD_EMCF_A_LEN]);
+ 		targets->nfcid1_len = skb->data[MICROREAD_EMCF_A_LEN];
++		if (targets->nfcid1_len > sizeof(targets->nfcid1)) {
++			r = -EINVAL;
++			goto exit_free;
++		}
++		memcpy(targets->nfcid1, &skb->data[MICROREAD_EMCF_A_UID],
++		       targets->nfcid1_len);
+ 		break;
+ 	case MICROREAD_GATE_ID_MREAD_ISO_A_3:
+ 		targets->supported_protocols =
+@@ -511,9 +515,13 @@ static void microread_target_discovered(struct nfc_hci_dev *hdev, u8 gate,
+ 		targets->sens_res =
+ 			 be16_to_cpu(*(u16 *)&skb->data[MICROREAD_EMCF_A3_ATQA]);
+ 		targets->sel_res = skb->data[MICROREAD_EMCF_A3_SAK];
+-		memcpy(targets->nfcid1, &skb->data[MICROREAD_EMCF_A3_UID],
+-		       skb->data[MICROREAD_EMCF_A3_LEN]);
+ 		targets->nfcid1_len = skb->data[MICROREAD_EMCF_A3_LEN];
++		if (targets->nfcid1_len > sizeof(targets->nfcid1)) {
++			r = -EINVAL;
++			goto exit_free;
++		}
++		memcpy(targets->nfcid1, &skb->data[MICROREAD_EMCF_A3_UID],
++		       targets->nfcid1_len);
+ 		break;
+ 	case MICROREAD_GATE_ID_MREAD_ISO_B:
+ 		targets->supported_protocols = NFC_PROTO_ISO14443_B_MASK;
+diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
+index 9aa012e6ea0a..379ad4fa9665 100644
+--- a/drivers/of/fdt.c
++++ b/drivers/of/fdt.c
+@@ -453,7 +453,7 @@ static int __init __reserved_mem_reserve_reg(unsigned long node,
+ 		base = dt_mem_next_cell(dt_root_addr_cells, &prop);
+ 		size = dt_mem_next_cell(dt_root_size_cells, &prop);
+ 
+-		if (base && size &&
++		if (size &&
+ 		    early_init_dt_reserve_memory_arch(base, size, nomap) == 0)
+ 			pr_debug("Reserved memory: reserved region for node '%s': base %pa, size %ld MiB\n",
+ 				uname, &base, (unsigned long)size / SZ_1M);
+diff --git a/drivers/of/irq.c b/drivers/of/irq.c
+index 3e06a699352d..1471e0a223a5 100644
+--- a/drivers/of/irq.c
++++ b/drivers/of/irq.c
+@@ -301,16 +301,17 @@ int of_irq_parse_one(struct device_node *device, int index, struct of_phandle_ar
+ 	/* Get the reg property (if any) */
+ 	addr = of_get_property(device, "reg", NULL);
+ 
++	/* Try the new-style interrupts-extended first */
++	res = of_parse_phandle_with_args(device, "interrupts-extended",
++					"#interrupt-cells", index, out_irq);
++	if (!res)
++		return of_irq_parse_raw(addr, out_irq);
++
+ 	/* Get the interrupts property */
+ 	intspec = of_get_property(device, "interrupts", &intlen);
+-	if (intspec == NULL) {
+-		/* Try the new-style interrupts-extended */
+-		res = of_parse_phandle_with_args(device, "interrupts-extended",
+-						"#interrupt-cells", index, out_irq);
+-		if (res)
+-			return -EINVAL;
+-		return of_irq_parse_raw(addr, out_irq);
+-	}
++	if (intspec == NULL)
++		return -EINVAL;
++
+ 	intlen /= sizeof(*intspec);
+ 
+ 	pr_debug(" intspec=%d intlen=%d\n", be32_to_cpup(intspec), intlen);
+diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
+index 602d153c7055..c074b262a492 100644
+--- a/drivers/pci/hotplug/acpiphp_glue.c
++++ b/drivers/pci/hotplug/acpiphp_glue.c
+@@ -573,19 +573,15 @@ static void disable_slot(struct acpiphp_slot *slot)
+ 	slot->flags &= (~SLOT_ENABLED);
+ }
+ 
+-static bool acpiphp_no_hotplug(struct acpi_device *adev)
+-{
+-	return adev && adev->flags.no_hotplug;
+-}
+-
+ static bool slot_no_hotplug(struct acpiphp_slot *slot)
+ {
+-	struct acpiphp_func *func;
++	struct pci_bus *bus = slot->bus;
++	struct pci_dev *dev;
+ 
+-	list_for_each_entry(func, &slot->funcs, sibling)
+-		if (acpiphp_no_hotplug(func_to_acpi_device(func)))
++	list_for_each_entry(dev, &bus->devices, bus_list) {
++		if (PCI_SLOT(dev->devfn) == slot->device && dev->ignore_hotplug)
+ 			return true;
+-
++	}
+ 	return false;
+ }
+ 
+@@ -658,7 +654,7 @@ static void trim_stale_devices(struct pci_dev *dev)
+ 
+ 		status = acpi_evaluate_integer(adev->handle, "_STA", NULL, &sta);
+ 		alive = (ACPI_SUCCESS(status) && device_status_valid(sta))
+-			|| acpiphp_no_hotplug(adev);
++			|| dev->ignore_hotplug;
+ 	}
+ 	if (!alive)
+ 		alive = pci_device_is_present(dev);
+diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
+index 056841651a80..fa6a320b4d58 100644
+--- a/drivers/pci/hotplug/pciehp_hpc.c
++++ b/drivers/pci/hotplug/pciehp_hpc.c
+@@ -508,6 +508,8 @@ static irqreturn_t pcie_isr(int irq, void *dev_id)
+ {
+ 	struct controller *ctrl = (struct controller *)dev_id;
+ 	struct pci_dev *pdev = ctrl_dev(ctrl);
++	struct pci_bus *subordinate = pdev->subordinate;
++	struct pci_dev *dev;
+ 	struct slot *slot = ctrl->slot;
+ 	u16 detected, intr_loc;
+ 
+@@ -541,6 +543,16 @@ static irqreturn_t pcie_isr(int irq, void *dev_id)
+ 		wake_up(&ctrl->queue);
+ 	}
+ 
++	if (subordinate) {
++		list_for_each_entry(dev, &subordinate->devices, bus_list) {
++			if (dev->ignore_hotplug) {
++				ctrl_dbg(ctrl, "ignoring hotplug event %#06x (%s requested no hotplug)\n",
++					 intr_loc, pci_name(dev));
++				return IRQ_HANDLED;
++			}
++		}
++	}
++
+ 	if (!(intr_loc & ~PCI_EXP_SLTSTA_CC))
+ 		return IRQ_HANDLED;
+ 
+diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
+index e3cf8a2e6292..4170113cde61 100644
+--- a/drivers/pci/probe.c
++++ b/drivers/pci/probe.c
+@@ -775,7 +775,7 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max, int pass)
+ 	/* Check if setup is sensible at all */
+ 	if (!pass &&
+ 	    (primary != bus->number || secondary <= bus->number ||
+-	     secondary > subordinate || subordinate > bus->busn_res.end)) {
++	     secondary > subordinate)) {
+ 		dev_info(&dev->dev, "bridge configuration invalid ([bus %02x-%02x]), reconfiguring\n",
+ 			 secondary, subordinate);
+ 		broken = 1;
+@@ -838,23 +838,18 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max, int pass)
+ 			goto out;
+ 		}
+ 
+-		if (max >= bus->busn_res.end) {
+-			dev_warn(&dev->dev, "can't allocate child bus %02x from %pR\n",
+-				 max, &bus->busn_res);
+-			goto out;
+-		}
+-
+ 		/* Clear errors */
+ 		pci_write_config_word(dev, PCI_STATUS, 0xffff);
+ 
+-		/* The bus will already exist if we are rescanning */
++		/* Prevent assigning a bus number that already exists.
++		 * This can happen when a bridge is hot-plugged, so in
++		 * this case we only re-scan this bus. */
+ 		child = pci_find_bus(pci_domain_nr(bus), max+1);
+ 		if (!child) {
+ 			child = pci_add_new_bus(bus, dev, max+1);
+ 			if (!child)
+ 				goto out;
+-			pci_bus_insert_busn_res(child, max+1,
+-						bus->busn_res.end);
++			pci_bus_insert_busn_res(child, max+1, 0xff);
+ 		}
+ 		max++;
+ 		buses = (buses & 0xff000000)
+@@ -913,11 +908,6 @@ int pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max, int pass)
+ 		/*
+ 		 * Set the subordinate bus number to its real value.
+ 		 */
+-		if (max > bus->busn_res.end) {
+-			dev_warn(&dev->dev, "max busn %02x is outside %pR\n",
+-				 max, &bus->busn_res);
+-			max = bus->busn_res.end;
+-		}
+ 		pci_bus_update_busn_res_end(child, max);
+ 		pci_write_config_byte(dev, PCI_SUBORDINATE_BUS, max);
+ 	}
+diff --git a/drivers/phy/phy-twl4030-usb.c b/drivers/phy/phy-twl4030-usb.c
+index 2e0e9b3774c8..ef4f3350faa1 100644
+--- a/drivers/phy/phy-twl4030-usb.c
++++ b/drivers/phy/phy-twl4030-usb.c
+@@ -34,6 +34,7 @@
+ #include <linux/delay.h>
+ #include <linux/usb/otg.h>
+ #include <linux/phy/phy.h>
++#include <linux/pm_runtime.h>
+ #include <linux/usb/musb-omap.h>
+ #include <linux/usb/ulpi.h>
+ #include <linux/i2c/twl.h>
+@@ -422,37 +423,55 @@ static void twl4030_phy_power(struct twl4030_usb *twl, int on)
+ 	}
+ }
+ 
+-static int twl4030_phy_power_off(struct phy *phy)
++static int twl4030_usb_runtime_suspend(struct device *dev)
+ {
+-	struct twl4030_usb *twl = phy_get_drvdata(phy);
++	struct twl4030_usb *twl = dev_get_drvdata(dev);
+ 
++	dev_dbg(twl->dev, "%s\n", __func__);
+ 	if (twl->asleep)
+ 		return 0;
+ 
+ 	twl4030_phy_power(twl, 0);
+ 	twl->asleep = 1;
+-	dev_dbg(twl->dev, "%s\n", __func__);
++
+ 	return 0;
+ }
+ 
+-static void __twl4030_phy_power_on(struct twl4030_usb *twl)
++static int twl4030_usb_runtime_resume(struct device *dev)
+ {
++	struct twl4030_usb *twl = dev_get_drvdata(dev);
++
++	dev_dbg(twl->dev, "%s\n", __func__);
++	if (!twl->asleep)
++		return 0;
++
+ 	twl4030_phy_power(twl, 1);
+-	twl4030_i2c_access(twl, 1);
+-	twl4030_usb_set_mode(twl, twl->usb_mode);
+-	if (twl->usb_mode == T2_USB_MODE_ULPI)
+-		twl4030_i2c_access(twl, 0);
++	twl->asleep = 0;
++
++	return 0;
++}
++
++static int twl4030_phy_power_off(struct phy *phy)
++{
++	struct twl4030_usb *twl = phy_get_drvdata(phy);
++
++	dev_dbg(twl->dev, "%s\n", __func__);
++	pm_runtime_mark_last_busy(twl->dev);
++	pm_runtime_put_autosuspend(twl->dev);
++
++	return 0;
+ }
+ 
+ static int twl4030_phy_power_on(struct phy *phy)
+ {
+ 	struct twl4030_usb *twl = phy_get_drvdata(phy);
+ 
+-	if (!twl->asleep)
+-		return 0;
+-	__twl4030_phy_power_on(twl);
+-	twl->asleep = 0;
+ 	dev_dbg(twl->dev, "%s\n", __func__);
++	pm_runtime_get_sync(twl->dev);
++	twl4030_i2c_access(twl, 1);
++	twl4030_usb_set_mode(twl, twl->usb_mode);
++	if (twl->usb_mode == T2_USB_MODE_ULPI)
++		twl4030_i2c_access(twl, 0);
+ 
+ 	/*
+ 	 * XXX When VBUS gets driven after musb goes to A mode,
+@@ -558,9 +577,27 @@ static irqreturn_t twl4030_usb_irq(int irq, void *_twl)
+ 		 * USB_LINK_VBUS state.  musb_hdrc won't care until it
+ 		 * starts to handle softconnect right.
+ 		 */
++		if ((status == OMAP_MUSB_VBUS_VALID) ||
++		    (status == OMAP_MUSB_ID_GROUND)) {
++			if (twl->asleep)
++				pm_runtime_get_sync(twl->dev);
++		} else {
++			if (!twl->asleep) {
++				pm_runtime_mark_last_busy(twl->dev);
++				pm_runtime_put_autosuspend(twl->dev);
++			}
++		}
+ 		omap_musb_mailbox(status);
+ 	}
+-	sysfs_notify(&twl->dev->kobj, NULL, "vbus");
++
++	/* don't schedule during sleep - irq works right then */
++	if (status == OMAP_MUSB_ID_GROUND && !twl->asleep) {
++		cancel_delayed_work(&twl->id_workaround_work);
++		schedule_delayed_work(&twl->id_workaround_work, HZ);
++	}
++
++	if (irq)
++		sysfs_notify(&twl->dev->kobj, NULL, "vbus");
+ 
+ 	return IRQ_HANDLED;
+ }
+@@ -569,29 +606,8 @@ static void twl4030_id_workaround_work(struct work_struct *work)
+ {
+ 	struct twl4030_usb *twl = container_of(work, struct twl4030_usb,
+ 		id_workaround_work.work);
+-	enum omap_musb_vbus_id_status status;
+-	bool status_changed = false;
+-
+-	status = twl4030_usb_linkstat(twl);
+-
+-	spin_lock_irq(&twl->lock);
+-	if (status >= 0 && status != twl->linkstat) {
+-		twl->linkstat = status;
+-		status_changed = true;
+-	}
+-	spin_unlock_irq(&twl->lock);
+-
+-	if (status_changed) {
+-		dev_dbg(twl->dev, "handle missing status change to %d\n",
+-				status);
+-		omap_musb_mailbox(status);
+-	}
+ 
+-	/* don't schedule during sleep - irq works right then */
+-	if (status == OMAP_MUSB_ID_GROUND && !twl->asleep) {
+-		cancel_delayed_work(&twl->id_workaround_work);
+-		schedule_delayed_work(&twl->id_workaround_work, HZ);
+-	}
++	twl4030_usb_irq(0, twl);
+ }
+ 
+ static int twl4030_phy_init(struct phy *phy)
+@@ -599,22 +615,17 @@ static int twl4030_phy_init(struct phy *phy)
+ 	struct twl4030_usb *twl = phy_get_drvdata(phy);
+ 	enum omap_musb_vbus_id_status status;
+ 
+-	/*
+-	 * Start in sleep state, we'll get called through set_suspend()
+-	 * callback when musb is runtime resumed and it's time to start.
+-	 */
+-	__twl4030_phy_power(twl, 0);
+-	twl->asleep = 1;
+-
++	pm_runtime_get_sync(twl->dev);
+ 	status = twl4030_usb_linkstat(twl);
+ 	twl->linkstat = status;
+ 
+-	if (status == OMAP_MUSB_ID_GROUND || status == OMAP_MUSB_VBUS_VALID) {
++	if (status == OMAP_MUSB_ID_GROUND || status == OMAP_MUSB_VBUS_VALID)
+ 		omap_musb_mailbox(twl->linkstat);
+-		twl4030_phy_power_on(phy);
+-	}
+ 
+ 	sysfs_notify(&twl->dev->kobj, NULL, "vbus");
++	pm_runtime_mark_last_busy(twl->dev);
++	pm_runtime_put_autosuspend(twl->dev);
++
+ 	return 0;
+ }
+ 
+@@ -650,6 +661,11 @@ static const struct phy_ops ops = {
+ 	.owner		= THIS_MODULE,
+ };
+ 
++static const struct dev_pm_ops twl4030_usb_pm_ops = {
++	SET_RUNTIME_PM_OPS(twl4030_usb_runtime_suspend,
++			   twl4030_usb_runtime_resume, NULL)
++};
++
+ static int twl4030_usb_probe(struct platform_device *pdev)
+ {
+ 	struct twl4030_usb_data *pdata = dev_get_platdata(&pdev->dev);
+@@ -726,6 +742,11 @@ static int twl4030_usb_probe(struct platform_device *pdev)
+ 
+ 	ATOMIC_INIT_NOTIFIER_HEAD(&twl->phy.notifier);
+ 
++	pm_runtime_use_autosuspend(&pdev->dev);
++	pm_runtime_set_autosuspend_delay(&pdev->dev, 2000);
++	pm_runtime_enable(&pdev->dev);
++	pm_runtime_get_sync(&pdev->dev);
++
+ 	/* Our job is to use irqs and status from the power module
+ 	 * to keep the transceiver disabled when nothing's connected.
+ 	 *
+@@ -744,6 +765,9 @@ static int twl4030_usb_probe(struct platform_device *pdev)
+ 		return status;
+ 	}
+ 
++	pm_runtime_mark_last_busy(&pdev->dev);
++	pm_runtime_put_autosuspend(twl->dev);
++
+ 	dev_info(&pdev->dev, "Initialized TWL4030 USB module\n");
+ 	return 0;
+ }
+@@ -753,6 +777,7 @@ static int twl4030_usb_remove(struct platform_device *pdev)
+ 	struct twl4030_usb *twl = platform_get_drvdata(pdev);
+ 	int val;
+ 
++	pm_runtime_get_sync(twl->dev);
+ 	cancel_delayed_work(&twl->id_workaround_work);
+ 	device_remove_file(twl->dev, &dev_attr_vbus);
+ 
+@@ -772,9 +797,8 @@ static int twl4030_usb_remove(struct platform_device *pdev)
+ 
+ 	/* disable complete OTG block */
+ 	twl4030_usb_clear_bits(twl, POWER_CTRL, POWER_CTRL_OTG_ENAB);
+-
+-	if (!twl->asleep)
+-		twl4030_phy_power(twl, 0);
++	pm_runtime_mark_last_busy(twl->dev);
++	pm_runtime_put(twl->dev);
+ 
+ 	return 0;
+ }
+@@ -792,6 +816,7 @@ static struct platform_driver twl4030_usb_driver = {
+ 	.remove		= twl4030_usb_remove,
+ 	.driver		= {
+ 		.name	= "twl4030_usb",
++		.pm	= &twl4030_usb_pm_ops,
+ 		.owner	= THIS_MODULE,
+ 		.of_match_table = of_match_ptr(twl4030_usb_id_table),
+ 	},
+diff --git a/drivers/pwm/core.c b/drivers/pwm/core.c
+index 4b66bf09ee55..d2c35920ff08 100644
+--- a/drivers/pwm/core.c
++++ b/drivers/pwm/core.c
+@@ -606,6 +606,8 @@ struct pwm_device *pwm_get(struct device *dev, const char *con_id)
+ 	unsigned int best = 0;
+ 	struct pwm_lookup *p;
+ 	unsigned int match;
++	unsigned int period;
++	enum pwm_polarity polarity;
+ 
+ 	/* look up via DT first */
+ 	if (IS_ENABLED(CONFIG_OF) && dev && dev->of_node)
+@@ -653,6 +655,8 @@ struct pwm_device *pwm_get(struct device *dev, const char *con_id)
+ 		if (match > best) {
+ 			chip = pwmchip_find_by_name(p->provider);
+ 			index = p->index;
++			period = p->period;
++			polarity = p->polarity;
+ 
+ 			if (match != 3)
+ 				best = match;
+@@ -668,8 +672,8 @@ struct pwm_device *pwm_get(struct device *dev, const char *con_id)
+ 	if (IS_ERR(pwm))
+ 		return pwm;
+ 
+-	pwm_set_period(pwm, p->period);
+-	pwm_set_polarity(pwm, p->polarity);
++	pwm_set_period(pwm, period);
++	pwm_set_polarity(pwm, polarity);
+ 
+ 
+ 	return pwm;
+diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
+index 3d1bc67bac9d..874bc950b9f6 100644
+--- a/drivers/scsi/libiscsi.c
++++ b/drivers/scsi/libiscsi.c
+@@ -717,11 +717,21 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr,
+ 			return NULL;
+ 		}
+ 
++		if (data_size > ISCSI_DEF_MAX_RECV_SEG_LEN) {
++			iscsi_conn_printk(KERN_ERR, conn, "Invalid buffer len of %u for login task. Max len is %u\n", data_size, ISCSI_DEF_MAX_RECV_SEG_LEN);
++			return NULL;
++		}
++
+ 		task = conn->login_task;
+ 	} else {
+ 		if (session->state != ISCSI_STATE_LOGGED_IN)
+ 			return NULL;
+ 
++		if (data_size != 0) {
++			iscsi_conn_printk(KERN_ERR, conn, "Can not send data buffer of len %u for op 0x%x\n", data_size, opcode);
++			return NULL;
++		}
++
+ 		BUG_ON(conn->c_stage == ISCSI_CONN_INITIAL_STAGE);
+ 		BUG_ON(conn->c_stage == ISCSI_CONN_STOPPED);
+ 
+diff --git a/drivers/spi/spi-dw-pci.c b/drivers/spi/spi-dw-pci.c
+index 3f3dc1226edf..e14960470d8d 100644
+--- a/drivers/spi/spi-dw-pci.c
++++ b/drivers/spi/spi-dw-pci.c
+@@ -62,6 +62,8 @@ static int spi_pci_probe(struct pci_dev *pdev,
+ 	if (ret)
+ 		return ret;
+ 
++	dws->regs = pcim_iomap_table(pdev)[pci_bar];
++
+ 	dws->bus_num = 0;
+ 	dws->num_cs = 4;
+ 	dws->irq = pdev->irq;
+diff --git a/drivers/spi/spi-dw.c b/drivers/spi/spi-dw.c
+index 29f33143b795..0dd0623319b0 100644
+--- a/drivers/spi/spi-dw.c
++++ b/drivers/spi/spi-dw.c
+@@ -271,7 +271,7 @@ static void giveback(struct dw_spi *dws)
+ 					transfer_list);
+ 
+ 	if (!last_transfer->cs_change)
+-		spi_chip_sel(dws, dws->cur_msg->spi, 0);
++		spi_chip_sel(dws, msg->spi, 0);
+ 
+ 	spi_finalize_current_message(dws->master);
+ }
+@@ -547,8 +547,7 @@ static int dw_spi_setup(struct spi_device *spi)
+ 	/* Only alloc on first setup */
+ 	chip = spi_get_ctldata(spi);
+ 	if (!chip) {
+-		chip = devm_kzalloc(&spi->dev, sizeof(struct chip_data),
+-				GFP_KERNEL);
++		chip = kzalloc(sizeof(struct chip_data), GFP_KERNEL);
+ 		if (!chip)
+ 			return -ENOMEM;
+ 		spi_set_ctldata(spi, chip);
+@@ -606,6 +605,14 @@ static int dw_spi_setup(struct spi_device *spi)
+ 	return 0;
+ }
+ 
++static void dw_spi_cleanup(struct spi_device *spi)
++{
++	struct chip_data *chip = spi_get_ctldata(spi);
++
++	kfree(chip);
++	spi_set_ctldata(spi, NULL);
++}
++
+ /* Restart the controller, disable all interrupts, clean rx fifo */
+ static void spi_hw_init(struct dw_spi *dws)
+ {
+@@ -661,6 +668,7 @@ int dw_spi_add_host(struct device *dev, struct dw_spi *dws)
+ 	master->bus_num = dws->bus_num;
+ 	master->num_chipselect = dws->num_cs;
+ 	master->setup = dw_spi_setup;
++	master->cleanup = dw_spi_cleanup;
+ 	master->transfer_one_message = dw_spi_transfer_one_message;
+ 	master->max_speed_hz = dws->max_freq;
+ 
+diff --git a/drivers/spi/spi-fsl-espi.c b/drivers/spi/spi-fsl-espi.c
+index 8ebd724e4c59..429e11190265 100644
+--- a/drivers/spi/spi-fsl-espi.c
++++ b/drivers/spi/spi-fsl-espi.c
+@@ -452,16 +452,16 @@ static int fsl_espi_setup(struct spi_device *spi)
+ 	int retval;
+ 	u32 hw_mode;
+ 	u32 loop_mode;
+-	struct spi_mpc8xxx_cs *cs = spi->controller_state;
++	struct spi_mpc8xxx_cs *cs = spi_get_ctldata(spi);
+ 
+ 	if (!spi->max_speed_hz)
+ 		return -EINVAL;
+ 
+ 	if (!cs) {
+-		cs = devm_kzalloc(&spi->dev, sizeof(*cs), GFP_KERNEL);
++		cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+ 		if (!cs)
+ 			return -ENOMEM;
+-		spi->controller_state = cs;
++		spi_set_ctldata(spi, cs);
+ 	}
+ 
+ 	mpc8xxx_spi = spi_master_get_devdata(spi->master);
+@@ -496,6 +496,14 @@ static int fsl_espi_setup(struct spi_device *spi)
+ 	return 0;
+ }
+ 
++static void fsl_espi_cleanup(struct spi_device *spi)
++{
++	struct spi_mpc8xxx_cs *cs = spi_get_ctldata(spi);
++
++	kfree(cs);
++	spi_set_ctldata(spi, NULL);
++}
++
+ void fsl_espi_cpu_irq(struct mpc8xxx_spi *mspi, u32 events)
+ {
+ 	struct fsl_espi_reg *reg_base = mspi->reg_base;
+@@ -605,6 +613,7 @@ static struct spi_master * fsl_espi_probe(struct device *dev,
+ 
+ 	master->bits_per_word_mask = SPI_BPW_RANGE_MASK(4, 16);
+ 	master->setup = fsl_espi_setup;
++	master->cleanup = fsl_espi_cleanup;
+ 
+ 	mpc8xxx_spi = spi_master_get_devdata(master);
+ 	mpc8xxx_spi->spi_do_one_msg = fsl_espi_do_one_msg;
+diff --git a/drivers/spi/spi-fsl-spi.c b/drivers/spi/spi-fsl-spi.c
+index 98ccd231bf00..bea26b719361 100644
+--- a/drivers/spi/spi-fsl-spi.c
++++ b/drivers/spi/spi-fsl-spi.c
+@@ -425,16 +425,16 @@ static int fsl_spi_setup(struct spi_device *spi)
+ 	struct fsl_spi_reg *reg_base;
+ 	int retval;
+ 	u32 hw_mode;
+-	struct spi_mpc8xxx_cs	*cs = spi->controller_state;
++	struct spi_mpc8xxx_cs *cs = spi_get_ctldata(spi);
+ 
+ 	if (!spi->max_speed_hz)
+ 		return -EINVAL;
+ 
+ 	if (!cs) {
+-		cs = devm_kzalloc(&spi->dev, sizeof(*cs), GFP_KERNEL);
++		cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+ 		if (!cs)
+ 			return -ENOMEM;
+-		spi->controller_state = cs;
++		spi_set_ctldata(spi, cs);
+ 	}
+ 	mpc8xxx_spi = spi_master_get_devdata(spi->master);
+ 
+@@ -496,9 +496,13 @@ static int fsl_spi_setup(struct spi_device *spi)
+ static void fsl_spi_cleanup(struct spi_device *spi)
+ {
+ 	struct mpc8xxx_spi *mpc8xxx_spi = spi_master_get_devdata(spi->master);
++	struct spi_mpc8xxx_cs *cs = spi_get_ctldata(spi);
+ 
+ 	if (mpc8xxx_spi->type == TYPE_GRLIB && gpio_is_valid(spi->cs_gpio))
+ 		gpio_free(spi->cs_gpio);
++
++	kfree(cs);
++	spi_set_ctldata(spi, NULL);
+ }
+ 
+ static void fsl_spi_cpu_irq(struct mpc8xxx_spi *mspi, u32 events)
+diff --git a/drivers/spi/spi-omap2-mcspi.c b/drivers/spi/spi-omap2-mcspi.c
+index 68441fa448de..352eed7463ac 100644
+--- a/drivers/spi/spi-omap2-mcspi.c
++++ b/drivers/spi/spi-omap2-mcspi.c
+@@ -329,7 +329,8 @@ static void omap2_mcspi_set_fifo(const struct spi_device *spi,
+ disable_fifo:
+ 	if (t->rx_buf != NULL)
+ 		chconf &= ~OMAP2_MCSPI_CHCONF_FFER;
+-	else
++
++	if (t->tx_buf != NULL)
+ 		chconf &= ~OMAP2_MCSPI_CHCONF_FFET;
+ 
+ 	mcspi_write_chconf0(spi, chconf);
+diff --git a/drivers/spi/spi-sirf.c b/drivers/spi/spi-sirf.c
+index 95ac276eaafe..1a5161336730 100644
+--- a/drivers/spi/spi-sirf.c
++++ b/drivers/spi/spi-sirf.c
+@@ -438,7 +438,8 @@ static void spi_sirfsoc_pio_transfer(struct spi_device *spi,
+ 			sspi->tx_word(sspi);
+ 		writel(SIRFSOC_SPI_TXFIFO_EMPTY_INT_EN |
+ 			SIRFSOC_SPI_TX_UFLOW_INT_EN |
+-			SIRFSOC_SPI_RX_OFLOW_INT_EN,
++			SIRFSOC_SPI_RX_OFLOW_INT_EN |
++			SIRFSOC_SPI_RX_IO_DMA_INT_EN,
+ 			sspi->base + SIRFSOC_SPI_INT_EN);
+ 		writel(SIRFSOC_SPI_RX_EN | SIRFSOC_SPI_TX_EN,
+ 			sspi->base + SIRFSOC_SPI_TX_RX_EN);
+diff --git a/drivers/staging/iio/meter/ade7758_trigger.c b/drivers/staging/iio/meter/ade7758_trigger.c
+index 7a94ddd42f59..8c4f2896cd0d 100644
+--- a/drivers/staging/iio/meter/ade7758_trigger.c
++++ b/drivers/staging/iio/meter/ade7758_trigger.c
+@@ -85,7 +85,7 @@ int ade7758_probe_trigger(struct iio_dev *indio_dev)
+ 	ret = iio_trigger_register(st->trig);
+ 
+ 	/* select default trigger */
+-	indio_dev->trig = st->trig;
++	indio_dev->trig = iio_trigger_get(st->trig);
+ 	if (ret)
+ 		goto error_free_irq;
+ 
+diff --git a/drivers/staging/imx-drm/imx-ldb.c b/drivers/staging/imx-drm/imx-ldb.c
+index 7e3f019d7e72..4662e00b456a 100644
+--- a/drivers/staging/imx-drm/imx-ldb.c
++++ b/drivers/staging/imx-drm/imx-ldb.c
+@@ -574,6 +574,9 @@ static void imx_ldb_unbind(struct device *dev, struct device *master,
+ 	for (i = 0; i < 2; i++) {
+ 		struct imx_ldb_channel *channel = &imx_ldb->channel[i];
+ 
++		if (!channel->connector.funcs)
++			continue;
++
+ 		channel->connector.funcs->destroy(&channel->connector);
+ 		channel->encoder.funcs->destroy(&channel->encoder);
+ 	}
+diff --git a/drivers/staging/imx-drm/ipuv3-plane.c b/drivers/staging/imx-drm/ipuv3-plane.c
+index 6f393a11f44d..50de10a550e9 100644
+--- a/drivers/staging/imx-drm/ipuv3-plane.c
++++ b/drivers/staging/imx-drm/ipuv3-plane.c
+@@ -281,7 +281,8 @@ static void ipu_plane_dpms(struct ipu_plane *ipu_plane, int mode)
+ 
+ 		ipu_idmac_put(ipu_plane->ipu_ch);
+ 		ipu_dmfc_put(ipu_plane->dmfc);
+-		ipu_dp_put(ipu_plane->dp);
++		if (ipu_plane->dp)
++			ipu_dp_put(ipu_plane->dp);
+ 	}
+ }
+ 
+diff --git a/drivers/staging/lustre/lustre/Kconfig b/drivers/staging/lustre/lustre/Kconfig
+index 209e4c7e6f8a..4f65ba1158bf 100644
+--- a/drivers/staging/lustre/lustre/Kconfig
++++ b/drivers/staging/lustre/lustre/Kconfig
+@@ -57,4 +57,5 @@ config LUSTRE_TRANSLATE_ERRNOS
+ config LUSTRE_LLITE_LLOOP
+ 	tristate "Lustre virtual block device"
+ 	depends on LUSTRE_FS && BLOCK
++	depends on !PPC_64K_PAGES && !ARM64_64K_PAGES
+ 	default m
+diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
+index 1f4c794f5fcc..260c3e1e312c 100644
+--- a/drivers/target/iscsi/iscsi_target.c
++++ b/drivers/target/iscsi/iscsi_target.c
+@@ -4540,6 +4540,7 @@ static void iscsit_logout_post_handler_diffcid(
+ {
+ 	struct iscsi_conn *l_conn;
+ 	struct iscsi_session *sess = conn->sess;
++	bool conn_found = false;
+ 
+ 	if (!sess)
+ 		return;
+@@ -4548,12 +4549,13 @@ static void iscsit_logout_post_handler_diffcid(
+ 	list_for_each_entry(l_conn, &sess->sess_conn_list, conn_list) {
+ 		if (l_conn->cid == cid) {
+ 			iscsit_inc_conn_usage_count(l_conn);
++			conn_found = true;
+ 			break;
+ 		}
+ 	}
+ 	spin_unlock_bh(&sess->conn_lock);
+ 
+-	if (!l_conn)
++	if (!conn_found)
+ 		return;
+ 
+ 	if (l_conn->sock)
+diff --git a/drivers/target/iscsi/iscsi_target_parameters.c b/drivers/target/iscsi/iscsi_target_parameters.c
+index 02f9de26f38a..18c29260b4a2 100644
+--- a/drivers/target/iscsi/iscsi_target_parameters.c
++++ b/drivers/target/iscsi/iscsi_target_parameters.c
+@@ -601,7 +601,7 @@ int iscsi_copy_param_list(
+ 	param_list = kzalloc(sizeof(struct iscsi_param_list), GFP_KERNEL);
+ 	if (!param_list) {
+ 		pr_err("Unable to allocate memory for struct iscsi_param_list.\n");
+-		goto err_out;
++		return -1;
+ 	}
+ 	INIT_LIST_HEAD(&param_list->param_list);
+ 	INIT_LIST_HEAD(&param_list->extra_response_list);
+diff --git a/drivers/target/target_core_configfs.c b/drivers/target/target_core_configfs.c
+index bf55c5a04cfa..756def38c77a 100644
+--- a/drivers/target/target_core_configfs.c
++++ b/drivers/target/target_core_configfs.c
+@@ -2363,7 +2363,7 @@ static ssize_t target_core_alua_tg_pt_gp_store_attr_alua_support_##_name(\
+ 		pr_err("Invalid value '%ld', must be '0' or '1'\n", tmp); \
+ 		return -EINVAL;						\
+ 	}								\
+-	if (!tmp)							\
++	if (tmp)							\
+ 		t->_var |= _bit;					\
+ 	else								\
+ 		t->_var &= ~_bit;					\
+diff --git a/drivers/tty/serial/atmel_serial.c b/drivers/tty/serial/atmel_serial.c
+index c4f750314100..ffefec83a02f 100644
+--- a/drivers/tty/serial/atmel_serial.c
++++ b/drivers/tty/serial/atmel_serial.c
+@@ -527,6 +527,45 @@ static void atmel_enable_ms(struct uart_port *port)
+ }
+ 
+ /*
++ * Disable modem status interrupts
++ */
++static void atmel_disable_ms(struct uart_port *port)
++{
++	struct atmel_uart_port *atmel_port = to_atmel_uart_port(port);
++	uint32_t idr = 0;
++
++	/*
++	 * Interrupt should not be disabled twice
++	 */
++	if (!atmel_port->ms_irq_enabled)
++		return;
++
++	atmel_port->ms_irq_enabled = false;
++
++	if (atmel_port->gpio_irq[UART_GPIO_CTS] >= 0)
++		disable_irq(atmel_port->gpio_irq[UART_GPIO_CTS]);
++	else
++		idr |= ATMEL_US_CTSIC;
++
++	if (atmel_port->gpio_irq[UART_GPIO_DSR] >= 0)
++		disable_irq(atmel_port->gpio_irq[UART_GPIO_DSR]);
++	else
++		idr |= ATMEL_US_DSRIC;
++
++	if (atmel_port->gpio_irq[UART_GPIO_RI] >= 0)
++		disable_irq(atmel_port->gpio_irq[UART_GPIO_RI]);
++	else
++		idr |= ATMEL_US_RIIC;
++
++	if (atmel_port->gpio_irq[UART_GPIO_DCD] >= 0)
++		disable_irq(atmel_port->gpio_irq[UART_GPIO_DCD]);
++	else
++		idr |= ATMEL_US_DCDIC;
++
++	UART_PUT_IDR(port, idr);
++}
++
++/*
+  * Control the transmission of a break signal
+  */
+ static void atmel_break_ctl(struct uart_port *port, int break_state)
+@@ -1993,7 +2032,9 @@ static void atmel_set_termios(struct uart_port *port, struct ktermios *termios,
+ 
+ 	/* CTS flow-control and modem-status interrupts */
+ 	if (UART_ENABLE_MS(port, termios->c_cflag))
+-		port->ops->enable_ms(port);
++		atmel_enable_ms(port);
++	else
++		atmel_disable_ms(port);
+ 
+ 	spin_unlock_irqrestore(&port->lock, flags);
+ }
+diff --git a/drivers/usb/chipidea/ci_hdrc_msm.c b/drivers/usb/chipidea/ci_hdrc_msm.c
+index d72b9d2de2c5..4935ac38fd00 100644
+--- a/drivers/usb/chipidea/ci_hdrc_msm.c
++++ b/drivers/usb/chipidea/ci_hdrc_msm.c
+@@ -20,13 +20,13 @@
+ static void ci_hdrc_msm_notify_event(struct ci_hdrc *ci, unsigned event)
+ {
+ 	struct device *dev = ci->gadget.dev.parent;
+-	int val;
+ 
+ 	switch (event) {
+ 	case CI_HDRC_CONTROLLER_RESET_EVENT:
+ 		dev_dbg(dev, "CI_HDRC_CONTROLLER_RESET_EVENT received\n");
+ 		writel(0, USB_AHBBURST);
+ 		writel(0, USB_AHBMODE);
++		usb_phy_init(ci->transceiver);
+ 		break;
+ 	case CI_HDRC_CONTROLLER_STOPPED_EVENT:
+ 		dev_dbg(dev, "CI_HDRC_CONTROLLER_STOPPED_EVENT received\n");
+@@ -34,10 +34,7 @@ static void ci_hdrc_msm_notify_event(struct ci_hdrc *ci, unsigned event)
+ 		 * Put the transceiver in non-driving mode. Otherwise host
+ 		 * may not detect soft-disconnection.
+ 		 */
+-		val = usb_phy_io_read(ci->transceiver, ULPI_FUNC_CTRL);
+-		val &= ~ULPI_FUNC_CTRL_OPMODE_MASK;
+-		val |= ULPI_FUNC_CTRL_OPMODE_NONDRIVING;
+-		usb_phy_io_write(ci->transceiver, val, ULPI_FUNC_CTRL);
++		usb_phy_notify_disconnect(ci->transceiver, USB_SPEED_UNKNOWN);
+ 		break;
+ 	default:
+ 		dev_dbg(dev, "unknown ci_hdrc event\n");
+diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
+index 27f217107ef1..50e854509f55 100644
+--- a/drivers/usb/core/hub.c
++++ b/drivers/usb/core/hub.c
+@@ -5008,9 +5008,10 @@ static void hub_events(void)
+ 
+ 		hub = list_entry(tmp, struct usb_hub, event_list);
+ 		kref_get(&hub->kref);
++		hdev = hub->hdev;
++		usb_get_dev(hdev);
+ 		spin_unlock_irq(&hub_event_lock);
+ 
+-		hdev = hub->hdev;
+ 		hub_dev = hub->intfdev;
+ 		intf = to_usb_interface(hub_dev);
+ 		dev_dbg(hub_dev, "state %d ports %d chg %04x evt %04x\n",
+@@ -5123,6 +5124,7 @@ static void hub_events(void)
+ 		usb_autopm_put_interface(intf);
+  loop_disconnected:
+ 		usb_unlock_device(hdev);
++		usb_put_dev(hdev);
+ 		kref_put(&hub->kref, hub_release);
+ 
+ 	} /* end while (1) */
+diff --git a/drivers/usb/dwc2/gadget.c b/drivers/usb/dwc2/gadget.c
+index f3c56a2fed5b..a0d2f31b30cc 100644
+--- a/drivers/usb/dwc2/gadget.c
++++ b/drivers/usb/dwc2/gadget.c
+@@ -1650,6 +1650,7 @@ static void s3c_hsotg_txfifo_flush(struct s3c_hsotg *hsotg, unsigned int idx)
+ 			dev_err(hsotg->dev,
+ 				"%s: timeout flushing fifo (GRSTCTL=%08x)\n",
+ 				__func__, val);
++			break;
+ 		}
+ 
+ 		udelay(1);
+@@ -2748,13 +2749,14 @@ static void s3c_hsotg_phy_enable(struct s3c_hsotg *hsotg)
+ 
+ 	dev_dbg(hsotg->dev, "pdev 0x%p\n", pdev);
+ 
+-	if (hsotg->phy) {
+-		phy_init(hsotg->phy);
+-		phy_power_on(hsotg->phy);
+-	} else if (hsotg->uphy)
++	if (hsotg->uphy)
+ 		usb_phy_init(hsotg->uphy);
+-	else if (hsotg->plat->phy_init)
++	else if (hsotg->plat && hsotg->plat->phy_init)
+ 		hsotg->plat->phy_init(pdev, hsotg->plat->phy_type);
++	else {
++		phy_init(hsotg->phy);
++		phy_power_on(hsotg->phy);
++	}
+ }
+ 
+ /**
+@@ -2768,13 +2770,14 @@ static void s3c_hsotg_phy_disable(struct s3c_hsotg *hsotg)
+ {
+ 	struct platform_device *pdev = to_platform_device(hsotg->dev);
+ 
+-	if (hsotg->phy) {
+-		phy_power_off(hsotg->phy);
+-		phy_exit(hsotg->phy);
+-	} else if (hsotg->uphy)
++	if (hsotg->uphy)
+ 		usb_phy_shutdown(hsotg->uphy);
+-	else if (hsotg->plat->phy_exit)
++	else if (hsotg->plat && hsotg->plat->phy_exit)
+ 		hsotg->plat->phy_exit(pdev, hsotg->plat->phy_type);
++	else {
++		phy_power_off(hsotg->phy);
++		phy_exit(hsotg->phy);
++	}
+ }
+ 
+ /**
+@@ -2893,13 +2896,11 @@ static int s3c_hsotg_udc_stop(struct usb_gadget *gadget,
+ 		return -ENODEV;
+ 
+ 	/* all endpoints should be shutdown */
+-	for (ep = 0; ep < hsotg->num_of_eps; ep++)
++	for (ep = 1; ep < hsotg->num_of_eps; ep++)
+ 		s3c_hsotg_ep_disable(&hsotg->eps[ep].ep);
+ 
+ 	spin_lock_irqsave(&hsotg->lock, flags);
+ 
+-	s3c_hsotg_phy_disable(hsotg);
+-
+ 	if (!driver)
+ 		hsotg->driver = NULL;
+ 
+@@ -2942,7 +2943,6 @@ static int s3c_hsotg_pullup(struct usb_gadget *gadget, int is_on)
+ 		s3c_hsotg_phy_enable(hsotg);
+ 		s3c_hsotg_core_init(hsotg);
+ 	} else {
+-		s3c_hsotg_disconnect(hsotg);
+ 		s3c_hsotg_phy_disable(hsotg);
+ 	}
+ 
+@@ -3444,13 +3444,6 @@ static int s3c_hsotg_probe(struct platform_device *pdev)
+ 
+ 	hsotg->irq = ret;
+ 
+-	ret = devm_request_irq(&pdev->dev, hsotg->irq, s3c_hsotg_irq, 0,
+-				dev_name(dev), hsotg);
+-	if (ret < 0) {
+-		dev_err(dev, "cannot claim IRQ\n");
+-		goto err_clk;
+-	}
+-
+ 	dev_info(dev, "regs %p, irq %d\n", hsotg->regs, hsotg->irq);
+ 
+ 	hsotg->gadget.max_speed = USB_SPEED_HIGH;
+@@ -3491,9 +3484,6 @@ static int s3c_hsotg_probe(struct platform_device *pdev)
+ 	if (hsotg->phy && (phy_get_bus_width(phy) == 8))
+ 		hsotg->phyif = GUSBCFG_PHYIF8;
+ 
+-	if (hsotg->phy)
+-		phy_init(hsotg->phy);
+-
+ 	/* usb phy enable */
+ 	s3c_hsotg_phy_enable(hsotg);
+ 
+@@ -3501,6 +3491,17 @@ static int s3c_hsotg_probe(struct platform_device *pdev)
+ 	s3c_hsotg_init(hsotg);
+ 	s3c_hsotg_hw_cfg(hsotg);
+ 
++	ret = devm_request_irq(&pdev->dev, hsotg->irq, s3c_hsotg_irq, 0,
++				dev_name(dev), hsotg);
++	if (ret < 0) {
++		s3c_hsotg_phy_disable(hsotg);
++		clk_disable_unprepare(hsotg->clk);
++		regulator_bulk_disable(ARRAY_SIZE(hsotg->supplies),
++				       hsotg->supplies);
++		dev_err(dev, "cannot claim IRQ\n");
++		goto err_clk;
++	}
++
+ 	/* hsotg->num_of_eps holds number of EPs other than ep0 */
+ 
+ 	if (hsotg->num_of_eps == 0) {
+@@ -3586,9 +3587,6 @@ static int s3c_hsotg_remove(struct platform_device *pdev)
+ 		usb_gadget_unregister_driver(hsotg->driver);
+ 	}
+ 
+-	s3c_hsotg_phy_disable(hsotg);
+-	if (hsotg->phy)
+-		phy_exit(hsotg->phy);
+ 	clk_disable_unprepare(hsotg->clk);
+ 
+ 	return 0;
+diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
+index eb69eb9f06c8..52b30c5b000e 100644
+--- a/drivers/usb/dwc3/core.c
++++ b/drivers/usb/dwc3/core.c
+@@ -786,20 +786,21 @@ static int dwc3_remove(struct platform_device *pdev)
+ {
+ 	struct dwc3	*dwc = platform_get_drvdata(pdev);
+ 
++	dwc3_debugfs_exit(dwc);
++	dwc3_core_exit_mode(dwc);
++	dwc3_event_buffers_cleanup(dwc);
++	dwc3_free_event_buffers(dwc);
++
+ 	usb_phy_set_suspend(dwc->usb2_phy, 1);
+ 	usb_phy_set_suspend(dwc->usb3_phy, 1);
+ 	phy_power_off(dwc->usb2_generic_phy);
+ 	phy_power_off(dwc->usb3_generic_phy);
+ 
++	dwc3_core_exit(dwc);
++
+ 	pm_runtime_put_sync(&pdev->dev);
+ 	pm_runtime_disable(&pdev->dev);
+ 
+-	dwc3_debugfs_exit(dwc);
+-	dwc3_core_exit_mode(dwc);
+-	dwc3_event_buffers_cleanup(dwc);
+-	dwc3_free_event_buffers(dwc);
+-	dwc3_core_exit(dwc);
+-
+ 	return 0;
+ }
+ 
+diff --git a/drivers/usb/dwc3/dwc3-omap.c b/drivers/usb/dwc3/dwc3-omap.c
+index 07a736acd0f2..3536ad7f1346 100644
+--- a/drivers/usb/dwc3/dwc3-omap.c
++++ b/drivers/usb/dwc3/dwc3-omap.c
+@@ -576,9 +576,9 @@ static int dwc3_omap_remove(struct platform_device *pdev)
+ 	if (omap->extcon_id_dev.edev)
+ 		extcon_unregister_interest(&omap->extcon_id_dev);
+ 	dwc3_omap_disable_irqs(omap);
++	device_for_each_child(&pdev->dev, NULL, dwc3_omap_remove_core);
+ 	pm_runtime_put_sync(&pdev->dev);
+ 	pm_runtime_disable(&pdev->dev);
+-	device_for_each_child(&pdev->dev, NULL, dwc3_omap_remove_core);
+ 
+ 	return 0;
+ }
+diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
+index dab7927d1009..f5b352a19eb0 100644
+--- a/drivers/usb/dwc3/gadget.c
++++ b/drivers/usb/dwc3/gadget.c
+@@ -527,7 +527,7 @@ static int dwc3_gadget_set_ep_config(struct dwc3 *dwc, struct dwc3_ep *dep,
+ 		dep->stream_capable = true;
+ 	}
+ 
+-	if (usb_endpoint_xfer_isoc(desc))
++	if (!usb_endpoint_xfer_control(desc))
+ 		params.param1 |= DWC3_DEPCFG_XFER_IN_PROGRESS_EN;
+ 
+ 	/*
+@@ -2042,12 +2042,6 @@ static void dwc3_endpoint_interrupt(struct dwc3 *dwc,
+ 		dwc3_endpoint_transfer_complete(dwc, dep, event, 1);
+ 		break;
+ 	case DWC3_DEPEVT_XFERINPROGRESS:
+-		if (!usb_endpoint_xfer_isoc(dep->endpoint.desc)) {
+-			dev_dbg(dwc->dev, "%s is not an Isochronous endpoint\n",
+-					dep->name);
+-			return;
+-		}
+-
+ 		dwc3_endpoint_transfer_complete(dwc, dep, event, 0);
+ 		break;
+ 	case DWC3_DEPEVT_XFERNOTREADY:
+diff --git a/drivers/usb/gadget/f_rndis.c b/drivers/usb/gadget/f_rndis.c
+index 9c41e9515b8e..ddb09dc6d1f2 100644
+--- a/drivers/usb/gadget/f_rndis.c
++++ b/drivers/usb/gadget/f_rndis.c
+@@ -727,6 +727,10 @@ rndis_bind(struct usb_configuration *c, struct usb_function *f)
+ 	rndis_control_intf.bInterfaceNumber = status;
+ 	rndis_union_desc.bMasterInterface0 = status;
+ 
++	if (cdev->use_os_string)
++		f->os_desc_table[0].if_id =
++			rndis_iad_descriptor.bFirstInterface;
++
+ 	status = usb_interface_id(c, f);
+ 	if (status < 0)
+ 		goto fail;
+diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
+index 81cda09b47e3..488a30836c36 100644
+--- a/drivers/usb/host/ehci-hcd.c
++++ b/drivers/usb/host/ehci-hcd.c
+@@ -965,8 +965,6 @@ rescan:
+ 	}
+ 
+ 	qh->exception = 1;
+-	if (ehci->rh_state < EHCI_RH_RUNNING)
+-		qh->qh_state = QH_STATE_IDLE;
+ 	switch (qh->qh_state) {
+ 	case QH_STATE_LINKED:
+ 		WARN_ON(!list_empty(&qh->qtd_list));
+diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
+index aa79e8749040..69aece31143a 100644
+--- a/drivers/usb/host/xhci-hub.c
++++ b/drivers/usb/host/xhci-hub.c
+@@ -468,7 +468,8 @@ static void xhci_hub_report_usb2_link_state(u32 *status, u32 status_reg)
+ }
+ 
+ /* Updates Link Status for super Speed port */
+-static void xhci_hub_report_usb3_link_state(u32 *status, u32 status_reg)
++static void xhci_hub_report_usb3_link_state(struct xhci_hcd *xhci,
++		u32 *status, u32 status_reg)
+ {
+ 	u32 pls = status_reg & PORT_PLS_MASK;
+ 
+@@ -507,7 +508,8 @@ static void xhci_hub_report_usb3_link_state(u32 *status, u32 status_reg)
+ 		 * in which sometimes the port enters compliance mode
+ 		 * caused by a delay on the host-device negotiation.
+ 		 */
+-		if (pls == USB_SS_PORT_LS_COMP_MOD)
++		if ((xhci->quirks & XHCI_COMP_MODE_QUIRK) &&
++				(pls == USB_SS_PORT_LS_COMP_MOD))
+ 			pls |= USB_PORT_STAT_CONNECTION;
+ 	}
+ 
+@@ -666,7 +668,7 @@ static u32 xhci_get_port_status(struct usb_hcd *hcd,
+ 	}
+ 	/* Update Port Link State */
+ 	if (hcd->speed == HCD_USB3) {
+-		xhci_hub_report_usb3_link_state(&status, raw_port_status);
++		xhci_hub_report_usb3_link_state(xhci, &status, raw_port_status);
+ 		/*
+ 		 * Verify if all USB3 Ports Have entered U0 already.
+ 		 * Delete Compliance Mode Timer if so.
+diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
+index 8056d90690ee..8936211b161d 100644
+--- a/drivers/usb/host/xhci-mem.c
++++ b/drivers/usb/host/xhci-mem.c
+@@ -1812,6 +1812,7 @@ void xhci_mem_cleanup(struct xhci_hcd *xhci)
+ 
+ 	if (xhci->lpm_command)
+ 		xhci_free_command(xhci, xhci->lpm_command);
++	xhci->lpm_command = NULL;
+ 	if (xhci->cmd_ring)
+ 		xhci_ring_free(xhci, xhci->cmd_ring);
+ 	xhci->cmd_ring = NULL;
+@@ -1819,7 +1820,7 @@ void xhci_mem_cleanup(struct xhci_hcd *xhci)
+ 	xhci_cleanup_command_queue(xhci);
+ 
+ 	num_ports = HCS_MAX_PORTS(xhci->hcs_params1);
+-	for (i = 0; i < num_ports; i++) {
++	for (i = 0; i < num_ports && xhci->rh_bw; i++) {
+ 		struct xhci_interval_bw_table *bwt = &xhci->rh_bw[i].bw_table;
+ 		for (j = 0; j < XHCI_MAX_INTERVAL; j++) {
+ 			struct list_head *ep = &bwt->interval_bw[j].endpoints;
+diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
+index e32cc6cf86dc..2d1284adc987 100644
+--- a/drivers/usb/host/xhci.c
++++ b/drivers/usb/host/xhci.c
+@@ -3982,13 +3982,21 @@ static int __maybe_unused xhci_change_max_exit_latency(struct xhci_hcd *xhci,
+ 	int ret;
+ 
+ 	spin_lock_irqsave(&xhci->lock, flags);
+-	if (max_exit_latency == xhci->devs[udev->slot_id]->current_mel) {
++
++	virt_dev = xhci->devs[udev->slot_id];
++
++	/*
++	 * virt_dev might not exists yet if xHC resumed from hibernate (S4) and
++	 * xHC was re-initialized. Exit latency will be set later after
++	 * hub_port_finish_reset() is done and xhci->devs[] are re-allocated
++	 */
++
++	if (!virt_dev || max_exit_latency == virt_dev->current_mel) {
+ 		spin_unlock_irqrestore(&xhci->lock, flags);
+ 		return 0;
+ 	}
+ 
+ 	/* Attempt to issue an Evaluate Context command to change the MEL. */
+-	virt_dev = xhci->devs[udev->slot_id];
+ 	command = xhci->lpm_command;
+ 	ctrl_ctx = xhci_get_input_control_ctx(xhci, command->in_ctx);
+ 	if (!ctrl_ctx) {
+diff --git a/drivers/usb/misc/sisusbvga/sisusb.c b/drivers/usb/misc/sisusbvga/sisusb.c
+index 06b5d77cd9ad..633caf643122 100644
+--- a/drivers/usb/misc/sisusbvga/sisusb.c
++++ b/drivers/usb/misc/sisusbvga/sisusb.c
+@@ -3250,6 +3250,7 @@ static const struct usb_device_id sisusb_table[] = {
+ 	{ USB_DEVICE(0x0711, 0x0918) },
+ 	{ USB_DEVICE(0x0711, 0x0920) },
+ 	{ USB_DEVICE(0x0711, 0x0950) },
++	{ USB_DEVICE(0x0711, 0x5200) },
+ 	{ USB_DEVICE(0x182d, 0x021c) },
+ 	{ USB_DEVICE(0x182d, 0x0269) },
+ 	{ }
+diff --git a/drivers/usb/phy/phy-tegra-usb.c b/drivers/usb/phy/phy-tegra-usb.c
+index bbe4f8e6e8d7..8834b70c868c 100644
+--- a/drivers/usb/phy/phy-tegra-usb.c
++++ b/drivers/usb/phy/phy-tegra-usb.c
+@@ -881,8 +881,8 @@ static int utmi_phy_probe(struct tegra_usb_phy *tegra_phy,
+ 		return -ENOMEM;
+ 	}
+ 
+-	tegra_phy->config = devm_kzalloc(&pdev->dev,
+-		sizeof(*tegra_phy->config), GFP_KERNEL);
++	tegra_phy->config = devm_kzalloc(&pdev->dev, sizeof(*config),
++					 GFP_KERNEL);
+ 	if (!tegra_phy->config) {
+ 		dev_err(&pdev->dev,
+ 			"unable to allocate memory for USB UTMIP config\n");
+diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
+index 8b0f517abb6b..3614620e09e1 100644
+--- a/drivers/usb/serial/ftdi_sio.c
++++ b/drivers/usb/serial/ftdi_sio.c
+@@ -741,6 +741,7 @@ static const struct usb_device_id id_table_combined[] = {
+ 	{ USB_DEVICE(FTDI_VID, FTDI_NDI_AURORA_SCU_PID),
+ 		.driver_info = (kernel_ulong_t)&ftdi_NDI_device_quirk },
+ 	{ USB_DEVICE(TELLDUS_VID, TELLDUS_TELLSTICK_PID) },
++	{ USB_DEVICE(NOVITUS_VID, NOVITUS_BONO_E_PID) },
+ 	{ USB_DEVICE(RTSYSTEMS_VID, RTSYSTEMS_USB_S03_PID) },
+ 	{ USB_DEVICE(RTSYSTEMS_VID, RTSYSTEMS_USB_59_PID) },
+ 	{ USB_DEVICE(RTSYSTEMS_VID, RTSYSTEMS_USB_57A_PID) },
+@@ -952,6 +953,8 @@ static const struct usb_device_id id_table_combined[] = {
+ 	{ USB_DEVICE(FTDI_VID, FTDI_EKEY_CONV_USB_PID) },
+ 	/* Infineon Devices */
+ 	{ USB_DEVICE_INTERFACE_NUMBER(INFINEON_VID, INFINEON_TRIBOARD_PID, 1) },
++	/* GE Healthcare devices */
++	{ USB_DEVICE(GE_HEALTHCARE_VID, GE_HEALTHCARE_NEMO_TRACKER_PID) },
+ 	{ }					/* Terminating entry */
+ };
+ 
+diff --git a/drivers/usb/serial/ftdi_sio_ids.h b/drivers/usb/serial/ftdi_sio_ids.h
+index 70b0b1d88ae9..5937b2d242f2 100644
+--- a/drivers/usb/serial/ftdi_sio_ids.h
++++ b/drivers/usb/serial/ftdi_sio_ids.h
+@@ -837,6 +837,12 @@
+ #define TELLDUS_TELLSTICK_PID		0x0C30	/* RF control dongle 433 MHz using FT232RL */
+ 
+ /*
++ * NOVITUS printers
++ */
++#define NOVITUS_VID			0x1a28
++#define NOVITUS_BONO_E_PID		0x6010
++
++/*
+  * RT Systems programming cables for various ham radios
+  */
+ #define RTSYSTEMS_VID		0x2100	/* Vendor ID */
+@@ -1385,3 +1391,9 @@
+  * ekey biometric systems GmbH (http://ekey.net/)
+  */
+ #define FTDI_EKEY_CONV_USB_PID		0xCB08	/* Converter USB */
++
++/*
++ * GE Healthcare devices
++ */
++#define GE_HEALTHCARE_VID		0x1901
++#define GE_HEALTHCARE_NEMO_TRACKER_PID	0x0015
+diff --git a/drivers/usb/serial/option.c b/drivers/usb/serial/option.c
+index a9688940543d..54a8120897a6 100644
+--- a/drivers/usb/serial/option.c
++++ b/drivers/usb/serial/option.c
+@@ -275,8 +275,12 @@ static void option_instat_callback(struct urb *urb);
+ #define ZTE_PRODUCT_MF622			0x0001
+ #define ZTE_PRODUCT_MF628			0x0015
+ #define ZTE_PRODUCT_MF626			0x0031
+-#define ZTE_PRODUCT_MC2718			0xffe8
+ #define ZTE_PRODUCT_AC2726			0xfff1
++#define ZTE_PRODUCT_CDMA_TECH			0xfffe
++#define ZTE_PRODUCT_AC8710T			0xffff
++#define ZTE_PRODUCT_MC2718			0xffe8
++#define ZTE_PRODUCT_AD3812			0xffeb
++#define ZTE_PRODUCT_MC2716			0xffed
+ 
+ #define BENQ_VENDOR_ID				0x04a5
+ #define BENQ_PRODUCT_H10			0x4068
+@@ -494,6 +498,10 @@ static void option_instat_callback(struct urb *urb);
+ #define INOVIA_VENDOR_ID			0x20a6
+ #define INOVIA_SEW858				0x1105
+ 
++/* VIA Telecom */
++#define VIATELECOM_VENDOR_ID			0x15eb
++#define VIATELECOM_PRODUCT_CDS7			0x0001
++
+ /* some devices interfaces need special handling due to a number of reasons */
+ enum option_blacklist_reason {
+ 		OPTION_BLACKLIST_NONE = 0,
+@@ -527,10 +535,18 @@ static const struct option_blacklist_info zte_k3765_z_blacklist = {
+ 	.reserved = BIT(4),
+ };
+ 
++static const struct option_blacklist_info zte_ad3812_z_blacklist = {
++	.sendsetup = BIT(0) | BIT(1) | BIT(2),
++};
++
+ static const struct option_blacklist_info zte_mc2718_z_blacklist = {
+ 	.sendsetup = BIT(1) | BIT(2) | BIT(3) | BIT(4),
+ };
+ 
++static const struct option_blacklist_info zte_mc2716_z_blacklist = {
++	.sendsetup = BIT(1) | BIT(2) | BIT(3),
++};
++
+ static const struct option_blacklist_info huawei_cdc12_blacklist = {
+ 	.reserved = BIT(1) | BIT(2),
+ };
+@@ -1070,6 +1086,7 @@ static const struct usb_device_id option_ids[] = {
+ 	{ USB_DEVICE_INTERFACE_CLASS(BANDRICH_VENDOR_ID, BANDRICH_PRODUCT_1012, 0xff) },
+ 	{ USB_DEVICE(KYOCERA_VENDOR_ID, KYOCERA_PRODUCT_KPC650) },
+ 	{ USB_DEVICE(KYOCERA_VENDOR_ID, KYOCERA_PRODUCT_KPC680) },
++	{ USB_DEVICE(QUALCOMM_VENDOR_ID, 0x6000)}, /* ZTE AC8700 */
+ 	{ USB_DEVICE(QUALCOMM_VENDOR_ID, 0x6613)}, /* Onda H600/ZTE MF330 */
+ 	{ USB_DEVICE(QUALCOMM_VENDOR_ID, 0x0023)}, /* ONYX 3G device */
+ 	{ USB_DEVICE(QUALCOMM_VENDOR_ID, 0x9000)}, /* SIMCom SIM5218 */
+@@ -1544,13 +1561,18 @@ static const struct usb_device_id option_ids[] = {
+ 	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff93, 0xff, 0xff, 0xff) },
+ 	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff94, 0xff, 0xff, 0xff) },
+ 
+-	/* NOTE: most ZTE CDMA devices should be driven by zte_ev, not option */
++	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_CDMA_TECH, 0xff, 0xff, 0xff) },
++	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_AC2726, 0xff, 0xff, 0xff) },
++	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_AC8710T, 0xff, 0xff, 0xff) },
+ 	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_MC2718, 0xff, 0xff, 0xff),
+ 	 .driver_info = (kernel_ulong_t)&zte_mc2718_z_blacklist },
++	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_AD3812, 0xff, 0xff, 0xff),
++	 .driver_info = (kernel_ulong_t)&zte_ad3812_z_blacklist },
++	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_MC2716, 0xff, 0xff, 0xff),
++	 .driver_info = (kernel_ulong_t)&zte_mc2716_z_blacklist },
+ 	{ USB_VENDOR_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff, 0x02, 0x01) },
+ 	{ USB_VENDOR_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff, 0x02, 0x05) },
+ 	{ USB_VENDOR_AND_INTERFACE_INFO(ZTE_VENDOR_ID, 0xff, 0x86, 0x10) },
+-	{ USB_DEVICE_AND_INTERFACE_INFO(ZTE_VENDOR_ID, ZTE_PRODUCT_AC2726, 0xff, 0xff, 0xff) },
+ 
+ 	{ USB_DEVICE(BENQ_VENDOR_ID, BENQ_PRODUCT_H10) },
+ 	{ USB_DEVICE(DLINK_VENDOR_ID, DLINK_PRODUCT_DWM_652) },
+@@ -1724,6 +1746,7 @@ static const struct usb_device_id option_ids[] = {
+ 	{ USB_DEVICE_AND_INTERFACE_INFO(0x07d1, 0x3e01, 0xff, 0xff, 0xff) }, /* D-Link DWM-152/C1 */
+ 	{ USB_DEVICE_AND_INTERFACE_INFO(0x07d1, 0x3e02, 0xff, 0xff, 0xff) }, /* D-Link DWM-156/C1 */
+ 	{ USB_DEVICE(INOVIA_VENDOR_ID, INOVIA_SEW858) },
++	{ USB_DEVICE(VIATELECOM_VENDOR_ID, VIATELECOM_PRODUCT_CDS7) },
+ 	{ } /* Terminating entry */
+ };
+ MODULE_DEVICE_TABLE(usb, option_ids);
+@@ -1916,6 +1939,8 @@ static void option_instat_callback(struct urb *urb)
+ 			dev_dbg(dev, "%s: type %x req %x\n", __func__,
+ 				req_pkt->bRequestType, req_pkt->bRequest);
+ 		}
++	} else if (status == -ENOENT || status == -ESHUTDOWN) {
++		dev_dbg(dev, "%s: urb stopped: %d\n", __func__, status);
+ 	} else
+ 		dev_err(dev, "%s: error %d\n", __func__, status);
+ 
+diff --git a/drivers/usb/serial/pl2303.c b/drivers/usb/serial/pl2303.c
+index b3d5a35c0d4b..e9bad928039f 100644
+--- a/drivers/usb/serial/pl2303.c
++++ b/drivers/usb/serial/pl2303.c
+@@ -45,6 +45,7 @@ static const struct usb_device_id id_table[] = {
+ 	{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_GPRS) },
+ 	{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_HCR331) },
+ 	{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_MOTOROLA) },
++	{ USB_DEVICE(PL2303_VENDOR_ID, PL2303_PRODUCT_ID_ZTEK) },
+ 	{ USB_DEVICE(IODATA_VENDOR_ID, IODATA_PRODUCT_ID) },
+ 	{ USB_DEVICE(IODATA_VENDOR_ID, IODATA_PRODUCT_ID_RSAQ5) },
+ 	{ USB_DEVICE(ATEN_VENDOR_ID, ATEN_PRODUCT_ID) },
+diff --git a/drivers/usb/serial/pl2303.h b/drivers/usb/serial/pl2303.h
+index 42bc082896ac..71fd9da1d6e7 100644
+--- a/drivers/usb/serial/pl2303.h
++++ b/drivers/usb/serial/pl2303.h
+@@ -22,6 +22,7 @@
+ #define PL2303_PRODUCT_ID_GPRS		0x0609
+ #define PL2303_PRODUCT_ID_HCR331	0x331a
+ #define PL2303_PRODUCT_ID_MOTOROLA	0x0307
++#define PL2303_PRODUCT_ID_ZTEK		0xe1f1
+ 
+ #define ATEN_VENDOR_ID		0x0557
+ #define ATEN_VENDOR_ID2		0x0547
+diff --git a/drivers/usb/serial/sierra.c b/drivers/usb/serial/sierra.c
+index 6f7f01eb556a..46179a0828eb 100644
+--- a/drivers/usb/serial/sierra.c
++++ b/drivers/usb/serial/sierra.c
+@@ -282,14 +282,19 @@ static const struct usb_device_id id_table[] = {
+ 	/* Sierra Wireless HSPA Non-Composite Device */
+ 	{ USB_DEVICE_AND_INTERFACE_INFO(0x1199, 0x6892, 0xFF, 0xFF, 0xFF)},
+ 	{ USB_DEVICE(0x1199, 0x6893) },	/* Sierra Wireless Device */
+-	{ USB_DEVICE(0x1199, 0x68A3), 	/* Sierra Wireless Direct IP modems */
++	/* Sierra Wireless Direct IP modems */
++	{ USB_DEVICE_AND_INTERFACE_INFO(0x1199, 0x68A3, 0xFF, 0xFF, 0xFF),
++	  .driver_info = (kernel_ulong_t)&direct_ip_interface_blacklist
++	},
++	{ USB_DEVICE_AND_INTERFACE_INFO(0x1199, 0x68AA, 0xFF, 0xFF, 0xFF),
+ 	  .driver_info = (kernel_ulong_t)&direct_ip_interface_blacklist
+ 	},
+ 	/* AT&T Direct IP LTE modems */
+ 	{ USB_DEVICE_AND_INTERFACE_INFO(0x0F3D, 0x68AA, 0xFF, 0xFF, 0xFF),
+ 	  .driver_info = (kernel_ulong_t)&direct_ip_interface_blacklist
+ 	},
+-	{ USB_DEVICE(0x0f3d, 0x68A3), 	/* Airprime/Sierra Wireless Direct IP modems */
++	/* Airprime/Sierra Wireless Direct IP modems */
++	{ USB_DEVICE_AND_INTERFACE_INFO(0x0F3D, 0x68A3, 0xFF, 0xFF, 0xFF),
+ 	  .driver_info = (kernel_ulong_t)&direct_ip_interface_blacklist
+ 	},
+ 
+diff --git a/drivers/usb/serial/usb-serial.c b/drivers/usb/serial/usb-serial.c
+index 02de3110fe94..475723c006f9 100644
+--- a/drivers/usb/serial/usb-serial.c
++++ b/drivers/usb/serial/usb-serial.c
+@@ -764,29 +764,39 @@ static int usb_serial_probe(struct usb_interface *interface,
+ 		if (usb_endpoint_is_bulk_in(endpoint)) {
+ 			/* we found a bulk in endpoint */
+ 			dev_dbg(ddev, "found bulk in on endpoint %d\n", i);
+-			bulk_in_endpoint[num_bulk_in] = endpoint;
+-			++num_bulk_in;
++			if (num_bulk_in < MAX_NUM_PORTS) {
++				bulk_in_endpoint[num_bulk_in] = endpoint;
++				++num_bulk_in;
++			}
+ 		}
+ 
+ 		if (usb_endpoint_is_bulk_out(endpoint)) {
+ 			/* we found a bulk out endpoint */
+ 			dev_dbg(ddev, "found bulk out on endpoint %d\n", i);
+-			bulk_out_endpoint[num_bulk_out] = endpoint;
+-			++num_bulk_out;
++			if (num_bulk_out < MAX_NUM_PORTS) {
++				bulk_out_endpoint[num_bulk_out] = endpoint;
++				++num_bulk_out;
++			}
+ 		}
+ 
+ 		if (usb_endpoint_is_int_in(endpoint)) {
+ 			/* we found a interrupt in endpoint */
+ 			dev_dbg(ddev, "found interrupt in on endpoint %d\n", i);
+-			interrupt_in_endpoint[num_interrupt_in] = endpoint;
+-			++num_interrupt_in;
++			if (num_interrupt_in < MAX_NUM_PORTS) {
++				interrupt_in_endpoint[num_interrupt_in] =
++						endpoint;
++				++num_interrupt_in;
++			}
+ 		}
+ 
+ 		if (usb_endpoint_is_int_out(endpoint)) {
+ 			/* we found an interrupt out endpoint */
+ 			dev_dbg(ddev, "found interrupt out on endpoint %d\n", i);
+-			interrupt_out_endpoint[num_interrupt_out] = endpoint;
+-			++num_interrupt_out;
++			if (num_interrupt_out < MAX_NUM_PORTS) {
++				interrupt_out_endpoint[num_interrupt_out] =
++						endpoint;
++				++num_interrupt_out;
++			}
+ 		}
+ 	}
+ 
+@@ -809,8 +819,10 @@ static int usb_serial_probe(struct usb_interface *interface,
+ 				if (usb_endpoint_is_int_in(endpoint)) {
+ 					/* we found a interrupt in endpoint */
+ 					dev_dbg(ddev, "found interrupt in for Prolific device on separate interface\n");
+-					interrupt_in_endpoint[num_interrupt_in] = endpoint;
+-					++num_interrupt_in;
++					if (num_interrupt_in < MAX_NUM_PORTS) {
++						interrupt_in_endpoint[num_interrupt_in] = endpoint;
++						++num_interrupt_in;
++					}
+ 				}
+ 			}
+ 		}
+@@ -850,6 +862,11 @@ static int usb_serial_probe(struct usb_interface *interface,
+ 			num_ports = type->num_ports;
+ 	}
+ 
++	if (num_ports > MAX_NUM_PORTS) {
++		dev_warn(ddev, "too many ports requested: %d\n", num_ports);
++		num_ports = MAX_NUM_PORTS;
++	}
++
+ 	serial->num_ports = num_ports;
+ 	serial->num_bulk_in = num_bulk_in;
+ 	serial->num_bulk_out = num_bulk_out;
+diff --git a/drivers/usb/serial/zte_ev.c b/drivers/usb/serial/zte_ev.c
+index e40ab739c4a6..c9bb107d5e5c 100644
+--- a/drivers/usb/serial/zte_ev.c
++++ b/drivers/usb/serial/zte_ev.c
+@@ -272,28 +272,16 @@ static void zte_ev_usb_serial_close(struct usb_serial_port *port)
+ }
+ 
+ static const struct usb_device_id id_table[] = {
+-	/* AC8710, AC8710T */
+-	{ USB_DEVICE_AND_INTERFACE_INFO(0x19d2, 0xffff, 0xff, 0xff, 0xff) },
+-	 /* AC8700 */
+-	{ USB_DEVICE_AND_INTERFACE_INFO(0x19d2, 0xfffe, 0xff, 0xff, 0xff) },
+-	/* MG880 */
+-	{ USB_DEVICE(0x19d2, 0xfffd) },
+-	{ USB_DEVICE(0x19d2, 0xfffc) },
+-	{ USB_DEVICE(0x19d2, 0xfffb) },
+-	/* AC8710_V3 */
++	{ USB_DEVICE(0x19d2, 0xffec) },
++	{ USB_DEVICE(0x19d2, 0xffee) },
+ 	{ USB_DEVICE(0x19d2, 0xfff6) },
+ 	{ USB_DEVICE(0x19d2, 0xfff7) },
+ 	{ USB_DEVICE(0x19d2, 0xfff8) },
+ 	{ USB_DEVICE(0x19d2, 0xfff9) },
+-	{ USB_DEVICE(0x19d2, 0xffee) },
+-	/* AC2716, MC2716 */
+-	{ USB_DEVICE_AND_INTERFACE_INFO(0x19d2, 0xffed, 0xff, 0xff, 0xff) },
+-	/* AD3812 */
+-	{ USB_DEVICE_AND_INTERFACE_INFO(0x19d2, 0xffeb, 0xff, 0xff, 0xff) },
+-	{ USB_DEVICE(0x19d2, 0xffec) },
+-	{ USB_DEVICE(0x05C6, 0x3197) },
+-	{ USB_DEVICE(0x05C6, 0x6000) },
+-	{ USB_DEVICE(0x05C6, 0x9008) },
++	{ USB_DEVICE(0x19d2, 0xfffb) },
++	{ USB_DEVICE(0x19d2, 0xfffc) },
++	/* MG880 */
++	{ USB_DEVICE(0x19d2, 0xfffd) },
+ 	{ },
+ };
+ MODULE_DEVICE_TABLE(usb, id_table);
+diff --git a/drivers/usb/storage/unusual_devs.h b/drivers/usb/storage/unusual_devs.h
+index 80a5b366255f..14137ee543a1 100644
+--- a/drivers/usb/storage/unusual_devs.h
++++ b/drivers/usb/storage/unusual_devs.h
+@@ -101,6 +101,12 @@ UNUSUAL_DEV(  0x03f0, 0x4002, 0x0001, 0x0001,
+ 		"PhotoSmart R707",
+ 		USB_SC_DEVICE, USB_PR_DEVICE, NULL, US_FL_FIX_CAPACITY),
+ 
++UNUSUAL_DEV(  0x03f3, 0x0001, 0x0000, 0x9999,
++		"Adaptec",
++		"USBConnect 2000",
++		USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++		US_FL_SCM_MULT_TARG ),
++
+ /* Reported by Sebastian Kapfer <sebastian_kapfer@gmx.net>
+  * and Olaf Hering <olh@suse.de> (different bcd's, same vendor/product)
+  * for USB floppies that need the SINGLE_LUN enforcement.
+@@ -741,6 +747,12 @@ UNUSUAL_DEV(  0x059b, 0x0001, 0x0100, 0x0100,
+ 		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ 		US_FL_SINGLE_LUN ),
+ 
++UNUSUAL_DEV(  0x059b, 0x0040, 0x0100, 0x0100,
++		"Iomega",
++		"Jaz USB Adapter",
++		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++		US_FL_SINGLE_LUN ),
++
+ /* Reported by <Hendryk.Pfeiffer@gmx.de> */
+ UNUSUAL_DEV(  0x059f, 0x0643, 0x0000, 0x0000,
+ 		"LaCie",
+@@ -1113,6 +1125,18 @@ UNUSUAL_DEV(  0x0851, 0x1543, 0x0200, 0x0200,
+ 		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ 		US_FL_NOT_LOCKABLE),
+ 
++UNUSUAL_DEV(  0x085a, 0x0026, 0x0100, 0x0133,
++		"Xircom",
++		"PortGear USB-SCSI (Mac USB Dock)",
++		USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++		US_FL_SCM_MULT_TARG ),
++
++UNUSUAL_DEV(  0x085a, 0x0028, 0x0100, 0x0133,
++		"Xircom",
++		"PortGear USB to SCSI Converter",
++		USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++		US_FL_SCM_MULT_TARG ),
++
+ /* Submitted by Jan De Luyck <lkml@kcore.org> */
+ UNUSUAL_DEV(  0x08bd, 0x1100, 0x0000, 0x0000,
+ 		"CITIZEN",
+@@ -1952,6 +1976,14 @@ UNUSUAL_DEV(  0x152d, 0x2329, 0x0100, 0x0100,
+ 		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ 		US_FL_IGNORE_RESIDUE | US_FL_SANE_SENSE ),
+ 
++/* Entrega Technologies U1-SC25 (later Xircom PortGear PGSCSI)
++ * and Mac USB Dock USB-SCSI */
++UNUSUAL_DEV(  0x1645, 0x0007, 0x0100, 0x0133,
++		"Entrega Technologies",
++		"USB to SCSI Converter",
++		USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++		US_FL_SCM_MULT_TARG ),
++
+ /* Reported by Robert Schedel <r.schedel@yahoo.de>
+  * Note: this is a 'super top' device like the above 14cd/6600 device */
+ UNUSUAL_DEV(  0x1652, 0x6600, 0x0201, 0x0201,
+@@ -1974,6 +2006,12 @@ UNUSUAL_DEV(  0x177f, 0x0400, 0x0000, 0x0000,
+ 		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+ 		US_FL_BULK_IGNORE_TAG | US_FL_MAX_SECTORS_64 ),
+ 
++UNUSUAL_DEV(  0x1822, 0x0001, 0x0000, 0x9999,
++		"Ariston Technologies",
++		"iConnect USB to SCSI adapter",
++		USB_SC_DEVICE, USB_PR_DEVICE, usb_stor_euscsi_init,
++		US_FL_SCM_MULT_TARG ),
++
+ /* Reported by Hans de Goede <hdegoede@redhat.com>
+  * These Appotech controllers are found in Picture Frames, they provide a
+  * (buggy) emulation of a cdrom drive which contains the windows software
+diff --git a/drivers/uwb/lc-dev.c b/drivers/uwb/lc-dev.c
+index 80079b8fed15..d0303f0dbe15 100644
+--- a/drivers/uwb/lc-dev.c
++++ b/drivers/uwb/lc-dev.c
+@@ -431,16 +431,19 @@ void uwbd_dev_onair(struct uwb_rc *rc, struct uwb_beca_e *bce)
+ 	uwb_dev->mac_addr = *bce->mac_addr;
+ 	uwb_dev->dev_addr = bce->dev_addr;
+ 	dev_set_name(&uwb_dev->dev, "%s", macbuf);
++
++	/* plug the beacon cache */
++	bce->uwb_dev = uwb_dev;
++	uwb_dev->bce = bce;
++	uwb_bce_get(bce);		/* released in uwb_dev_sys_release() */
++
+ 	result = uwb_dev_add(uwb_dev, &rc->uwb_dev.dev, rc);
+ 	if (result < 0) {
+ 		dev_err(dev, "new device %s: cannot instantiate device\n",
+ 			macbuf);
+ 		goto error_dev_add;
+ 	}
+-	/* plug the beacon cache */
+-	bce->uwb_dev = uwb_dev;
+-	uwb_dev->bce = bce;
+-	uwb_bce_get(bce);		/* released in uwb_dev_sys_release() */
++
+ 	dev_info(dev, "uwb device (mac %s dev %s) connected to %s %s\n",
+ 		 macbuf, devbuf, rc->uwb_dev.dev.parent->bus->name,
+ 		 dev_name(rc->uwb_dev.dev.parent));
+@@ -448,6 +451,8 @@ void uwbd_dev_onair(struct uwb_rc *rc, struct uwb_beca_e *bce)
+ 	return;
+ 
+ error_dev_add:
++	bce->uwb_dev = NULL;
++	uwb_bce_put(bce);
+ 	kfree(uwb_dev);
+ 	return;
+ }
+diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
+index 5f1e1f3cd186..f8bb36f9d9ce 100644
+--- a/drivers/xen/manage.c
++++ b/drivers/xen/manage.c
+@@ -103,16 +103,11 @@ static void do_suspend(void)
+ 
+ 	shutting_down = SHUTDOWN_SUSPEND;
+ 
+-#ifdef CONFIG_PREEMPT
+-	/* If the kernel is preemptible, we need to freeze all the processes
+-	   to prevent them from being in the middle of a pagetable update
+-	   during suspend. */
+ 	err = freeze_processes();
+ 	if (err) {
+ 		pr_err("%s: freeze failed %d\n", __func__, err);
+ 		goto out;
+ 	}
+-#endif
+ 
+ 	err = dpm_suspend_start(PMSG_FREEZE);
+ 	if (err) {
+@@ -157,10 +152,8 @@ out_resume:
+ 	dpm_resume_end(si.cancelled ? PMSG_THAW : PMSG_RESTORE);
+ 
+ out_thaw:
+-#ifdef CONFIG_PREEMPT
+ 	thaw_processes();
+ out:
+-#endif
+ 	shutting_down = SHUTDOWN_INVALID;
+ }
+ #endif	/* CONFIG_HIBERNATE_CALLBACKS */
+diff --git a/fs/aio.c b/fs/aio.c
+index 1c9c5f0a9e2b..d72588a4c935 100644
+--- a/fs/aio.c
++++ b/fs/aio.c
+@@ -141,6 +141,7 @@ struct kioctx {
+ 
+ 	struct {
+ 		unsigned	tail;
++		unsigned	completed_events;
+ 		spinlock_t	completion_lock;
+ 	} ____cacheline_aligned_in_smp;
+ 
+@@ -796,6 +797,9 @@ void exit_aio(struct mm_struct *mm)
+ 	unsigned i = 0;
+ 
+ 	while (1) {
++		struct completion requests_done =
++			COMPLETION_INITIALIZER_ONSTACK(requests_done);
++
+ 		rcu_read_lock();
+ 		table = rcu_dereference(mm->ioctx_table);
+ 
+@@ -823,7 +827,10 @@ void exit_aio(struct mm_struct *mm)
+ 		 */
+ 		ctx->mmap_size = 0;
+ 
+-		kill_ioctx(mm, ctx, NULL);
++		kill_ioctx(mm, ctx, &requests_done);
++
++		/* Wait until all IO for the context are done. */
++		wait_for_completion(&requests_done);
+ 	}
+ }
+ 
+@@ -880,6 +887,68 @@ out:
+ 	return ret;
+ }
+ 
++/* refill_reqs_available
++ *	Updates the reqs_available reference counts used for tracking the
++ *	number of free slots in the completion ring.  This can be called
++ *	from aio_complete() (to optimistically update reqs_available) or
++ *	from aio_get_req() (the we're out of events case).  It must be
++ *	called holding ctx->completion_lock.
++ */
++static void refill_reqs_available(struct kioctx *ctx, unsigned head,
++                                  unsigned tail)
++{
++	unsigned events_in_ring, completed;
++
++	/* Clamp head since userland can write to it. */
++	head %= ctx->nr_events;
++	if (head <= tail)
++		events_in_ring = tail - head;
++	else
++		events_in_ring = ctx->nr_events - (head - tail);
++
++	completed = ctx->completed_events;
++	if (events_in_ring < completed)
++		completed -= events_in_ring;
++	else
++		completed = 0;
++
++	if (!completed)
++		return;
++
++	ctx->completed_events -= completed;
++	put_reqs_available(ctx, completed);
++}
++
++/* user_refill_reqs_available
++ *	Called to refill reqs_available when aio_get_req() encounters an
++ *	out of space in the completion ring.
++ */
++static void user_refill_reqs_available(struct kioctx *ctx)
++{
++	spin_lock_irq(&ctx->completion_lock);
++	if (ctx->completed_events) {
++		struct aio_ring *ring;
++		unsigned head;
++
++		/* Access of ring->head may race with aio_read_events_ring()
++		 * here, but that's okay since whether we read the old version
++		 * or the new version, and either will be valid.  The important
++		 * part is that head cannot pass tail since we prevent
++		 * aio_complete() from updating tail by holding
++		 * ctx->completion_lock.  Even if head is invalid, the check
++		 * against ctx->completed_events below will make sure we do the
++		 * safe/right thing.
++		 */
++		ring = kmap_atomic(ctx->ring_pages[0]);
++		head = ring->head;
++		kunmap_atomic(ring);
++
++		refill_reqs_available(ctx, head, ctx->tail);
++	}
++
++	spin_unlock_irq(&ctx->completion_lock);
++}
++
+ /* aio_get_req
+  *	Allocate a slot for an aio request.
+  * Returns NULL if no requests are free.
+@@ -888,8 +957,11 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
+ {
+ 	struct kiocb *req;
+ 
+-	if (!get_reqs_available(ctx))
+-		return NULL;
++	if (!get_reqs_available(ctx)) {
++		user_refill_reqs_available(ctx);
++		if (!get_reqs_available(ctx))
++			return NULL;
++	}
+ 
+ 	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
+ 	if (unlikely(!req))
+@@ -948,8 +1020,8 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
+ 	struct kioctx	*ctx = iocb->ki_ctx;
+ 	struct aio_ring	*ring;
+ 	struct io_event	*ev_page, *event;
++	unsigned tail, pos, head;
+ 	unsigned long	flags;
+-	unsigned tail, pos;
+ 
+ 	/*
+ 	 * Special case handling for sync iocbs:
+@@ -1010,10 +1082,14 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
+ 	ctx->tail = tail;
+ 
+ 	ring = kmap_atomic(ctx->ring_pages[0]);
++	head = ring->head;
+ 	ring->tail = tail;
+ 	kunmap_atomic(ring);
+ 	flush_dcache_page(ctx->ring_pages[0]);
+ 
++	ctx->completed_events++;
++	if (ctx->completed_events > 1)
++		refill_reqs_available(ctx, head, tail);
+ 	spin_unlock_irqrestore(&ctx->completion_lock, flags);
+ 
+ 	pr_debug("added to ring %p at [%u]\n", iocb, tail);
+@@ -1028,7 +1104,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
+ 
+ 	/* everything turned out well, dispose of the aiocb. */
+ 	kiocb_free(iocb);
+-	put_reqs_available(ctx, 1);
+ 
+ 	/*
+ 	 * We have to order our ring_info tail store above and test
+@@ -1065,6 +1140,12 @@ static long aio_read_events_ring(struct kioctx *ctx,
+ 	tail = ring->tail;
+ 	kunmap_atomic(ring);
+ 
++	/*
++	 * Ensure that once we've read the current tail pointer, that
++	 * we also see the events that were stored up to the tail.
++	 */
++	smp_rmb();
++
+ 	pr_debug("h%u t%u m%u\n", head, tail, ctx->nr_events);
+ 
+ 	if (head == tail)
+diff --git a/fs/buffer.c b/fs/buffer.c
+index eba6e4f621ce..36fdceb82635 100644
+--- a/fs/buffer.c
++++ b/fs/buffer.c
+@@ -1029,7 +1029,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
+ 		bh = page_buffers(page);
+ 		if (bh->b_size == size) {
+ 			end_block = init_page_buffers(page, bdev,
+-						index << sizebits, size);
++						(sector_t)index << sizebits,
++						size);
+ 			goto done;
+ 		}
+ 		if (!try_to_free_buffers(page))
+@@ -1050,7 +1051,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
+ 	 */
+ 	spin_lock(&inode->i_mapping->private_lock);
+ 	link_dev_buffers(page, bh);
+-	end_block = init_page_buffers(page, bdev, index << sizebits, size);
++	end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
++			size);
+ 	spin_unlock(&inode->i_mapping->private_lock);
+ done:
+ 	ret = (block < end_block) ? 1 : -ENXIO;
+diff --git a/fs/cachefiles/bind.c b/fs/cachefiles/bind.c
+index d749731dc0ee..fbb08e97438d 100644
+--- a/fs/cachefiles/bind.c
++++ b/fs/cachefiles/bind.c
+@@ -50,18 +50,18 @@ int cachefiles_daemon_bind(struct cachefiles_cache *cache, char *args)
+ 	       cache->brun_percent  < 100);
+ 
+ 	if (*args) {
+-		pr_err("'bind' command doesn't take an argument");
++		pr_err("'bind' command doesn't take an argument\n");
+ 		return -EINVAL;
+ 	}
+ 
+ 	if (!cache->rootdirname) {
+-		pr_err("No cache directory specified");
++		pr_err("No cache directory specified\n");
+ 		return -EINVAL;
+ 	}
+ 
+ 	/* don't permit already bound caches to be re-bound */
+ 	if (test_bit(CACHEFILES_READY, &cache->flags)) {
+-		pr_err("Cache already bound");
++		pr_err("Cache already bound\n");
+ 		return -EBUSY;
+ 	}
+ 
+@@ -248,7 +248,7 @@ error_open_root:
+ 	kmem_cache_free(cachefiles_object_jar, fsdef);
+ error_root_object:
+ 	cachefiles_end_secure(cache, saved_cred);
+-	pr_err("Failed to register: %d", ret);
++	pr_err("Failed to register: %d\n", ret);
+ 	return ret;
+ }
+ 
+diff --git a/fs/cachefiles/daemon.c b/fs/cachefiles/daemon.c
+index b078d3081d6c..ce1b115dcc28 100644
+--- a/fs/cachefiles/daemon.c
++++ b/fs/cachefiles/daemon.c
+@@ -315,7 +315,7 @@ static unsigned int cachefiles_daemon_poll(struct file *file,
+ static int cachefiles_daemon_range_error(struct cachefiles_cache *cache,
+ 					 char *args)
+ {
+-	pr_err("Free space limits must be in range 0%%<=stop<cull<run<100%%");
++	pr_err("Free space limits must be in range 0%%<=stop<cull<run<100%%\n");
+ 
+ 	return -EINVAL;
+ }
+@@ -475,12 +475,12 @@ static int cachefiles_daemon_dir(struct cachefiles_cache *cache, char *args)
+ 	_enter(",%s", args);
+ 
+ 	if (!*args) {
+-		pr_err("Empty directory specified");
++		pr_err("Empty directory specified\n");
+ 		return -EINVAL;
+ 	}
+ 
+ 	if (cache->rootdirname) {
+-		pr_err("Second cache directory specified");
++		pr_err("Second cache directory specified\n");
+ 		return -EEXIST;
+ 	}
+ 
+@@ -503,12 +503,12 @@ static int cachefiles_daemon_secctx(struct cachefiles_cache *cache, char *args)
+ 	_enter(",%s", args);
+ 
+ 	if (!*args) {
+-		pr_err("Empty security context specified");
++		pr_err("Empty security context specified\n");
+ 		return -EINVAL;
+ 	}
+ 
+ 	if (cache->secctx) {
+-		pr_err("Second security context specified");
++		pr_err("Second security context specified\n");
+ 		return -EINVAL;
+ 	}
+ 
+@@ -531,7 +531,7 @@ static int cachefiles_daemon_tag(struct cachefiles_cache *cache, char *args)
+ 	_enter(",%s", args);
+ 
+ 	if (!*args) {
+-		pr_err("Empty tag specified");
++		pr_err("Empty tag specified\n");
+ 		return -EINVAL;
+ 	}
+ 
+@@ -562,12 +562,12 @@ static int cachefiles_daemon_cull(struct cachefiles_cache *cache, char *args)
+ 		goto inval;
+ 
+ 	if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+-		pr_err("cull applied to unready cache");
++		pr_err("cull applied to unready cache\n");
+ 		return -EIO;
+ 	}
+ 
+ 	if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+-		pr_err("cull applied to dead cache");
++		pr_err("cull applied to dead cache\n");
+ 		return -EIO;
+ 	}
+ 
+@@ -587,11 +587,11 @@ static int cachefiles_daemon_cull(struct cachefiles_cache *cache, char *args)
+ 
+ notdir:
+ 	path_put(&path);
+-	pr_err("cull command requires dirfd to be a directory");
++	pr_err("cull command requires dirfd to be a directory\n");
+ 	return -ENOTDIR;
+ 
+ inval:
+-	pr_err("cull command requires dirfd and filename");
++	pr_err("cull command requires dirfd and filename\n");
+ 	return -EINVAL;
+ }
+ 
+@@ -614,7 +614,7 @@ static int cachefiles_daemon_debug(struct cachefiles_cache *cache, char *args)
+ 	return 0;
+ 
+ inval:
+-	pr_err("debug command requires mask");
++	pr_err("debug command requires mask\n");
+ 	return -EINVAL;
+ }
+ 
+@@ -634,12 +634,12 @@ static int cachefiles_daemon_inuse(struct cachefiles_cache *cache, char *args)
+ 		goto inval;
+ 
+ 	if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+-		pr_err("inuse applied to unready cache");
++		pr_err("inuse applied to unready cache\n");
+ 		return -EIO;
+ 	}
+ 
+ 	if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+-		pr_err("inuse applied to dead cache");
++		pr_err("inuse applied to dead cache\n");
+ 		return -EIO;
+ 	}
+ 
+@@ -659,11 +659,11 @@ static int cachefiles_daemon_inuse(struct cachefiles_cache *cache, char *args)
+ 
+ notdir:
+ 	path_put(&path);
+-	pr_err("inuse command requires dirfd to be a directory");
++	pr_err("inuse command requires dirfd to be a directory\n");
+ 	return -ENOTDIR;
+ 
+ inval:
+-	pr_err("inuse command requires dirfd and filename");
++	pr_err("inuse command requires dirfd and filename\n");
+ 	return -EINVAL;
+ }
+ 
+diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
+index 3d50998abf57..8c52472d2efa 100644
+--- a/fs/cachefiles/internal.h
++++ b/fs/cachefiles/internal.h
+@@ -255,7 +255,7 @@ extern int cachefiles_remove_object_xattr(struct cachefiles_cache *cache,
+ 
+ #define cachefiles_io_error(___cache, FMT, ...)		\
+ do {							\
+-	pr_err("I/O Error: " FMT, ##__VA_ARGS__);	\
++	pr_err("I/O Error: " FMT"\n", ##__VA_ARGS__);	\
+ 	fscache_io_error(&(___cache)->cache);		\
+ 	set_bit(CACHEFILES_DEAD, &(___cache)->flags);	\
+ } while (0)
+diff --git a/fs/cachefiles/main.c b/fs/cachefiles/main.c
+index 180edfb45f66..711f13d8c2de 100644
+--- a/fs/cachefiles/main.c
++++ b/fs/cachefiles/main.c
+@@ -84,7 +84,7 @@ error_proc:
+ error_object_jar:
+ 	misc_deregister(&cachefiles_dev);
+ error_dev:
+-	pr_err("failed to register: %d", ret);
++	pr_err("failed to register: %d\n", ret);
+ 	return ret;
+ }
+ 
+diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
+index 5bf2b41e66d3..55c0acb516d4 100644
+--- a/fs/cachefiles/namei.c
++++ b/fs/cachefiles/namei.c
+@@ -543,7 +543,7 @@ lookup_again:
+ 			       next, next->d_inode, next->d_inode->i_ino);
+ 
+ 		} else if (!S_ISDIR(next->d_inode->i_mode)) {
+-			pr_err("inode %lu is not a directory",
++			pr_err("inode %lu is not a directory\n",
+ 			       next->d_inode->i_ino);
+ 			ret = -ENOBUFS;
+ 			goto error;
+@@ -574,7 +574,7 @@ lookup_again:
+ 		} else if (!S_ISDIR(next->d_inode->i_mode) &&
+ 			   !S_ISREG(next->d_inode->i_mode)
+ 			   ) {
+-			pr_err("inode %lu is not a file or directory",
++			pr_err("inode %lu is not a file or directory\n",
+ 			       next->d_inode->i_ino);
+ 			ret = -ENOBUFS;
+ 			goto error;
+@@ -768,7 +768,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
+ 	ASSERT(subdir->d_inode);
+ 
+ 	if (!S_ISDIR(subdir->d_inode->i_mode)) {
+-		pr_err("%s is not a directory", dirname);
++		pr_err("%s is not a directory\n", dirname);
+ 		ret = -EIO;
+ 		goto check_error;
+ 	}
+@@ -795,13 +795,13 @@ check_error:
+ mkdir_error:
+ 	mutex_unlock(&dir->d_inode->i_mutex);
+ 	dput(subdir);
+-	pr_err("mkdir %s failed with error %d", dirname, ret);
++	pr_err("mkdir %s failed with error %d\n", dirname, ret);
+ 	return ERR_PTR(ret);
+ 
+ lookup_error:
+ 	mutex_unlock(&dir->d_inode->i_mutex);
+ 	ret = PTR_ERR(subdir);
+-	pr_err("Lookup %s failed with error %d", dirname, ret);
++	pr_err("Lookup %s failed with error %d\n", dirname, ret);
+ 	return ERR_PTR(ret);
+ 
+ nomem_d_alloc:
+@@ -891,7 +891,7 @@ lookup_error:
+ 	if (ret == -EIO) {
+ 		cachefiles_io_error(cache, "Lookup failed");
+ 	} else if (ret != -ENOMEM) {
+-		pr_err("Internal error: %d", ret);
++		pr_err("Internal error: %d\n", ret);
+ 		ret = -EIO;
+ 	}
+ 
+@@ -950,7 +950,7 @@ error:
+ 	}
+ 
+ 	if (ret != -ENOMEM) {
+-		pr_err("Internal error: %d", ret);
++		pr_err("Internal error: %d\n", ret);
+ 		ret = -EIO;
+ 	}
+ 
+diff --git a/fs/cachefiles/xattr.c b/fs/cachefiles/xattr.c
+index 1ad51ffbb275..acbc1f094fb1 100644
+--- a/fs/cachefiles/xattr.c
++++ b/fs/cachefiles/xattr.c
+@@ -51,7 +51,7 @@ int cachefiles_check_object_type(struct cachefiles_object *object)
+ 	}
+ 
+ 	if (ret != -EEXIST) {
+-		pr_err("Can't set xattr on %*.*s [%lu] (err %d)",
++		pr_err("Can't set xattr on %*.*s [%lu] (err %d)\n",
+ 		       dentry->d_name.len, dentry->d_name.len,
+ 		       dentry->d_name.name, dentry->d_inode->i_ino,
+ 		       -ret);
+@@ -64,7 +64,7 @@ int cachefiles_check_object_type(struct cachefiles_object *object)
+ 		if (ret == -ERANGE)
+ 			goto bad_type_length;
+ 
+-		pr_err("Can't read xattr on %*.*s [%lu] (err %d)",
++		pr_err("Can't read xattr on %*.*s [%lu] (err %d)\n",
+ 		       dentry->d_name.len, dentry->d_name.len,
+ 		       dentry->d_name.name, dentry->d_inode->i_ino,
+ 		       -ret);
+@@ -85,14 +85,14 @@ error:
+ 	return ret;
+ 
+ bad_type_length:
+-	pr_err("Cache object %lu type xattr length incorrect",
++	pr_err("Cache object %lu type xattr length incorrect\n",
+ 	       dentry->d_inode->i_ino);
+ 	ret = -EIO;
+ 	goto error;
+ 
+ bad_type:
+ 	xtype[2] = 0;
+-	pr_err("Cache object %*.*s [%lu] type %s not %s",
++	pr_err("Cache object %*.*s [%lu] type %s not %s\n",
+ 	       dentry->d_name.len, dentry->d_name.len,
+ 	       dentry->d_name.name, dentry->d_inode->i_ino,
+ 	       xtype, type);
+@@ -293,7 +293,7 @@ error:
+ 	return ret;
+ 
+ bad_type_length:
+-	pr_err("Cache object %lu xattr length incorrect",
++	pr_err("Cache object %lu xattr length incorrect\n",
+ 	       dentry->d_inode->i_ino);
+ 	ret = -EIO;
+ 	goto error;
+diff --git a/fs/cifs/link.c b/fs/cifs/link.c
+index 68559fd557fb..a5c2812ead68 100644
+--- a/fs/cifs/link.c
++++ b/fs/cifs/link.c
+@@ -213,8 +213,12 @@ create_mf_symlink(const unsigned int xid, struct cifs_tcon *tcon,
+ 	if (rc)
+ 		goto out;
+ 
+-	rc = tcon->ses->server->ops->create_mf_symlink(xid, tcon, cifs_sb,
+-					fromName, buf, &bytes_written);
++	if (tcon->ses->server->ops->create_mf_symlink)
++		rc = tcon->ses->server->ops->create_mf_symlink(xid, tcon,
++					cifs_sb, fromName, buf, &bytes_written);
++	else
++		rc = -EOPNOTSUPP;
++
+ 	if (rc)
+ 		goto out;
+ 
+diff --git a/fs/eventpoll.c b/fs/eventpoll.c
+index b10b48c2a7af..7bcfff900f05 100644
+--- a/fs/eventpoll.c
++++ b/fs/eventpoll.c
+@@ -1852,7 +1852,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
+ 		goto error_tgt_fput;
+ 
+ 	/* Check if EPOLLWAKEUP is allowed */
+-	ep_take_care_of_epollwakeup(&epds);
++	if (ep_op_has_event(op))
++		ep_take_care_of_epollwakeup(&epds);
+ 
+ 	/*
+ 	 * We have to check that the file structure underneath the file descriptor
+diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
+index 1bbe7c315138..b6874405f0dc 100644
+--- a/fs/ext4/ext4.h
++++ b/fs/ext4/ext4.h
+@@ -1826,7 +1826,7 @@ ext4_group_first_block_no(struct super_block *sb, ext4_group_t group_no)
+ /*
+  * Special error return code only used by dx_probe() and its callers.
+  */
+-#define ERR_BAD_DX_DIR	-75000
++#define ERR_BAD_DX_DIR	(-(MAX_ERRNO - 1))
+ 
+ /*
+  * Timeout and state flag for lazy initialization inode thread.
+diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
+index 9e6eced1605b..5e127be91bb6 100644
+--- a/fs/ext4/namei.c
++++ b/fs/ext4/namei.c
+@@ -1227,7 +1227,7 @@ static struct buffer_head * ext4_find_entry (struct inode *dir,
+ 				   buffer */
+ 	int num = 0;
+ 	ext4_lblk_t  nblocks;
+-	int i, err;
++	int i, err = 0;
+ 	int namelen;
+ 
+ 	*res_dir = NULL;
+@@ -1264,7 +1264,11 @@ static struct buffer_head * ext4_find_entry (struct inode *dir,
+ 		 * return.  Otherwise, fall back to doing a search the
+ 		 * old fashioned way.
+ 		 */
+-		if (bh || (err != ERR_BAD_DX_DIR))
++		if (err == -ENOENT)
++			return NULL;
++		if (err && err != ERR_BAD_DX_DIR)
++			return ERR_PTR(err);
++		if (bh)
+ 			return bh;
+ 		dxtrace(printk(KERN_DEBUG "ext4_find_entry: dx failed, "
+ 			       "falling back\n"));
+@@ -1295,6 +1299,11 @@ restart:
+ 				}
+ 				num++;
+ 				bh = ext4_getblk(NULL, dir, b++, 0, &err);
++				if (unlikely(err)) {
++					if (ra_max == 0)
++						return ERR_PTR(err);
++					break;
++				}
+ 				bh_use[ra_max] = bh;
+ 				if (bh)
+ 					ll_rw_block(READ | REQ_META | REQ_PRIO,
+@@ -1417,6 +1426,8 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsi
+ 		return ERR_PTR(-ENAMETOOLONG);
+ 
+ 	bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
++	if (IS_ERR(bh))
++		return (struct dentry *) bh;
+ 	inode = NULL;
+ 	if (bh) {
+ 		__u32 ino = le32_to_cpu(de->inode);
+@@ -1450,6 +1461,8 @@ struct dentry *ext4_get_parent(struct dentry *child)
+ 	struct buffer_head *bh;
+ 
+ 	bh = ext4_find_entry(child->d_inode, &dotdot, &de, NULL);
++	if (IS_ERR(bh))
++		return (struct dentry *) bh;
+ 	if (!bh)
+ 		return ERR_PTR(-ENOENT);
+ 	ino = le32_to_cpu(de->inode);
+@@ -2727,6 +2740,8 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
+ 
+ 	retval = -ENOENT;
+ 	bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
++	if (IS_ERR(bh))
++		return PTR_ERR(bh);
+ 	if (!bh)
+ 		goto end_rmdir;
+ 
+@@ -2794,6 +2809,8 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
+ 
+ 	retval = -ENOENT;
+ 	bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
++	if (IS_ERR(bh))
++		return PTR_ERR(bh);
+ 	if (!bh)
+ 		goto end_unlink;
+ 
+@@ -3121,6 +3138,8 @@ static int ext4_find_delete_entry(handle_t *handle, struct inode *dir,
+ 	struct ext4_dir_entry_2 *de;
+ 
+ 	bh = ext4_find_entry(dir, d_name, &de, NULL);
++	if (IS_ERR(bh))
++		return PTR_ERR(bh);
+ 	if (bh) {
+ 		retval = ext4_delete_entry(handle, dir, de, bh);
+ 		brelse(bh);
+@@ -3205,6 +3224,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 		dquot_initialize(new.inode);
+ 
+ 	old.bh = ext4_find_entry(old.dir, &old.dentry->d_name, &old.de, NULL);
++	if (IS_ERR(old.bh))
++		return PTR_ERR(old.bh);
+ 	/*
+ 	 *  Check for inode number is _not_ due to possible IO errors.
+ 	 *  We might rmdir the source, keep it as pwd of some process
+@@ -3217,6 +3238,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 
+ 	new.bh = ext4_find_entry(new.dir, &new.dentry->d_name,
+ 				 &new.de, &new.inlined);
++	if (IS_ERR(new.bh)) {
++		retval = PTR_ERR(new.bh);
++		new.bh = NULL;
++		goto end_rename;
++	}
+ 	if (new.bh) {
+ 		if (!new.inode) {
+ 			brelse(new.bh);
+@@ -3345,6 +3371,8 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 
+ 	old.bh = ext4_find_entry(old.dir, &old.dentry->d_name,
+ 				 &old.de, &old.inlined);
++	if (IS_ERR(old.bh))
++		return PTR_ERR(old.bh);
+ 	/*
+ 	 *  Check for inode number is _not_ due to possible IO errors.
+ 	 *  We might rmdir the source, keep it as pwd of some process
+@@ -3357,6 +3385,11 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
+ 
+ 	new.bh = ext4_find_entry(new.dir, &new.dentry->d_name,
+ 				 &new.de, &new.inlined);
++	if (IS_ERR(new.bh)) {
++		retval = PTR_ERR(new.bh);
++		new.bh = NULL;
++		goto end_rename;
++	}
+ 
+ 	/* RENAME_EXCHANGE case: old *and* new must both exist */
+ 	if (!new.bh || le32_to_cpu(new.de->inode) != new.inode->i_ino)
+diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
+index bb0e80f03e2e..1e43b905ff98 100644
+--- a/fs/ext4/resize.c
++++ b/fs/ext4/resize.c
+@@ -575,6 +575,7 @@ handle_bb:
+ 		bh = bclean(handle, sb, block);
+ 		if (IS_ERR(bh)) {
+ 			err = PTR_ERR(bh);
++			bh = NULL;
+ 			goto out;
+ 		}
+ 		overhead = ext4_group_overhead_blocks(sb, group);
+@@ -603,6 +604,7 @@ handle_ib:
+ 		bh = bclean(handle, sb, block);
+ 		if (IS_ERR(bh)) {
+ 			err = PTR_ERR(bh);
++			bh = NULL;
+ 			goto out;
+ 		}
+ 
+diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
+index e62e59477884..9c1a680ee468 100644
+--- a/fs/gfs2/inode.c
++++ b/fs/gfs2/inode.c
+@@ -626,8 +626,10 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
+ 	if (!IS_ERR(inode)) {
+ 		d = d_splice_alias(inode, dentry);
+ 		error = PTR_ERR(d);
+-		if (IS_ERR(d))
++		if (IS_ERR(d)) {
++			inode = ERR_CAST(d);
+ 			goto fail_gunlock;
++		}
+ 		error = 0;
+ 		if (file) {
+ 			if (S_ISREG(inode->i_mode)) {
+@@ -854,7 +856,6 @@ static struct dentry *__gfs2_lookup(struct inode *dir, struct dentry *dentry,
+ 
+ 	d = d_splice_alias(inode, dentry);
+ 	if (IS_ERR(d)) {
+-		iput(inode);
+ 		gfs2_glock_dq_uninit(&gh);
+ 		return d;
+ 	}
+diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
+index 8f27c93f8d2e..ec9e082f9ecd 100644
+--- a/fs/lockd/svc.c
++++ b/fs/lockd/svc.c
+@@ -253,13 +253,11 @@ static int lockd_up_net(struct svc_serv *serv, struct net *net)
+ 
+ 	error = make_socks(serv, net);
+ 	if (error < 0)
+-		goto err_socks;
++		goto err_bind;
+ 	set_grace_period(net);
+ 	dprintk("lockd_up_net: per-net data created; net=%p\n", net);
+ 	return 0;
+ 
+-err_socks:
+-	svc_rpcb_cleanup(serv, net);
+ err_bind:
+ 	ln->nlmsvc_users--;
+ 	return error;
+diff --git a/fs/locks.c b/fs/locks.c
+index 717fbc404e6b..be530f9b13ce 100644
+--- a/fs/locks.c
++++ b/fs/locks.c
+@@ -1595,7 +1595,7 @@ static int generic_add_lease(struct file *filp, long arg, struct file_lock **flp
+ 	smp_mb();
+ 	error = check_conflicting_open(dentry, arg);
+ 	if (error)
+-		locks_unlink_lock(flp);
++		locks_unlink_lock(before);
+ out:
+ 	if (is_deleg)
+ 		mutex_unlock(&inode->i_mutex);
+diff --git a/fs/namei.c b/fs/namei.c
+index 17ca8b85c308..d4ca42085e1d 100644
+--- a/fs/namei.c
++++ b/fs/namei.c
+@@ -644,24 +644,22 @@ static int complete_walk(struct nameidata *nd)
+ 
+ static __always_inline void set_root(struct nameidata *nd)
+ {
+-	if (!nd->root.mnt)
+-		get_fs_root(current->fs, &nd->root);
++	get_fs_root(current->fs, &nd->root);
+ }
+ 
+ static int link_path_walk(const char *, struct nameidata *);
+ 
+-static __always_inline void set_root_rcu(struct nameidata *nd)
++static __always_inline unsigned set_root_rcu(struct nameidata *nd)
+ {
+-	if (!nd->root.mnt) {
+-		struct fs_struct *fs = current->fs;
+-		unsigned seq;
++	struct fs_struct *fs = current->fs;
++	unsigned seq, res;
+ 
+-		do {
+-			seq = read_seqcount_begin(&fs->seq);
+-			nd->root = fs->root;
+-			nd->seq = __read_seqcount_begin(&nd->root.dentry->d_seq);
+-		} while (read_seqcount_retry(&fs->seq, seq));
+-	}
++	do {
++		seq = read_seqcount_begin(&fs->seq);
++		nd->root = fs->root;
++		res = __read_seqcount_begin(&nd->root.dentry->d_seq);
++	} while (read_seqcount_retry(&fs->seq, seq));
++	return res;
+ }
+ 
+ static void path_put_conditional(struct path *path, struct nameidata *nd)
+@@ -861,7 +859,8 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
+ 			return PTR_ERR(s);
+ 		}
+ 		if (*s == '/') {
+-			set_root(nd);
++			if (!nd->root.mnt)
++				set_root(nd);
+ 			path_put(&nd->path);
+ 			nd->path = nd->root;
+ 			path_get(&nd->root);
+@@ -1136,7 +1135,8 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path,
+ 
+ static int follow_dotdot_rcu(struct nameidata *nd)
+ {
+-	set_root_rcu(nd);
++	if (!nd->root.mnt)
++		set_root_rcu(nd);
+ 
+ 	while (1) {
+ 		if (nd->path.dentry == nd->root.dentry &&
+@@ -1249,7 +1249,8 @@ static void follow_mount(struct path *path)
+ 
+ static void follow_dotdot(struct nameidata *nd)
+ {
+-	set_root(nd);
++	if (!nd->root.mnt)
++		set_root(nd);
+ 
+ 	while(1) {
+ 		struct dentry *old = nd->path.dentry;
+@@ -1847,7 +1848,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
+ 	if (*name=='/') {
+ 		if (flags & LOOKUP_RCU) {
+ 			rcu_read_lock();
+-			set_root_rcu(nd);
++			nd->seq = set_root_rcu(nd);
+ 		} else {
+ 			set_root(nd);
+ 			path_get(&nd->root);
+diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
+index 9b431f44fad9..c3ccfe440390 100644
+--- a/fs/nfs/blocklayout/blocklayout.c
++++ b/fs/nfs/blocklayout/blocklayout.c
+@@ -210,8 +210,7 @@ static void bl_end_io_read(struct bio *bio, int err)
+ 			SetPageUptodate(bvec->bv_page);
+ 
+ 	if (err) {
+-		struct nfs_pgio_data *rdata = par->data;
+-		struct nfs_pgio_header *header = rdata->header;
++		struct nfs_pgio_header *header = par->data;
+ 
+ 		if (!header->pnfs_error)
+ 			header->pnfs_error = -EIO;
+@@ -224,43 +223,44 @@ static void bl_end_io_read(struct bio *bio, int err)
+ static void bl_read_cleanup(struct work_struct *work)
+ {
+ 	struct rpc_task *task;
+-	struct nfs_pgio_data *rdata;
++	struct nfs_pgio_header *hdr;
+ 	dprintk("%s enter\n", __func__);
+ 	task = container_of(work, struct rpc_task, u.tk_work);
+-	rdata = container_of(task, struct nfs_pgio_data, task);
+-	pnfs_ld_read_done(rdata);
++	hdr = container_of(task, struct nfs_pgio_header, task);
++	pnfs_ld_read_done(hdr);
+ }
+ 
+ static void
+ bl_end_par_io_read(void *data, int unused)
+ {
+-	struct nfs_pgio_data *rdata = data;
++	struct nfs_pgio_header *hdr = data;
+ 
+-	rdata->task.tk_status = rdata->header->pnfs_error;
+-	INIT_WORK(&rdata->task.u.tk_work, bl_read_cleanup);
+-	schedule_work(&rdata->task.u.tk_work);
++	hdr->task.tk_status = hdr->pnfs_error;
++	INIT_WORK(&hdr->task.u.tk_work, bl_read_cleanup);
++	schedule_work(&hdr->task.u.tk_work);
+ }
+ 
+ static enum pnfs_try_status
+-bl_read_pagelist(struct nfs_pgio_data *rdata)
++bl_read_pagelist(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *header = rdata->header;
++	struct nfs_pgio_header *header = hdr;
+ 	int i, hole;
+ 	struct bio *bio = NULL;
+ 	struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+ 	sector_t isect, extent_length = 0;
+ 	struct parallel_io *par;
+-	loff_t f_offset = rdata->args.offset;
+-	size_t bytes_left = rdata->args.count;
++	loff_t f_offset = hdr->args.offset;
++	size_t bytes_left = hdr->args.count;
+ 	unsigned int pg_offset, pg_len;
+-	struct page **pages = rdata->args.pages;
+-	int pg_index = rdata->args.pgbase >> PAGE_CACHE_SHIFT;
++	struct page **pages = hdr->args.pages;
++	int pg_index = hdr->args.pgbase >> PAGE_CACHE_SHIFT;
+ 	const bool is_dio = (header->dreq != NULL);
+ 
+ 	dprintk("%s enter nr_pages %u offset %lld count %u\n", __func__,
+-	       rdata->pages.npages, f_offset, (unsigned int)rdata->args.count);
++		hdr->page_array.npages, f_offset,
++		(unsigned int)hdr->args.count);
+ 
+-	par = alloc_parallel(rdata);
++	par = alloc_parallel(hdr);
+ 	if (!par)
+ 		goto use_mds;
+ 	par->pnfs_callback = bl_end_par_io_read;
+@@ -268,7 +268,7 @@ bl_read_pagelist(struct nfs_pgio_data *rdata)
+ 
+ 	isect = (sector_t) (f_offset >> SECTOR_SHIFT);
+ 	/* Code assumes extents are page-aligned */
+-	for (i = pg_index; i < rdata->pages.npages; i++) {
++	for (i = pg_index; i < hdr->page_array.npages; i++) {
+ 		if (!extent_length) {
+ 			/* We've used up the previous extent */
+ 			bl_put_extent(be);
+@@ -317,7 +317,8 @@ bl_read_pagelist(struct nfs_pgio_data *rdata)
+ 			struct pnfs_block_extent *be_read;
+ 
+ 			be_read = (hole && cow_read) ? cow_read : be;
+-			bio = do_add_page_to_bio(bio, rdata->pages.npages - i,
++			bio = do_add_page_to_bio(bio,
++						 hdr->page_array.npages - i,
+ 						 READ,
+ 						 isect, pages[i], be_read,
+ 						 bl_end_io_read, par,
+@@ -332,10 +333,10 @@ bl_read_pagelist(struct nfs_pgio_data *rdata)
+ 		extent_length -= PAGE_CACHE_SECTORS;
+ 	}
+ 	if ((isect << SECTOR_SHIFT) >= header->inode->i_size) {
+-		rdata->res.eof = 1;
+-		rdata->res.count = header->inode->i_size - rdata->args.offset;
++		hdr->res.eof = 1;
++		hdr->res.count = header->inode->i_size - hdr->args.offset;
+ 	} else {
+-		rdata->res.count = (isect << SECTOR_SHIFT) - rdata->args.offset;
++		hdr->res.count = (isect << SECTOR_SHIFT) - hdr->args.offset;
+ 	}
+ out:
+ 	bl_put_extent(be);
+@@ -390,8 +391,7 @@ static void bl_end_io_write_zero(struct bio *bio, int err)
+ 	}
+ 
+ 	if (unlikely(err)) {
+-		struct nfs_pgio_data *data = par->data;
+-		struct nfs_pgio_header *header = data->header;
++		struct nfs_pgio_header *header = par->data;
+ 
+ 		if (!header->pnfs_error)
+ 			header->pnfs_error = -EIO;
+@@ -405,8 +405,7 @@ static void bl_end_io_write(struct bio *bio, int err)
+ {
+ 	struct parallel_io *par = bio->bi_private;
+ 	const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+-	struct nfs_pgio_data *data = par->data;
+-	struct nfs_pgio_header *header = data->header;
++	struct nfs_pgio_header *header = par->data;
+ 
+ 	if (!uptodate) {
+ 		if (!header->pnfs_error)
+@@ -423,32 +422,32 @@ static void bl_end_io_write(struct bio *bio, int err)
+ static void bl_write_cleanup(struct work_struct *work)
+ {
+ 	struct rpc_task *task;
+-	struct nfs_pgio_data *wdata;
++	struct nfs_pgio_header *hdr;
+ 	dprintk("%s enter\n", __func__);
+ 	task = container_of(work, struct rpc_task, u.tk_work);
+-	wdata = container_of(task, struct nfs_pgio_data, task);
+-	if (likely(!wdata->header->pnfs_error)) {
++	hdr = container_of(task, struct nfs_pgio_header, task);
++	if (likely(!hdr->pnfs_error)) {
+ 		/* Marks for LAYOUTCOMMIT */
+-		mark_extents_written(BLK_LSEG2EXT(wdata->header->lseg),
+-				     wdata->args.offset, wdata->args.count);
++		mark_extents_written(BLK_LSEG2EXT(hdr->lseg),
++				     hdr->args.offset, hdr->args.count);
+ 	}
+-	pnfs_ld_write_done(wdata);
++	pnfs_ld_write_done(hdr);
+ }
+ 
+ /* Called when last of bios associated with a bl_write_pagelist call finishes */
+ static void bl_end_par_io_write(void *data, int num_se)
+ {
+-	struct nfs_pgio_data *wdata = data;
++	struct nfs_pgio_header *hdr = data;
+ 
+-	if (unlikely(wdata->header->pnfs_error)) {
+-		bl_free_short_extents(&BLK_LSEG2EXT(wdata->header->lseg)->bl_inval,
++	if (unlikely(hdr->pnfs_error)) {
++		bl_free_short_extents(&BLK_LSEG2EXT(hdr->lseg)->bl_inval,
+ 					num_se);
+ 	}
+ 
+-	wdata->task.tk_status = wdata->header->pnfs_error;
+-	wdata->verf.committed = NFS_FILE_SYNC;
+-	INIT_WORK(&wdata->task.u.tk_work, bl_write_cleanup);
+-	schedule_work(&wdata->task.u.tk_work);
++	hdr->task.tk_status = hdr->pnfs_error;
++	hdr->writeverf.committed = NFS_FILE_SYNC;
++	INIT_WORK(&hdr->task.u.tk_work, bl_write_cleanup);
++	schedule_work(&hdr->task.u.tk_work);
+ }
+ 
+ /* FIXME STUB - mark intersection of layout and page as bad, so is not
+@@ -673,18 +672,17 @@ check_page:
+ }
+ 
+ static enum pnfs_try_status
+-bl_write_pagelist(struct nfs_pgio_data *wdata, int sync)
++bl_write_pagelist(struct nfs_pgio_header *header, int sync)
+ {
+-	struct nfs_pgio_header *header = wdata->header;
+ 	int i, ret, npg_zero, pg_index, last = 0;
+ 	struct bio *bio = NULL;
+ 	struct pnfs_block_extent *be = NULL, *cow_read = NULL;
+ 	sector_t isect, last_isect = 0, extent_length = 0;
+ 	struct parallel_io *par = NULL;
+-	loff_t offset = wdata->args.offset;
+-	size_t count = wdata->args.count;
++	loff_t offset = header->args.offset;
++	size_t count = header->args.count;
+ 	unsigned int pg_offset, pg_len, saved_len;
+-	struct page **pages = wdata->args.pages;
++	struct page **pages = header->args.pages;
+ 	struct page *page;
+ 	pgoff_t index;
+ 	u64 temp;
+@@ -699,11 +697,11 @@ bl_write_pagelist(struct nfs_pgio_data *wdata, int sync)
+ 		dprintk("pnfsblock nonblock aligned DIO writes. Resend MDS\n");
+ 		goto out_mds;
+ 	}
+-	/* At this point, wdata->pages is a (sequential) list of nfs_pages.
++	/* At this point, header->page_aray is a (sequential) list of nfs_pages.
+ 	 * We want to write each, and if there is an error set pnfs_error
+ 	 * to have it redone using nfs.
+ 	 */
+-	par = alloc_parallel(wdata);
++	par = alloc_parallel(header);
+ 	if (!par)
+ 		goto out_mds;
+ 	par->pnfs_callback = bl_end_par_io_write;
+@@ -790,8 +788,8 @@ next_page:
+ 	bio = bl_submit_bio(WRITE, bio);
+ 
+ 	/* Middle pages */
+-	pg_index = wdata->args.pgbase >> PAGE_CACHE_SHIFT;
+-	for (i = pg_index; i < wdata->pages.npages; i++) {
++	pg_index = header->args.pgbase >> PAGE_CACHE_SHIFT;
++	for (i = pg_index; i < header->page_array.npages; i++) {
+ 		if (!extent_length) {
+ 			/* We've used up the previous extent */
+ 			bl_put_extent(be);
+@@ -862,7 +860,8 @@ next_page:
+ 		}
+ 
+ 
+-		bio = do_add_page_to_bio(bio, wdata->pages.npages - i, WRITE,
++		bio = do_add_page_to_bio(bio, header->page_array.npages - i,
++					 WRITE,
+ 					 isect, pages[i], be,
+ 					 bl_end_io_write, par,
+ 					 pg_offset, pg_len);
+@@ -890,7 +889,7 @@ next_page:
+ 	}
+ 
+ write_done:
+-	wdata->res.count = wdata->args.count;
++	header->res.count = header->args.count;
+ out:
+ 	bl_put_extent(be);
+ 	bl_put_extent(cow_read);
+diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
+index f11b9eed0de1..1b34eeb0d8de 100644
+--- a/fs/nfs/direct.c
++++ b/fs/nfs/direct.c
+@@ -148,8 +148,8 @@ static void nfs_direct_set_hdr_verf(struct nfs_direct_req *dreq,
+ {
+ 	struct nfs_writeverf *verfp;
+ 
+-	verfp = nfs_direct_select_verf(dreq, hdr->data->ds_clp,
+-				      hdr->data->ds_idx);
++	verfp = nfs_direct_select_verf(dreq, hdr->ds_clp,
++				      hdr->ds_idx);
+ 	WARN_ON_ONCE(verfp->committed >= 0);
+ 	memcpy(verfp, &hdr->verf, sizeof(struct nfs_writeverf));
+ 	WARN_ON_ONCE(verfp->committed < 0);
+@@ -169,8 +169,8 @@ static int nfs_direct_set_or_cmp_hdr_verf(struct nfs_direct_req *dreq,
+ {
+ 	struct nfs_writeverf *verfp;
+ 
+-	verfp = nfs_direct_select_verf(dreq, hdr->data->ds_clp,
+-					 hdr->data->ds_idx);
++	verfp = nfs_direct_select_verf(dreq, hdr->ds_clp,
++					 hdr->ds_idx);
+ 	if (verfp->committed < 0) {
+ 		nfs_direct_set_hdr_verf(dreq, hdr);
+ 		return 0;
+diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
+index d2eba1c13b7e..a596a1938b52 100644
+--- a/fs/nfs/filelayout/filelayout.c
++++ b/fs/nfs/filelayout/filelayout.c
+@@ -84,19 +84,18 @@ filelayout_get_dserver_offset(struct pnfs_layout_segment *lseg, loff_t offset)
+ 	BUG();
+ }
+ 
+-static void filelayout_reset_write(struct nfs_pgio_data *data)
++static void filelayout_reset_write(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-	struct rpc_task *task = &data->task;
++	struct rpc_task *task = &hdr->task;
+ 
+ 	if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags)) {
+ 		dprintk("%s Reset task %5u for i/o through MDS "
+ 			"(req %s/%llu, %u bytes @ offset %llu)\n", __func__,
+-			data->task.tk_pid,
++			hdr->task.tk_pid,
+ 			hdr->inode->i_sb->s_id,
+ 			(unsigned long long)NFS_FILEID(hdr->inode),
+-			data->args.count,
+-			(unsigned long long)data->args.offset);
++			hdr->args.count,
++			(unsigned long long)hdr->args.offset);
+ 
+ 		task->tk_status = pnfs_write_done_resend_to_mds(hdr->inode,
+ 							&hdr->pages,
+@@ -105,19 +104,18 @@ static void filelayout_reset_write(struct nfs_pgio_data *data)
+ 	}
+ }
+ 
+-static void filelayout_reset_read(struct nfs_pgio_data *data)
++static void filelayout_reset_read(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-	struct rpc_task *task = &data->task;
++	struct rpc_task *task = &hdr->task;
+ 
+ 	if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags)) {
+ 		dprintk("%s Reset task %5u for i/o through MDS "
+ 			"(req %s/%llu, %u bytes @ offset %llu)\n", __func__,
+-			data->task.tk_pid,
++			hdr->task.tk_pid,
+ 			hdr->inode->i_sb->s_id,
+ 			(unsigned long long)NFS_FILEID(hdr->inode),
+-			data->args.count,
+-			(unsigned long long)data->args.offset);
++			hdr->args.count,
++			(unsigned long long)hdr->args.offset);
+ 
+ 		task->tk_status = pnfs_read_done_resend_to_mds(hdr->inode,
+ 							&hdr->pages,
+@@ -243,18 +241,17 @@ wait_on_recovery:
+ /* NFS_PROTO call done callback routines */
+ 
+ static int filelayout_read_done_cb(struct rpc_task *task,
+-				struct nfs_pgio_data *data)
++				struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+ 	int err;
+ 
+-	trace_nfs4_pnfs_read(data, task->tk_status);
+-	err = filelayout_async_handle_error(task, data->args.context->state,
+-					    data->ds_clp, hdr->lseg);
++	trace_nfs4_pnfs_read(hdr, task->tk_status);
++	err = filelayout_async_handle_error(task, hdr->args.context->state,
++					    hdr->ds_clp, hdr->lseg);
+ 
+ 	switch (err) {
+ 	case -NFS4ERR_RESET_TO_MDS:
+-		filelayout_reset_read(data);
++		filelayout_reset_read(hdr);
+ 		return task->tk_status;
+ 	case -EAGAIN:
+ 		rpc_restart_call_prepare(task);
+@@ -270,15 +267,14 @@ static int filelayout_read_done_cb(struct rpc_task *task,
+  * rfc5661 is not clear about which credential should be used.
+  */
+ static void
+-filelayout_set_layoutcommit(struct nfs_pgio_data *wdata)
++filelayout_set_layoutcommit(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = wdata->header;
+ 
+ 	if (FILELAYOUT_LSEG(hdr->lseg)->commit_through_mds ||
+-	    wdata->res.verf->committed == NFS_FILE_SYNC)
++	    hdr->res.verf->committed == NFS_FILE_SYNC)
+ 		return;
+ 
+-	pnfs_set_layoutcommit(wdata);
++	pnfs_set_layoutcommit(hdr);
+ 	dprintk("%s inode %lu pls_end_pos %lu\n", __func__, hdr->inode->i_ino,
+ 		(unsigned long) NFS_I(hdr->inode)->layout->plh_lwb);
+ }
+@@ -305,83 +301,82 @@ filelayout_reset_to_mds(struct pnfs_layout_segment *lseg)
+  */
+ static void filelayout_read_prepare(struct rpc_task *task, void *data)
+ {
+-	struct nfs_pgio_data *rdata = data;
++	struct nfs_pgio_header *hdr = data;
+ 
+-	if (unlikely(test_bit(NFS_CONTEXT_BAD, &rdata->args.context->flags))) {
++	if (unlikely(test_bit(NFS_CONTEXT_BAD, &hdr->args.context->flags))) {
+ 		rpc_exit(task, -EIO);
+ 		return;
+ 	}
+-	if (filelayout_reset_to_mds(rdata->header->lseg)) {
++	if (filelayout_reset_to_mds(hdr->lseg)) {
+ 		dprintk("%s task %u reset io to MDS\n", __func__, task->tk_pid);
+-		filelayout_reset_read(rdata);
++		filelayout_reset_read(hdr);
+ 		rpc_exit(task, 0);
+ 		return;
+ 	}
+-	rdata->pgio_done_cb = filelayout_read_done_cb;
++	hdr->pgio_done_cb = filelayout_read_done_cb;
+ 
+-	if (nfs41_setup_sequence(rdata->ds_clp->cl_session,
+-			&rdata->args.seq_args,
+-			&rdata->res.seq_res,
++	if (nfs41_setup_sequence(hdr->ds_clp->cl_session,
++			&hdr->args.seq_args,
++			&hdr->res.seq_res,
+ 			task))
+ 		return;
+-	if (nfs4_set_rw_stateid(&rdata->args.stateid, rdata->args.context,
+-			rdata->args.lock_context, FMODE_READ) == -EIO)
++	if (nfs4_set_rw_stateid(&hdr->args.stateid, hdr->args.context,
++			hdr->args.lock_context, FMODE_READ) == -EIO)
+ 		rpc_exit(task, -EIO); /* lost lock, terminate I/O */
+ }
+ 
+ static void filelayout_read_call_done(struct rpc_task *task, void *data)
+ {
+-	struct nfs_pgio_data *rdata = data;
++	struct nfs_pgio_header *hdr = data;
+ 
+ 	dprintk("--> %s task->tk_status %d\n", __func__, task->tk_status);
+ 
+-	if (test_bit(NFS_IOHDR_REDO, &rdata->header->flags) &&
++	if (test_bit(NFS_IOHDR_REDO, &hdr->flags) &&
+ 	    task->tk_status == 0) {
+-		nfs41_sequence_done(task, &rdata->res.seq_res);
++		nfs41_sequence_done(task, &hdr->res.seq_res);
+ 		return;
+ 	}
+ 
+ 	/* Note this may cause RPC to be resent */
+-	rdata->header->mds_ops->rpc_call_done(task, data);
++	hdr->mds_ops->rpc_call_done(task, data);
+ }
+ 
+ static void filelayout_read_count_stats(struct rpc_task *task, void *data)
+ {
+-	struct nfs_pgio_data *rdata = data;
++	struct nfs_pgio_header *hdr = data;
+ 
+-	rpc_count_iostats(task, NFS_SERVER(rdata->header->inode)->client->cl_metrics);
++	rpc_count_iostats(task, NFS_SERVER(hdr->inode)->client->cl_metrics);
+ }
+ 
+ static void filelayout_read_release(void *data)
+ {
+-	struct nfs_pgio_data *rdata = data;
+-	struct pnfs_layout_hdr *lo = rdata->header->lseg->pls_layout;
++	struct nfs_pgio_header *hdr = data;
++	struct pnfs_layout_hdr *lo = hdr->lseg->pls_layout;
+ 
+ 	filelayout_fenceme(lo->plh_inode, lo);
+-	nfs_put_client(rdata->ds_clp);
+-	rdata->header->mds_ops->rpc_release(data);
++	nfs_put_client(hdr->ds_clp);
++	hdr->mds_ops->rpc_release(data);
+ }
+ 
+ static int filelayout_write_done_cb(struct rpc_task *task,
+-				struct nfs_pgio_data *data)
++				struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+ 	int err;
+ 
+-	trace_nfs4_pnfs_write(data, task->tk_status);
+-	err = filelayout_async_handle_error(task, data->args.context->state,
+-					    data->ds_clp, hdr->lseg);
++	trace_nfs4_pnfs_write(hdr, task->tk_status);
++	err = filelayout_async_handle_error(task, hdr->args.context->state,
++					    hdr->ds_clp, hdr->lseg);
+ 
+ 	switch (err) {
+ 	case -NFS4ERR_RESET_TO_MDS:
+-		filelayout_reset_write(data);
++		filelayout_reset_write(hdr);
+ 		return task->tk_status;
+ 	case -EAGAIN:
+ 		rpc_restart_call_prepare(task);
+ 		return -EAGAIN;
+ 	}
+ 
+-	filelayout_set_layoutcommit(data);
++	filelayout_set_layoutcommit(hdr);
+ 	return 0;
+ }
+ 
+@@ -419,57 +414,57 @@ static int filelayout_commit_done_cb(struct rpc_task *task,
+ 
+ static void filelayout_write_prepare(struct rpc_task *task, void *data)
+ {
+-	struct nfs_pgio_data *wdata = data;
++	struct nfs_pgio_header *hdr = data;
+ 
+-	if (unlikely(test_bit(NFS_CONTEXT_BAD, &wdata->args.context->flags))) {
++	if (unlikely(test_bit(NFS_CONTEXT_BAD, &hdr->args.context->flags))) {
+ 		rpc_exit(task, -EIO);
+ 		return;
+ 	}
+-	if (filelayout_reset_to_mds(wdata->header->lseg)) {
++	if (filelayout_reset_to_mds(hdr->lseg)) {
+ 		dprintk("%s task %u reset io to MDS\n", __func__, task->tk_pid);
+-		filelayout_reset_write(wdata);
++		filelayout_reset_write(hdr);
+ 		rpc_exit(task, 0);
+ 		return;
+ 	}
+-	if (nfs41_setup_sequence(wdata->ds_clp->cl_session,
+-			&wdata->args.seq_args,
+-			&wdata->res.seq_res,
++	if (nfs41_setup_sequence(hdr->ds_clp->cl_session,
++			&hdr->args.seq_args,
++			&hdr->res.seq_res,
+ 			task))
+ 		return;
+-	if (nfs4_set_rw_stateid(&wdata->args.stateid, wdata->args.context,
+-			wdata->args.lock_context, FMODE_WRITE) == -EIO)
++	if (nfs4_set_rw_stateid(&hdr->args.stateid, hdr->args.context,
++			hdr->args.lock_context, FMODE_WRITE) == -EIO)
+ 		rpc_exit(task, -EIO); /* lost lock, terminate I/O */
+ }
+ 
+ static void filelayout_write_call_done(struct rpc_task *task, void *data)
+ {
+-	struct nfs_pgio_data *wdata = data;
++	struct nfs_pgio_header *hdr = data;
+ 
+-	if (test_bit(NFS_IOHDR_REDO, &wdata->header->flags) &&
++	if (test_bit(NFS_IOHDR_REDO, &hdr->flags) &&
+ 	    task->tk_status == 0) {
+-		nfs41_sequence_done(task, &wdata->res.seq_res);
++		nfs41_sequence_done(task, &hdr->res.seq_res);
+ 		return;
+ 	}
+ 
+ 	/* Note this may cause RPC to be resent */
+-	wdata->header->mds_ops->rpc_call_done(task, data);
++	hdr->mds_ops->rpc_call_done(task, data);
+ }
+ 
+ static void filelayout_write_count_stats(struct rpc_task *task, void *data)
+ {
+-	struct nfs_pgio_data *wdata = data;
++	struct nfs_pgio_header *hdr = data;
+ 
+-	rpc_count_iostats(task, NFS_SERVER(wdata->header->inode)->client->cl_metrics);
++	rpc_count_iostats(task, NFS_SERVER(hdr->inode)->client->cl_metrics);
+ }
+ 
+ static void filelayout_write_release(void *data)
+ {
+-	struct nfs_pgio_data *wdata = data;
+-	struct pnfs_layout_hdr *lo = wdata->header->lseg->pls_layout;
++	struct nfs_pgio_header *hdr = data;
++	struct pnfs_layout_hdr *lo = hdr->lseg->pls_layout;
+ 
+ 	filelayout_fenceme(lo->plh_inode, lo);
+-	nfs_put_client(wdata->ds_clp);
+-	wdata->header->mds_ops->rpc_release(data);
++	nfs_put_client(hdr->ds_clp);
++	hdr->mds_ops->rpc_release(data);
+ }
+ 
+ static void filelayout_commit_prepare(struct rpc_task *task, void *data)
+@@ -529,19 +524,18 @@ static const struct rpc_call_ops filelayout_commit_call_ops = {
+ };
+ 
+ static enum pnfs_try_status
+-filelayout_read_pagelist(struct nfs_pgio_data *data)
++filelayout_read_pagelist(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+ 	struct pnfs_layout_segment *lseg = hdr->lseg;
+ 	struct nfs4_pnfs_ds *ds;
+ 	struct rpc_clnt *ds_clnt;
+-	loff_t offset = data->args.offset;
++	loff_t offset = hdr->args.offset;
+ 	u32 j, idx;
+ 	struct nfs_fh *fh;
+ 
+ 	dprintk("--> %s ino %lu pgbase %u req %Zu@%llu\n",
+ 		__func__, hdr->inode->i_ino,
+-		data->args.pgbase, (size_t)data->args.count, offset);
++		hdr->args.pgbase, (size_t)hdr->args.count, offset);
+ 
+ 	/* Retrieve the correct rpc_client for the byte range */
+ 	j = nfs4_fl_calc_j_index(lseg, offset);
+@@ -559,30 +553,29 @@ filelayout_read_pagelist(struct nfs_pgio_data *data)
+ 
+ 	/* No multipath support. Use first DS */
+ 	atomic_inc(&ds->ds_clp->cl_count);
+-	data->ds_clp = ds->ds_clp;
+-	data->ds_idx = idx;
++	hdr->ds_clp = ds->ds_clp;
++	hdr->ds_idx = idx;
+ 	fh = nfs4_fl_select_ds_fh(lseg, j);
+ 	if (fh)
+-		data->args.fh = fh;
++		hdr->args.fh = fh;
+ 
+-	data->args.offset = filelayout_get_dserver_offset(lseg, offset);
+-	data->mds_offset = offset;
++	hdr->args.offset = filelayout_get_dserver_offset(lseg, offset);
++	hdr->mds_offset = offset;
+ 
+ 	/* Perform an asynchronous read to ds */
+-	nfs_initiate_pgio(ds_clnt, data,
++	nfs_initiate_pgio(ds_clnt, hdr,
+ 			    &filelayout_read_call_ops, 0, RPC_TASK_SOFTCONN);
+ 	return PNFS_ATTEMPTED;
+ }
+ 
+ /* Perform async writes. */
+ static enum pnfs_try_status
+-filelayout_write_pagelist(struct nfs_pgio_data *data, int sync)
++filelayout_write_pagelist(struct nfs_pgio_header *hdr, int sync)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+ 	struct pnfs_layout_segment *lseg = hdr->lseg;
+ 	struct nfs4_pnfs_ds *ds;
+ 	struct rpc_clnt *ds_clnt;
+-	loff_t offset = data->args.offset;
++	loff_t offset = hdr->args.offset;
+ 	u32 j, idx;
+ 	struct nfs_fh *fh;
+ 
+@@ -598,21 +591,20 @@ filelayout_write_pagelist(struct nfs_pgio_data *data, int sync)
+ 		return PNFS_NOT_ATTEMPTED;
+ 
+ 	dprintk("%s ino %lu sync %d req %Zu@%llu DS: %s cl_count %d\n",
+-		__func__, hdr->inode->i_ino, sync, (size_t) data->args.count,
++		__func__, hdr->inode->i_ino, sync, (size_t) hdr->args.count,
+ 		offset, ds->ds_remotestr, atomic_read(&ds->ds_clp->cl_count));
+ 
+-	data->pgio_done_cb = filelayout_write_done_cb;
++	hdr->pgio_done_cb = filelayout_write_done_cb;
+ 	atomic_inc(&ds->ds_clp->cl_count);
+-	data->ds_clp = ds->ds_clp;
+-	data->ds_idx = idx;
++	hdr->ds_clp = ds->ds_clp;
++	hdr->ds_idx = idx;
+ 	fh = nfs4_fl_select_ds_fh(lseg, j);
+ 	if (fh)
+-		data->args.fh = fh;
+-
+-	data->args.offset = filelayout_get_dserver_offset(lseg, offset);
++		hdr->args.fh = fh;
++	hdr->args.offset = filelayout_get_dserver_offset(lseg, offset);
+ 
+ 	/* Perform an asynchronous write */
+-	nfs_initiate_pgio(ds_clnt, data,
++	nfs_initiate_pgio(ds_clnt, hdr,
+ 				    &filelayout_write_call_ops, sync,
+ 				    RPC_TASK_SOFTCONN);
+ 	return PNFS_ATTEMPTED;
+@@ -1023,6 +1015,7 @@ static u32 select_bucket_index(struct nfs4_filelayout_segment *fl, u32 j)
+ 
+ /* The generic layer is about to remove the req from the commit list.
+  * If this will make the bucket empty, it will need to put the lseg reference.
++ * Note this is must be called holding the inode (/cinfo) lock
+  */
+ static void
+ filelayout_clear_request_commit(struct nfs_page *req,
+@@ -1030,7 +1023,6 @@ filelayout_clear_request_commit(struct nfs_page *req,
+ {
+ 	struct pnfs_layout_segment *freeme = NULL;
+ 
+-	spin_lock(cinfo->lock);
+ 	if (!test_and_clear_bit(PG_COMMIT_TO_DS, &req->wb_flags))
+ 		goto out;
+ 	cinfo->ds->nwritten--;
+@@ -1045,8 +1037,7 @@ filelayout_clear_request_commit(struct nfs_page *req,
+ 	}
+ out:
+ 	nfs_request_remove_commit_list(req, cinfo);
+-	spin_unlock(cinfo->lock);
+-	pnfs_put_lseg(freeme);
++	pnfs_put_lseg_async(freeme);
+ }
+ 
+ static struct list_head *
+diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
+index f415cbf9f6c3..4d0eecbc98bc 100644
+--- a/fs/nfs/internal.h
++++ b/fs/nfs/internal.h
+@@ -238,11 +238,11 @@ void nfs_set_pgio_error(struct nfs_pgio_header *hdr, int error, loff_t pos);
+ int nfs_iocounter_wait(struct nfs_io_counter *c);
+ 
+ extern const struct nfs_pageio_ops nfs_pgio_rw_ops;
+-struct nfs_rw_header *nfs_rw_header_alloc(const struct nfs_rw_ops *);
+-void nfs_rw_header_free(struct nfs_pgio_header *);
+-void nfs_pgio_data_release(struct nfs_pgio_data *);
++struct nfs_pgio_header *nfs_pgio_header_alloc(const struct nfs_rw_ops *);
++void nfs_pgio_header_free(struct nfs_pgio_header *);
++void nfs_pgio_data_destroy(struct nfs_pgio_header *);
+ int nfs_generic_pgio(struct nfs_pageio_descriptor *, struct nfs_pgio_header *);
+-int nfs_initiate_pgio(struct rpc_clnt *, struct nfs_pgio_data *,
++int nfs_initiate_pgio(struct rpc_clnt *, struct nfs_pgio_header *,
+ 		      const struct rpc_call_ops *, int, int);
+ void nfs_free_request(struct nfs_page *req);
+ 
+@@ -482,7 +482,7 @@ static inline void nfs_inode_dio_wait(struct inode *inode)
+ extern ssize_t nfs_dreq_bytes_left(struct nfs_direct_req *dreq);
+ 
+ /* nfs4proc.c */
+-extern void __nfs4_read_done_cb(struct nfs_pgio_data *);
++extern void __nfs4_read_done_cb(struct nfs_pgio_header *);
+ extern struct nfs_client *nfs4_init_client(struct nfs_client *clp,
+ 			    const struct rpc_timeout *timeparms,
+ 			    const char *ip_addr);
+diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
+index f0afa291fd58..809670eba52a 100644
+--- a/fs/nfs/nfs3proc.c
++++ b/fs/nfs/nfs3proc.c
+@@ -795,41 +795,44 @@ nfs3_proc_pathconf(struct nfs_server *server, struct nfs_fh *fhandle,
+ 	return status;
+ }
+ 
+-static int nfs3_read_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs3_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+-	struct inode *inode = data->header->inode;
++	struct inode *inode = hdr->inode;
+ 
+ 	if (nfs3_async_handle_jukebox(task, inode))
+ 		return -EAGAIN;
+ 
+ 	nfs_invalidate_atime(inode);
+-	nfs_refresh_inode(inode, &data->fattr);
++	nfs_refresh_inode(inode, &hdr->fattr);
+ 	return 0;
+ }
+ 
+-static void nfs3_proc_read_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs3_proc_read_setup(struct nfs_pgio_header *hdr,
++				 struct rpc_message *msg)
+ {
+ 	msg->rpc_proc = &nfs3_procedures[NFS3PROC_READ];
+ }
+ 
+-static int nfs3_proc_pgio_rpc_prepare(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs3_proc_pgio_rpc_prepare(struct rpc_task *task,
++				      struct nfs_pgio_header *hdr)
+ {
+ 	rpc_call_start(task);
+ 	return 0;
+ }
+ 
+-static int nfs3_write_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs3_write_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+-	struct inode *inode = data->header->inode;
++	struct inode *inode = hdr->inode;
+ 
+ 	if (nfs3_async_handle_jukebox(task, inode))
+ 		return -EAGAIN;
+ 	if (task->tk_status >= 0)
+-		nfs_post_op_update_inode_force_wcc(inode, data->res.fattr);
++		nfs_post_op_update_inode_force_wcc(inode, hdr->res.fattr);
+ 	return 0;
+ }
+ 
+-static void nfs3_proc_write_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs3_proc_write_setup(struct nfs_pgio_header *hdr,
++				  struct rpc_message *msg)
+ {
+ 	msg->rpc_proc = &nfs3_procedures[NFS3PROC_WRITE];
+ }
+diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
+index ba2affa51941..b8ea4a26998c 100644
+--- a/fs/nfs/nfs4_fs.h
++++ b/fs/nfs/nfs4_fs.h
+@@ -337,11 +337,11 @@ nfs4_state_protect(struct nfs_client *clp, unsigned long sp4_mode,
+  */
+ static inline void
+ nfs4_state_protect_write(struct nfs_client *clp, struct rpc_clnt **clntp,
+-			 struct rpc_message *msg, struct nfs_pgio_data *wdata)
++			 struct rpc_message *msg, struct nfs_pgio_header *hdr)
+ {
+ 	if (_nfs4_state_protect(clp, NFS_SP4_MACH_CRED_WRITE, clntp, msg) &&
+ 	    !test_bit(NFS_SP4_MACH_CRED_COMMIT, &clp->cl_sp4_flags))
+-		wdata->args.stable = NFS_FILE_SYNC;
++		hdr->args.stable = NFS_FILE_SYNC;
+ }
+ #else /* CONFIG_NFS_v4_1 */
+ static inline struct nfs4_session *nfs4_get_session(const struct nfs_server *server)
+@@ -369,7 +369,7 @@ nfs4_state_protect(struct nfs_client *clp, unsigned long sp4_flags,
+ 
+ static inline void
+ nfs4_state_protect_write(struct nfs_client *clp, struct rpc_clnt **clntp,
+-			 struct rpc_message *msg, struct nfs_pgio_data *wdata)
++			 struct rpc_message *msg, struct nfs_pgio_header *hdr)
+ {
+ }
+ #endif /* CONFIG_NFS_V4_1 */
+diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
+index aa9ef4876046..6e045d5ee950 100644
+--- a/fs/nfs/nfs4client.c
++++ b/fs/nfs/nfs4client.c
+@@ -482,6 +482,16 @@ int nfs40_walk_client_list(struct nfs_client *new,
+ 
+ 	spin_lock(&nn->nfs_client_lock);
+ 	list_for_each_entry(pos, &nn->nfs_client_list, cl_share_link) {
++
++		if (pos->rpc_ops != new->rpc_ops)
++			continue;
++
++		if (pos->cl_proto != new->cl_proto)
++			continue;
++
++		if (pos->cl_minorversion != new->cl_minorversion)
++			continue;
++
+ 		/* If "pos" isn't marked ready, we can't trust the
+ 		 * remaining fields in "pos" */
+ 		if (pos->cl_cons_state > NFS_CS_READY) {
+@@ -501,15 +511,6 @@ int nfs40_walk_client_list(struct nfs_client *new,
+ 		if (pos->cl_cons_state != NFS_CS_READY)
+ 			continue;
+ 
+-		if (pos->rpc_ops != new->rpc_ops)
+-			continue;
+-
+-		if (pos->cl_proto != new->cl_proto)
+-			continue;
+-
+-		if (pos->cl_minorversion != new->cl_minorversion)
+-			continue;
+-
+ 		if (pos->cl_clientid != new->cl_clientid)
+ 			continue;
+ 
+@@ -622,6 +623,16 @@ int nfs41_walk_client_list(struct nfs_client *new,
+ 
+ 	spin_lock(&nn->nfs_client_lock);
+ 	list_for_each_entry(pos, &nn->nfs_client_list, cl_share_link) {
++
++		if (pos->rpc_ops != new->rpc_ops)
++			continue;
++
++		if (pos->cl_proto != new->cl_proto)
++			continue;
++
++		if (pos->cl_minorversion != new->cl_minorversion)
++			continue;
++
+ 		/* If "pos" isn't marked ready, we can't trust the
+ 		 * remaining fields in "pos", especially the client
+ 		 * ID and serverowner fields.  Wait for CREATE_SESSION
+@@ -647,15 +658,6 @@ int nfs41_walk_client_list(struct nfs_client *new,
+ 		if (pos->cl_cons_state != NFS_CS_READY)
+ 			continue;
+ 
+-		if (pos->rpc_ops != new->rpc_ops)
+-			continue;
+-
+-		if (pos->cl_proto != new->cl_proto)
+-			continue;
+-
+-		if (pos->cl_minorversion != new->cl_minorversion)
+-			continue;
+-
+ 		if (!nfs4_match_clientids(pos, new))
+ 			continue;
+ 
+diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
+index dac979866f83..3275e94538e7 100644
+--- a/fs/nfs/nfs4proc.c
++++ b/fs/nfs/nfs4proc.c
+@@ -2599,23 +2599,23 @@ static void nfs4_close_prepare(struct rpc_task *task, void *data)
+ 	is_rdwr = test_bit(NFS_O_RDWR_STATE, &state->flags);
+ 	is_rdonly = test_bit(NFS_O_RDONLY_STATE, &state->flags);
+ 	is_wronly = test_bit(NFS_O_WRONLY_STATE, &state->flags);
+-	/* Calculate the current open share mode */
+-	calldata->arg.fmode = 0;
+-	if (is_rdonly || is_rdwr)
+-		calldata->arg.fmode |= FMODE_READ;
+-	if (is_wronly || is_rdwr)
+-		calldata->arg.fmode |= FMODE_WRITE;
+ 	/* Calculate the change in open mode */
++	calldata->arg.fmode = 0;
+ 	if (state->n_rdwr == 0) {
+-		if (state->n_rdonly == 0) {
+-			call_close |= is_rdonly || is_rdwr;
+-			calldata->arg.fmode &= ~FMODE_READ;
+-		}
+-		if (state->n_wronly == 0) {
+-			call_close |= is_wronly || is_rdwr;
+-			calldata->arg.fmode &= ~FMODE_WRITE;
+-		}
+-	}
++		if (state->n_rdonly == 0)
++			call_close |= is_rdonly;
++		else if (is_rdonly)
++			calldata->arg.fmode |= FMODE_READ;
++		if (state->n_wronly == 0)
++			call_close |= is_wronly;
++		else if (is_wronly)
++			calldata->arg.fmode |= FMODE_WRITE;
++	} else if (is_rdwr)
++		calldata->arg.fmode |= FMODE_READ|FMODE_WRITE;
++
++	if (calldata->arg.fmode == 0)
++		call_close |= is_rdwr;
++
+ 	if (!nfs4_valid_open_stateid(state))
+ 		call_close = 0;
+ 	spin_unlock(&state->owner->so_lock);
+@@ -4041,24 +4041,25 @@ static bool nfs4_error_stateid_expired(int err)
+ 	return false;
+ }
+ 
+-void __nfs4_read_done_cb(struct nfs_pgio_data *data)
++void __nfs4_read_done_cb(struct nfs_pgio_header *hdr)
+ {
+-	nfs_invalidate_atime(data->header->inode);
++	nfs_invalidate_atime(hdr->inode);
+ }
+ 
+-static int nfs4_read_done_cb(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_read_done_cb(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_server *server = NFS_SERVER(data->header->inode);
++	struct nfs_server *server = NFS_SERVER(hdr->inode);
+ 
+-	trace_nfs4_read(data, task->tk_status);
+-	if (nfs4_async_handle_error(task, server, data->args.context->state) == -EAGAIN) {
++	trace_nfs4_read(hdr, task->tk_status);
++	if (nfs4_async_handle_error(task, server,
++				    hdr->args.context->state) == -EAGAIN) {
+ 		rpc_restart_call_prepare(task);
+ 		return -EAGAIN;
+ 	}
+ 
+-	__nfs4_read_done_cb(data);
++	__nfs4_read_done_cb(hdr);
+ 	if (task->tk_status > 0)
+-		renew_lease(server, data->timestamp);
++		renew_lease(server, hdr->timestamp);
+ 	return 0;
+ }
+ 
+@@ -4076,54 +4077,59 @@ static bool nfs4_read_stateid_changed(struct rpc_task *task,
+ 	return true;
+ }
+ 
+-static int nfs4_read_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+ 
+ 	dprintk("--> %s\n", __func__);
+ 
+-	if (!nfs4_sequence_done(task, &data->res.seq_res))
++	if (!nfs4_sequence_done(task, &hdr->res.seq_res))
+ 		return -EAGAIN;
+-	if (nfs4_read_stateid_changed(task, &data->args))
++	if (nfs4_read_stateid_changed(task, &hdr->args))
+ 		return -EAGAIN;
+-	return data->pgio_done_cb ? data->pgio_done_cb(task, data) :
+-				    nfs4_read_done_cb(task, data);
++	return hdr->pgio_done_cb ? hdr->pgio_done_cb(task, hdr) :
++				    nfs4_read_done_cb(task, hdr);
+ }
+ 
+-static void nfs4_proc_read_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs4_proc_read_setup(struct nfs_pgio_header *hdr,
++				 struct rpc_message *msg)
+ {
+-	data->timestamp   = jiffies;
+-	data->pgio_done_cb = nfs4_read_done_cb;
++	hdr->timestamp   = jiffies;
++	hdr->pgio_done_cb = nfs4_read_done_cb;
+ 	msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_READ];
+-	nfs4_init_sequence(&data->args.seq_args, &data->res.seq_res, 0);
++	nfs4_init_sequence(&hdr->args.seq_args, &hdr->res.seq_res, 0);
+ }
+ 
+-static int nfs4_proc_pgio_rpc_prepare(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_proc_pgio_rpc_prepare(struct rpc_task *task,
++				      struct nfs_pgio_header *hdr)
+ {
+-	if (nfs4_setup_sequence(NFS_SERVER(data->header->inode),
+-			&data->args.seq_args,
+-			&data->res.seq_res,
++	if (nfs4_setup_sequence(NFS_SERVER(hdr->inode),
++			&hdr->args.seq_args,
++			&hdr->res.seq_res,
+ 			task))
+ 		return 0;
+-	if (nfs4_set_rw_stateid(&data->args.stateid, data->args.context,
+-				data->args.lock_context, data->header->rw_ops->rw_mode) == -EIO)
++	if (nfs4_set_rw_stateid(&hdr->args.stateid, hdr->args.context,
++				hdr->args.lock_context,
++				hdr->rw_ops->rw_mode) == -EIO)
+ 		return -EIO;
+-	if (unlikely(test_bit(NFS_CONTEXT_BAD, &data->args.context->flags)))
++	if (unlikely(test_bit(NFS_CONTEXT_BAD, &hdr->args.context->flags)))
+ 		return -EIO;
+ 	return 0;
+ }
+ 
+-static int nfs4_write_done_cb(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_write_done_cb(struct rpc_task *task,
++			      struct nfs_pgio_header *hdr)
+ {
+-	struct inode *inode = data->header->inode;
++	struct inode *inode = hdr->inode;
+ 	
+-	trace_nfs4_write(data, task->tk_status);
+-	if (nfs4_async_handle_error(task, NFS_SERVER(inode), data->args.context->state) == -EAGAIN) {
++	trace_nfs4_write(hdr, task->tk_status);
++	if (nfs4_async_handle_error(task, NFS_SERVER(inode),
++				    hdr->args.context->state) == -EAGAIN) {
+ 		rpc_restart_call_prepare(task);
+ 		return -EAGAIN;
+ 	}
+ 	if (task->tk_status >= 0) {
+-		renew_lease(NFS_SERVER(inode), data->timestamp);
+-		nfs_post_op_update_inode_force_wcc(inode, &data->fattr);
++		renew_lease(NFS_SERVER(inode), hdr->timestamp);
++		nfs_post_op_update_inode_force_wcc(inode, &hdr->fattr);
+ 	}
+ 	return 0;
+ }
+@@ -4142,23 +4148,21 @@ static bool nfs4_write_stateid_changed(struct rpc_task *task,
+ 	return true;
+ }
+ 
+-static int nfs4_write_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs4_write_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+-	if (!nfs4_sequence_done(task, &data->res.seq_res))
++	if (!nfs4_sequence_done(task, &hdr->res.seq_res))
+ 		return -EAGAIN;
+-	if (nfs4_write_stateid_changed(task, &data->args))
++	if (nfs4_write_stateid_changed(task, &hdr->args))
+ 		return -EAGAIN;
+-	return data->pgio_done_cb ? data->pgio_done_cb(task, data) :
+-		nfs4_write_done_cb(task, data);
++	return hdr->pgio_done_cb ? hdr->pgio_done_cb(task, hdr) :
++		nfs4_write_done_cb(task, hdr);
+ }
+ 
+ static
+-bool nfs4_write_need_cache_consistency_data(const struct nfs_pgio_data *data)
++bool nfs4_write_need_cache_consistency_data(struct nfs_pgio_header *hdr)
+ {
+-	const struct nfs_pgio_header *hdr = data->header;
+-
+ 	/* Don't request attributes for pNFS or O_DIRECT writes */
+-	if (data->ds_clp != NULL || hdr->dreq != NULL)
++	if (hdr->ds_clp != NULL || hdr->dreq != NULL)
+ 		return false;
+ 	/* Otherwise, request attributes if and only if we don't hold
+ 	 * a delegation
+@@ -4166,23 +4170,24 @@ bool nfs4_write_need_cache_consistency_data(const struct nfs_pgio_data *data)
+ 	return nfs4_have_delegation(hdr->inode, FMODE_READ) == 0;
+ }
+ 
+-static void nfs4_proc_write_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs4_proc_write_setup(struct nfs_pgio_header *hdr,
++				  struct rpc_message *msg)
+ {
+-	struct nfs_server *server = NFS_SERVER(data->header->inode);
++	struct nfs_server *server = NFS_SERVER(hdr->inode);
+ 
+-	if (!nfs4_write_need_cache_consistency_data(data)) {
+-		data->args.bitmask = NULL;
+-		data->res.fattr = NULL;
++	if (!nfs4_write_need_cache_consistency_data(hdr)) {
++		hdr->args.bitmask = NULL;
++		hdr->res.fattr = NULL;
+ 	} else
+-		data->args.bitmask = server->cache_consistency_bitmask;
++		hdr->args.bitmask = server->cache_consistency_bitmask;
+ 
+-	if (!data->pgio_done_cb)
+-		data->pgio_done_cb = nfs4_write_done_cb;
+-	data->res.server = server;
+-	data->timestamp   = jiffies;
++	if (!hdr->pgio_done_cb)
++		hdr->pgio_done_cb = nfs4_write_done_cb;
++	hdr->res.server = server;
++	hdr->timestamp   = jiffies;
+ 
+ 	msg->rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_WRITE];
+-	nfs4_init_sequence(&data->args.seq_args, &data->res.seq_res, 1);
++	nfs4_init_sequence(&hdr->args.seq_args, &hdr->res.seq_res, 1);
+ }
+ 
+ static void nfs4_proc_commit_rpc_prepare(struct rpc_task *task, struct nfs_commit_data *data)
+diff --git a/fs/nfs/nfs4trace.h b/fs/nfs/nfs4trace.h
+index 0a744f3a86f6..1c32adbe728d 100644
+--- a/fs/nfs/nfs4trace.h
++++ b/fs/nfs/nfs4trace.h
+@@ -932,11 +932,11 @@ DEFINE_NFS4_IDMAP_EVENT(nfs4_map_gid_to_group);
+ 
+ DECLARE_EVENT_CLASS(nfs4_read_event,
+ 		TP_PROTO(
+-			const struct nfs_pgio_data *data,
++			const struct nfs_pgio_header *hdr,
+ 			int error
+ 		),
+ 
+-		TP_ARGS(data, error),
++		TP_ARGS(hdr, error),
+ 
+ 		TP_STRUCT__entry(
+ 			__field(dev_t, dev)
+@@ -948,12 +948,12 @@ DECLARE_EVENT_CLASS(nfs4_read_event,
+ 		),
+ 
+ 		TP_fast_assign(
+-			const struct inode *inode = data->header->inode;
++			const struct inode *inode = hdr->inode;
+ 			__entry->dev = inode->i_sb->s_dev;
+ 			__entry->fileid = NFS_FILEID(inode);
+ 			__entry->fhandle = nfs_fhandle_hash(NFS_FH(inode));
+-			__entry->offset = data->args.offset;
+-			__entry->count = data->args.count;
++			__entry->offset = hdr->args.offset;
++			__entry->count = hdr->args.count;
+ 			__entry->error = error;
+ 		),
+ 
+@@ -972,10 +972,10 @@ DECLARE_EVENT_CLASS(nfs4_read_event,
+ #define DEFINE_NFS4_READ_EVENT(name) \
+ 	DEFINE_EVENT(nfs4_read_event, name, \
+ 			TP_PROTO( \
+-				const struct nfs_pgio_data *data, \
++				const struct nfs_pgio_header *hdr, \
+ 				int error \
+ 			), \
+-			TP_ARGS(data, error))
++			TP_ARGS(hdr, error))
+ DEFINE_NFS4_READ_EVENT(nfs4_read);
+ #ifdef CONFIG_NFS_V4_1
+ DEFINE_NFS4_READ_EVENT(nfs4_pnfs_read);
+@@ -983,11 +983,11 @@ DEFINE_NFS4_READ_EVENT(nfs4_pnfs_read);
+ 
+ DECLARE_EVENT_CLASS(nfs4_write_event,
+ 		TP_PROTO(
+-			const struct nfs_pgio_data *data,
++			const struct nfs_pgio_header *hdr,
+ 			int error
+ 		),
+ 
+-		TP_ARGS(data, error),
++		TP_ARGS(hdr, error),
+ 
+ 		TP_STRUCT__entry(
+ 			__field(dev_t, dev)
+@@ -999,12 +999,12 @@ DECLARE_EVENT_CLASS(nfs4_write_event,
+ 		),
+ 
+ 		TP_fast_assign(
+-			const struct inode *inode = data->header->inode;
++			const struct inode *inode = hdr->inode;
+ 			__entry->dev = inode->i_sb->s_dev;
+ 			__entry->fileid = NFS_FILEID(inode);
+ 			__entry->fhandle = nfs_fhandle_hash(NFS_FH(inode));
+-			__entry->offset = data->args.offset;
+-			__entry->count = data->args.count;
++			__entry->offset = hdr->args.offset;
++			__entry->count = hdr->args.count;
+ 			__entry->error = error;
+ 		),
+ 
+@@ -1024,10 +1024,10 @@ DECLARE_EVENT_CLASS(nfs4_write_event,
+ #define DEFINE_NFS4_WRITE_EVENT(name) \
+ 	DEFINE_EVENT(nfs4_write_event, name, \
+ 			TP_PROTO( \
+-				const struct nfs_pgio_data *data, \
++				const struct nfs_pgio_header *hdr, \
+ 				int error \
+ 			), \
+-			TP_ARGS(data, error))
++			TP_ARGS(hdr, error))
+ DEFINE_NFS4_WRITE_EVENT(nfs4_write);
+ #ifdef CONFIG_NFS_V4_1
+ DEFINE_NFS4_WRITE_EVENT(nfs4_pnfs_write);
+diff --git a/fs/nfs/objlayout/objio_osd.c b/fs/nfs/objlayout/objio_osd.c
+index 611320753db2..ae05278b3761 100644
+--- a/fs/nfs/objlayout/objio_osd.c
++++ b/fs/nfs/objlayout/objio_osd.c
+@@ -439,22 +439,21 @@ static void _read_done(struct ore_io_state *ios, void *private)
+ 	objlayout_read_done(&objios->oir, status, objios->sync);
+ }
+ 
+-int objio_read_pagelist(struct nfs_pgio_data *rdata)
++int objio_read_pagelist(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = rdata->header;
+ 	struct objio_state *objios;
+ 	int ret;
+ 
+ 	ret = objio_alloc_io_state(NFS_I(hdr->inode)->layout, true,
+-			hdr->lseg, rdata->args.pages, rdata->args.pgbase,
+-			rdata->args.offset, rdata->args.count, rdata,
++			hdr->lseg, hdr->args.pages, hdr->args.pgbase,
++			hdr->args.offset, hdr->args.count, hdr,
+ 			GFP_KERNEL, &objios);
+ 	if (unlikely(ret))
+ 		return ret;
+ 
+ 	objios->ios->done = _read_done;
+ 	dprintk("%s: offset=0x%llx length=0x%x\n", __func__,
+-		rdata->args.offset, rdata->args.count);
++		hdr->args.offset, hdr->args.count);
+ 	ret = ore_read(objios->ios);
+ 	if (unlikely(ret))
+ 		objio_free_result(&objios->oir);
+@@ -487,11 +486,11 @@ static void _write_done(struct ore_io_state *ios, void *private)
+ static struct page *__r4w_get_page(void *priv, u64 offset, bool *uptodate)
+ {
+ 	struct objio_state *objios = priv;
+-	struct nfs_pgio_data *wdata = objios->oir.rpcdata;
+-	struct address_space *mapping = wdata->header->inode->i_mapping;
++	struct nfs_pgio_header *hdr = objios->oir.rpcdata;
++	struct address_space *mapping = hdr->inode->i_mapping;
+ 	pgoff_t index = offset / PAGE_SIZE;
+ 	struct page *page;
+-	loff_t i_size = i_size_read(wdata->header->inode);
++	loff_t i_size = i_size_read(hdr->inode);
+ 
+ 	if (offset >= i_size) {
+ 		*uptodate = true;
+@@ -531,15 +530,14 @@ static const struct _ore_r4w_op _r4w_op = {
+ 	.put_page = &__r4w_put_page,
+ };
+ 
+-int objio_write_pagelist(struct nfs_pgio_data *wdata, int how)
++int objio_write_pagelist(struct nfs_pgio_header *hdr, int how)
+ {
+-	struct nfs_pgio_header *hdr = wdata->header;
+ 	struct objio_state *objios;
+ 	int ret;
+ 
+ 	ret = objio_alloc_io_state(NFS_I(hdr->inode)->layout, false,
+-			hdr->lseg, wdata->args.pages, wdata->args.pgbase,
+-			wdata->args.offset, wdata->args.count, wdata, GFP_NOFS,
++			hdr->lseg, hdr->args.pages, hdr->args.pgbase,
++			hdr->args.offset, hdr->args.count, hdr, GFP_NOFS,
+ 			&objios);
+ 	if (unlikely(ret))
+ 		return ret;
+@@ -551,7 +549,7 @@ int objio_write_pagelist(struct nfs_pgio_data *wdata, int how)
+ 		objios->ios->done = _write_done;
+ 
+ 	dprintk("%s: offset=0x%llx length=0x%x\n", __func__,
+-		wdata->args.offset, wdata->args.count);
++		hdr->args.offset, hdr->args.count);
+ 	ret = ore_write(objios->ios);
+ 	if (unlikely(ret)) {
+ 		objio_free_result(&objios->oir);
+diff --git a/fs/nfs/objlayout/objlayout.c b/fs/nfs/objlayout/objlayout.c
+index 765d3f54e986..86312787cee6 100644
+--- a/fs/nfs/objlayout/objlayout.c
++++ b/fs/nfs/objlayout/objlayout.c
+@@ -229,36 +229,36 @@ objlayout_io_set_result(struct objlayout_io_res *oir, unsigned index,
+ static void _rpc_read_complete(struct work_struct *work)
+ {
+ 	struct rpc_task *task;
+-	struct nfs_pgio_data *rdata;
++	struct nfs_pgio_header *hdr;
+ 
+ 	dprintk("%s enter\n", __func__);
+ 	task = container_of(work, struct rpc_task, u.tk_work);
+-	rdata = container_of(task, struct nfs_pgio_data, task);
++	hdr = container_of(task, struct nfs_pgio_header, task);
+ 
+-	pnfs_ld_read_done(rdata);
++	pnfs_ld_read_done(hdr);
+ }
+ 
+ void
+ objlayout_read_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
+ {
+-	struct nfs_pgio_data *rdata = oir->rpcdata;
++	struct nfs_pgio_header *hdr = oir->rpcdata;
+ 
+-	oir->status = rdata->task.tk_status = status;
++	oir->status = hdr->task.tk_status = status;
+ 	if (status >= 0)
+-		rdata->res.count = status;
++		hdr->res.count = status;
+ 	else
+-		rdata->header->pnfs_error = status;
++		hdr->pnfs_error = status;
+ 	objlayout_iodone(oir);
+ 	/* must not use oir after this point */
+ 
+ 	dprintk("%s: Return status=%zd eof=%d sync=%d\n", __func__,
+-		status, rdata->res.eof, sync);
++		status, hdr->res.eof, sync);
+ 
+ 	if (sync)
+-		pnfs_ld_read_done(rdata);
++		pnfs_ld_read_done(hdr);
+ 	else {
+-		INIT_WORK(&rdata->task.u.tk_work, _rpc_read_complete);
+-		schedule_work(&rdata->task.u.tk_work);
++		INIT_WORK(&hdr->task.u.tk_work, _rpc_read_complete);
++		schedule_work(&hdr->task.u.tk_work);
+ 	}
+ }
+ 
+@@ -266,12 +266,11 @@ objlayout_read_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
+  * Perform sync or async reads.
+  */
+ enum pnfs_try_status
+-objlayout_read_pagelist(struct nfs_pgio_data *rdata)
++objlayout_read_pagelist(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = rdata->header;
+ 	struct inode *inode = hdr->inode;
+-	loff_t offset = rdata->args.offset;
+-	size_t count = rdata->args.count;
++	loff_t offset = hdr->args.offset;
++	size_t count = hdr->args.count;
+ 	int err;
+ 	loff_t eof;
+ 
+@@ -279,23 +278,23 @@ objlayout_read_pagelist(struct nfs_pgio_data *rdata)
+ 	if (unlikely(offset + count > eof)) {
+ 		if (offset >= eof) {
+ 			err = 0;
+-			rdata->res.count = 0;
+-			rdata->res.eof = 1;
++			hdr->res.count = 0;
++			hdr->res.eof = 1;
+ 			/*FIXME: do we need to call pnfs_ld_read_done() */
+ 			goto out;
+ 		}
+ 		count = eof - offset;
+ 	}
+ 
+-	rdata->res.eof = (offset + count) >= eof;
+-	_fix_verify_io_params(hdr->lseg, &rdata->args.pages,
+-			      &rdata->args.pgbase,
+-			      rdata->args.offset, rdata->args.count);
++	hdr->res.eof = (offset + count) >= eof;
++	_fix_verify_io_params(hdr->lseg, &hdr->args.pages,
++			      &hdr->args.pgbase,
++			      hdr->args.offset, hdr->args.count);
+ 
+ 	dprintk("%s: inode(%lx) offset 0x%llx count 0x%Zx eof=%d\n",
+-		__func__, inode->i_ino, offset, count, rdata->res.eof);
++		__func__, inode->i_ino, offset, count, hdr->res.eof);
+ 
+-	err = objio_read_pagelist(rdata);
++	err = objio_read_pagelist(hdr);
+  out:
+ 	if (unlikely(err)) {
+ 		hdr->pnfs_error = err;
+@@ -312,38 +311,38 @@ objlayout_read_pagelist(struct nfs_pgio_data *rdata)
+ static void _rpc_write_complete(struct work_struct *work)
+ {
+ 	struct rpc_task *task;
+-	struct nfs_pgio_data *wdata;
++	struct nfs_pgio_header *hdr;
+ 
+ 	dprintk("%s enter\n", __func__);
+ 	task = container_of(work, struct rpc_task, u.tk_work);
+-	wdata = container_of(task, struct nfs_pgio_data, task);
++	hdr = container_of(task, struct nfs_pgio_header, task);
+ 
+-	pnfs_ld_write_done(wdata);
++	pnfs_ld_write_done(hdr);
+ }
+ 
+ void
+ objlayout_write_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
+ {
+-	struct nfs_pgio_data *wdata = oir->rpcdata;
++	struct nfs_pgio_header *hdr = oir->rpcdata;
+ 
+-	oir->status = wdata->task.tk_status = status;
++	oir->status = hdr->task.tk_status = status;
+ 	if (status >= 0) {
+-		wdata->res.count = status;
+-		wdata->verf.committed = oir->committed;
++		hdr->res.count = status;
++		hdr->writeverf.committed = oir->committed;
+ 	} else {
+-		wdata->header->pnfs_error = status;
++		hdr->pnfs_error = status;
+ 	}
+ 	objlayout_iodone(oir);
+ 	/* must not use oir after this point */
+ 
+ 	dprintk("%s: Return status %zd committed %d sync=%d\n", __func__,
+-		status, wdata->verf.committed, sync);
++		status, hdr->writeverf.committed, sync);
+ 
+ 	if (sync)
+-		pnfs_ld_write_done(wdata);
++		pnfs_ld_write_done(hdr);
+ 	else {
+-		INIT_WORK(&wdata->task.u.tk_work, _rpc_write_complete);
+-		schedule_work(&wdata->task.u.tk_work);
++		INIT_WORK(&hdr->task.u.tk_work, _rpc_write_complete);
++		schedule_work(&hdr->task.u.tk_work);
+ 	}
+ }
+ 
+@@ -351,17 +350,15 @@ objlayout_write_done(struct objlayout_io_res *oir, ssize_t status, bool sync)
+  * Perform sync or async writes.
+  */
+ enum pnfs_try_status
+-objlayout_write_pagelist(struct nfs_pgio_data *wdata,
+-			 int how)
++objlayout_write_pagelist(struct nfs_pgio_header *hdr, int how)
+ {
+-	struct nfs_pgio_header *hdr = wdata->header;
+ 	int err;
+ 
+-	_fix_verify_io_params(hdr->lseg, &wdata->args.pages,
+-			      &wdata->args.pgbase,
+-			      wdata->args.offset, wdata->args.count);
++	_fix_verify_io_params(hdr->lseg, &hdr->args.pages,
++			      &hdr->args.pgbase,
++			      hdr->args.offset, hdr->args.count);
+ 
+-	err = objio_write_pagelist(wdata, how);
++	err = objio_write_pagelist(hdr, how);
+ 	if (unlikely(err)) {
+ 		hdr->pnfs_error = err;
+ 		dprintk("%s: Returned Error %d\n", __func__, err);
+diff --git a/fs/nfs/objlayout/objlayout.h b/fs/nfs/objlayout/objlayout.h
+index 01e041029a6c..fd13f1d2f136 100644
+--- a/fs/nfs/objlayout/objlayout.h
++++ b/fs/nfs/objlayout/objlayout.h
+@@ -119,8 +119,8 @@ extern void objio_free_lseg(struct pnfs_layout_segment *lseg);
+  */
+ extern void objio_free_result(struct objlayout_io_res *oir);
+ 
+-extern int objio_read_pagelist(struct nfs_pgio_data *rdata);
+-extern int objio_write_pagelist(struct nfs_pgio_data *wdata, int how);
++extern int objio_read_pagelist(struct nfs_pgio_header *rdata);
++extern int objio_write_pagelist(struct nfs_pgio_header *wdata, int how);
+ 
+ /*
+  * callback API
+@@ -168,10 +168,10 @@ extern struct pnfs_layout_segment *objlayout_alloc_lseg(
+ extern void objlayout_free_lseg(struct pnfs_layout_segment *);
+ 
+ extern enum pnfs_try_status objlayout_read_pagelist(
+-	struct nfs_pgio_data *);
++	struct nfs_pgio_header *);
+ 
+ extern enum pnfs_try_status objlayout_write_pagelist(
+-	struct nfs_pgio_data *,
++	struct nfs_pgio_header *,
+ 	int how);
+ 
+ extern void objlayout_encode_layoutcommit(
+diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
+index 17fab89f6358..34136ff5abf0 100644
+--- a/fs/nfs/pagelist.c
++++ b/fs/nfs/pagelist.c
+@@ -145,19 +145,51 @@ static int nfs_wait_bit_uninterruptible(void *word)
+ /*
+  * nfs_page_group_lock - lock the head of the page group
+  * @req - request in group that is to be locked
++ * @nonblock - if true don't block waiting for lock
+  *
+  * this lock must be held if modifying the page group list
++ *
++ * return 0 on success, < 0 on error: -EDELAY if nonblocking or the
++ * result from wait_on_bit_lock
++ *
++ * NOTE: calling with nonblock=false should always have set the
++ *       lock bit (see fs/buffer.c and other uses of wait_on_bit_lock
++ *       with TASK_UNINTERRUPTIBLE), so there is no need to check the result.
++ */
++int
++nfs_page_group_lock(struct nfs_page *req, bool nonblock)
++{
++	struct nfs_page *head = req->wb_head;
++
++	WARN_ON_ONCE(head != head->wb_head);
++
++	if (!test_and_set_bit(PG_HEADLOCK, &head->wb_flags))
++		return 0;
++
++	if (!nonblock)
++		return wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
++				nfs_wait_bit_uninterruptible,
++				TASK_UNINTERRUPTIBLE);
++
++	return -EAGAIN;
++}
++
++/*
++ * nfs_page_group_lock_wait - wait for the lock to clear, but don't grab it
++ * @req - a request in the group
++ *
++ * This is a blocking call to wait for the group lock to be cleared.
+  */
+ void
+-nfs_page_group_lock(struct nfs_page *req)
++nfs_page_group_lock_wait(struct nfs_page *req)
+ {
+ 	struct nfs_page *head = req->wb_head;
+ 
+ 	WARN_ON_ONCE(head != head->wb_head);
+ 
+-	wait_on_bit_lock(&head->wb_flags, PG_HEADLOCK,
+-			nfs_wait_bit_uninterruptible,
+-			TASK_UNINTERRUPTIBLE);
++	wait_on_bit(&head->wb_flags, PG_HEADLOCK,
++		nfs_wait_bit_uninterruptible,
++		TASK_UNINTERRUPTIBLE);
+ }
+ 
+ /*
+@@ -218,7 +250,7 @@ bool nfs_page_group_sync_on_bit(struct nfs_page *req, unsigned int bit)
+ {
+ 	bool ret;
+ 
+-	nfs_page_group_lock(req);
++	nfs_page_group_lock(req, false);
+ 	ret = nfs_page_group_sync_on_bit_locked(req, bit);
+ 	nfs_page_group_unlock(req);
+ 
+@@ -462,123 +494,72 @@ size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
+ }
+ EXPORT_SYMBOL_GPL(nfs_generic_pg_test);
+ 
+-static inline struct nfs_rw_header *NFS_RW_HEADER(struct nfs_pgio_header *hdr)
+-{
+-	return container_of(hdr, struct nfs_rw_header, header);
+-}
+-
+-/**
+- * nfs_rw_header_alloc - Allocate a header for a read or write
+- * @ops: Read or write function vector
+- */
+-struct nfs_rw_header *nfs_rw_header_alloc(const struct nfs_rw_ops *ops)
++struct nfs_pgio_header *nfs_pgio_header_alloc(const struct nfs_rw_ops *ops)
+ {
+-	struct nfs_rw_header *header = ops->rw_alloc_header();
+-
+-	if (header) {
+-		struct nfs_pgio_header *hdr = &header->header;
++	struct nfs_pgio_header *hdr = ops->rw_alloc_header();
+ 
++	if (hdr) {
+ 		INIT_LIST_HEAD(&hdr->pages);
+ 		spin_lock_init(&hdr->lock);
+-		atomic_set(&hdr->refcnt, 0);
+ 		hdr->rw_ops = ops;
+ 	}
+-	return header;
++	return hdr;
+ }
+-EXPORT_SYMBOL_GPL(nfs_rw_header_alloc);
++EXPORT_SYMBOL_GPL(nfs_pgio_header_alloc);
+ 
+ /*
+- * nfs_rw_header_free - Free a read or write header
++ * nfs_pgio_header_free - Free a read or write header
+  * @hdr: The header to free
+  */
+-void nfs_rw_header_free(struct nfs_pgio_header *hdr)
++void nfs_pgio_header_free(struct nfs_pgio_header *hdr)
+ {
+-	hdr->rw_ops->rw_free_header(NFS_RW_HEADER(hdr));
+-}
+-EXPORT_SYMBOL_GPL(nfs_rw_header_free);
+-
+-/**
+- * nfs_pgio_data_alloc - Allocate pageio data
+- * @hdr: The header making a request
+- * @pagecount: Number of pages to create
+- */
+-static struct nfs_pgio_data *nfs_pgio_data_alloc(struct nfs_pgio_header *hdr,
+-						 unsigned int pagecount)
+-{
+-	struct nfs_pgio_data *data, *prealloc;
+-
+-	prealloc = &NFS_RW_HEADER(hdr)->rpc_data;
+-	if (prealloc->header == NULL)
+-		data = prealloc;
+-	else
+-		data = kzalloc(sizeof(*data), GFP_KERNEL);
+-	if (!data)
+-		goto out;
+-
+-	if (nfs_pgarray_set(&data->pages, pagecount)) {
+-		data->header = hdr;
+-		atomic_inc(&hdr->refcnt);
+-	} else {
+-		if (data != prealloc)
+-			kfree(data);
+-		data = NULL;
+-	}
+-out:
+-	return data;
++	hdr->rw_ops->rw_free_header(hdr);
+ }
++EXPORT_SYMBOL_GPL(nfs_pgio_header_free);
+ 
+ /**
+- * nfs_pgio_data_release - Properly free pageio data
+- * @data: The data to release
++ * nfs_pgio_data_destroy - make @hdr suitable for reuse
++ *
++ * Frees memory and releases refs from nfs_generic_pgio, so that it may
++ * be called again.
++ *
++ * @hdr: A header that has had nfs_generic_pgio called
+  */
+-void nfs_pgio_data_release(struct nfs_pgio_data *data)
++void nfs_pgio_data_destroy(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-	struct nfs_rw_header *pageio_header = NFS_RW_HEADER(hdr);
+-
+-	put_nfs_open_context(data->args.context);
+-	if (data->pages.pagevec != data->pages.page_array)
+-		kfree(data->pages.pagevec);
+-	if (data == &pageio_header->rpc_data) {
+-		data->header = NULL;
+-		data = NULL;
+-	}
+-	if (atomic_dec_and_test(&hdr->refcnt))
+-		hdr->completion_ops->completion(hdr);
+-	/* Note: we only free the rpc_task after callbacks are done.
+-	 * See the comment in rpc_free_task() for why
+-	 */
+-	kfree(data);
++	put_nfs_open_context(hdr->args.context);
++	if (hdr->page_array.pagevec != hdr->page_array.page_array)
++		kfree(hdr->page_array.pagevec);
+ }
+-EXPORT_SYMBOL_GPL(nfs_pgio_data_release);
++EXPORT_SYMBOL_GPL(nfs_pgio_data_destroy);
+ 
+ /**
+  * nfs_pgio_rpcsetup - Set up arguments for a pageio call
+- * @data: The pageio data
++ * @hdr: The pageio hdr
+  * @count: Number of bytes to read
+  * @offset: Initial offset
+  * @how: How to commit data (writes only)
+  * @cinfo: Commit information for the call (writes only)
+  */
+-static void nfs_pgio_rpcsetup(struct nfs_pgio_data *data,
++static void nfs_pgio_rpcsetup(struct nfs_pgio_header *hdr,
+ 			      unsigned int count, unsigned int offset,
+ 			      int how, struct nfs_commit_info *cinfo)
+ {
+-	struct nfs_page *req = data->header->req;
++	struct nfs_page *req = hdr->req;
+ 
+ 	/* Set up the RPC argument and reply structs
+-	 * NB: take care not to mess about with data->commit et al. */
++	 * NB: take care not to mess about with hdr->commit et al. */
+ 
+-	data->args.fh     = NFS_FH(data->header->inode);
+-	data->args.offset = req_offset(req) + offset;
++	hdr->args.fh     = NFS_FH(hdr->inode);
++	hdr->args.offset = req_offset(req) + offset;
+ 	/* pnfs_set_layoutcommit needs this */
+-	data->mds_offset = data->args.offset;
+-	data->args.pgbase = req->wb_pgbase + offset;
+-	data->args.pages  = data->pages.pagevec;
+-	data->args.count  = count;
+-	data->args.context = get_nfs_open_context(req->wb_context);
+-	data->args.lock_context = req->wb_lock_context;
+-	data->args.stable  = NFS_UNSTABLE;
++	hdr->mds_offset = hdr->args.offset;
++	hdr->args.pgbase = req->wb_pgbase + offset;
++	hdr->args.pages  = hdr->page_array.pagevec;
++	hdr->args.count  = count;
++	hdr->args.context = get_nfs_open_context(req->wb_context);
++	hdr->args.lock_context = req->wb_lock_context;
++	hdr->args.stable  = NFS_UNSTABLE;
+ 	switch (how & (FLUSH_STABLE | FLUSH_COND_STABLE)) {
+ 	case 0:
+ 		break;
+@@ -586,59 +567,60 @@ static void nfs_pgio_rpcsetup(struct nfs_pgio_data *data,
+ 		if (nfs_reqs_to_commit(cinfo))
+ 			break;
+ 	default:
+-		data->args.stable = NFS_FILE_SYNC;
++		hdr->args.stable = NFS_FILE_SYNC;
+ 	}
+ 
+-	data->res.fattr   = &data->fattr;
+-	data->res.count   = count;
+-	data->res.eof     = 0;
+-	data->res.verf    = &data->verf;
+-	nfs_fattr_init(&data->fattr);
++	hdr->res.fattr   = &hdr->fattr;
++	hdr->res.count   = count;
++	hdr->res.eof     = 0;
++	hdr->res.verf    = &hdr->writeverf;
++	nfs_fattr_init(&hdr->fattr);
+ }
+ 
+ /**
+- * nfs_pgio_prepare - Prepare pageio data to go over the wire
++ * nfs_pgio_prepare - Prepare pageio hdr to go over the wire
+  * @task: The current task
+- * @calldata: pageio data to prepare
++ * @calldata: pageio header to prepare
+  */
+ static void nfs_pgio_prepare(struct rpc_task *task, void *calldata)
+ {
+-	struct nfs_pgio_data *data = calldata;
++	struct nfs_pgio_header *hdr = calldata;
+ 	int err;
+-	err = NFS_PROTO(data->header->inode)->pgio_rpc_prepare(task, data);
++	err = NFS_PROTO(hdr->inode)->pgio_rpc_prepare(task, hdr);
+ 	if (err)
+ 		rpc_exit(task, err);
+ }
+ 
+-int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_data *data,
++int nfs_initiate_pgio(struct rpc_clnt *clnt, struct nfs_pgio_header *hdr,
+ 		      const struct rpc_call_ops *call_ops, int how, int flags)
+ {
++	struct inode *inode = hdr->inode;
+ 	struct rpc_task *task;
+ 	struct rpc_message msg = {
+-		.rpc_argp = &data->args,
+-		.rpc_resp = &data->res,
+-		.rpc_cred = data->header->cred,
++		.rpc_argp = &hdr->args,
++		.rpc_resp = &hdr->res,
++		.rpc_cred = hdr->cred,
+ 	};
+ 	struct rpc_task_setup task_setup_data = {
+ 		.rpc_client = clnt,
+-		.task = &data->task,
++		.task = &hdr->task,
+ 		.rpc_message = &msg,
+ 		.callback_ops = call_ops,
+-		.callback_data = data,
++		.callback_data = hdr,
+ 		.workqueue = nfsiod_workqueue,
+ 		.flags = RPC_TASK_ASYNC | flags,
+ 	};
+ 	int ret = 0;
+ 
+-	data->header->rw_ops->rw_initiate(data, &msg, &task_setup_data, how);
++	hdr->rw_ops->rw_initiate(hdr, &msg, &task_setup_data, how);
+ 
+ 	dprintk("NFS: %5u initiated pgio call "
+ 		"(req %s/%llu, %u bytes @ offset %llu)\n",
+-		data->task.tk_pid,
+-		data->header->inode->i_sb->s_id,
+-		(unsigned long long)NFS_FILEID(data->header->inode),
+-		data->args.count,
+-		(unsigned long long)data->args.offset);
++		hdr->task.tk_pid,
++		inode->i_sb->s_id,
++		(unsigned long long)NFS_FILEID(inode),
++		hdr->args.count,
++		(unsigned long long)hdr->args.offset);
+ 
+ 	task = rpc_run_task(&task_setup_data);
+ 	if (IS_ERR(task)) {
+@@ -665,22 +647,23 @@ static int nfs_pgio_error(struct nfs_pageio_descriptor *desc,
+ 			  struct nfs_pgio_header *hdr)
+ {
+ 	set_bit(NFS_IOHDR_REDO, &hdr->flags);
+-	nfs_pgio_data_release(hdr->data);
+-	hdr->data = NULL;
++	nfs_pgio_data_destroy(hdr);
++	hdr->completion_ops->completion(hdr);
+ 	desc->pg_completion_ops->error_cleanup(&desc->pg_list);
+ 	return -ENOMEM;
+ }
+ 
+ /**
+  * nfs_pgio_release - Release pageio data
+- * @calldata: The pageio data to release
++ * @calldata: The pageio header to release
+  */
+ static void nfs_pgio_release(void *calldata)
+ {
+-	struct nfs_pgio_data *data = calldata;
+-	if (data->header->rw_ops->rw_release)
+-		data->header->rw_ops->rw_release(data);
+-	nfs_pgio_data_release(data);
++	struct nfs_pgio_header *hdr = calldata;
++	if (hdr->rw_ops->rw_release)
++		hdr->rw_ops->rw_release(hdr);
++	nfs_pgio_data_destroy(hdr);
++	hdr->completion_ops->completion(hdr);
+ }
+ 
+ /**
+@@ -721,22 +704,22 @@ EXPORT_SYMBOL_GPL(nfs_pageio_init);
+ /**
+  * nfs_pgio_result - Basic pageio error handling
+  * @task: The task that ran
+- * @calldata: Pageio data to check
++ * @calldata: Pageio header to check
+  */
+ static void nfs_pgio_result(struct rpc_task *task, void *calldata)
+ {
+-	struct nfs_pgio_data *data = calldata;
+-	struct inode *inode = data->header->inode;
++	struct nfs_pgio_header *hdr = calldata;
++	struct inode *inode = hdr->inode;
+ 
+ 	dprintk("NFS: %s: %5u, (status %d)\n", __func__,
+ 		task->tk_pid, task->tk_status);
+ 
+-	if (data->header->rw_ops->rw_done(task, data, inode) != 0)
++	if (hdr->rw_ops->rw_done(task, hdr, inode) != 0)
+ 		return;
+ 	if (task->tk_status < 0)
+-		nfs_set_pgio_error(data->header, task->tk_status, data->args.offset);
++		nfs_set_pgio_error(hdr, task->tk_status, hdr->args.offset);
+ 	else
+-		data->header->rw_ops->rw_result(task, data);
++		hdr->rw_ops->rw_result(task, hdr);
+ }
+ 
+ /*
+@@ -751,32 +734,42 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
+ 		     struct nfs_pgio_header *hdr)
+ {
+ 	struct nfs_page		*req;
+-	struct page		**pages;
+-	struct nfs_pgio_data	*data;
++	struct page		**pages,
++				*last_page;
+ 	struct list_head *head = &desc->pg_list;
+ 	struct nfs_commit_info cinfo;
++	unsigned int pagecount, pageused;
+ 
+-	data = nfs_pgio_data_alloc(hdr, nfs_page_array_len(desc->pg_base,
+-							   desc->pg_count));
+-	if (!data)
++	pagecount = nfs_page_array_len(desc->pg_base, desc->pg_count);
++	if (!nfs_pgarray_set(&hdr->page_array, pagecount))
+ 		return nfs_pgio_error(desc, hdr);
+ 
+ 	nfs_init_cinfo(&cinfo, desc->pg_inode, desc->pg_dreq);
+-	pages = data->pages.pagevec;
++	pages = hdr->page_array.pagevec;
++	last_page = NULL;
++	pageused = 0;
+ 	while (!list_empty(head)) {
+ 		req = nfs_list_entry(head->next);
+ 		nfs_list_remove_request(req);
+ 		nfs_list_add_request(req, &hdr->pages);
+-		*pages++ = req->wb_page;
++
++		if (WARN_ON_ONCE(pageused >= pagecount))
++			return nfs_pgio_error(desc, hdr);
++
++		if (!last_page || last_page != req->wb_page) {
++			*pages++ = last_page = req->wb_page;
++			pageused++;
++		}
+ 	}
++	if (WARN_ON_ONCE(pageused != pagecount))
++		return nfs_pgio_error(desc, hdr);
+ 
+ 	if ((desc->pg_ioflags & FLUSH_COND_STABLE) &&
+ 	    (desc->pg_moreio || nfs_reqs_to_commit(&cinfo)))
+ 		desc->pg_ioflags &= ~FLUSH_COND_STABLE;
+ 
+ 	/* Set up the argument struct */
+-	nfs_pgio_rpcsetup(data, desc->pg_count, 0, desc->pg_ioflags, &cinfo);
+-	hdr->data = data;
++	nfs_pgio_rpcsetup(hdr, desc->pg_count, 0, desc->pg_ioflags, &cinfo);
+ 	desc->pg_rpc_callops = &nfs_pgio_common_ops;
+ 	return 0;
+ }
+@@ -784,25 +777,20 @@ EXPORT_SYMBOL_GPL(nfs_generic_pgio);
+ 
+ static int nfs_generic_pg_pgios(struct nfs_pageio_descriptor *desc)
+ {
+-	struct nfs_rw_header *rw_hdr;
+ 	struct nfs_pgio_header *hdr;
+ 	int ret;
+ 
+-	rw_hdr = nfs_rw_header_alloc(desc->pg_rw_ops);
+-	if (!rw_hdr) {
++	hdr = nfs_pgio_header_alloc(desc->pg_rw_ops);
++	if (!hdr) {
+ 		desc->pg_completion_ops->error_cleanup(&desc->pg_list);
+ 		return -ENOMEM;
+ 	}
+-	hdr = &rw_hdr->header;
+-	nfs_pgheader_init(desc, hdr, nfs_rw_header_free);
+-	atomic_inc(&hdr->refcnt);
++	nfs_pgheader_init(desc, hdr, nfs_pgio_header_free);
+ 	ret = nfs_generic_pgio(desc, hdr);
+ 	if (ret == 0)
+ 		ret = nfs_initiate_pgio(NFS_CLIENT(hdr->inode),
+-					hdr->data, desc->pg_rpc_callops,
++					hdr, desc->pg_rpc_callops,
+ 					desc->pg_ioflags, 0);
+-	if (atomic_dec_and_test(&hdr->refcnt))
+-		hdr->completion_ops->completion(hdr);
+ 	return ret;
+ }
+ 
+@@ -845,6 +833,14 @@ static bool nfs_can_coalesce_requests(struct nfs_page *prev,
+ 			return false;
+ 		if (req_offset(req) != req_offset(prev) + prev->wb_bytes)
+ 			return false;
++		if (req->wb_page == prev->wb_page) {
++			if (req->wb_pgbase != prev->wb_pgbase + prev->wb_bytes)
++				return false;
++		} else {
++			if (req->wb_pgbase != 0 ||
++			    prev->wb_pgbase + prev->wb_bytes != PAGE_CACHE_SIZE)
++				return false;
++		}
+ 	}
+ 	size = pgio->pg_ops->pg_test(pgio, prev, req);
+ 	WARN_ON_ONCE(size > req->wb_bytes);
+@@ -916,7 +912,7 @@ static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
+ 	unsigned int bytes_left = 0;
+ 	unsigned int offset, pgbase;
+ 
+-	nfs_page_group_lock(req);
++	nfs_page_group_lock(req, false);
+ 
+ 	subreq = req;
+ 	bytes_left = subreq->wb_bytes;
+@@ -938,7 +934,7 @@ static int __nfs_pageio_add_request(struct nfs_pageio_descriptor *desc,
+ 			if (desc->pg_recoalesce)
+ 				return 0;
+ 			/* retry add_request for this subreq */
+-			nfs_page_group_lock(req);
++			nfs_page_group_lock(req, false);
+ 			continue;
+ 		}
+ 
+diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
+index 6fdcd233d6f7..5f3eb3df7c59 100644
+--- a/fs/nfs/pnfs.c
++++ b/fs/nfs/pnfs.c
+@@ -361,6 +361,23 @@ pnfs_put_lseg(struct pnfs_layout_segment *lseg)
+ }
+ EXPORT_SYMBOL_GPL(pnfs_put_lseg);
+ 
++static void pnfs_put_lseg_async_work(struct work_struct *work)
++{
++	struct pnfs_layout_segment *lseg;
++
++	lseg = container_of(work, struct pnfs_layout_segment, pls_work);
++
++	pnfs_put_lseg(lseg);
++}
++
++void
++pnfs_put_lseg_async(struct pnfs_layout_segment *lseg)
++{
++	INIT_WORK(&lseg->pls_work, pnfs_put_lseg_async_work);
++	schedule_work(&lseg->pls_work);
++}
++EXPORT_SYMBOL_GPL(pnfs_put_lseg_async);
++
+ static u64
+ end_offset(u64 start, u64 len)
+ {
+@@ -1502,9 +1519,8 @@ int pnfs_write_done_resend_to_mds(struct inode *inode,
+ }
+ EXPORT_SYMBOL_GPL(pnfs_write_done_resend_to_mds);
+ 
+-static void pnfs_ld_handle_write_error(struct nfs_pgio_data *data)
++static void pnfs_ld_handle_write_error(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+ 
+ 	dprintk("pnfs write error = %d\n", hdr->pnfs_error);
+ 	if (NFS_SERVER(hdr->inode)->pnfs_curr_ld->flags &
+@@ -1512,7 +1528,7 @@ static void pnfs_ld_handle_write_error(struct nfs_pgio_data *data)
+ 		pnfs_return_layout(hdr->inode);
+ 	}
+ 	if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags))
+-		data->task.tk_status = pnfs_write_done_resend_to_mds(hdr->inode,
++		hdr->task.tk_status = pnfs_write_done_resend_to_mds(hdr->inode,
+ 							&hdr->pages,
+ 							hdr->completion_ops,
+ 							hdr->dreq);
+@@ -1521,41 +1537,36 @@ static void pnfs_ld_handle_write_error(struct nfs_pgio_data *data)
+ /*
+  * Called by non rpc-based layout drivers
+  */
+-void pnfs_ld_write_done(struct nfs_pgio_data *data)
++void pnfs_ld_write_done(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-
+-	trace_nfs4_pnfs_write(data, hdr->pnfs_error);
++	trace_nfs4_pnfs_write(hdr, hdr->pnfs_error);
+ 	if (!hdr->pnfs_error) {
+-		pnfs_set_layoutcommit(data);
+-		hdr->mds_ops->rpc_call_done(&data->task, data);
++		pnfs_set_layoutcommit(hdr);
++		hdr->mds_ops->rpc_call_done(&hdr->task, hdr);
+ 	} else
+-		pnfs_ld_handle_write_error(data);
+-	hdr->mds_ops->rpc_release(data);
++		pnfs_ld_handle_write_error(hdr);
++	hdr->mds_ops->rpc_release(hdr);
+ }
+ EXPORT_SYMBOL_GPL(pnfs_ld_write_done);
+ 
+ static void
+ pnfs_write_through_mds(struct nfs_pageio_descriptor *desc,
+-		struct nfs_pgio_data *data)
++		struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-
+ 	if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags)) {
+ 		list_splice_tail_init(&hdr->pages, &desc->pg_list);
+ 		nfs_pageio_reset_write_mds(desc);
+ 		desc->pg_recoalesce = 1;
+ 	}
+-	nfs_pgio_data_release(data);
++	nfs_pgio_data_destroy(hdr);
+ }
+ 
+ static enum pnfs_try_status
+-pnfs_try_to_write_data(struct nfs_pgio_data *wdata,
++pnfs_try_to_write_data(struct nfs_pgio_header *hdr,
+ 			const struct rpc_call_ops *call_ops,
+ 			struct pnfs_layout_segment *lseg,
+ 			int how)
+ {
+-	struct nfs_pgio_header *hdr = wdata->header;
+ 	struct inode *inode = hdr->inode;
+ 	enum pnfs_try_status trypnfs;
+ 	struct nfs_server *nfss = NFS_SERVER(inode);
+@@ -1563,8 +1574,8 @@ pnfs_try_to_write_data(struct nfs_pgio_data *wdata,
+ 	hdr->mds_ops = call_ops;
+ 
+ 	dprintk("%s: Writing ino:%lu %u@%llu (how %d)\n", __func__,
+-		inode->i_ino, wdata->args.count, wdata->args.offset, how);
+-	trypnfs = nfss->pnfs_curr_ld->write_pagelist(wdata, how);
++		inode->i_ino, hdr->args.count, hdr->args.offset, how);
++	trypnfs = nfss->pnfs_curr_ld->write_pagelist(hdr, how);
+ 	if (trypnfs != PNFS_NOT_ATTEMPTED)
+ 		nfs_inc_stats(inode, NFSIOS_PNFS_WRITE);
+ 	dprintk("%s End (trypnfs:%d)\n", __func__, trypnfs);
+@@ -1575,51 +1586,45 @@ static void
+ pnfs_do_write(struct nfs_pageio_descriptor *desc,
+ 	      struct nfs_pgio_header *hdr, int how)
+ {
+-	struct nfs_pgio_data *data = hdr->data;
+ 	const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
+ 	struct pnfs_layout_segment *lseg = desc->pg_lseg;
+ 	enum pnfs_try_status trypnfs;
+ 
+ 	desc->pg_lseg = NULL;
+-	trypnfs = pnfs_try_to_write_data(data, call_ops, lseg, how);
++	trypnfs = pnfs_try_to_write_data(hdr, call_ops, lseg, how);
+ 	if (trypnfs == PNFS_NOT_ATTEMPTED)
+-		pnfs_write_through_mds(desc, data);
++		pnfs_write_through_mds(desc, hdr);
+ 	pnfs_put_lseg(lseg);
+ }
+ 
+ static void pnfs_writehdr_free(struct nfs_pgio_header *hdr)
+ {
+ 	pnfs_put_lseg(hdr->lseg);
+-	nfs_rw_header_free(hdr);
++	nfs_pgio_header_free(hdr);
+ }
+ EXPORT_SYMBOL_GPL(pnfs_writehdr_free);
+ 
+ int
+ pnfs_generic_pg_writepages(struct nfs_pageio_descriptor *desc)
+ {
+-	struct nfs_rw_header *whdr;
+ 	struct nfs_pgio_header *hdr;
+ 	int ret;
+ 
+-	whdr = nfs_rw_header_alloc(desc->pg_rw_ops);
+-	if (!whdr) {
++	hdr = nfs_pgio_header_alloc(desc->pg_rw_ops);
++	if (!hdr) {
+ 		desc->pg_completion_ops->error_cleanup(&desc->pg_list);
+ 		pnfs_put_lseg(desc->pg_lseg);
+ 		desc->pg_lseg = NULL;
+ 		return -ENOMEM;
+ 	}
+-	hdr = &whdr->header;
+ 	nfs_pgheader_init(desc, hdr, pnfs_writehdr_free);
+ 	hdr->lseg = pnfs_get_lseg(desc->pg_lseg);
+-	atomic_inc(&hdr->refcnt);
+ 	ret = nfs_generic_pgio(desc, hdr);
+ 	if (ret != 0) {
+ 		pnfs_put_lseg(desc->pg_lseg);
+ 		desc->pg_lseg = NULL;
+ 	} else
+ 		pnfs_do_write(desc, hdr, desc->pg_ioflags);
+-	if (atomic_dec_and_test(&hdr->refcnt))
+-		hdr->completion_ops->completion(hdr);
+ 	return ret;
+ }
+ EXPORT_SYMBOL_GPL(pnfs_generic_pg_writepages);
+@@ -1652,17 +1657,15 @@ int pnfs_read_done_resend_to_mds(struct inode *inode,
+ }
+ EXPORT_SYMBOL_GPL(pnfs_read_done_resend_to_mds);
+ 
+-static void pnfs_ld_handle_read_error(struct nfs_pgio_data *data)
++static void pnfs_ld_handle_read_error(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-
+ 	dprintk("pnfs read error = %d\n", hdr->pnfs_error);
+ 	if (NFS_SERVER(hdr->inode)->pnfs_curr_ld->flags &
+ 	    PNFS_LAYOUTRET_ON_ERROR) {
+ 		pnfs_return_layout(hdr->inode);
+ 	}
+ 	if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags))
+-		data->task.tk_status = pnfs_read_done_resend_to_mds(hdr->inode,
++		hdr->task.tk_status = pnfs_read_done_resend_to_mds(hdr->inode,
+ 							&hdr->pages,
+ 							hdr->completion_ops,
+ 							hdr->dreq);
+@@ -1671,43 +1674,38 @@ static void pnfs_ld_handle_read_error(struct nfs_pgio_data *data)
+ /*
+  * Called by non rpc-based layout drivers
+  */
+-void pnfs_ld_read_done(struct nfs_pgio_data *data)
++void pnfs_ld_read_done(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-
+-	trace_nfs4_pnfs_read(data, hdr->pnfs_error);
++	trace_nfs4_pnfs_read(hdr, hdr->pnfs_error);
+ 	if (likely(!hdr->pnfs_error)) {
+-		__nfs4_read_done_cb(data);
+-		hdr->mds_ops->rpc_call_done(&data->task, data);
++		__nfs4_read_done_cb(hdr);
++		hdr->mds_ops->rpc_call_done(&hdr->task, hdr);
+ 	} else
+-		pnfs_ld_handle_read_error(data);
+-	hdr->mds_ops->rpc_release(data);
++		pnfs_ld_handle_read_error(hdr);
++	hdr->mds_ops->rpc_release(hdr);
+ }
+ EXPORT_SYMBOL_GPL(pnfs_ld_read_done);
+ 
+ static void
+ pnfs_read_through_mds(struct nfs_pageio_descriptor *desc,
+-		struct nfs_pgio_data *data)
++		struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-
+ 	if (!test_and_set_bit(NFS_IOHDR_REDO, &hdr->flags)) {
+ 		list_splice_tail_init(&hdr->pages, &desc->pg_list);
+ 		nfs_pageio_reset_read_mds(desc);
+ 		desc->pg_recoalesce = 1;
+ 	}
+-	nfs_pgio_data_release(data);
++	nfs_pgio_data_destroy(hdr);
+ }
+ 
+ /*
+  * Call the appropriate parallel I/O subsystem read function.
+  */
+ static enum pnfs_try_status
+-pnfs_try_to_read_data(struct nfs_pgio_data *rdata,
++pnfs_try_to_read_data(struct nfs_pgio_header *hdr,
+ 		       const struct rpc_call_ops *call_ops,
+ 		       struct pnfs_layout_segment *lseg)
+ {
+-	struct nfs_pgio_header *hdr = rdata->header;
+ 	struct inode *inode = hdr->inode;
+ 	struct nfs_server *nfss = NFS_SERVER(inode);
+ 	enum pnfs_try_status trypnfs;
+@@ -1715,9 +1713,9 @@ pnfs_try_to_read_data(struct nfs_pgio_data *rdata,
+ 	hdr->mds_ops = call_ops;
+ 
+ 	dprintk("%s: Reading ino:%lu %u@%llu\n",
+-		__func__, inode->i_ino, rdata->args.count, rdata->args.offset);
++		__func__, inode->i_ino, hdr->args.count, hdr->args.offset);
+ 
+-	trypnfs = nfss->pnfs_curr_ld->read_pagelist(rdata);
++	trypnfs = nfss->pnfs_curr_ld->read_pagelist(hdr);
+ 	if (trypnfs != PNFS_NOT_ATTEMPTED)
+ 		nfs_inc_stats(inode, NFSIOS_PNFS_READ);
+ 	dprintk("%s End (trypnfs:%d)\n", __func__, trypnfs);
+@@ -1727,52 +1725,46 @@ pnfs_try_to_read_data(struct nfs_pgio_data *rdata,
+ static void
+ pnfs_do_read(struct nfs_pageio_descriptor *desc, struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_data *data = hdr->data;
+ 	const struct rpc_call_ops *call_ops = desc->pg_rpc_callops;
+ 	struct pnfs_layout_segment *lseg = desc->pg_lseg;
+ 	enum pnfs_try_status trypnfs;
+ 
+ 	desc->pg_lseg = NULL;
+-	trypnfs = pnfs_try_to_read_data(data, call_ops, lseg);
++	trypnfs = pnfs_try_to_read_data(hdr, call_ops, lseg);
+ 	if (trypnfs == PNFS_NOT_ATTEMPTED)
+-		pnfs_read_through_mds(desc, data);
++		pnfs_read_through_mds(desc, hdr);
+ 	pnfs_put_lseg(lseg);
+ }
+ 
+ static void pnfs_readhdr_free(struct nfs_pgio_header *hdr)
+ {
+ 	pnfs_put_lseg(hdr->lseg);
+-	nfs_rw_header_free(hdr);
++	nfs_pgio_header_free(hdr);
+ }
+ EXPORT_SYMBOL_GPL(pnfs_readhdr_free);
+ 
+ int
+ pnfs_generic_pg_readpages(struct nfs_pageio_descriptor *desc)
+ {
+-	struct nfs_rw_header *rhdr;
+ 	struct nfs_pgio_header *hdr;
+ 	int ret;
+ 
+-	rhdr = nfs_rw_header_alloc(desc->pg_rw_ops);
+-	if (!rhdr) {
++	hdr = nfs_pgio_header_alloc(desc->pg_rw_ops);
++	if (!hdr) {
+ 		desc->pg_completion_ops->error_cleanup(&desc->pg_list);
+ 		ret = -ENOMEM;
+ 		pnfs_put_lseg(desc->pg_lseg);
+ 		desc->pg_lseg = NULL;
+ 		return ret;
+ 	}
+-	hdr = &rhdr->header;
+ 	nfs_pgheader_init(desc, hdr, pnfs_readhdr_free);
+ 	hdr->lseg = pnfs_get_lseg(desc->pg_lseg);
+-	atomic_inc(&hdr->refcnt);
+ 	ret = nfs_generic_pgio(desc, hdr);
+ 	if (ret != 0) {
+ 		pnfs_put_lseg(desc->pg_lseg);
+ 		desc->pg_lseg = NULL;
+ 	} else
+ 		pnfs_do_read(desc, hdr);
+-	if (atomic_dec_and_test(&hdr->refcnt))
+-		hdr->completion_ops->completion(hdr);
+ 	return ret;
+ }
+ EXPORT_SYMBOL_GPL(pnfs_generic_pg_readpages);
+@@ -1820,12 +1812,11 @@ void pnfs_set_lo_fail(struct pnfs_layout_segment *lseg)
+ EXPORT_SYMBOL_GPL(pnfs_set_lo_fail);
+ 
+ void
+-pnfs_set_layoutcommit(struct nfs_pgio_data *wdata)
++pnfs_set_layoutcommit(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = wdata->header;
+ 	struct inode *inode = hdr->inode;
+ 	struct nfs_inode *nfsi = NFS_I(inode);
+-	loff_t end_pos = wdata->mds_offset + wdata->res.count;
++	loff_t end_pos = hdr->mds_offset + hdr->res.count;
+ 	bool mark_as_dirty = false;
+ 
+ 	spin_lock(&inode->i_lock);
+diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
+index 4fb309a2b4c4..ae22a9ccc1b9 100644
+--- a/fs/nfs/pnfs.h
++++ b/fs/nfs/pnfs.h
+@@ -32,6 +32,7 @@
+ 
+ #include <linux/nfs_fs.h>
+ #include <linux/nfs_page.h>
++#include <linux/workqueue.h>
+ 
+ enum {
+ 	NFS_LSEG_VALID = 0,	/* cleared when lseg is recalled/returned */
+@@ -46,6 +47,7 @@ struct pnfs_layout_segment {
+ 	atomic_t pls_refcount;
+ 	unsigned long pls_flags;
+ 	struct pnfs_layout_hdr *pls_layout;
++	struct work_struct pls_work;
+ };
+ 
+ enum pnfs_try_status {
+@@ -113,8 +115,8 @@ struct pnfs_layoutdriver_type {
+ 	 * Return PNFS_ATTEMPTED to indicate the layout code has attempted
+ 	 * I/O, else return PNFS_NOT_ATTEMPTED to fall back to normal NFS
+ 	 */
+-	enum pnfs_try_status (*read_pagelist) (struct nfs_pgio_data *nfs_data);
+-	enum pnfs_try_status (*write_pagelist) (struct nfs_pgio_data *nfs_data, int how);
++	enum pnfs_try_status (*read_pagelist)(struct nfs_pgio_header *);
++	enum pnfs_try_status (*write_pagelist)(struct nfs_pgio_header *, int);
+ 
+ 	void (*free_deviceid_node) (struct nfs4_deviceid_node *);
+ 
+@@ -179,6 +181,7 @@ extern int nfs4_proc_layoutreturn(struct nfs4_layoutreturn *lrp);
+ /* pnfs.c */
+ void pnfs_get_layout_hdr(struct pnfs_layout_hdr *lo);
+ void pnfs_put_lseg(struct pnfs_layout_segment *lseg);
++void pnfs_put_lseg_async(struct pnfs_layout_segment *lseg);
+ 
+ void set_pnfs_layoutdriver(struct nfs_server *, const struct nfs_fh *, u32);
+ void unset_pnfs_layoutdriver(struct nfs_server *);
+@@ -213,13 +216,13 @@ bool pnfs_roc(struct inode *ino);
+ void pnfs_roc_release(struct inode *ino);
+ void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
+ bool pnfs_roc_drain(struct inode *ino, u32 *barrier, struct rpc_task *task);
+-void pnfs_set_layoutcommit(struct nfs_pgio_data *wdata);
++void pnfs_set_layoutcommit(struct nfs_pgio_header *);
+ void pnfs_cleanup_layoutcommit(struct nfs4_layoutcommit_data *data);
+ int pnfs_layoutcommit_inode(struct inode *inode, bool sync);
+ int _pnfs_return_layout(struct inode *);
+ int pnfs_commit_and_return_layout(struct inode *);
+-void pnfs_ld_write_done(struct nfs_pgio_data *);
+-void pnfs_ld_read_done(struct nfs_pgio_data *);
++void pnfs_ld_write_done(struct nfs_pgio_header *);
++void pnfs_ld_read_done(struct nfs_pgio_header *);
+ struct pnfs_layout_segment *pnfs_update_layout(struct inode *ino,
+ 					       struct nfs_open_context *ctx,
+ 					       loff_t pos,
+@@ -410,6 +413,10 @@ static inline void pnfs_put_lseg(struct pnfs_layout_segment *lseg)
+ {
+ }
+ 
++static inline void pnfs_put_lseg_async(struct pnfs_layout_segment *lseg)
++{
++}
++
+ static inline int pnfs_return_layout(struct inode *ino)
+ {
+ 	return 0;
+diff --git a/fs/nfs/proc.c b/fs/nfs/proc.c
+index c171ce1a8a30..b09cc23d6f43 100644
+--- a/fs/nfs/proc.c
++++ b/fs/nfs/proc.c
+@@ -578,46 +578,49 @@ nfs_proc_pathconf(struct nfs_server *server, struct nfs_fh *fhandle,
+ 	return 0;
+ }
+ 
+-static int nfs_read_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs_read_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+-	struct inode *inode = data->header->inode;
++	struct inode *inode = hdr->inode;
+ 
+ 	nfs_invalidate_atime(inode);
+ 	if (task->tk_status >= 0) {
+-		nfs_refresh_inode(inode, data->res.fattr);
++		nfs_refresh_inode(inode, hdr->res.fattr);
+ 		/* Emulate the eof flag, which isn't normally needed in NFSv2
+ 		 * as it is guaranteed to always return the file attributes
+ 		 */
+-		if (data->args.offset + data->res.count >= data->res.fattr->size)
+-			data->res.eof = 1;
++		if (hdr->args.offset + hdr->res.count >= hdr->res.fattr->size)
++			hdr->res.eof = 1;
+ 	}
+ 	return 0;
+ }
+ 
+-static void nfs_proc_read_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs_proc_read_setup(struct nfs_pgio_header *hdr,
++				struct rpc_message *msg)
+ {
+ 	msg->rpc_proc = &nfs_procedures[NFSPROC_READ];
+ }
+ 
+-static int nfs_proc_pgio_rpc_prepare(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs_proc_pgio_rpc_prepare(struct rpc_task *task,
++				     struct nfs_pgio_header *hdr)
+ {
+ 	rpc_call_start(task);
+ 	return 0;
+ }
+ 
+-static int nfs_write_done(struct rpc_task *task, struct nfs_pgio_data *data)
++static int nfs_write_done(struct rpc_task *task, struct nfs_pgio_header *hdr)
+ {
+-	struct inode *inode = data->header->inode;
++	struct inode *inode = hdr->inode;
+ 
+ 	if (task->tk_status >= 0)
+-		nfs_post_op_update_inode_force_wcc(inode, data->res.fattr);
++		nfs_post_op_update_inode_force_wcc(inode, hdr->res.fattr);
+ 	return 0;
+ }
+ 
+-static void nfs_proc_write_setup(struct nfs_pgio_data *data, struct rpc_message *msg)
++static void nfs_proc_write_setup(struct nfs_pgio_header *hdr,
++				 struct rpc_message *msg)
+ {
+ 	/* Note: NFSv2 ignores @stable and always uses NFS_FILE_SYNC */
+-	data->args.stable = NFS_FILE_SYNC;
++	hdr->args.stable = NFS_FILE_SYNC;
+ 	msg->rpc_proc = &nfs_procedures[NFSPROC_WRITE];
+ }
+ 
+diff --git a/fs/nfs/read.c b/fs/nfs/read.c
+index e818a475ca64..b1532b73fea3 100644
+--- a/fs/nfs/read.c
++++ b/fs/nfs/read.c
+@@ -33,12 +33,12 @@ static const struct nfs_rw_ops nfs_rw_read_ops;
+ 
+ static struct kmem_cache *nfs_rdata_cachep;
+ 
+-static struct nfs_rw_header *nfs_readhdr_alloc(void)
++static struct nfs_pgio_header *nfs_readhdr_alloc(void)
+ {
+ 	return kmem_cache_zalloc(nfs_rdata_cachep, GFP_KERNEL);
+ }
+ 
+-static void nfs_readhdr_free(struct nfs_rw_header *rhdr)
++static void nfs_readhdr_free(struct nfs_pgio_header *rhdr)
+ {
+ 	kmem_cache_free(nfs_rdata_cachep, rhdr);
+ }
+@@ -172,14 +172,15 @@ out:
+ 	hdr->release(hdr);
+ }
+ 
+-static void nfs_initiate_read(struct nfs_pgio_data *data, struct rpc_message *msg,
++static void nfs_initiate_read(struct nfs_pgio_header *hdr,
++			      struct rpc_message *msg,
+ 			      struct rpc_task_setup *task_setup_data, int how)
+ {
+-	struct inode *inode = data->header->inode;
++	struct inode *inode = hdr->inode;
+ 	int swap_flags = IS_SWAPFILE(inode) ? NFS_RPC_SWAPFLAGS : 0;
+ 
+ 	task_setup_data->flags |= swap_flags;
+-	NFS_PROTO(inode)->read_setup(data, msg);
++	NFS_PROTO(inode)->read_setup(hdr, msg);
+ }
+ 
+ static void
+@@ -203,14 +204,15 @@ static const struct nfs_pgio_completion_ops nfs_async_read_completion_ops = {
+  * This is the callback from RPC telling us whether a reply was
+  * received or some error occurred (timeout or socket shutdown).
+  */
+-static int nfs_readpage_done(struct rpc_task *task, struct nfs_pgio_data *data,
++static int nfs_readpage_done(struct rpc_task *task,
++			     struct nfs_pgio_header *hdr,
+ 			     struct inode *inode)
+ {
+-	int status = NFS_PROTO(inode)->read_done(task, data);
++	int status = NFS_PROTO(inode)->read_done(task, hdr);
+ 	if (status != 0)
+ 		return status;
+ 
+-	nfs_add_stats(inode, NFSIOS_SERVERREADBYTES, data->res.count);
++	nfs_add_stats(inode, NFSIOS_SERVERREADBYTES, hdr->res.count);
+ 
+ 	if (task->tk_status == -ESTALE) {
+ 		set_bit(NFS_INO_STALE, &NFS_I(inode)->flags);
+@@ -219,34 +221,34 @@ static int nfs_readpage_done(struct rpc_task *task, struct nfs_pgio_data *data,
+ 	return 0;
+ }
+ 
+-static void nfs_readpage_retry(struct rpc_task *task, struct nfs_pgio_data *data)
++static void nfs_readpage_retry(struct rpc_task *task,
++			       struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_args *argp = &data->args;
+-	struct nfs_pgio_res  *resp = &data->res;
++	struct nfs_pgio_args *argp = &hdr->args;
++	struct nfs_pgio_res  *resp = &hdr->res;
+ 
+ 	/* This is a short read! */
+-	nfs_inc_stats(data->header->inode, NFSIOS_SHORTREAD);
++	nfs_inc_stats(hdr->inode, NFSIOS_SHORTREAD);
+ 	/* Has the server at least made some progress? */
+ 	if (resp->count == 0) {
+-		nfs_set_pgio_error(data->header, -EIO, argp->offset);
++		nfs_set_pgio_error(hdr, -EIO, argp->offset);
+ 		return;
+ 	}
+-	/* Yes, so retry the read at the end of the data */
+-	data->mds_offset += resp->count;
++	/* Yes, so retry the read at the end of the hdr */
++	hdr->mds_offset += resp->count;
+ 	argp->offset += resp->count;
+ 	argp->pgbase += resp->count;
+ 	argp->count -= resp->count;
+ 	rpc_restart_call_prepare(task);
+ }
+ 
+-static void nfs_readpage_result(struct rpc_task *task, struct nfs_pgio_data *data)
++static void nfs_readpage_result(struct rpc_task *task,
++				struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-
+-	if (data->res.eof) {
++	if (hdr->res.eof) {
+ 		loff_t bound;
+ 
+-		bound = data->args.offset + data->res.count;
++		bound = hdr->args.offset + hdr->res.count;
+ 		spin_lock(&hdr->lock);
+ 		if (bound < hdr->io_start + hdr->good_bytes) {
+ 			set_bit(NFS_IOHDR_EOF, &hdr->flags);
+@@ -254,8 +256,8 @@ static void nfs_readpage_result(struct rpc_task *task, struct nfs_pgio_data *dat
+ 			hdr->good_bytes = bound - hdr->io_start;
+ 		}
+ 		spin_unlock(&hdr->lock);
+-	} else if (data->res.count != data->args.count)
+-		nfs_readpage_retry(task, data);
++	} else if (hdr->res.count != hdr->args.count)
++		nfs_readpage_retry(task, hdr);
+ }
+ 
+ /*
+@@ -404,7 +406,7 @@ out:
+ int __init nfs_init_readpagecache(void)
+ {
+ 	nfs_rdata_cachep = kmem_cache_create("nfs_read_data",
+-					     sizeof(struct nfs_rw_header),
++					     sizeof(struct nfs_pgio_header),
+ 					     0, SLAB_HWCACHE_ALIGN,
+ 					     NULL);
+ 	if (nfs_rdata_cachep == NULL)
+diff --git a/fs/nfs/write.c b/fs/nfs/write.c
+index 5e2f10304548..ecb0f9fd5632 100644
+--- a/fs/nfs/write.c
++++ b/fs/nfs/write.c
+@@ -71,18 +71,18 @@ void nfs_commit_free(struct nfs_commit_data *p)
+ }
+ EXPORT_SYMBOL_GPL(nfs_commit_free);
+ 
+-static struct nfs_rw_header *nfs_writehdr_alloc(void)
++static struct nfs_pgio_header *nfs_writehdr_alloc(void)
+ {
+-	struct nfs_rw_header *p = mempool_alloc(nfs_wdata_mempool, GFP_NOIO);
++	struct nfs_pgio_header *p = mempool_alloc(nfs_wdata_mempool, GFP_NOIO);
+ 
+ 	if (p)
+ 		memset(p, 0, sizeof(*p));
+ 	return p;
+ }
+ 
+-static void nfs_writehdr_free(struct nfs_rw_header *whdr)
++static void nfs_writehdr_free(struct nfs_pgio_header *hdr)
+ {
+-	mempool_free(whdr, nfs_wdata_mempool);
++	mempool_free(hdr, nfs_wdata_mempool);
+ }
+ 
+ static void nfs_context_set_write_error(struct nfs_open_context *ctx, int error)
+@@ -216,7 +216,7 @@ static bool nfs_page_group_covers_page(struct nfs_page *req)
+ 	unsigned int pos = 0;
+ 	unsigned int len = nfs_page_length(req->wb_page);
+ 
+-	nfs_page_group_lock(req);
++	nfs_page_group_lock(req, false);
+ 
+ 	do {
+ 		tmp = nfs_page_group_search_locked(req->wb_head, pos);
+@@ -379,8 +379,6 @@ nfs_destroy_unlinked_subrequests(struct nfs_page *destroy_list,
+ 		subreq->wb_head = subreq;
+ 		subreq->wb_this_page = subreq;
+ 
+-		nfs_clear_request_commit(subreq);
+-
+ 		/* subreq is now totally disconnected from page group or any
+ 		 * write / commit lists. last chance to wake any waiters */
+ 		nfs_unlock_request(subreq);
+@@ -455,8 +453,23 @@ try_again:
+ 		return NULL;
+ 	}
+ 
++	/* holding inode lock, so always make a non-blocking call to try the
++	 * page group lock */
++	ret = nfs_page_group_lock(head, true);
++	if (ret < 0) {
++		spin_unlock(&inode->i_lock);
++
++		if (!nonblock && ret == -EAGAIN) {
++			nfs_page_group_lock_wait(head);
++			nfs_release_request(head);
++			goto try_again;
++		}
++
++		nfs_release_request(head);
++		return ERR_PTR(ret);
++	}
++
+ 	/* lock each request in the page group */
+-	nfs_page_group_lock(head);
+ 	subreq = head;
+ 	do {
+ 		/*
+@@ -488,7 +501,7 @@ try_again:
+ 	 * Commit list removal accounting is done after locks are dropped */
+ 	subreq = head;
+ 	do {
+-		nfs_list_remove_request(subreq);
++		nfs_clear_request_commit(subreq);
+ 		subreq = subreq->wb_this_page;
+ 	} while (subreq != head);
+ 
+@@ -518,15 +531,11 @@ try_again:
+ 
+ 	nfs_page_group_unlock(head);
+ 
+-	/* drop lock to clear_request_commit the head req and clean up
+-	 * requests on destroy list */
++	/* drop lock to clean uprequests on destroy list */
+ 	spin_unlock(&inode->i_lock);
+ 
+ 	nfs_destroy_unlinked_subrequests(destroy_list, head);
+ 
+-	/* clean up commit list state */
+-	nfs_clear_request_commit(head);
+-
+ 	/* still holds ref on head from nfs_page_find_head_request_locked
+ 	 * and still has lock on head from lock loop */
+ 	return head;
+@@ -808,6 +817,7 @@ nfs_clear_page_commit(struct page *page)
+ 	dec_bdi_stat(page_file_mapping(page)->backing_dev_info, BDI_RECLAIMABLE);
+ }
+ 
++/* Called holding inode (/cinfo) lock */
+ static void
+ nfs_clear_request_commit(struct nfs_page *req)
+ {
+@@ -817,20 +827,18 @@ nfs_clear_request_commit(struct nfs_page *req)
+ 
+ 		nfs_init_cinfo_from_inode(&cinfo, inode);
+ 		if (!pnfs_clear_request_commit(req, &cinfo)) {
+-			spin_lock(cinfo.lock);
+ 			nfs_request_remove_commit_list(req, &cinfo);
+-			spin_unlock(cinfo.lock);
+ 		}
+ 		nfs_clear_page_commit(req->wb_page);
+ 	}
+ }
+ 
+ static inline
+-int nfs_write_need_commit(struct nfs_pgio_data *data)
++int nfs_write_need_commit(struct nfs_pgio_header *hdr)
+ {
+-	if (data->verf.committed == NFS_DATA_SYNC)
+-		return data->header->lseg == NULL;
+-	return data->verf.committed != NFS_FILE_SYNC;
++	if (hdr->writeverf.committed == NFS_DATA_SYNC)
++		return hdr->lseg == NULL;
++	return hdr->writeverf.committed != NFS_FILE_SYNC;
+ }
+ 
+ #else
+@@ -857,7 +865,7 @@ nfs_clear_request_commit(struct nfs_page *req)
+ }
+ 
+ static inline
+-int nfs_write_need_commit(struct nfs_pgio_data *data)
++int nfs_write_need_commit(struct nfs_pgio_header *hdr)
+ {
+ 	return 0;
+ }
+@@ -1038,9 +1046,9 @@ static struct nfs_page *nfs_try_to_update_request(struct inode *inode,
+ 	else
+ 		req->wb_bytes = rqend - req->wb_offset;
+ out_unlock:
+-	spin_unlock(&inode->i_lock);
+ 	if (req)
+ 		nfs_clear_request_commit(req);
++	spin_unlock(&inode->i_lock);
+ 	return req;
+ out_flushme:
+ 	spin_unlock(&inode->i_lock);
+@@ -1241,17 +1249,18 @@ static int flush_task_priority(int how)
+ 	return RPC_PRIORITY_NORMAL;
+ }
+ 
+-static void nfs_initiate_write(struct nfs_pgio_data *data, struct rpc_message *msg,
++static void nfs_initiate_write(struct nfs_pgio_header *hdr,
++			       struct rpc_message *msg,
+ 			       struct rpc_task_setup *task_setup_data, int how)
+ {
+-	struct inode *inode = data->header->inode;
++	struct inode *inode = hdr->inode;
+ 	int priority = flush_task_priority(how);
+ 
+ 	task_setup_data->priority = priority;
+-	NFS_PROTO(inode)->write_setup(data, msg);
++	NFS_PROTO(inode)->write_setup(hdr, msg);
+ 
+ 	nfs4_state_protect_write(NFS_SERVER(inode)->nfs_client,
+-				 &task_setup_data->rpc_client, msg, data);
++				 &task_setup_data->rpc_client, msg, hdr);
+ }
+ 
+ /* If a nfs_flush_* function fails, it should remove reqs from @head and
+@@ -1313,18 +1322,17 @@ void nfs_commit_prepare(struct rpc_task *task, void *calldata)
+ 	NFS_PROTO(data->inode)->commit_rpc_prepare(task, data);
+ }
+ 
+-static void nfs_writeback_release_common(struct nfs_pgio_data *data)
++static void nfs_writeback_release_common(struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_header *hdr = data->header;
+-	int status = data->task.tk_status;
++	int status = hdr->task.tk_status;
+ 
+-	if ((status >= 0) && nfs_write_need_commit(data)) {
++	if ((status >= 0) && nfs_write_need_commit(hdr)) {
+ 		spin_lock(&hdr->lock);
+ 		if (test_bit(NFS_IOHDR_NEED_RESCHED, &hdr->flags))
+ 			; /* Do nothing */
+ 		else if (!test_and_set_bit(NFS_IOHDR_NEED_COMMIT, &hdr->flags))
+-			memcpy(&hdr->verf, &data->verf, sizeof(hdr->verf));
+-		else if (memcmp(&hdr->verf, &data->verf, sizeof(hdr->verf)))
++			memcpy(&hdr->verf, &hdr->writeverf, sizeof(hdr->verf));
++		else if (memcmp(&hdr->verf, &hdr->writeverf, sizeof(hdr->verf)))
+ 			set_bit(NFS_IOHDR_NEED_RESCHED, &hdr->flags);
+ 		spin_unlock(&hdr->lock);
+ 	}
+@@ -1358,7 +1366,8 @@ static int nfs_should_remove_suid(const struct inode *inode)
+ /*
+  * This function is called when the WRITE call is complete.
+  */
+-static int nfs_writeback_done(struct rpc_task *task, struct nfs_pgio_data *data,
++static int nfs_writeback_done(struct rpc_task *task,
++			      struct nfs_pgio_header *hdr,
+ 			      struct inode *inode)
+ {
+ 	int status;
+@@ -1370,13 +1379,14 @@ static int nfs_writeback_done(struct rpc_task *task, struct nfs_pgio_data *data,
+ 	 * another writer had changed the file, but some applications
+ 	 * depend on tighter cache coherency when writing.
+ 	 */
+-	status = NFS_PROTO(inode)->write_done(task, data);
++	status = NFS_PROTO(inode)->write_done(task, hdr);
+ 	if (status != 0)
+ 		return status;
+-	nfs_add_stats(inode, NFSIOS_SERVERWRITTENBYTES, data->res.count);
++	nfs_add_stats(inode, NFSIOS_SERVERWRITTENBYTES, hdr->res.count);
+ 
+ #if IS_ENABLED(CONFIG_NFS_V3) || IS_ENABLED(CONFIG_NFS_V4)
+-	if (data->res.verf->committed < data->args.stable && task->tk_status >= 0) {
++	if (hdr->res.verf->committed < hdr->args.stable &&
++	    task->tk_status >= 0) {
+ 		/* We tried a write call, but the server did not
+ 		 * commit data to stable storage even though we
+ 		 * requested it.
+@@ -1392,7 +1402,7 @@ static int nfs_writeback_done(struct rpc_task *task, struct nfs_pgio_data *data,
+ 			dprintk("NFS:       faulty NFS server %s:"
+ 				" (committed = %d) != (stable = %d)\n",
+ 				NFS_SERVER(inode)->nfs_client->cl_hostname,
+-				data->res.verf->committed, data->args.stable);
++				hdr->res.verf->committed, hdr->args.stable);
+ 			complain = jiffies + 300 * HZ;
+ 		}
+ 	}
+@@ -1407,16 +1417,17 @@ static int nfs_writeback_done(struct rpc_task *task, struct nfs_pgio_data *data,
+ /*
+  * This function is called when the WRITE call is complete.
+  */
+-static void nfs_writeback_result(struct rpc_task *task, struct nfs_pgio_data *data)
++static void nfs_writeback_result(struct rpc_task *task,
++				 struct nfs_pgio_header *hdr)
+ {
+-	struct nfs_pgio_args	*argp = &data->args;
+-	struct nfs_pgio_res	*resp = &data->res;
++	struct nfs_pgio_args	*argp = &hdr->args;
++	struct nfs_pgio_res	*resp = &hdr->res;
+ 
+ 	if (resp->count < argp->count) {
+ 		static unsigned long    complain;
+ 
+ 		/* This a short write! */
+-		nfs_inc_stats(data->header->inode, NFSIOS_SHORTWRITE);
++		nfs_inc_stats(hdr->inode, NFSIOS_SHORTWRITE);
+ 
+ 		/* Has the server at least made some progress? */
+ 		if (resp->count == 0) {
+@@ -1426,14 +1437,14 @@ static void nfs_writeback_result(struct rpc_task *task, struct nfs_pgio_data *da
+ 				       argp->count);
+ 				complain = jiffies + 300 * HZ;
+ 			}
+-			nfs_set_pgio_error(data->header, -EIO, argp->offset);
++			nfs_set_pgio_error(hdr, -EIO, argp->offset);
+ 			task->tk_status = -EIO;
+ 			return;
+ 		}
+ 		/* Was this an NFSv2 write or an NFSv3 stable write? */
+ 		if (resp->verf->committed != NFS_UNSTABLE) {
+ 			/* Resend from where the server left off */
+-			data->mds_offset += resp->count;
++			hdr->mds_offset += resp->count;
+ 			argp->offset += resp->count;
+ 			argp->pgbase += resp->count;
+ 			argp->count -= resp->count;
+@@ -1884,7 +1895,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
+ int __init nfs_init_writepagecache(void)
+ {
+ 	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
+-					     sizeof(struct nfs_rw_header),
++					     sizeof(struct nfs_pgio_header),
+ 					     0, SLAB_HWCACHE_ALIGN,
+ 					     NULL);
+ 	if (nfs_wdata_cachep == NULL)
+diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
+index 944275c8f56d..1d5103dfc203 100644
+--- a/fs/nfsd/nfs4xdr.c
++++ b/fs/nfsd/nfs4xdr.c
+@@ -2662,6 +2662,7 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
+ 	struct xdr_stream *xdr = cd->xdr;
+ 	int start_offset = xdr->buf->len;
+ 	int cookie_offset;
++	u32 name_and_cookie;
+ 	int entry_bytes;
+ 	__be32 nfserr = nfserr_toosmall;
+ 	__be64 wire_offset;
+@@ -2723,7 +2724,14 @@ nfsd4_encode_dirent(void *ccdv, const char *name, int namlen,
+ 	cd->rd_maxcount -= entry_bytes;
+ 	if (!cd->rd_dircount)
+ 		goto fail;
+-	cd->rd_dircount--;
++	/*
++	 * RFC 3530 14.2.24 describes rd_dircount as only a "hint", so
++	 * let's always let through the first entry, at least:
++	 */
++	name_and_cookie = 4 * XDR_QUADLEN(namlen) + 8;
++	if (name_and_cookie > cd->rd_dircount && cd->cookie_offset)
++		goto fail;
++	cd->rd_dircount -= min(cd->rd_dircount, name_and_cookie);
+ 	cd->cookie_offset = cookie_offset;
+ skip_entry:
+ 	cd->common.err = nfs_ok;
+@@ -3104,7 +3112,8 @@ static __be32 nfsd4_encode_splice_read(
+ 
+ 	buf->page_len = maxcount;
+ 	buf->len += maxcount;
+-	xdr->page_ptr += (maxcount + PAGE_SIZE - 1) / PAGE_SIZE;
++	xdr->page_ptr += (buf->page_base + maxcount + PAGE_SIZE - 1)
++							/ PAGE_SIZE;
+ 
+ 	/* Use rest of head for padding and remaining ops: */
+ 	buf->tail[0].iov_base = xdr->p;
+@@ -3333,6 +3342,10 @@ nfsd4_encode_readdir(struct nfsd4_compoundres *resp, __be32 nfserr, struct nfsd4
+ 	}
+ 	maxcount = min_t(int, maxcount-16, bytes_left);
+ 
++	/* RFC 3530 14.2.24 allows us to ignore dircount when it's 0: */
++	if (!readdir->rd_dircount)
++		readdir->rd_dircount = INT_MAX;
++
+ 	readdir->xdr = xdr;
+ 	readdir->rd_maxcount = maxcount;
+ 	readdir->common.err = 0;
+diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
+index 6252b173a465..d071e7f23de2 100644
+--- a/fs/nilfs2/inode.c
++++ b/fs/nilfs2/inode.c
+@@ -24,6 +24,7 @@
+ #include <linux/buffer_head.h>
+ #include <linux/gfp.h>
+ #include <linux/mpage.h>
++#include <linux/pagemap.h>
+ #include <linux/writeback.h>
+ #include <linux/aio.h>
+ #include "nilfs.h"
+@@ -219,10 +220,10 @@ static int nilfs_writepage(struct page *page, struct writeback_control *wbc)
+ 
+ static int nilfs_set_page_dirty(struct page *page)
+ {
++	struct inode *inode = page->mapping->host;
+ 	int ret = __set_page_dirty_nobuffers(page);
+ 
+ 	if (page_has_buffers(page)) {
+-		struct inode *inode = page->mapping->host;
+ 		unsigned nr_dirty = 0;
+ 		struct buffer_head *bh, *head;
+ 
+@@ -245,6 +246,10 @@ static int nilfs_set_page_dirty(struct page *page)
+ 
+ 		if (nr_dirty)
+ 			nilfs_set_file_dirty(inode, nr_dirty);
++	} else if (ret) {
++		unsigned nr_dirty = 1 << (PAGE_CACHE_SHIFT - inode->i_blkbits);
++
++		nilfs_set_file_dirty(inode, nr_dirty);
+ 	}
+ 	return ret;
+ }
+diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
+index 238a5930cb3c..9d7e2b9659cb 100644
+--- a/fs/notify/fdinfo.c
++++ b/fs/notify/fdinfo.c
+@@ -42,7 +42,7 @@ static int show_mark_fhandle(struct seq_file *m, struct inode *inode)
+ {
+ 	struct {
+ 		struct file_handle handle;
+-		u8 pad[64];
++		u8 pad[MAX_HANDLE_SZ];
+ 	} f;
+ 	int size, ret, i;
+ 
+@@ -50,7 +50,7 @@ static int show_mark_fhandle(struct seq_file *m, struct inode *inode)
+ 	size = f.handle.handle_bytes >> 2;
+ 
+ 	ret = exportfs_encode_inode_fh(inode, (struct fid *)f.handle.f_handle, &size, 0);
+-	if ((ret == 255) || (ret == -ENOSPC)) {
++	if ((ret == FILEID_INVALID) || (ret < 0)) {
+ 		WARN_ONCE(1, "Can't encode file handler for inotify: %d\n", ret);
+ 		return 0;
+ 	}
+diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
+index 82abf0cc9a12..9d405d6d2504 100644
+--- a/fs/ocfs2/dlm/dlmmaster.c
++++ b/fs/ocfs2/dlm/dlmmaster.c
+@@ -655,12 +655,9 @@ void dlm_lockres_clear_refmap_bit(struct dlm_ctxt *dlm,
+ 	clear_bit(bit, res->refmap);
+ }
+ 
+-
+-void dlm_lockres_grab_inflight_ref(struct dlm_ctxt *dlm,
++static void __dlm_lockres_grab_inflight_ref(struct dlm_ctxt *dlm,
+ 				   struct dlm_lock_resource *res)
+ {
+-	assert_spin_locked(&res->spinlock);
+-
+ 	res->inflight_locks++;
+ 
+ 	mlog(0, "%s: res %.*s, inflight++: now %u, %ps()\n", dlm->name,
+@@ -668,6 +665,13 @@ void dlm_lockres_grab_inflight_ref(struct dlm_ctxt *dlm,
+ 	     __builtin_return_address(0));
+ }
+ 
++void dlm_lockres_grab_inflight_ref(struct dlm_ctxt *dlm,
++				   struct dlm_lock_resource *res)
++{
++	assert_spin_locked(&res->spinlock);
++	__dlm_lockres_grab_inflight_ref(dlm, res);
++}
++
+ void dlm_lockres_drop_inflight_ref(struct dlm_ctxt *dlm,
+ 				   struct dlm_lock_resource *res)
+ {
+@@ -894,10 +898,8 @@ lookup:
+ 	/* finally add the lockres to its hash bucket */
+ 	__dlm_insert_lockres(dlm, res);
+ 
+-	/* Grab inflight ref to pin the resource */
+-	spin_lock(&res->spinlock);
+-	dlm_lockres_grab_inflight_ref(dlm, res);
+-	spin_unlock(&res->spinlock);
++	/* since this lockres is new it doesn't not require the spinlock */
++	__dlm_lockres_grab_inflight_ref(dlm, res);
+ 
+ 	/* get an extra ref on the mle in case this is a BLOCK
+ 	 * if so, the creator of the BLOCK may try to put the last
+diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
+index 61e8a9b021dd..42234a871b22 100644
+--- a/fs/ufs/inode.c
++++ b/fs/ufs/inode.c
+@@ -902,9 +902,6 @@ void ufs_evict_inode(struct inode * inode)
+ 	invalidate_inode_buffers(inode);
+ 	clear_inode(inode);
+ 
+-	if (want_delete) {
+-		lock_ufs(inode->i_sb);
+-		ufs_free_inode (inode);
+-		unlock_ufs(inode->i_sb);
+-	}
++	if (want_delete)
++		ufs_free_inode(inode);
+ }
+diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
+index 90d74b8f8eba..2df62a73f20c 100644
+--- a/fs/ufs/namei.c
++++ b/fs/ufs/namei.c
+@@ -126,12 +126,12 @@ static int ufs_symlink (struct inode * dir, struct dentry * dentry,
+ 	if (l > sb->s_blocksize)
+ 		goto out_notlocked;
+ 
+-	lock_ufs(dir->i_sb);
+ 	inode = ufs_new_inode(dir, S_IFLNK | S_IRWXUGO);
+ 	err = PTR_ERR(inode);
+ 	if (IS_ERR(inode))
+-		goto out;
++		goto out_notlocked;
+ 
++	lock_ufs(dir->i_sb);
+ 	if (l > UFS_SB(sb)->s_uspi->s_maxsymlinklen) {
+ 		/* slow symlink */
+ 		inode->i_op = &ufs_symlink_inode_operations;
+@@ -181,13 +181,9 @@ static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
+ 	struct inode * inode;
+ 	int err;
+ 
+-	lock_ufs(dir->i_sb);
+-	inode_inc_link_count(dir);
+-
+ 	inode = ufs_new_inode(dir, S_IFDIR|mode);
+-	err = PTR_ERR(inode);
+ 	if (IS_ERR(inode))
+-		goto out_dir;
++		return PTR_ERR(inode);
+ 
+ 	inode->i_op = &ufs_dir_inode_operations;
+ 	inode->i_fop = &ufs_dir_operations;
+@@ -195,6 +191,9 @@ static int ufs_mkdir(struct inode * dir, struct dentry * dentry, umode_t mode)
+ 
+ 	inode_inc_link_count(inode);
+ 
++	lock_ufs(dir->i_sb);
++	inode_inc_link_count(dir);
++
+ 	err = ufs_make_empty(inode, dir);
+ 	if (err)
+ 		goto out_fail;
+@@ -212,7 +211,6 @@ out_fail:
+ 	inode_dec_link_count(inode);
+ 	inode_dec_link_count(inode);
+ 	iput (inode);
+-out_dir:
+ 	inode_dec_link_count(dir);
+ 	unlock_ufs(dir->i_sb);
+ 	goto out;
+diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
+index 0826a4407e8e..d07aa9b7fb99 100644
+--- a/include/acpi/acpi_bus.h
++++ b/include/acpi/acpi_bus.h
+@@ -118,6 +118,7 @@ struct acpi_device;
+ struct acpi_hotplug_profile {
+ 	struct kobject kobj;
+ 	int (*scan_dependent)(struct acpi_device *adev);
++	void (*notify_online)(struct acpi_device *adev);
+ 	bool enabled:1;
+ 	bool demand_offline:1;
+ };
+diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
+index a5183da3ef92..f2fcd3ed5676 100644
+--- a/include/drm/ttm/ttm_bo_driver.h
++++ b/include/drm/ttm/ttm_bo_driver.h
+@@ -182,6 +182,7 @@ struct ttm_mem_type_manager_func {
+ 	 * @man: Pointer to a memory type manager.
+ 	 * @bo: Pointer to the buffer object we're allocating space for.
+ 	 * @placement: Placement details.
++	 * @flags: Additional placement flags.
+ 	 * @mem: Pointer to a struct ttm_mem_reg to be filled in.
+ 	 *
+ 	 * This function should allocate space in the memory type managed
+@@ -206,6 +207,7 @@ struct ttm_mem_type_manager_func {
+ 	int  (*get_node)(struct ttm_mem_type_manager *man,
+ 			 struct ttm_buffer_object *bo,
+ 			 struct ttm_placement *placement,
++			 uint32_t flags,
+ 			 struct ttm_mem_reg *mem);
+ 
+ 	/**
+diff --git a/include/linux/ccp.h b/include/linux/ccp.h
+index ebcc9d146219..7f437036baa4 100644
+--- a/include/linux/ccp.h
++++ b/include/linux/ccp.h
+@@ -27,6 +27,13 @@ struct ccp_cmd;
+ 	defined(CONFIG_CRYPTO_DEV_CCP_DD_MODULE)
+ 
+ /**
++ * ccp_present - check if a CCP device is present
++ *
++ * Returns zero if a CCP device is present, -ENODEV otherwise.
++ */
++int ccp_present(void);
++
++/**
+  * ccp_enqueue_cmd - queue an operation for processing by the CCP
+  *
+  * @cmd: ccp_cmd struct to be processed
+@@ -53,6 +60,11 @@ int ccp_enqueue_cmd(struct ccp_cmd *cmd);
+ 
+ #else /* CONFIG_CRYPTO_DEV_CCP_DD is not enabled */
+ 
++static inline int ccp_present(void)
++{
++	return -ENODEV;
++}
++
+ static inline int ccp_enqueue_cmd(struct ccp_cmd *cmd)
+ {
+ 	return -ENODEV;
+diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
+index 404a686a3644..721de254ba7a 100644
+--- a/include/linux/ftrace.h
++++ b/include/linux/ftrace.h
+@@ -103,6 +103,15 @@ enum {
+ 	FTRACE_OPS_FL_DELETED			= 1 << 8,
+ };
+ 
++#ifdef CONFIG_DYNAMIC_FTRACE
++/* The hash used to know what functions callbacks trace */
++struct ftrace_ops_hash {
++	struct ftrace_hash		*notrace_hash;
++	struct ftrace_hash		*filter_hash;
++	struct mutex			regex_lock;
++};
++#endif
++
+ /*
+  * Note, ftrace_ops can be referenced outside of RCU protection.
+  * (Although, for perf, the control ops prevent that). If ftrace_ops is
+@@ -121,8 +130,8 @@ struct ftrace_ops {
+ 	int __percpu			*disabled;
+ 	void				*private;
+ #ifdef CONFIG_DYNAMIC_FTRACE
+-	struct ftrace_hash		*notrace_hash;
+-	struct ftrace_hash		*filter_hash;
++	struct ftrace_ops_hash		local_hash;
++	struct ftrace_ops_hash		*func_hash;
+ 	struct mutex			regex_lock;
+ #endif
+ };
+diff --git a/include/linux/iio/trigger.h b/include/linux/iio/trigger.h
+index 369cf2cd5144..68f46cd5d514 100644
+--- a/include/linux/iio/trigger.h
++++ b/include/linux/iio/trigger.h
+@@ -84,10 +84,12 @@ static inline void iio_trigger_put(struct iio_trigger *trig)
+ 	put_device(&trig->dev);
+ }
+ 
+-static inline void iio_trigger_get(struct iio_trigger *trig)
++static inline struct iio_trigger *iio_trigger_get(struct iio_trigger *trig)
+ {
+ 	get_device(&trig->dev);
+ 	__module_get(trig->ops->owner);
++
++	return trig;
+ }
+ 
+ /**
+diff --git a/include/linux/nfs_page.h b/include/linux/nfs_page.h
+index 7d9096d95d4a..55a486421fdd 100644
+--- a/include/linux/nfs_page.h
++++ b/include/linux/nfs_page.h
+@@ -62,12 +62,13 @@ struct nfs_pageio_ops {
+ 
+ struct nfs_rw_ops {
+ 	const fmode_t rw_mode;
+-	struct nfs_rw_header *(*rw_alloc_header)(void);
+-	void (*rw_free_header)(struct nfs_rw_header *);
+-	void (*rw_release)(struct nfs_pgio_data *);
+-	int  (*rw_done)(struct rpc_task *, struct nfs_pgio_data *, struct inode *);
+-	void (*rw_result)(struct rpc_task *, struct nfs_pgio_data *);
+-	void (*rw_initiate)(struct nfs_pgio_data *, struct rpc_message *,
++	struct nfs_pgio_header *(*rw_alloc_header)(void);
++	void (*rw_free_header)(struct nfs_pgio_header *);
++	void (*rw_release)(struct nfs_pgio_header *);
++	int  (*rw_done)(struct rpc_task *, struct nfs_pgio_header *,
++			struct inode *);
++	void (*rw_result)(struct rpc_task *, struct nfs_pgio_header *);
++	void (*rw_initiate)(struct nfs_pgio_header *, struct rpc_message *,
+ 			    struct rpc_task_setup *, int);
+ };
+ 
+@@ -119,7 +120,8 @@ extern size_t nfs_generic_pg_test(struct nfs_pageio_descriptor *desc,
+ extern  int nfs_wait_on_request(struct nfs_page *);
+ extern	void nfs_unlock_request(struct nfs_page *req);
+ extern	void nfs_unlock_and_release_request(struct nfs_page *);
+-extern void nfs_page_group_lock(struct nfs_page *);
++extern int nfs_page_group_lock(struct nfs_page *, bool);
++extern void nfs_page_group_lock_wait(struct nfs_page *);
+ extern void nfs_page_group_unlock(struct nfs_page *);
+ extern bool nfs_page_group_sync_on_bit(struct nfs_page *, unsigned int);
+ 
+diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
+index 9a1396e70310..2c35d524ffc6 100644
+--- a/include/linux/nfs_xdr.h
++++ b/include/linux/nfs_xdr.h
+@@ -1257,14 +1257,10 @@ enum {
+ 	NFS_IOHDR_NEED_RESCHED,
+ };
+ 
+-struct nfs_pgio_data;
+-
+ struct nfs_pgio_header {
+ 	struct inode		*inode;
+ 	struct rpc_cred		*cred;
+ 	struct list_head	pages;
+-	struct nfs_pgio_data	*data;
+-	atomic_t		refcnt;
+ 	struct nfs_page		*req;
+ 	struct nfs_writeverf	verf;		/* Used for writes */
+ 	struct pnfs_layout_segment *lseg;
+@@ -1281,28 +1277,23 @@ struct nfs_pgio_header {
+ 	int			error;		/* merge with pnfs_error */
+ 	unsigned long		good_bytes;	/* boundary of good data */
+ 	unsigned long		flags;
+-};
+ 
+-struct nfs_pgio_data {
+-	struct nfs_pgio_header	*header;
++	/*
++	 * rpc data
++	 */
+ 	struct rpc_task		task;
+ 	struct nfs_fattr	fattr;
+-	struct nfs_writeverf	verf;		/* Used for writes */
++	struct nfs_writeverf	writeverf;	/* Used for writes */
+ 	struct nfs_pgio_args	args;		/* argument struct */
+ 	struct nfs_pgio_res	res;		/* result struct */
+ 	unsigned long		timestamp;	/* For lease renewal */
+-	int (*pgio_done_cb) (struct rpc_task *task, struct nfs_pgio_data *data);
++	int (*pgio_done_cb)(struct rpc_task *, struct nfs_pgio_header *);
+ 	__u64			mds_offset;	/* Filelayout dense stripe */
+-	struct nfs_page_array	pages;
++	struct nfs_page_array	page_array;
+ 	struct nfs_client	*ds_clp;	/* pNFS data server */
+ 	int			ds_idx;		/* ds index if ds_clp is set */
+ };
+ 
+-struct nfs_rw_header {
+-	struct nfs_pgio_header	header;
+-	struct nfs_pgio_data	rpc_data;
+-};
+-
+ struct nfs_mds_commit_info {
+ 	atomic_t rpcs_out;
+ 	unsigned long		ncommit;
+@@ -1432,11 +1423,12 @@ struct nfs_rpc_ops {
+ 			     struct nfs_pathconf *);
+ 	int	(*set_capabilities)(struct nfs_server *, struct nfs_fh *);
+ 	int	(*decode_dirent)(struct xdr_stream *, struct nfs_entry *, int);
+-	int	(*pgio_rpc_prepare)(struct rpc_task *, struct nfs_pgio_data *);
+-	void	(*read_setup)   (struct nfs_pgio_data *, struct rpc_message *);
+-	int	(*read_done)  (struct rpc_task *, struct nfs_pgio_data *);
+-	void	(*write_setup)  (struct nfs_pgio_data *, struct rpc_message *);
+-	int	(*write_done)  (struct rpc_task *, struct nfs_pgio_data *);
++	int	(*pgio_rpc_prepare)(struct rpc_task *,
++				    struct nfs_pgio_header *);
++	void	(*read_setup)(struct nfs_pgio_header *, struct rpc_message *);
++	int	(*read_done)(struct rpc_task *, struct nfs_pgio_header *);
++	void	(*write_setup)(struct nfs_pgio_header *, struct rpc_message *);
++	int	(*write_done)(struct rpc_task *, struct nfs_pgio_header *);
+ 	void	(*commit_setup) (struct nfs_commit_data *, struct rpc_message *);
+ 	void	(*commit_rpc_prepare)(struct rpc_task *, struct nfs_commit_data *);
+ 	int	(*commit_done) (struct rpc_task *, struct nfs_commit_data *);
+diff --git a/include/linux/pci.h b/include/linux/pci.h
+index 466bcd111d85..97fe7ebf2e25 100644
+--- a/include/linux/pci.h
++++ b/include/linux/pci.h
+@@ -303,6 +303,7 @@ struct pci_dev {
+ 						   D3cold, not set for devices
+ 						   powered on/off by the
+ 						   corresponding bridge */
++	unsigned int	ignore_hotplug:1;	/* Ignore hotplug events */
+ 	unsigned int	d3_delay;	/* D3->D0 transition time in ms */
+ 	unsigned int	d3cold_delay;	/* D3cold->D0 transition time in ms */
+ 
+@@ -1019,6 +1020,11 @@ bool pci_dev_run_wake(struct pci_dev *dev);
+ bool pci_check_pme_status(struct pci_dev *dev);
+ void pci_pme_wakeup_bus(struct pci_bus *bus);
+ 
++static inline void pci_ignore_hotplug(struct pci_dev *dev)
++{
++	dev->ignore_hotplug = 1;
++}
++
+ static inline int pci_enable_wake(struct pci_dev *dev, pci_power_t state,
+ 				  bool enable)
+ {
+diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
+index 535f158977b9..8cf350325dc6 100644
+--- a/include/linux/seqlock.h
++++ b/include/linux/seqlock.h
+@@ -164,8 +164,6 @@ static inline unsigned read_seqcount_begin(const seqcount_t *s)
+ static inline unsigned raw_seqcount_begin(const seqcount_t *s)
+ {
+ 	unsigned ret = ACCESS_ONCE(s->sequence);
+-
+-	seqcount_lockdep_reader_access(s);
+ 	smp_rmb();
+ 	return ret & ~1;
+ }
+diff --git a/include/linux/vga_switcheroo.h b/include/linux/vga_switcheroo.h
+index 502073a53dd3..b483abd34493 100644
+--- a/include/linux/vga_switcheroo.h
++++ b/include/linux/vga_switcheroo.h
+@@ -64,6 +64,7 @@ int vga_switcheroo_get_client_state(struct pci_dev *dev);
+ void vga_switcheroo_set_dynamic_switch(struct pci_dev *pdev, enum vga_switcheroo_state dynamic);
+ 
+ int vga_switcheroo_init_domain_pm_ops(struct device *dev, struct dev_pm_domain *domain);
++void vga_switcheroo_fini_domain_pm_ops(struct device *dev);
+ int vga_switcheroo_init_domain_pm_optimus_hdmi_audio(struct device *dev, struct dev_pm_domain *domain);
+ #else
+ 
+@@ -82,6 +83,7 @@ static inline int vga_switcheroo_get_client_state(struct pci_dev *dev) { return
+ static inline void vga_switcheroo_set_dynamic_switch(struct pci_dev *pdev, enum vga_switcheroo_state dynamic) {}
+ 
+ static inline int vga_switcheroo_init_domain_pm_ops(struct device *dev, struct dev_pm_domain *domain) { return -EINVAL; }
++static inline void vga_switcheroo_fini_domain_pm_ops(struct device *dev) {}
+ static inline int vga_switcheroo_init_domain_pm_optimus_hdmi_audio(struct device *dev, struct dev_pm_domain *domain) { return -EINVAL; }
+ 
+ #endif
+diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
+index a0cc2e95ed1b..b996e6cde6bb 100644
+--- a/include/linux/workqueue.h
++++ b/include/linux/workqueue.h
+@@ -419,7 +419,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,
+ 	alloc_workqueue("%s", WQ_FREEZABLE | WQ_UNBOUND | WQ_MEM_RECLAIM, \
+ 			1, (name))
+ #define create_singlethread_workqueue(name)				\
+-	alloc_workqueue("%s", WQ_UNBOUND | WQ_MEM_RECLAIM, 1, (name))
++	alloc_ordered_workqueue("%s", WQ_MEM_RECLAIM, name)
+ 
+ extern void destroy_workqueue(struct workqueue_struct *wq);
+ 
+diff --git a/include/net/regulatory.h b/include/net/regulatory.h
+index 259992444e80..dad7ab20a8cb 100644
+--- a/include/net/regulatory.h
++++ b/include/net/regulatory.h
+@@ -167,7 +167,7 @@ struct ieee80211_reg_rule {
+ struct ieee80211_regdomain {
+ 	struct rcu_head rcu_head;
+ 	u32 n_reg_rules;
+-	char alpha2[2];
++	char alpha2[3];
+ 	enum nl80211_dfs_regions dfs_region;
+ 	struct ieee80211_reg_rule reg_rules[];
+ };
+diff --git a/include/uapi/drm/radeon_drm.h b/include/uapi/drm/radeon_drm.h
+index 1cc0b610f162..79719f940ea4 100644
+--- a/include/uapi/drm/radeon_drm.h
++++ b/include/uapi/drm/radeon_drm.h
+@@ -942,6 +942,7 @@ struct drm_radeon_cs_chunk {
+ };
+ 
+ /* drm_radeon_cs_reloc.flags */
++#define RADEON_RELOC_PRIO_MASK		(0xf << 0)
+ 
+ struct drm_radeon_cs_reloc {
+ 	uint32_t		handle;
+diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
+index c38355c1f3c9..1590c49cae57 100644
+--- a/include/uapi/linux/xattr.h
++++ b/include/uapi/linux/xattr.h
+@@ -13,7 +13,7 @@
+ #ifndef _UAPI_LINUX_XATTR_H
+ #define _UAPI_LINUX_XATTR_H
+ 
+-#ifdef __UAPI_DEF_XATTR
++#if __UAPI_DEF_XATTR
+ #define __USE_KERNEL_XATTR_DEFS
+ 
+ #define XATTR_CREATE	0x1	/* set value, fail if attr already exists */
+diff --git a/kernel/cgroup.c b/kernel/cgroup.c
+index 70776aec2562..0a46b2aa9dfb 100644
+--- a/kernel/cgroup.c
++++ b/kernel/cgroup.c
+@@ -1031,6 +1031,11 @@ static void cgroup_get(struct cgroup *cgrp)
+ 	css_get(&cgrp->self);
+ }
+ 
++static bool cgroup_tryget(struct cgroup *cgrp)
++{
++	return css_tryget(&cgrp->self);
++}
++
+ static void cgroup_put(struct cgroup *cgrp)
+ {
+ 	css_put(&cgrp->self);
+@@ -1091,7 +1096,8 @@ static struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn)
+ 	 * protection against removal.  Ensure @cgrp stays accessible and
+ 	 * break the active_ref protection.
+ 	 */
+-	cgroup_get(cgrp);
++	if (!cgroup_tryget(cgrp))
++		return NULL;
+ 	kernfs_break_active_protection(kn);
+ 
+ 	mutex_lock(&cgroup_mutex);
+@@ -3827,7 +3833,6 @@ static int pidlist_array_load(struct cgroup *cgrp, enum cgroup_filetype type,
+ 
+ 	l = cgroup_pidlist_find_create(cgrp, type);
+ 	if (!l) {
+-		mutex_unlock(&cgrp->pidlist_mutex);
+ 		pidlist_free(array);
+ 		return -ENOMEM;
+ 	}
+@@ -4236,6 +4241,15 @@ static void css_release_work_fn(struct work_struct *work)
+ 		/* cgroup release path */
+ 		cgroup_idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
+ 		cgrp->id = -1;
++
++		/*
++		 * There are two control paths which try to determine
++		 * cgroup from dentry without going through kernfs -
++		 * cgroupstats_build() and css_tryget_online_from_dir().
++		 * Those are supported by RCU protecting clearing of
++		 * cgrp->kn->priv backpointer.
++		 */
++		RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv, NULL);
+ 	}
+ 
+ 	mutex_unlock(&cgroup_mutex);
+@@ -4387,6 +4401,11 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
+ 	struct kernfs_node *kn;
+ 	int ssid, ret;
+ 
++	/* Do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable.
++	 */
++	if (strchr(name, '\n'))
++		return -EINVAL;
++
+ 	parent = cgroup_kn_lock_live(parent_kn);
+ 	if (!parent)
+ 		return -ENODEV;
+@@ -4656,16 +4675,6 @@ static int cgroup_rmdir(struct kernfs_node *kn)
+ 
+ 	cgroup_kn_unlock(kn);
+ 
+-	/*
+-	 * There are two control paths which try to determine cgroup from
+-	 * dentry without going through kernfs - cgroupstats_build() and
+-	 * css_tryget_online_from_dir().  Those are supported by RCU
+-	 * protecting clearing of cgrp->kn->priv backpointer, which should
+-	 * happen after all files under it have been removed.
+-	 */
+-	if (!ret)
+-		RCU_INIT_POINTER(*(void __rcu __force **)&kn->priv, NULL);
+-
+ 	cgroup_put(cgrp);
+ 	return ret;
+ }
+@@ -5231,7 +5240,7 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
+ 	/*
+ 	 * This path doesn't originate from kernfs and @kn could already
+ 	 * have been or be removed at any point.  @kn->priv is RCU
+-	 * protected for this access.  See cgroup_rmdir() for details.
++	 * protected for this access.  See css_release_work_fn() for details.
+ 	 */
+ 	cgrp = rcu_dereference(kn->priv);
+ 	if (cgrp)
+diff --git a/kernel/events/core.c b/kernel/events/core.c
+index 6b17ac1b0c2a..f626c9f1f3c0 100644
+--- a/kernel/events/core.c
++++ b/kernel/events/core.c
+@@ -1523,6 +1523,11 @@ retry:
+ 	 */
+ 	if (ctx->is_active) {
+ 		raw_spin_unlock_irq(&ctx->lock);
++		/*
++		 * Reload the task pointer, it might have been changed by
++		 * a concurrent perf_event_context_sched_out().
++		 */
++		task = ctx->task;
+ 		goto retry;
+ 	}
+ 
+@@ -1966,6 +1971,11 @@ retry:
+ 	 */
+ 	if (ctx->is_active) {
+ 		raw_spin_unlock_irq(&ctx->lock);
++		/*
++		 * Reload the task pointer, it might have been changed by
++		 * a concurrent perf_event_context_sched_out().
++		 */
++		task = ctx->task;
+ 		goto retry;
+ 	}
+ 
+diff --git a/kernel/futex.c b/kernel/futex.c
+index b632b5f3f094..c20fb395a672 100644
+--- a/kernel/futex.c
++++ b/kernel/futex.c
+@@ -2628,6 +2628,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
+ 	 * shared futexes. We need to compare the keys:
+ 	 */
+ 	if (match_futex(&q.key, &key2)) {
++		queue_unlock(hb);
+ 		ret = -EINVAL;
+ 		goto out_put_keys;
+ 	}
+diff --git a/kernel/kcmp.c b/kernel/kcmp.c
+index e30ac0fe61c3..0aa69ea1d8fd 100644
+--- a/kernel/kcmp.c
++++ b/kernel/kcmp.c
+@@ -44,11 +44,12 @@ static long kptr_obfuscate(long v, int type)
+  */
+ static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
+ {
+-	long ret;
++	long t1, t2;
+ 
+-	ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);
++	t1 = kptr_obfuscate((long)v1, type);
++	t2 = kptr_obfuscate((long)v2, type);
+ 
+-	return (ret < 0) | ((ret > 0) << 1);
++	return (t1 < t2) | ((t1 > t2) << 1);
+ }
+ 
+ /* The caller must have pinned the task */
+diff --git a/kernel/module.c b/kernel/module.c
+index 81e727cf6df9..673aeb0c25dc 100644
+--- a/kernel/module.c
++++ b/kernel/module.c
+@@ -3308,6 +3308,11 @@ static int load_module(struct load_info *info, const char __user *uargs,
+ 	mutex_lock(&module_mutex);
+ 	module_bug_cleanup(mod);
+ 	mutex_unlock(&module_mutex);
++
++	/* we can't deallocate the module until we clear memory protection */
++	unset_module_init_ro_nx(mod);
++	unset_module_core_ro_nx(mod);
++
+  ddebug_cleanup:
+ 	dynamic_debug_remove(info->debug);
+ 	synchronize_sched();
+diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
+index 13e839dbca07..971285d5b7a0 100644
+--- a/kernel/printk/printk.c
++++ b/kernel/printk/printk.c
+@@ -1617,15 +1617,15 @@ asmlinkage int vprintk_emit(int facility, int level,
+ 	raw_spin_lock(&logbuf_lock);
+ 	logbuf_cpu = this_cpu;
+ 
+-	if (recursion_bug) {
++	if (unlikely(recursion_bug)) {
+ 		static const char recursion_msg[] =
+ 			"BUG: recent printk recursion!";
+ 
+ 		recursion_bug = 0;
+-		text_len = strlen(recursion_msg);
+ 		/* emit KERN_CRIT message */
+ 		printed_len += log_store(0, 2, LOG_PREFIX|LOG_NEWLINE, 0,
+-					 NULL, 0, recursion_msg, text_len);
++					 NULL, 0, recursion_msg,
++					 strlen(recursion_msg));
+ 	}
+ 
+ 	/*
+diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
+index fe75444ae7ec..cd45a0727a16 100644
+--- a/kernel/time/alarmtimer.c
++++ b/kernel/time/alarmtimer.c
+@@ -464,18 +464,26 @@ static enum alarmtimer_type clock2alarm(clockid_t clockid)
+ static enum alarmtimer_restart alarm_handle_timer(struct alarm *alarm,
+ 							ktime_t now)
+ {
++	unsigned long flags;
+ 	struct k_itimer *ptr = container_of(alarm, struct k_itimer,
+ 						it.alarm.alarmtimer);
+-	if (posix_timer_event(ptr, 0) != 0)
+-		ptr->it_overrun++;
++	enum alarmtimer_restart result = ALARMTIMER_NORESTART;
++
++	spin_lock_irqsave(&ptr->it_lock, flags);
++	if ((ptr->it_sigev_notify & ~SIGEV_THREAD_ID) != SIGEV_NONE) {
++		if (posix_timer_event(ptr, 0) != 0)
++			ptr->it_overrun++;
++	}
+ 
+ 	/* Re-add periodic timers */
+ 	if (ptr->it.alarm.interval.tv64) {
+ 		ptr->it_overrun += alarm_forward(alarm, now,
+ 						ptr->it.alarm.interval);
+-		return ALARMTIMER_RESTART;
++		result = ALARMTIMER_RESTART;
+ 	}
+-	return ALARMTIMER_NORESTART;
++	spin_unlock_irqrestore(&ptr->it_lock, flags);
++
++	return result;
+ }
+ 
+ /**
+@@ -541,18 +549,22 @@ static int alarm_timer_create(struct k_itimer *new_timer)
+  * @new_timer: k_itimer pointer
+  * @cur_setting: itimerspec data to fill
+  *
+- * Copies the itimerspec data out from the k_itimer
++ * Copies out the current itimerspec data
+  */
+ static void alarm_timer_get(struct k_itimer *timr,
+ 				struct itimerspec *cur_setting)
+ {
+-	memset(cur_setting, 0, sizeof(struct itimerspec));
++	ktime_t relative_expiry_time =
++		alarm_expires_remaining(&(timr->it.alarm.alarmtimer));
++
++	if (ktime_to_ns(relative_expiry_time) > 0) {
++		cur_setting->it_value = ktime_to_timespec(relative_expiry_time);
++	} else {
++		cur_setting->it_value.tv_sec = 0;
++		cur_setting->it_value.tv_nsec = 0;
++	}
+ 
+-	cur_setting->it_interval =
+-			ktime_to_timespec(timr->it.alarm.interval);
+-	cur_setting->it_value =
+-		ktime_to_timespec(timr->it.alarm.alarmtimer.node.expires);
+-	return;
++	cur_setting->it_interval = ktime_to_timespec(timr->it.alarm.interval);
+ }
+ 
+ /**
+diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
+index ac9d1dad630b..ca167e660e10 100644
+--- a/kernel/trace/ftrace.c
++++ b/kernel/trace/ftrace.c
+@@ -65,15 +65,21 @@
+ #define FL_GLOBAL_CONTROL_MASK (FTRACE_OPS_FL_CONTROL)
+ 
+ #ifdef CONFIG_DYNAMIC_FTRACE
+-#define INIT_REGEX_LOCK(opsname)	\
+-	.regex_lock	= __MUTEX_INITIALIZER(opsname.regex_lock),
++#define INIT_OPS_HASH(opsname)	\
++	.func_hash		= &opsname.local_hash,			\
++	.local_hash.regex_lock	= __MUTEX_INITIALIZER(opsname.local_hash.regex_lock),
++#define ASSIGN_OPS_HASH(opsname, val) \
++	.func_hash		= val, \
++	.local_hash.regex_lock	= __MUTEX_INITIALIZER(opsname.local_hash.regex_lock),
+ #else
+-#define INIT_REGEX_LOCK(opsname)
++#define INIT_OPS_HASH(opsname)
++#define ASSIGN_OPS_HASH(opsname, val)
+ #endif
+ 
+ static struct ftrace_ops ftrace_list_end __read_mostly = {
+ 	.func		= ftrace_stub,
+ 	.flags		= FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_STUB,
++	INIT_OPS_HASH(ftrace_list_end)
+ };
+ 
+ /* ftrace_enabled is a method to turn ftrace on or off */
+@@ -108,6 +114,7 @@ static struct ftrace_ops *ftrace_ops_list __read_mostly = &ftrace_list_end;
+ ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub;
+ ftrace_func_t ftrace_pid_function __read_mostly = ftrace_stub;
+ static struct ftrace_ops global_ops;
++static struct ftrace_ops graph_ops;
+ static struct ftrace_ops control_ops;
+ 
+ #if ARCH_SUPPORTS_FTRACE_OPS
+@@ -143,7 +150,8 @@ static inline void ftrace_ops_init(struct ftrace_ops *ops)
+ {
+ #ifdef CONFIG_DYNAMIC_FTRACE
+ 	if (!(ops->flags & FTRACE_OPS_FL_INITIALIZED)) {
+-		mutex_init(&ops->regex_lock);
++		mutex_init(&ops->local_hash.regex_lock);
++		ops->func_hash = &ops->local_hash;
+ 		ops->flags |= FTRACE_OPS_FL_INITIALIZED;
+ 	}
+ #endif
+@@ -902,7 +910,7 @@ static void unregister_ftrace_profiler(void)
+ static struct ftrace_ops ftrace_profile_ops __read_mostly = {
+ 	.func		= function_profile_call,
+ 	.flags		= FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED,
+-	INIT_REGEX_LOCK(ftrace_profile_ops)
++	INIT_OPS_HASH(ftrace_profile_ops)
+ };
+ 
+ static int register_ftrace_profiler(void)
+@@ -1082,11 +1090,12 @@ static const struct ftrace_hash empty_hash = {
+ #define EMPTY_HASH	((struct ftrace_hash *)&empty_hash)
+ 
+ static struct ftrace_ops global_ops = {
+-	.func			= ftrace_stub,
+-	.notrace_hash		= EMPTY_HASH,
+-	.filter_hash		= EMPTY_HASH,
+-	.flags			= FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED,
+-	INIT_REGEX_LOCK(global_ops)
++	.func				= ftrace_stub,
++	.local_hash.notrace_hash	= EMPTY_HASH,
++	.local_hash.filter_hash		= EMPTY_HASH,
++	INIT_OPS_HASH(global_ops)
++	.flags				= FTRACE_OPS_FL_RECURSION_SAFE |
++					  FTRACE_OPS_FL_INITIALIZED,
+ };
+ 
+ struct ftrace_page {
+@@ -1227,8 +1236,8 @@ static void free_ftrace_hash_rcu(struct ftrace_hash *hash)
+ void ftrace_free_filter(struct ftrace_ops *ops)
+ {
+ 	ftrace_ops_init(ops);
+-	free_ftrace_hash(ops->filter_hash);
+-	free_ftrace_hash(ops->notrace_hash);
++	free_ftrace_hash(ops->func_hash->filter_hash);
++	free_ftrace_hash(ops->func_hash->notrace_hash);
+ }
+ 
+ static struct ftrace_hash *alloc_ftrace_hash(int size_bits)
+@@ -1289,9 +1298,9 @@ alloc_and_copy_ftrace_hash(int size_bits, struct ftrace_hash *hash)
+ }
+ 
+ static void
+-ftrace_hash_rec_disable(struct ftrace_ops *ops, int filter_hash);
++ftrace_hash_rec_disable_modify(struct ftrace_ops *ops, int filter_hash);
+ static void
+-ftrace_hash_rec_enable(struct ftrace_ops *ops, int filter_hash);
++ftrace_hash_rec_enable_modify(struct ftrace_ops *ops, int filter_hash);
+ 
+ static int
+ ftrace_hash_move(struct ftrace_ops *ops, int enable,
+@@ -1311,7 +1320,7 @@ ftrace_hash_move(struct ftrace_ops *ops, int enable,
+ 	 * Remove the current set, update the hash and add
+ 	 * them back.
+ 	 */
+-	ftrace_hash_rec_disable(ops, enable);
++	ftrace_hash_rec_disable_modify(ops, enable);
+ 
+ 	/*
+ 	 * If the new source is empty, just free dst and assign it
+@@ -1360,7 +1369,7 @@ ftrace_hash_move(struct ftrace_ops *ops, int enable,
+ 	 *  On success, we enable the new hash.
+ 	 *  On failure, we re-enable the original hash.
+ 	 */
+-	ftrace_hash_rec_enable(ops, enable);
++	ftrace_hash_rec_enable_modify(ops, enable);
+ 
+ 	return ret;
+ }
+@@ -1394,8 +1403,8 @@ ftrace_ops_test(struct ftrace_ops *ops, unsigned long ip, void *regs)
+ 		return 0;
+ #endif
+ 
+-	filter_hash = rcu_dereference_raw_notrace(ops->filter_hash);
+-	notrace_hash = rcu_dereference_raw_notrace(ops->notrace_hash);
++	filter_hash = rcu_dereference_raw_notrace(ops->func_hash->filter_hash);
++	notrace_hash = rcu_dereference_raw_notrace(ops->func_hash->notrace_hash);
+ 
+ 	if ((ftrace_hash_empty(filter_hash) ||
+ 	     ftrace_lookup_ip(filter_hash, ip)) &&
+@@ -1519,14 +1528,14 @@ static void __ftrace_hash_rec_update(struct ftrace_ops *ops,
+ 	 *   gets inversed.
+ 	 */
+ 	if (filter_hash) {
+-		hash = ops->filter_hash;
+-		other_hash = ops->notrace_hash;
++		hash = ops->func_hash->filter_hash;
++		other_hash = ops->func_hash->notrace_hash;
+ 		if (ftrace_hash_empty(hash))
+ 			all = 1;
+ 	} else {
+ 		inc = !inc;
+-		hash = ops->notrace_hash;
+-		other_hash = ops->filter_hash;
++		hash = ops->func_hash->notrace_hash;
++		other_hash = ops->func_hash->filter_hash;
+ 		/*
+ 		 * If the notrace hash has no items,
+ 		 * then there's nothing to do.
+@@ -1604,6 +1613,41 @@ static void ftrace_hash_rec_enable(struct ftrace_ops *ops,
+ 	__ftrace_hash_rec_update(ops, filter_hash, 1);
+ }
+ 
++static void ftrace_hash_rec_update_modify(struct ftrace_ops *ops,
++					  int filter_hash, int inc)
++{
++	struct ftrace_ops *op;
++
++	__ftrace_hash_rec_update(ops, filter_hash, inc);
++
++	if (ops->func_hash != &global_ops.local_hash)
++		return;
++
++	/*
++	 * If the ops shares the global_ops hash, then we need to update
++	 * all ops that are enabled and use this hash.
++	 */
++	do_for_each_ftrace_op(op, ftrace_ops_list) {
++		/* Already done */
++		if (op == ops)
++			continue;
++		if (op->func_hash == &global_ops.local_hash)
++			__ftrace_hash_rec_update(op, filter_hash, inc);
++	} while_for_each_ftrace_op(op);
++}
++
++static void ftrace_hash_rec_disable_modify(struct ftrace_ops *ops,
++					   int filter_hash)
++{
++	ftrace_hash_rec_update_modify(ops, filter_hash, 0);
++}
++
++static void ftrace_hash_rec_enable_modify(struct ftrace_ops *ops,
++					  int filter_hash)
++{
++	ftrace_hash_rec_update_modify(ops, filter_hash, 1);
++}
++
+ static void print_ip_ins(const char *fmt, unsigned char *p)
+ {
+ 	int i;
+@@ -1809,7 +1853,7 @@ __ftrace_replace_code(struct dyn_ftrace *rec, int enable)
+ 		return ftrace_make_call(rec, ftrace_addr);
+ 
+ 	case FTRACE_UPDATE_MAKE_NOP:
+-		return ftrace_make_nop(NULL, rec, ftrace_addr);
++		return ftrace_make_nop(NULL, rec, ftrace_old_addr);
+ 
+ 	case FTRACE_UPDATE_MODIFY_CALL:
+ 		return ftrace_modify_call(rec, ftrace_old_addr, ftrace_addr);
+@@ -2196,8 +2240,8 @@ static inline int ops_traces_mod(struct ftrace_ops *ops)
+ 	 * Filter_hash being empty will default to trace module.
+ 	 * But notrace hash requires a test of individual module functions.
+ 	 */
+-	return ftrace_hash_empty(ops->filter_hash) &&
+-		ftrace_hash_empty(ops->notrace_hash);
++	return ftrace_hash_empty(ops->func_hash->filter_hash) &&
++		ftrace_hash_empty(ops->func_hash->notrace_hash);
+ }
+ 
+ /*
+@@ -2219,12 +2263,12 @@ ops_references_rec(struct ftrace_ops *ops, struct dyn_ftrace *rec)
+ 		return 0;
+ 
+ 	/* The function must be in the filter */
+-	if (!ftrace_hash_empty(ops->filter_hash) &&
+-	    !ftrace_lookup_ip(ops->filter_hash, rec->ip))
++	if (!ftrace_hash_empty(ops->func_hash->filter_hash) &&
++	    !ftrace_lookup_ip(ops->func_hash->filter_hash, rec->ip))
+ 		return 0;
+ 
+ 	/* If in notrace hash, we ignore it too */
+-	if (ftrace_lookup_ip(ops->notrace_hash, rec->ip))
++	if (ftrace_lookup_ip(ops->func_hash->notrace_hash, rec->ip))
+ 		return 0;
+ 
+ 	return 1;
+@@ -2544,10 +2588,10 @@ t_next(struct seq_file *m, void *v, loff_t *pos)
+ 	} else {
+ 		rec = &iter->pg->records[iter->idx++];
+ 		if (((iter->flags & FTRACE_ITER_FILTER) &&
+-		     !(ftrace_lookup_ip(ops->filter_hash, rec->ip))) ||
++		     !(ftrace_lookup_ip(ops->func_hash->filter_hash, rec->ip))) ||
+ 
+ 		    ((iter->flags & FTRACE_ITER_NOTRACE) &&
+-		     !ftrace_lookup_ip(ops->notrace_hash, rec->ip)) ||
++		     !ftrace_lookup_ip(ops->func_hash->notrace_hash, rec->ip)) ||
+ 
+ 		    ((iter->flags & FTRACE_ITER_ENABLED) &&
+ 		     !(rec->flags & FTRACE_FL_ENABLED))) {
+@@ -2596,7 +2640,7 @@ static void *t_start(struct seq_file *m, loff_t *pos)
+ 	 * functions are enabled.
+ 	 */
+ 	if (iter->flags & FTRACE_ITER_FILTER &&
+-	    ftrace_hash_empty(ops->filter_hash)) {
++	    ftrace_hash_empty(ops->func_hash->filter_hash)) {
+ 		if (*pos > 0)
+ 			return t_hash_start(m, pos);
+ 		iter->flags |= FTRACE_ITER_PRINTALL;
+@@ -2750,12 +2794,12 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
+ 	iter->ops = ops;
+ 	iter->flags = flag;
+ 
+-	mutex_lock(&ops->regex_lock);
++	mutex_lock(&ops->func_hash->regex_lock);
+ 
+ 	if (flag & FTRACE_ITER_NOTRACE)
+-		hash = ops->notrace_hash;
++		hash = ops->func_hash->notrace_hash;
+ 	else
+-		hash = ops->filter_hash;
++		hash = ops->func_hash->filter_hash;
+ 
+ 	if (file->f_mode & FMODE_WRITE) {
+ 		iter->hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, hash);
+@@ -2788,7 +2832,7 @@ ftrace_regex_open(struct ftrace_ops *ops, int flag,
+ 		file->private_data = iter;
+ 
+  out_unlock:
+-	mutex_unlock(&ops->regex_lock);
++	mutex_unlock(&ops->func_hash->regex_lock);
+ 
+ 	return ret;
+ }
+@@ -3026,7 +3070,7 @@ static struct ftrace_ops trace_probe_ops __read_mostly =
+ {
+ 	.func		= function_trace_probe_call,
+ 	.flags		= FTRACE_OPS_FL_INITIALIZED,
+-	INIT_REGEX_LOCK(trace_probe_ops)
++	INIT_OPS_HASH(trace_probe_ops)
+ };
+ 
+ static int ftrace_probe_registered;
+@@ -3089,7 +3133,7 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ 			      void *data)
+ {
+ 	struct ftrace_func_probe *entry;
+-	struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
++	struct ftrace_hash **orig_hash = &trace_probe_ops.func_hash->filter_hash;
+ 	struct ftrace_hash *hash;
+ 	struct ftrace_page *pg;
+ 	struct dyn_ftrace *rec;
+@@ -3106,7 +3150,7 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ 	if (WARN_ON(not))
+ 		return -EINVAL;
+ 
+-	mutex_lock(&trace_probe_ops.regex_lock);
++	mutex_lock(&trace_probe_ops.func_hash->regex_lock);
+ 
+ 	hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+ 	if (!hash) {
+@@ -3175,7 +3219,7 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+  out_unlock:
+ 	mutex_unlock(&ftrace_lock);
+  out:
+-	mutex_unlock(&trace_probe_ops.regex_lock);
++	mutex_unlock(&trace_probe_ops.func_hash->regex_lock);
+ 	free_ftrace_hash(hash);
+ 
+ 	return count;
+@@ -3193,7 +3237,7 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ 	struct ftrace_func_entry *rec_entry;
+ 	struct ftrace_func_probe *entry;
+ 	struct ftrace_func_probe *p;
+-	struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
++	struct ftrace_hash **orig_hash = &trace_probe_ops.func_hash->filter_hash;
+ 	struct list_head free_list;
+ 	struct ftrace_hash *hash;
+ 	struct hlist_node *tmp;
+@@ -3215,7 +3259,7 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ 			return;
+ 	}
+ 
+-	mutex_lock(&trace_probe_ops.regex_lock);
++	mutex_lock(&trace_probe_ops.func_hash->regex_lock);
+ 
+ 	hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+ 	if (!hash)
+@@ -3268,7 +3312,7 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
+ 	mutex_unlock(&ftrace_lock);
+ 		
+  out_unlock:
+-	mutex_unlock(&trace_probe_ops.regex_lock);
++	mutex_unlock(&trace_probe_ops.func_hash->regex_lock);
+ 	free_ftrace_hash(hash);
+ }
+ 
+@@ -3464,12 +3508,12 @@ ftrace_set_hash(struct ftrace_ops *ops, unsigned char *buf, int len,
+ 	if (unlikely(ftrace_disabled))
+ 		return -ENODEV;
+ 
+-	mutex_lock(&ops->regex_lock);
++	mutex_lock(&ops->func_hash->regex_lock);
+ 
+ 	if (enable)
+-		orig_hash = &ops->filter_hash;
++		orig_hash = &ops->func_hash->filter_hash;
+ 	else
+-		orig_hash = &ops->notrace_hash;
++		orig_hash = &ops->func_hash->notrace_hash;
+ 
+ 	hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
+ 	if (!hash) {
+@@ -3497,7 +3541,7 @@ ftrace_set_hash(struct ftrace_ops *ops, unsigned char *buf, int len,
+ 	mutex_unlock(&ftrace_lock);
+ 
+  out_regex_unlock:
+-	mutex_unlock(&ops->regex_lock);
++	mutex_unlock(&ops->func_hash->regex_lock);
+ 
+ 	free_ftrace_hash(hash);
+ 	return ret;
+@@ -3704,15 +3748,15 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
+ 
+ 	trace_parser_put(parser);
+ 
+-	mutex_lock(&iter->ops->regex_lock);
++	mutex_lock(&iter->ops->func_hash->regex_lock);
+ 
+ 	if (file->f_mode & FMODE_WRITE) {
+ 		filter_hash = !!(iter->flags & FTRACE_ITER_FILTER);
+ 
+ 		if (filter_hash)
+-			orig_hash = &iter->ops->filter_hash;
++			orig_hash = &iter->ops->func_hash->filter_hash;
+ 		else
+-			orig_hash = &iter->ops->notrace_hash;
++			orig_hash = &iter->ops->func_hash->notrace_hash;
+ 
+ 		mutex_lock(&ftrace_lock);
+ 		ret = ftrace_hash_move(iter->ops, filter_hash,
+@@ -3723,7 +3767,7 @@ int ftrace_regex_release(struct inode *inode, struct file *file)
+ 		mutex_unlock(&ftrace_lock);
+ 	}
+ 
+-	mutex_unlock(&iter->ops->regex_lock);
++	mutex_unlock(&iter->ops->func_hash->regex_lock);
+ 	free_ftrace_hash(iter->hash);
+ 	kfree(iter);
+ 
+@@ -4335,7 +4379,6 @@ void __init ftrace_init(void)
+ static struct ftrace_ops global_ops = {
+ 	.func			= ftrace_stub,
+ 	.flags			= FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED,
+-	INIT_REGEX_LOCK(global_ops)
+ };
+ 
+ static int __init ftrace_nodyn_init(void)
+@@ -4437,7 +4480,7 @@ ftrace_ops_control_func(unsigned long ip, unsigned long parent_ip,
+ static struct ftrace_ops control_ops = {
+ 	.func	= ftrace_ops_control_func,
+ 	.flags	= FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED,
+-	INIT_REGEX_LOCK(control_ops)
++	INIT_OPS_HASH(control_ops)
+ };
+ 
+ static inline void
+@@ -4873,6 +4916,14 @@ ftrace_enable_sysctl(struct ctl_table *table, int write,
+ 
+ #ifdef CONFIG_FUNCTION_GRAPH_TRACER
+ 
++static struct ftrace_ops graph_ops = {
++	.func			= ftrace_stub,
++	.flags			= FTRACE_OPS_FL_RECURSION_SAFE |
++				   FTRACE_OPS_FL_INITIALIZED |
++				   FTRACE_OPS_FL_STUB,
++	ASSIGN_OPS_HASH(graph_ops, &global_ops.local_hash)
++};
++
+ static int ftrace_graph_active;
+ 
+ int ftrace_graph_entry_stub(struct ftrace_graph_ent *trace)
+@@ -5035,12 +5086,28 @@ static int ftrace_graph_entry_test(struct ftrace_graph_ent *trace)
+  */
+ static void update_function_graph_func(void)
+ {
+-	if (ftrace_ops_list == &ftrace_list_end ||
+-	    (ftrace_ops_list == &global_ops &&
+-	     global_ops.next == &ftrace_list_end))
+-		ftrace_graph_entry = __ftrace_graph_entry;
+-	else
++	struct ftrace_ops *op;
++	bool do_test = false;
++
++	/*
++	 * The graph and global ops share the same set of functions
++	 * to test. If any other ops is on the list, then
++	 * the graph tracing needs to test if its the function
++	 * it should call.
++	 */
++	do_for_each_ftrace_op(op, ftrace_ops_list) {
++		if (op != &global_ops && op != &graph_ops &&
++		    op != &ftrace_list_end) {
++			do_test = true;
++			/* in double loop, break out with goto */
++			goto out;
++		}
++	} while_for_each_ftrace_op(op);
++ out:
++	if (do_test)
+ 		ftrace_graph_entry = ftrace_graph_entry_test;
++	else
++		ftrace_graph_entry = __ftrace_graph_entry;
+ }
+ 
+ static struct notifier_block ftrace_suspend_notifier = {
+@@ -5081,11 +5148,7 @@ int register_ftrace_graph(trace_func_graph_ret_t retfunc,
+ 	ftrace_graph_entry = ftrace_graph_entry_test;
+ 	update_function_graph_func();
+ 
+-	/* Function graph doesn't use the .func field of global_ops */
+-	global_ops.flags |= FTRACE_OPS_FL_STUB;
+-
+-	ret = ftrace_startup(&global_ops, FTRACE_START_FUNC_RET);
+-
++	ret = ftrace_startup(&graph_ops, FTRACE_START_FUNC_RET);
+ out:
+ 	mutex_unlock(&ftrace_lock);
+ 	return ret;
+@@ -5102,8 +5165,7 @@ void unregister_ftrace_graph(void)
+ 	ftrace_graph_return = (trace_func_graph_ret_t)ftrace_stub;
+ 	ftrace_graph_entry = ftrace_graph_entry_stub;
+ 	__ftrace_graph_entry = ftrace_graph_entry_stub;
+-	ftrace_shutdown(&global_ops, FTRACE_STOP_FUNC_RET);
+-	global_ops.flags &= ~FTRACE_OPS_FL_STUB;
++	ftrace_shutdown(&graph_ops, FTRACE_STOP_FUNC_RET);
+ 	unregister_pm_notifier(&ftrace_suspend_notifier);
+ 	unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);
+ 
+diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
+index b95381ebdd5e..2ff0580d3dcd 100644
+--- a/kernel/trace/ring_buffer.c
++++ b/kernel/trace/ring_buffer.c
+@@ -626,8 +626,22 @@ int ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
+ 		work = &cpu_buffer->irq_work;
+ 	}
+ 
+-	work->waiters_pending = true;
+ 	poll_wait(filp, &work->waiters, poll_table);
++	work->waiters_pending = true;
++	/*
++	 * There's a tight race between setting the waiters_pending and
++	 * checking if the ring buffer is empty.  Once the waiters_pending bit
++	 * is set, the next event will wake the task up, but we can get stuck
++	 * if there's only a single event in.
++	 *
++	 * FIXME: Ideally, we need a memory barrier on the writer side as well,
++	 * but adding a memory barrier to all events will cause too much of a
++	 * performance hit in the fast path.  We only need a memory barrier when
++	 * the buffer goes from empty to having content.  But as this race is
++	 * extremely small, and it's not a problem if another event comes in, we
++	 * will fix it later.
++	 */
++	smp_mb();
+ 
+ 	if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) ||
+ 	    (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, cpu)))
+diff --git a/mm/dmapool.c b/mm/dmapool.c
+index 306baa594f95..ba8019b063e1 100644
+--- a/mm/dmapool.c
++++ b/mm/dmapool.c
+@@ -176,7 +176,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev,
+ 	if (list_empty(&dev->dma_pools) &&
+ 	    device_create_file(dev, &dev_attr_pools)) {
+ 		kfree(retval);
+-		return NULL;
++		retval = NULL;
+ 	} else
+ 		list_add(&retval->pools, &dev->dma_pools);
+ 	mutex_unlock(&pools_lock);
+diff --git a/mm/memblock.c b/mm/memblock.c
+index 6d2f219a48b0..70fad0c0dafb 100644
+--- a/mm/memblock.c
++++ b/mm/memblock.c
+@@ -192,8 +192,7 @@ phys_addr_t __init_memblock memblock_find_in_range_node(phys_addr_t size,
+ 					phys_addr_t align, phys_addr_t start,
+ 					phys_addr_t end, int nid)
+ {
+-	int ret;
+-	phys_addr_t kernel_end;
++	phys_addr_t kernel_end, ret;
+ 
+ 	/* pump up @end */
+ 	if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
+diff --git a/mm/memory.c b/mm/memory.c
+index 0a21f3d162ae..533023da2faa 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -1125,7 +1125,7 @@ again:
+ 						addr) != page->index) {
+ 				pte_t ptfile = pgoff_to_pte(page->index);
+ 				if (pte_soft_dirty(ptent))
+-					pte_file_mksoft_dirty(ptfile);
++					ptfile = pte_file_mksoft_dirty(ptfile);
+ 				set_pte_at(mm, addr, pte, ptfile);
+ 			}
+ 			if (PageAnon(page))
+diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
+index 3707c71ae4cd..51108165f829 100644
+--- a/mm/percpu-vm.c
++++ b/mm/percpu-vm.c
+@@ -108,7 +108,7 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
+ 			    int page_start, int page_end)
+ {
+ 	const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_COLD;
+-	unsigned int cpu;
++	unsigned int cpu, tcpu;
+ 	int i;
+ 
+ 	for_each_possible_cpu(cpu) {
+@@ -116,14 +116,23 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
+ 			struct page **pagep = &pages[pcpu_page_idx(cpu, i)];
+ 
+ 			*pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0);
+-			if (!*pagep) {
+-				pcpu_free_pages(chunk, pages, populated,
+-						page_start, page_end);
+-				return -ENOMEM;
+-			}
++			if (!*pagep)
++				goto err;
+ 		}
+ 	}
+ 	return 0;
++
++err:
++	while (--i >= page_start)
++		__free_page(pages[pcpu_page_idx(cpu, i)]);
++
++	for_each_possible_cpu(tcpu) {
++		if (tcpu == cpu)
++			break;
++		for (i = page_start; i < page_end; i++)
++			__free_page(pages[pcpu_page_idx(tcpu, i)]);
++	}
++	return -ENOMEM;
+ }
+ 
+ /**
+@@ -263,6 +272,7 @@ err:
+ 		__pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start),
+ 				   page_end - page_start);
+ 	}
++	pcpu_post_unmap_tlb_flush(chunk, page_start, page_end);
+ 	return err;
+ }
+ 
+diff --git a/mm/percpu.c b/mm/percpu.c
+index 2ddf9a990dbd..492f601df473 100644
+--- a/mm/percpu.c
++++ b/mm/percpu.c
+@@ -1933,6 +1933,8 @@ void __init setup_per_cpu_areas(void)
+ 
+ 	if (pcpu_setup_first_chunk(ai, fc) < 0)
+ 		panic("Failed to initialize percpu areas.");
++
++	pcpu_free_alloc_info(ai);
+ }
+ 
+ #endif	/* CONFIG_SMP */
+diff --git a/mm/shmem.c b/mm/shmem.c
+index af68b15a8fc1..e53ab3a8a8d3 100644
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -2064,8 +2064,10 @@ static int shmem_rename(struct inode *old_dir, struct dentry *old_dentry, struct
+ 
+ 	if (new_dentry->d_inode) {
+ 		(void) shmem_unlink(new_dir, new_dentry);
+-		if (they_are_dirs)
++		if (they_are_dirs) {
++			drop_nlink(new_dentry->d_inode);
+ 			drop_nlink(old_dir);
++		}
+ 	} else if (they_are_dirs) {
+ 		drop_nlink(old_dir);
+ 		inc_nlink(new_dir);
+diff --git a/mm/slab.c b/mm/slab.c
+index 3070b929a1bf..c9103e4cf2c2 100644
+--- a/mm/slab.c
++++ b/mm/slab.c
+@@ -2224,7 +2224,8 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
+ int
+ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
+ {
+-	size_t left_over, freelist_size, ralign;
++	size_t left_over, freelist_size;
++	size_t ralign = BYTES_PER_WORD;
+ 	gfp_t gfp;
+ 	int err;
+ 	size_t size = cachep->size;
+@@ -2257,14 +2258,6 @@ __kmem_cache_create (struct kmem_cache *cachep, unsigned long flags)
+ 		size &= ~(BYTES_PER_WORD - 1);
+ 	}
+ 
+-	/*
+-	 * Redzoning and user store require word alignment or possibly larger.
+-	 * Note this will be overridden by architecture or caller mandated
+-	 * alignment if either is greater than BYTES_PER_WORD.
+-	 */
+-	if (flags & SLAB_STORE_USER)
+-		ralign = BYTES_PER_WORD;
+-
+ 	if (flags & SLAB_RED_ZONE) {
+ 		ralign = REDZONE_ALIGN;
+ 		/* If redzoning, ensure that the second redzone is suitably
+diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c
+index 3345401be1b3..c8779f316d30 100644
+--- a/net/mac80211/mlme.c
++++ b/net/mac80211/mlme.c
+@@ -4355,8 +4355,7 @@ int ieee80211_mgd_assoc(struct ieee80211_sub_if_data *sdata,
+ 	rcu_read_unlock();
+ 
+ 	if (bss->wmm_used && bss->uapsd_supported &&
+-	    (sdata->local->hw.flags & IEEE80211_HW_SUPPORTS_UAPSD) &&
+-	    sdata->wmm_acm != 0xff) {
++	    (sdata->local->hw.flags & IEEE80211_HW_SUPPORTS_UAPSD)) {
+ 		assoc_data->uapsd = true;
+ 		ifmgd->flags |= IEEE80211_STA_UAPSD_ENABLED;
+ 	} else {
+diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
+index e6836755c45d..5c34e8d42e01 100644
+--- a/net/netfilter/ipvs/ip_vs_core.c
++++ b/net/netfilter/ipvs/ip_vs_core.c
+@@ -1906,7 +1906,7 @@ static struct nf_hook_ops ip_vs_ops[] __read_mostly = {
+ 	{
+ 		.hook		= ip_vs_local_reply6,
+ 		.owner		= THIS_MODULE,
+-		.pf		= NFPROTO_IPV4,
++		.pf		= NFPROTO_IPV6,
+ 		.hooknum	= NF_INET_LOCAL_OUT,
+ 		.priority	= NF_IP6_PRI_NAT_DST + 1,
+ 	},
+diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
+index 73ba1cc7a88d..6f70bdd3a90a 100644
+--- a/net/netfilter/ipvs/ip_vs_xmit.c
++++ b/net/netfilter/ipvs/ip_vs_xmit.c
+@@ -967,8 +967,8 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
+ 	iph->nexthdr		=	IPPROTO_IPV6;
+ 	iph->payload_len	=	old_iph->payload_len;
+ 	be16_add_cpu(&iph->payload_len, sizeof(*old_iph));
+-	iph->priority		=	old_iph->priority;
+ 	memset(&iph->flow_lbl, 0, sizeof(iph->flow_lbl));
++	ipv6_change_dsfield(iph, 0, ipv6_get_dsfield(old_iph));
+ 	iph->daddr = cp->daddr.in6;
+ 	iph->saddr = saddr;
+ 	iph->hop_limit		=	old_iph->hop_limit;
+diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
+index 8746ff9a8357..62101ed0d2af 100644
+--- a/net/netfilter/nf_tables_api.c
++++ b/net/netfilter/nf_tables_api.c
+@@ -899,6 +899,9 @@ static struct nft_stats __percpu *nft_stats_alloc(const struct nlattr *attr)
+ static void nft_chain_stats_replace(struct nft_base_chain *chain,
+ 				    struct nft_stats __percpu *newstats)
+ {
++	if (newstats == NULL)
++		return;
++
+ 	if (chain->stats) {
+ 		struct nft_stats __percpu *oldstats =
+ 				nft_dereference(chain->stats);
+diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
+index f4e833005320..7198d660b4de 100644
+--- a/net/netfilter/xt_cgroup.c
++++ b/net/netfilter/xt_cgroup.c
+@@ -31,7 +31,7 @@ static int cgroup_mt_check(const struct xt_mtchk_param *par)
+ 	if (info->invert & ~1)
+ 		return -EINVAL;
+ 
+-	return info->id ? 0 : -EINVAL;
++	return 0;
+ }
+ 
+ static bool
+diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
+index a3910fc2122b..47dc6836830a 100644
+--- a/net/netfilter/xt_hashlimit.c
++++ b/net/netfilter/xt_hashlimit.c
+@@ -104,7 +104,7 @@ struct xt_hashlimit_htable {
+ 	spinlock_t lock;		/* lock for list_head */
+ 	u_int32_t rnd;			/* random seed for hash */
+ 	unsigned int count;		/* number entries in table */
+-	struct timer_list timer;	/* timer for gc */
++	struct delayed_work gc_work;
+ 
+ 	/* seq_file stuff */
+ 	struct proc_dir_entry *pde;
+@@ -213,7 +213,7 @@ dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent)
+ 	call_rcu_bh(&ent->rcu, dsthash_free_rcu);
+ 	ht->count--;
+ }
+-static void htable_gc(unsigned long htlong);
++static void htable_gc(struct work_struct *work);
+ 
+ static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
+ 			 u_int8_t family)
+@@ -273,9 +273,9 @@ static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo,
+ 	}
+ 	hinfo->net = net;
+ 
+-	setup_timer(&hinfo->timer, htable_gc, (unsigned long)hinfo);
+-	hinfo->timer.expires = jiffies + msecs_to_jiffies(hinfo->cfg.gc_interval);
+-	add_timer(&hinfo->timer);
++	INIT_DEFERRABLE_WORK(&hinfo->gc_work, htable_gc);
++	queue_delayed_work(system_power_efficient_wq, &hinfo->gc_work,
++			   msecs_to_jiffies(hinfo->cfg.gc_interval));
+ 
+ 	hlist_add_head(&hinfo->node, &hashlimit_net->htables);
+ 
+@@ -300,29 +300,30 @@ static void htable_selective_cleanup(struct xt_hashlimit_htable *ht,
+ {
+ 	unsigned int i;
+ 
+-	/* lock hash table and iterate over it */
+-	spin_lock_bh(&ht->lock);
+ 	for (i = 0; i < ht->cfg.size; i++) {
+ 		struct dsthash_ent *dh;
+ 		struct hlist_node *n;
++
++		spin_lock_bh(&ht->lock);
+ 		hlist_for_each_entry_safe(dh, n, &ht->hash[i], node) {
+ 			if ((*select)(ht, dh))
+ 				dsthash_free(ht, dh);
+ 		}
++		spin_unlock_bh(&ht->lock);
++		cond_resched();
+ 	}
+-	spin_unlock_bh(&ht->lock);
+ }
+ 
+-/* hash table garbage collector, run by timer */
+-static void htable_gc(unsigned long htlong)
++static void htable_gc(struct work_struct *work)
+ {
+-	struct xt_hashlimit_htable *ht = (struct xt_hashlimit_htable *)htlong;
++	struct xt_hashlimit_htable *ht;
++
++	ht = container_of(work, struct xt_hashlimit_htable, gc_work.work);
+ 
+ 	htable_selective_cleanup(ht, select_gc);
+ 
+-	/* re-add the timer accordingly */
+-	ht->timer.expires = jiffies + msecs_to_jiffies(ht->cfg.gc_interval);
+-	add_timer(&ht->timer);
++	queue_delayed_work(system_power_efficient_wq,
++			   &ht->gc_work, msecs_to_jiffies(ht->cfg.gc_interval));
+ }
+ 
+ static void htable_remove_proc_entry(struct xt_hashlimit_htable *hinfo)
+@@ -341,7 +342,7 @@ static void htable_remove_proc_entry(struct xt_hashlimit_htable *hinfo)
+ 
+ static void htable_destroy(struct xt_hashlimit_htable *hinfo)
+ {
+-	del_timer_sync(&hinfo->timer);
++	cancel_delayed_work_sync(&hinfo->gc_work);
+ 	htable_remove_proc_entry(hinfo);
+ 	htable_selective_cleanup(hinfo, select_all);
+ 	kfree(hinfo->name);
+diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
+index 6668daf69326..d702af40ddea 100644
+--- a/net/wireless/nl80211.c
++++ b/net/wireless/nl80211.c
+@@ -6978,6 +6978,9 @@ void __cfg80211_send_event_skb(struct sk_buff *skb, gfp_t gfp)
+ 	struct nlattr *data = ((void **)skb->cb)[2];
+ 	enum nl80211_multicast_groups mcgrp = NL80211_MCGRP_TESTMODE;
+ 
++	/* clear CB data for netlink core to own from now on */
++	memset(skb->cb, 0, sizeof(skb->cb));
++
+ 	nla_nest_end(skb, data);
+ 	genlmsg_end(skb, hdr);
+ 
+@@ -9300,6 +9303,9 @@ int cfg80211_vendor_cmd_reply(struct sk_buff *skb)
+ 	void *hdr = ((void **)skb->cb)[1];
+ 	struct nlattr *data = ((void **)skb->cb)[2];
+ 
++	/* clear CB data for netlink core to own from now on */
++	memset(skb->cb, 0, sizeof(skb->cb));
++
+ 	if (WARN_ON(!rdev->cur_cmd_info)) {
+ 		kfree_skb(skb);
+ 		return -EINVAL;
+diff --git a/sound/core/info.c b/sound/core/info.c
+index 051d55b05521..9f404e965ea2 100644
+--- a/sound/core/info.c
++++ b/sound/core/info.c
+@@ -684,7 +684,7 @@ int snd_info_card_free(struct snd_card *card)
+  * snd_info_get_line - read one line from the procfs buffer
+  * @buffer: the procfs buffer
+  * @line: the buffer to store
+- * @len: the max. buffer size - 1
++ * @len: the max. buffer size
+  *
+  * Reads one line from the buffer and stores the string.
+  *
+@@ -704,7 +704,7 @@ int snd_info_get_line(struct snd_info_buffer *buffer, char *line, int len)
+ 			buffer->stop = 1;
+ 		if (c == '\n')
+ 			break;
+-		if (len) {
++		if (len > 1) {
+ 			len--;
+ 			*line++ = c;
+ 		}
+diff --git a/sound/core/pcm_lib.c b/sound/core/pcm_lib.c
+index 9acc77eae487..0032278567ad 100644
+--- a/sound/core/pcm_lib.c
++++ b/sound/core/pcm_lib.c
+@@ -1782,14 +1782,16 @@ static int snd_pcm_lib_ioctl_fifo_size(struct snd_pcm_substream *substream,
+ {
+ 	struct snd_pcm_hw_params *params = arg;
+ 	snd_pcm_format_t format;
+-	int channels, width;
++	int channels;
++	ssize_t frame_size;
+ 
+ 	params->fifo_size = substream->runtime->hw.fifo_size;
+ 	if (!(substream->runtime->hw.info & SNDRV_PCM_INFO_FIFO_IN_FRAMES)) {
+ 		format = params_format(params);
+ 		channels = params_channels(params);
+-		width = snd_pcm_format_physical_width(format);
+-		params->fifo_size /= width * channels;
++		frame_size = snd_pcm_format_size(format, channels);
++		if (frame_size > 0)
++			params->fifo_size /= (unsigned)frame_size;
+ 	}
+ 	return 0;
+ }
+diff --git a/sound/firewire/amdtp.c b/sound/firewire/amdtp.c
+index f96bf4c7c232..95fc2eaf11dc 100644
+--- a/sound/firewire/amdtp.c
++++ b/sound/firewire/amdtp.c
+@@ -507,7 +507,16 @@ static void amdtp_pull_midi(struct amdtp_stream *s,
+ static void update_pcm_pointers(struct amdtp_stream *s,
+ 				struct snd_pcm_substream *pcm,
+ 				unsigned int frames)
+-{	unsigned int ptr;
++{
++	unsigned int ptr;
++
++	/*
++	 * In IEC 61883-6, one data block represents one event. In ALSA, one
++	 * event equals to one PCM frame. But Dice has a quirk to transfer
++	 * two PCM frames in one data block.
++	 */
++	if (s->double_pcm_frames)
++		frames *= 2;
+ 
+ 	ptr = s->pcm_buffer_pointer + frames;
+ 	if (ptr >= pcm->runtime->buffer_size)
+diff --git a/sound/firewire/amdtp.h b/sound/firewire/amdtp.h
+index d8ee7b0e9386..4823c08196ac 100644
+--- a/sound/firewire/amdtp.h
++++ b/sound/firewire/amdtp.h
+@@ -125,6 +125,7 @@ struct amdtp_stream {
+ 	unsigned int pcm_buffer_pointer;
+ 	unsigned int pcm_period_pointer;
+ 	bool pointer_flush;
++	bool double_pcm_frames;
+ 
+ 	struct snd_rawmidi_substream *midi[AMDTP_MAX_CHANNELS_FOR_MIDI * 8];
+ 
+diff --git a/sound/firewire/dice.c b/sound/firewire/dice.c
+index a9a30c0161f1..e3a04d69c853 100644
+--- a/sound/firewire/dice.c
++++ b/sound/firewire/dice.c
+@@ -567,10 +567,14 @@ static int dice_hw_params(struct snd_pcm_substream *substream,
+ 		return err;
+ 
+ 	/*
+-	 * At rates above 96 kHz, pretend that the stream runs at half the
+-	 * actual sample rate with twice the number of channels; two samples
+-	 * of a channel are stored consecutively in the packet. Requires
+-	 * blocking mode and PCM buffer size should be aligned to SYT_INTERVAL.
++	 * At 176.4/192.0 kHz, Dice has a quirk to transfer two PCM frames in
++	 * one data block of AMDTP packet. Thus sampling transfer frequency is
++	 * a half of PCM sampling frequency, i.e. PCM frames at 192.0 kHz are
++	 * transferred on AMDTP packets at 96 kHz. Two successive samples of a
++	 * channel are stored consecutively in the packet. This quirk is called
++	 * as 'Dual Wire'.
++	 * For this quirk, blocking mode is required and PCM buffer size should
++	 * be aligned to SYT_INTERVAL.
+ 	 */
+ 	channels = params_channels(hw_params);
+ 	if (rate_index > 4) {
+@@ -579,18 +583,25 @@ static int dice_hw_params(struct snd_pcm_substream *substream,
+ 			return err;
+ 		}
+ 
+-		for (i = 0; i < channels; i++) {
+-			dice->stream.pcm_positions[i * 2] = i;
+-			dice->stream.pcm_positions[i * 2 + 1] = i + channels;
+-		}
+-
+ 		rate /= 2;
+ 		channels *= 2;
++		dice->stream.double_pcm_frames = true;
++	} else {
++		dice->stream.double_pcm_frames = false;
+ 	}
+ 
+ 	mode = rate_index_to_mode(rate_index);
+ 	amdtp_stream_set_parameters(&dice->stream, rate, channels,
+ 				    dice->rx_midi_ports[mode]);
++	if (rate_index > 4) {
++		channels /= 2;
++
++		for (i = 0; i < channels; i++) {
++			dice->stream.pcm_positions[i] = i * 2;
++			dice->stream.pcm_positions[i + channels] = i * 2 + 1;
++		}
++	}
++
+ 	amdtp_stream_set_pcm_format(&dice->stream,
+ 				    params_format(hw_params));
+ 
+diff --git a/sound/pci/hda/patch_conexant.c b/sound/pci/hda/patch_conexant.c
+index 1dc7e974f3b1..d5792653e77b 100644
+--- a/sound/pci/hda/patch_conexant.c
++++ b/sound/pci/hda/patch_conexant.c
+@@ -2822,6 +2822,7 @@ enum {
+ 	CXT_FIXUP_HEADPHONE_MIC_PIN,
+ 	CXT_FIXUP_HEADPHONE_MIC,
+ 	CXT_FIXUP_GPIO1,
++	CXT_FIXUP_ASPIRE_DMIC,
+ 	CXT_FIXUP_THINKPAD_ACPI,
+ 	CXT_FIXUP_OLPC_XO,
+ 	CXT_FIXUP_CAP_MIX_AMP,
+@@ -3269,6 +3270,12 @@ static const struct hda_fixup cxt_fixups[] = {
+ 			{ }
+ 		},
+ 	},
++	[CXT_FIXUP_ASPIRE_DMIC] = {
++		.type = HDA_FIXUP_FUNC,
++		.v.func = cxt_fixup_stereo_dmic,
++		.chained = true,
++		.chain_id = CXT_FIXUP_GPIO1,
++	},
+ 	[CXT_FIXUP_THINKPAD_ACPI] = {
+ 		.type = HDA_FIXUP_FUNC,
+ 		.v.func = hda_fixup_thinkpad_acpi,
+@@ -3349,7 +3356,7 @@ static const struct hda_model_fixup cxt5051_fixup_models[] = {
+ 
+ static const struct snd_pci_quirk cxt5066_fixups[] = {
+ 	SND_PCI_QUIRK(0x1025, 0x0543, "Acer Aspire One 522", CXT_FIXUP_STEREO_DMIC),
+-	SND_PCI_QUIRK(0x1025, 0x054c, "Acer Aspire 3830TG", CXT_FIXUP_GPIO1),
++	SND_PCI_QUIRK(0x1025, 0x054c, "Acer Aspire 3830TG", CXT_FIXUP_ASPIRE_DMIC),
+ 	SND_PCI_QUIRK(0x1043, 0x138d, "Asus", CXT_FIXUP_HEADPHONE_MIC_PIN),
+ 	SND_PCI_QUIRK(0x152d, 0x0833, "OLPC XO-1.5", CXT_FIXUP_OLPC_XO),
+ 	SND_PCI_QUIRK(0x17aa, 0x20f2, "Lenovo T400", CXT_PINCFG_LENOVO_TP410),
+@@ -3375,6 +3382,7 @@ static const struct hda_model_fixup cxt5066_fixup_models[] = {
+ 	{ .id = CXT_PINCFG_LENOVO_TP410, .name = "tp410" },
+ 	{ .id = CXT_FIXUP_THINKPAD_ACPI, .name = "thinkpad" },
+ 	{ .id = CXT_PINCFG_LEMOTE_A1004, .name = "lemote-a1004" },
++	{ .id = CXT_PINCFG_LEMOTE_A1205, .name = "lemote-a1205" },
+ 	{ .id = CXT_FIXUP_OLPC_XO, .name = "olpc-xo" },
+ 	{}
+ };
+diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
+index 25728aaacc26..88e4623d4f97 100644
+--- a/sound/pci/hda/patch_realtek.c
++++ b/sound/pci/hda/patch_realtek.c
+@@ -327,6 +327,7 @@ static void alc_auto_init_amp(struct hda_codec *codec, int type)
+ 		case 0x10ec0885:
+ 		case 0x10ec0887:
+ 		/*case 0x10ec0889:*/ /* this causes an SPDIF problem */
++		case 0x10ec0900:
+ 			alc889_coef_init(codec);
+ 			break;
+ 		case 0x10ec0888:
+@@ -2349,6 +2350,7 @@ static int patch_alc882(struct hda_codec *codec)
+ 	switch (codec->vendor_id) {
+ 	case 0x10ec0882:
+ 	case 0x10ec0885:
++	case 0x10ec0900:
+ 		break;
+ 	default:
+ 		/* ALC883 and variants */
+diff --git a/sound/pci/hda/patch_sigmatel.c b/sound/pci/hda/patch_sigmatel.c
+index 4d3a3b932690..619aec71b1e2 100644
+--- a/sound/pci/hda/patch_sigmatel.c
++++ b/sound/pci/hda/patch_sigmatel.c
+@@ -565,8 +565,8 @@ static void stac_init_power_map(struct hda_codec *codec)
+ 		if (snd_hda_jack_tbl_get(codec, nid))
+ 			continue;
+ 		if (def_conf == AC_JACK_PORT_COMPLEX &&
+-		    !(spec->vref_mute_led_nid == nid ||
+-		      is_jack_detectable(codec, nid))) {
++		    spec->vref_mute_led_nid != nid &&
++		    is_jack_detectable(codec, nid)) {
+ 			snd_hda_jack_detect_enable_callback(codec, nid,
+ 							    STAC_PWR_EVENT,
+ 							    jack_update_power);
+@@ -4263,11 +4263,18 @@ static int stac_parse_auto_config(struct hda_codec *codec)
+ 			return err;
+ 	}
+ 
+-	stac_init_power_map(codec);
+-
+ 	return 0;
+ }
+ 
++static int stac_build_controls(struct hda_codec *codec)
++{
++	int err = snd_hda_gen_build_controls(codec);
++
++	if (err < 0)
++		return err;
++	stac_init_power_map(codec);
++	return 0;
++}
+ 
+ static int stac_init(struct hda_codec *codec)
+ {
+@@ -4379,7 +4386,7 @@ static int stac_suspend(struct hda_codec *codec)
+ #endif /* CONFIG_PM */
+ 
+ static const struct hda_codec_ops stac_patch_ops = {
+-	.build_controls = snd_hda_gen_build_controls,
++	.build_controls = stac_build_controls,
+ 	.build_pcms = snd_hda_gen_build_pcms,
+ 	.init = stac_init,
+ 	.free = stac_free,
+diff --git a/sound/soc/davinci/davinci-mcasp.c b/sound/soc/davinci/davinci-mcasp.c
+index 9afb14629a17..b7559bc49426 100644
+--- a/sound/soc/davinci/davinci-mcasp.c
++++ b/sound/soc/davinci/davinci-mcasp.c
+@@ -455,8 +455,17 @@ static int davinci_config_channel_size(struct davinci_mcasp *mcasp,
+ {
+ 	u32 fmt;
+ 	u32 tx_rotate = (word_length / 4) & 0x7;
+-	u32 rx_rotate = (32 - word_length) / 4;
+ 	u32 mask = (1ULL << word_length) - 1;
++	/*
++	 * For captured data we should not rotate, inversion and masking is
++	 * enoguh to get the data to the right position:
++	 * Format	  data from bus		after reverse (XRBUF)
++	 * S16_LE:	|LSB|MSB|xxx|xxx|	|xxx|xxx|MSB|LSB|
++	 * S24_3LE:	|LSB|DAT|MSB|xxx|	|xxx|MSB|DAT|LSB|
++	 * S24_LE:	|LSB|DAT|MSB|xxx|	|xxx|MSB|DAT|LSB|
++	 * S32_LE:	|LSB|DAT|DAT|MSB|	|MSB|DAT|DAT|LSB|
++	 */
++	u32 rx_rotate = 0;
+ 
+ 	/*
+ 	 * if s BCLK-to-LRCLK ratio has been configured via the set_clkdiv()


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-06 11:16 Anthony G. Basile
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-10-06 11:16 UTC (permalink / raw
  To: gentoo-commits

commit:     06a07c7f7ebb2c26793e4bf990975df43e6c9bf6
Author:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Mon Oct  6 11:01:54 2014 +0000
Commit:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Mon Oct  6 11:01:54 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=06a07c7f

Remove duplicate of multipath-tcp patch.

---
 5010_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 --------------------------
 1 file changed, 19230 deletions(-)

diff --git a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
deleted file mode 100644
index 3000da3..0000000
--- a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
+++ /dev/null
@@ -1,19230 +0,0 @@
-diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
-index 768a0fb67dd6..5a46d91a8df9 100644
---- a/drivers/infiniband/hw/cxgb4/cm.c
-+++ b/drivers/infiniband/hw/cxgb4/cm.c
-@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
- 	 */
- 	memset(&tmp_opt, 0, sizeof(tmp_opt));
- 	tcp_clear_options(&tmp_opt);
--	tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+	tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
- 
- 	req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
- 	memset(req, 0, sizeof(*req));
-diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
-index 2faef339d8f2..d86c853ffaad 100644
---- a/include/linux/ipv6.h
-+++ b/include/linux/ipv6.h
-@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
- 	return inet_sk(__sk)->pinet6;
- }
- 
--static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
--{
--	struct request_sock *req = reqsk_alloc(ops);
--
--	if (req)
--		inet_rsk(req)->pktopts = NULL;
--
--	return req;
--}
--
- static inline struct raw6_sock *raw6_sk(const struct sock *sk)
- {
- 	return (struct raw6_sock *)sk;
-@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
- 	return NULL;
- }
- 
--static inline struct inet6_request_sock *
--			inet6_rsk(const struct request_sock *rsk)
--{
--	return NULL;
--}
--
- static inline struct raw6_sock *raw6_sk(const struct sock *sk)
- {
- 	return NULL;
-diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
-index ec89301ada41..99ea4b0e3693 100644
---- a/include/linux/skbuff.h
-+++ b/include/linux/skbuff.h
-@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
- 						  bool zero_okay,
- 						  __sum16 check)
- {
--	if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
--		skb->csum_valid = 1;
-+	if (skb_csum_unnecessary(skb)) {
-+		return false;
-+	} else if (zero_okay && !check) {
-+		skb->ip_summed = CHECKSUM_UNNECESSARY;
- 		return false;
- 	}
- 
-diff --git a/include/linux/tcp.h b/include/linux/tcp.h
-index a0513210798f..7bc2e078d6ca 100644
---- a/include/linux/tcp.h
-+++ b/include/linux/tcp.h
-@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
- /* TCP Fast Open */
- #define TCP_FASTOPEN_COOKIE_MIN	4	/* Min Fast Open Cookie size in bytes */
- #define TCP_FASTOPEN_COOKIE_MAX	16	/* Max Fast Open Cookie size in bytes */
--#define TCP_FASTOPEN_COOKIE_SIZE 8	/* the size employed by this impl. */
-+#define TCP_FASTOPEN_COOKIE_SIZE 4	/* the size employed by this impl. */
- 
- /* TCP Fast Open Cookie as stored in memory */
- struct tcp_fastopen_cookie {
-@@ -72,6 +72,51 @@ struct tcp_sack_block {
- 	u32	end_seq;
- };
- 
-+struct tcp_out_options {
-+	u16	options;	/* bit field of OPTION_* */
-+	u8	ws;		/* window scale, 0 to disable */
-+	u8	num_sack_blocks;/* number of SACK blocks to include */
-+	u8	hash_size;	/* bytes in hash_location */
-+	u16	mss;		/* 0 to disable */
-+	__u8	*hash_location;	/* temporary pointer, overloaded */
-+	__u32	tsval, tsecr;	/* need to include OPTION_TS */
-+	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
-+#ifdef CONFIG_MPTCP
-+	u16	mptcp_options;	/* bit field of MPTCP related OPTION_* */
-+	u8	dss_csum:1,
-+		add_addr_v4:1,
-+		add_addr_v6:1;	/* dss-checksum required? */
-+
-+	union {
-+		struct {
-+			__u64	sender_key;	/* sender's key for mptcp */
-+			__u64	receiver_key;	/* receiver's key for mptcp */
-+		} mp_capable;
-+
-+		struct {
-+			__u64	sender_truncated_mac;
-+			__u32	sender_nonce;
-+					/* random number of the sender */
-+			__u32	token;	/* token for mptcp */
-+			u8	low_prio:1;
-+		} mp_join_syns;
-+	};
-+
-+	struct {
-+		struct in_addr addr;
-+		u8 addr_id;
-+	} add_addr4;
-+
-+	struct {
-+		struct in6_addr addr;
-+		u8 addr_id;
-+	} add_addr6;
-+
-+	u16	remove_addrs;	/* list of address id */
-+	u8	addr_id;	/* address id (mp_join or add_address) */
-+#endif /* CONFIG_MPTCP */
-+};
-+
- /*These are used to set the sack_ok field in struct tcp_options_received */
- #define TCP_SACK_SEEN     (1 << 0)   /*1 = peer is SACK capable, */
- #define TCP_FACK_ENABLED  (1 << 1)   /*1 = FACK is enabled locally*/
-@@ -95,6 +140,9 @@ struct tcp_options_received {
- 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
- };
- 
-+struct mptcp_cb;
-+struct mptcp_tcp_sock;
-+
- static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
- {
- 	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
-@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
- 
- struct tcp_request_sock {
- 	struct inet_request_sock 	req;
--#ifdef CONFIG_TCP_MD5SIG
--	/* Only used by TCP MD5 Signature so far. */
- 	const struct tcp_request_sock_ops *af_specific;
--#endif
- 	struct sock			*listener; /* needed for TFO */
- 	u32				rcv_isn;
- 	u32				snt_isn;
-@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
- 	return (struct tcp_request_sock *)req;
- }
- 
-+struct tcp_md5sig_key;
-+
- struct tcp_sock {
- 	/* inet_connection_sock has to be the first member of tcp_sock */
- 	struct inet_connection_sock	inet_conn;
-@@ -326,6 +373,37 @@ struct tcp_sock {
- 	 * socket. Used to retransmit SYNACKs etc.
- 	 */
- 	struct request_sock *fastopen_rsk;
-+
-+	/* MPTCP/TCP-specific callbacks */
-+	const struct tcp_sock_ops	*ops;
-+
-+	struct mptcp_cb		*mpcb;
-+	struct sock		*meta_sk;
-+	/* We keep these flags even if CONFIG_MPTCP is not checked, because
-+	 * it allows checking MPTCP capability just by checking the mpc flag,
-+	 * rather than adding ifdefs everywhere.
-+	 */
-+	u16     mpc:1,          /* Other end is multipath capable */
-+		inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
-+		send_mp_fclose:1,
-+		request_mptcp:1, /* Did we send out an MP_CAPABLE?
-+				  * (this speeds up mptcp_doit() in tcp_recvmsg)
-+				  */
-+		mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
-+		pf:1, /* Potentially Failed state: when this flag is set, we
-+		       * stop using the subflow
-+		       */
-+		mp_killed:1, /* Killed with a tcp_done in mptcp? */
-+		was_meta_sk:1,	/* This was a meta sk (in case of reuse) */
-+		is_master_sk,
-+		close_it:1,	/* Must close socket in mptcp_data_ready? */
-+		closing:1;
-+	struct mptcp_tcp_sock *mptcp;
-+#ifdef CONFIG_MPTCP
-+	struct hlist_nulls_node tk_table;
-+	u32		mptcp_loc_token;
-+	u64		mptcp_loc_key;
-+#endif /* CONFIG_MPTCP */
- };
- 
- enum tsq_flags {
-@@ -337,6 +415,8 @@ enum tsq_flags {
- 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
- 				    * tcp_v{4|6}_mtu_reduced()
- 				    */
-+	MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
-+	MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
- };
- 
- static inline struct tcp_sock *tcp_sk(const struct sock *sk)
-@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
- #ifdef CONFIG_TCP_MD5SIG
- 	struct tcp_md5sig_key	  *tw_md5_key;
- #endif
-+	struct mptcp_tw		  *mptcp_tw;
- };
- 
- static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
-diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
-index 74af137304be..83f63033897a 100644
---- a/include/net/inet6_connection_sock.h
-+++ b/include/net/inet6_connection_sock.h
-@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
- 
- struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
- 				      const struct request_sock *req);
-+u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-+		    const u32 rnd, const u32 synq_hsize);
- 
- struct request_sock *inet6_csk_search_req(const struct sock *sk,
- 					  struct request_sock ***prevp,
-diff --git a/include/net/inet_common.h b/include/net/inet_common.h
-index fe7994c48b75..780f229f46a8 100644
---- a/include/net/inet_common.h
-+++ b/include/net/inet_common.h
-@@ -1,6 +1,8 @@
- #ifndef _INET_COMMON_H
- #define _INET_COMMON_H
- 
-+#include <net/sock.h>
-+
- extern const struct proto_ops inet_stream_ops;
- extern const struct proto_ops inet_dgram_ops;
- 
-@@ -13,6 +15,8 @@ struct sock;
- struct sockaddr;
- struct socket;
- 
-+int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
-+int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
- int inet_release(struct socket *sock);
- int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
- 			int addr_len, int flags);
-diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
-index 7a4313887568..f62159e39839 100644
---- a/include/net/inet_connection_sock.h
-+++ b/include/net/inet_connection_sock.h
-@@ -30,6 +30,7 @@
- 
- struct inet_bind_bucket;
- struct tcp_congestion_ops;
-+struct tcp_options_received;
- 
- /*
-  * Pointers to address related TCP functions
-@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
- 
- struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
- 
-+u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
-+		   const u32 synq_hsize);
-+
- struct request_sock *inet_csk_search_req(const struct sock *sk,
- 					 struct request_sock ***prevp,
- 					 const __be16 rport,
-diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
-index b1edf17bec01..6a32d8d6b85e 100644
---- a/include/net/inet_sock.h
-+++ b/include/net/inet_sock.h
-@@ -86,10 +86,14 @@ struct inet_request_sock {
- 				wscale_ok  : 1,
- 				ecn_ok	   : 1,
- 				acked	   : 1,
--				no_srccheck: 1;
-+				no_srccheck: 1,
-+				mptcp_rqsk : 1,
-+				saw_mpc    : 1;
- 	kmemcheck_bitfield_end(flags);
--	struct ip_options_rcu	*opt;
--	struct sk_buff		*pktopts;
-+	union {
-+		struct ip_options_rcu	*opt;
-+		struct sk_buff		*pktopts;
-+	};
- 	u32                     ir_mark;
- };
- 
-diff --git a/include/net/mptcp.h b/include/net/mptcp.h
-new file mode 100644
-index 000000000000..712780fc39e4
---- /dev/null
-+++ b/include/net/mptcp.h
-@@ -0,0 +1,1439 @@
-+/*
-+ *	MPTCP implementation
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef _MPTCP_H
-+#define _MPTCP_H
-+
-+#include <linux/inetdevice.h>
-+#include <linux/ipv6.h>
-+#include <linux/list.h>
-+#include <linux/net.h>
-+#include <linux/netpoll.h>
-+#include <linux/skbuff.h>
-+#include <linux/socket.h>
-+#include <linux/tcp.h>
-+#include <linux/kernel.h>
-+
-+#include <asm/byteorder.h>
-+#include <asm/unaligned.h>
-+#include <crypto/hash.h>
-+#include <net/tcp.h>
-+
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	#define ntohll(x)  be64_to_cpu(x)
-+	#define htonll(x)  cpu_to_be64(x)
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	#define ntohll(x) (x)
-+	#define htonll(x) (x)
-+#endif
-+
-+struct mptcp_loc4 {
-+	u8		loc4_id;
-+	u8		low_prio:1;
-+	struct in_addr	addr;
-+};
-+
-+struct mptcp_rem4 {
-+	u8		rem4_id;
-+	__be16		port;
-+	struct in_addr	addr;
-+};
-+
-+struct mptcp_loc6 {
-+	u8		loc6_id;
-+	u8		low_prio:1;
-+	struct in6_addr	addr;
-+};
-+
-+struct mptcp_rem6 {
-+	u8		rem6_id;
-+	__be16		port;
-+	struct in6_addr	addr;
-+};
-+
-+struct mptcp_request_sock {
-+	struct tcp_request_sock		req;
-+	/* hlist-nulls entry to the hash-table. Depending on whether this is a
-+	 * a new MPTCP connection or an additional subflow, the request-socket
-+	 * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
-+	 */
-+	struct hlist_nulls_node		hash_entry;
-+
-+	union {
-+		struct {
-+			/* Only on initial subflows */
-+			u64		mptcp_loc_key;
-+			u64		mptcp_rem_key;
-+			u32		mptcp_loc_token;
-+		};
-+
-+		struct {
-+			/* Only on additional subflows */
-+			struct mptcp_cb	*mptcp_mpcb;
-+			u32		mptcp_rem_nonce;
-+			u32		mptcp_loc_nonce;
-+			u64		mptcp_hash_tmac;
-+		};
-+	};
-+
-+	u8				loc_id;
-+	u8				rem_id; /* Address-id in the MP_JOIN */
-+	u8				dss_csum:1,
-+					is_sub:1, /* Is this a new subflow? */
-+					low_prio:1, /* Interface set to low-prio? */
-+					rcv_low_prio:1;
-+};
-+
-+struct mptcp_options_received {
-+	u16	saw_mpc:1,
-+		dss_csum:1,
-+		drop_me:1,
-+
-+		is_mp_join:1,
-+		join_ack:1,
-+
-+		saw_low_prio:2, /* 0x1 - low-prio set for this subflow
-+				 * 0x2 - low-prio set for another subflow
-+				 */
-+		low_prio:1,
-+
-+		saw_add_addr:2, /* Saw at least one add_addr option:
-+				 * 0x1: IPv4 - 0x2: IPv6
-+				 */
-+		more_add_addr:1, /* Saw one more add-addr. */
-+
-+		saw_rem_addr:1, /* Saw at least one rem_addr option */
-+		more_rem_addr:1, /* Saw one more rem-addr. */
-+
-+		mp_fail:1,
-+		mp_fclose:1;
-+	u8	rem_id;		/* Address-id in the MP_JOIN */
-+	u8	prio_addr_id;	/* Address-id in the MP_PRIO */
-+
-+	const unsigned char *add_addr_ptr; /* Pointer to add-address option */
-+	const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
-+
-+	u32	data_ack;
-+	u32	data_seq;
-+	u16	data_len;
-+
-+	u32	mptcp_rem_token;/* Remote token */
-+
-+	/* Key inside the option (from mp_capable or fast_close) */
-+	u64	mptcp_key;
-+
-+	u32	mptcp_recv_nonce;
-+	u64	mptcp_recv_tmac;
-+	u8	mptcp_recv_mac[20];
-+};
-+
-+struct mptcp_tcp_sock {
-+	struct tcp_sock	*next;		/* Next subflow socket */
-+	struct hlist_node cb_list;
-+	struct mptcp_options_received rx_opt;
-+
-+	 /* Those three fields record the current mapping */
-+	u64	map_data_seq;
-+	u32	map_subseq;
-+	u16	map_data_len;
-+	u16	slave_sk:1,
-+		fully_established:1,
-+		establish_increased:1,
-+		second_packet:1,
-+		attached:1,
-+		send_mp_fail:1,
-+		include_mpc:1,
-+		mapping_present:1,
-+		map_data_fin:1,
-+		low_prio:1, /* use this socket as backup */
-+		rcv_low_prio:1, /* Peer sent low-prio option to us */
-+		send_mp_prio:1, /* Trigger to send mp_prio on this socket */
-+		pre_established:1; /* State between sending 3rd ACK and
-+				    * receiving the fourth ack of new subflows.
-+				    */
-+
-+	/* isn: needed to translate abs to relative subflow seqnums */
-+	u32	snt_isn;
-+	u32	rcv_isn;
-+	u8	path_index;
-+	u8	loc_id;
-+	u8	rem_id;
-+
-+#define MPTCP_SCHED_SIZE 4
-+	u8	mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
-+
-+	struct sk_buff  *shortcut_ofoqueue; /* Shortcut to the current modified
-+					     * skb in the ofo-queue.
-+					     */
-+
-+	int	init_rcv_wnd;
-+	u32	infinite_cutoff_seq;
-+	struct delayed_work work;
-+	u32	mptcp_loc_nonce;
-+	struct tcp_sock *tp; /* Where is my daddy? */
-+	u32	last_end_data_seq;
-+
-+	/* MP_JOIN subflow: timer for retransmitting the 3rd ack */
-+	struct timer_list mptcp_ack_timer;
-+
-+	/* HMAC of the third ack */
-+	char sender_mac[20];
-+};
-+
-+struct mptcp_tw {
-+	struct list_head list;
-+	u64 loc_key;
-+	u64 rcv_nxt;
-+	struct mptcp_cb __rcu *mpcb;
-+	u8 meta_tw:1,
-+	   in_list:1;
-+};
-+
-+#define MPTCP_PM_NAME_MAX 16
-+struct mptcp_pm_ops {
-+	struct list_head list;
-+
-+	/* Signal the creation of a new MPTCP-session. */
-+	void (*new_session)(const struct sock *meta_sk);
-+	void (*release_sock)(struct sock *meta_sk);
-+	void (*fully_established)(struct sock *meta_sk);
-+	void (*new_remote_address)(struct sock *meta_sk);
-+	int  (*get_local_id)(sa_family_t family, union inet_addr *addr,
-+			     struct net *net, bool *low_prio);
-+	void (*addr_signal)(struct sock *sk, unsigned *size,
-+			    struct tcp_out_options *opts, struct sk_buff *skb);
-+	void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
-+			  sa_family_t family, __be16 port, u8 id);
-+	void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
-+	void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
-+	void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
-+
-+	char		name[MPTCP_PM_NAME_MAX];
-+	struct module	*owner;
-+};
-+
-+#define MPTCP_SCHED_NAME_MAX 16
-+struct mptcp_sched_ops {
-+	struct list_head list;
-+
-+	struct sock *		(*get_subflow)(struct sock *meta_sk,
-+					       struct sk_buff *skb,
-+					       bool zero_wnd_test);
-+	struct sk_buff *	(*next_segment)(struct sock *meta_sk,
-+						int *reinject,
-+						struct sock **subsk,
-+						unsigned int *limit);
-+	void			(*init)(struct sock *sk);
-+
-+	char			name[MPTCP_SCHED_NAME_MAX];
-+	struct module		*owner;
-+};
-+
-+struct mptcp_cb {
-+	/* list of sockets in this multipath connection */
-+	struct tcp_sock *connection_list;
-+	/* list of sockets that need a call to release_cb */
-+	struct hlist_head callback_list;
-+
-+	/* High-order bits of 64-bit sequence numbers */
-+	u32 snd_high_order[2];
-+	u32 rcv_high_order[2];
-+
-+	u16	send_infinite_mapping:1,
-+		in_time_wait:1,
-+		list_rcvd:1, /* XXX TO REMOVE */
-+		addr_signal:1, /* Path-manager wants us to call addr_signal */
-+		dss_csum:1,
-+		server_side:1,
-+		infinite_mapping_rcv:1,
-+		infinite_mapping_snd:1,
-+		dfin_combined:1,   /* Was the DFIN combined with subflow-fin? */
-+		passive_close:1,
-+		snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
-+		rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
-+
-+	/* socket count in this connection */
-+	u8 cnt_subflows;
-+	u8 cnt_established;
-+
-+	struct mptcp_sched_ops *sched_ops;
-+
-+	struct sk_buff_head reinject_queue;
-+	/* First cache-line boundary is here minus 8 bytes. But from the
-+	 * reinject-queue only the next and prev pointers are regularly
-+	 * accessed. Thus, the whole data-path is on a single cache-line.
-+	 */
-+
-+	u64	csum_cutoff_seq;
-+
-+	/***** Start of fields, used for connection closure */
-+	spinlock_t	 tw_lock;
-+	unsigned char	 mptw_state;
-+	u8		 dfin_path_index;
-+
-+	struct list_head tw_list;
-+
-+	/***** Start of fields, used for subflow establishment and closure */
-+	atomic_t	mpcb_refcnt;
-+
-+	/* Mutex needed, because otherwise mptcp_close will complain that the
-+	 * socket is owned by the user.
-+	 * E.g., mptcp_sub_close_wq is taking the meta-lock.
-+	 */
-+	struct mutex	mpcb_mutex;
-+
-+	/***** Start of fields, used for subflow establishment */
-+	struct sock *meta_sk;
-+
-+	/* Master socket, also part of the connection_list, this
-+	 * socket is the one that the application sees.
-+	 */
-+	struct sock *master_sk;
-+
-+	__u64	mptcp_loc_key;
-+	__u64	mptcp_rem_key;
-+	__u32	mptcp_loc_token;
-+	__u32	mptcp_rem_token;
-+
-+#define MPTCP_PM_SIZE 608
-+	u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
-+	struct mptcp_pm_ops *pm_ops;
-+
-+	u32 path_index_bits;
-+	/* Next pi to pick up in case a new path becomes available */
-+	u8 next_path_index;
-+
-+	/* Original snd/rcvbuf of the initial subflow.
-+	 * Used for the new subflows on the server-side to allow correct
-+	 * autotuning
-+	 */
-+	int orig_sk_rcvbuf;
-+	int orig_sk_sndbuf;
-+	u32 orig_window_clamp;
-+
-+	/* Timer for retransmitting SYN/ACK+MP_JOIN */
-+	struct timer_list synack_timer;
-+};
-+
-+#define MPTCP_SUB_CAPABLE			0
-+#define MPTCP_SUB_LEN_CAPABLE_SYN		12
-+#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN		12
-+#define MPTCP_SUB_LEN_CAPABLE_ACK		20
-+#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN		20
-+
-+#define MPTCP_SUB_JOIN			1
-+#define MPTCP_SUB_LEN_JOIN_SYN		12
-+#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN	12
-+#define MPTCP_SUB_LEN_JOIN_SYNACK	16
-+#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN	16
-+#define MPTCP_SUB_LEN_JOIN_ACK		24
-+#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN	24
-+
-+#define MPTCP_SUB_DSS		2
-+#define MPTCP_SUB_LEN_DSS	4
-+#define MPTCP_SUB_LEN_DSS_ALIGN	4
-+
-+/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
-+ * as they are part of the DSS-option.
-+ * To get the total length, just add the different options together.
-+ */
-+#define MPTCP_SUB_LEN_SEQ	10
-+#define MPTCP_SUB_LEN_SEQ_CSUM	12
-+#define MPTCP_SUB_LEN_SEQ_ALIGN	12
-+
-+#define MPTCP_SUB_LEN_SEQ_64		14
-+#define MPTCP_SUB_LEN_SEQ_CSUM_64	16
-+#define MPTCP_SUB_LEN_SEQ_64_ALIGN	16
-+
-+#define MPTCP_SUB_LEN_ACK	4
-+#define MPTCP_SUB_LEN_ACK_ALIGN	4
-+
-+#define MPTCP_SUB_LEN_ACK_64		8
-+#define MPTCP_SUB_LEN_ACK_64_ALIGN	8
-+
-+/* This is the "default" option-length we will send out most often.
-+ * MPTCP DSS-header
-+ * 32-bit data sequence number
-+ * 32-bit data ack
-+ *
-+ * It is necessary to calculate the effective MSS we will be using when
-+ * sending data.
-+ */
-+#define MPTCP_SUB_LEN_DSM_ALIGN  (MPTCP_SUB_LEN_DSS_ALIGN +		\
-+				  MPTCP_SUB_LEN_SEQ_ALIGN +		\
-+				  MPTCP_SUB_LEN_ACK_ALIGN)
-+
-+#define MPTCP_SUB_ADD_ADDR		3
-+#define MPTCP_SUB_LEN_ADD_ADDR4		8
-+#define MPTCP_SUB_LEN_ADD_ADDR6		20
-+#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN	8
-+#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN	20
-+
-+#define MPTCP_SUB_REMOVE_ADDR	4
-+#define MPTCP_SUB_LEN_REMOVE_ADDR	4
-+
-+#define MPTCP_SUB_PRIO		5
-+#define MPTCP_SUB_LEN_PRIO	3
-+#define MPTCP_SUB_LEN_PRIO_ADDR	4
-+#define MPTCP_SUB_LEN_PRIO_ALIGN	4
-+
-+#define MPTCP_SUB_FAIL		6
-+#define MPTCP_SUB_LEN_FAIL	12
-+#define MPTCP_SUB_LEN_FAIL_ALIGN	12
-+
-+#define MPTCP_SUB_FCLOSE	7
-+#define MPTCP_SUB_LEN_FCLOSE	12
-+#define MPTCP_SUB_LEN_FCLOSE_ALIGN	12
-+
-+
-+#define OPTION_MPTCP		(1 << 5)
-+
-+#ifdef CONFIG_MPTCP
-+
-+/* Used for checking if the mptcp initialization has been successful */
-+extern bool mptcp_init_failed;
-+
-+/* MPTCP options */
-+#define OPTION_TYPE_SYN		(1 << 0)
-+#define OPTION_TYPE_SYNACK	(1 << 1)
-+#define OPTION_TYPE_ACK		(1 << 2)
-+#define OPTION_MP_CAPABLE	(1 << 3)
-+#define OPTION_DATA_ACK		(1 << 4)
-+#define OPTION_ADD_ADDR		(1 << 5)
-+#define OPTION_MP_JOIN		(1 << 6)
-+#define OPTION_MP_FAIL		(1 << 7)
-+#define OPTION_MP_FCLOSE	(1 << 8)
-+#define OPTION_REMOVE_ADDR	(1 << 9)
-+#define OPTION_MP_PRIO		(1 << 10)
-+
-+/* MPTCP flags: both TX and RX */
-+#define MPTCPHDR_SEQ		0x01 /* DSS.M option is present */
-+#define MPTCPHDR_FIN		0x02 /* DSS.F option is present */
-+#define MPTCPHDR_SEQ64_INDEX	0x04 /* index of seq in mpcb->snd_high_order */
-+/* MPTCP flags: RX only */
-+#define MPTCPHDR_ACK		0x08
-+#define MPTCPHDR_SEQ64_SET	0x10 /* Did we received a 64-bit seq number?  */
-+#define MPTCPHDR_SEQ64_OFO	0x20 /* Is it not in our circular array? */
-+#define MPTCPHDR_DSS_CSUM	0x40
-+#define MPTCPHDR_JOIN		0x80
-+/* MPTCP flags: TX only */
-+#define MPTCPHDR_INF		0x08
-+
-+struct mptcp_option {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	ver:4,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		ver:4;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+};
-+
-+struct mp_capable {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	ver:4,
-+		sub:4;
-+	__u8	h:1,
-+		rsv:5,
-+		b:1,
-+		a:1;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		ver:4;
-+	__u8	a:1,
-+		b:1,
-+		rsv:5,
-+		h:1;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u64	sender_key;
-+	__u64	receiver_key;
-+} __attribute__((__packed__));
-+
-+struct mp_join {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	b:1,
-+		rsv:3,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		rsv:3,
-+		b:1;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u8	addr_id;
-+	union {
-+		struct {
-+			u32	token;
-+			u32	nonce;
-+		} syn;
-+		struct {
-+			__u64	mac;
-+			u32	nonce;
-+		} synack;
-+		struct {
-+			__u8	mac[20];
-+		} ack;
-+	} u;
-+} __attribute__((__packed__));
-+
-+struct mp_dss {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u16	rsv1:4,
-+		sub:4,
-+		A:1,
-+		a:1,
-+		M:1,
-+		m:1,
-+		F:1,
-+		rsv2:3;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u16	sub:4,
-+		rsv1:4,
-+		rsv2:3,
-+		F:1,
-+		m:1,
-+		M:1,
-+		a:1,
-+		A:1;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+};
-+
-+struct mp_add_addr {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	ipver:4,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		ipver:4;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u8	addr_id;
-+	union {
-+		struct {
-+			struct in_addr	addr;
-+			__be16		port;
-+		} v4;
-+		struct {
-+			struct in6_addr	addr;
-+			__be16		port;
-+		} v6;
-+	} u;
-+} __attribute__((__packed__));
-+
-+struct mp_remove_addr {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	rsv:4,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		rsv:4;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+	/* list of addr_id */
-+	__u8	addrs_id;
-+};
-+
-+struct mp_fail {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u16	rsv1:4,
-+		sub:4,
-+		rsv2:8;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u16	sub:4,
-+		rsv1:4,
-+		rsv2:8;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__be64	data_seq;
-+} __attribute__((__packed__));
-+
-+struct mp_fclose {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u16	rsv1:4,
-+		sub:4,
-+		rsv2:8;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u16	sub:4,
-+		rsv1:4,
-+		rsv2:8;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u64	key;
-+} __attribute__((__packed__));
-+
-+struct mp_prio {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	b:1,
-+		rsv:3,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		rsv:3,
-+		b:1;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u8	addr_id;
-+} __attribute__((__packed__));
-+
-+static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
-+{
-+	return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
-+}
-+
-+#define MPTCP_APP	2
-+
-+extern int sysctl_mptcp_enabled;
-+extern int sysctl_mptcp_checksum;
-+extern int sysctl_mptcp_debug;
-+extern int sysctl_mptcp_syn_retries;
-+
-+extern struct workqueue_struct *mptcp_wq;
-+
-+#define mptcp_debug(fmt, args...)					\
-+	do {								\
-+		if (unlikely(sysctl_mptcp_debug))			\
-+			pr_err(__FILE__ ": " fmt, ##args);	\
-+	} while (0)
-+
-+/* Iterates over all subflows */
-+#define mptcp_for_each_tp(mpcb, tp)					\
-+	for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
-+
-+#define mptcp_for_each_sk(mpcb, sk)					\
-+	for ((sk) = (struct sock *)(mpcb)->connection_list;		\
-+	     sk;							\
-+	     sk = (struct sock *)tcp_sk(sk)->mptcp->next)
-+
-+#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)			\
-+	for (__sk = (struct sock *)(__mpcb)->connection_list,		\
-+	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
-+	     __sk;							\
-+	     __sk = __temp,						\
-+	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
-+
-+/* Iterates over all bit set to 1 in a bitset */
-+#define mptcp_for_each_bit_set(b, i)					\
-+	for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
-+
-+#define mptcp_for_each_bit_unset(b, i)					\
-+	mptcp_for_each_bit_set(~b, i)
-+
-+extern struct lock_class_key meta_key;
-+extern struct lock_class_key meta_slock_key;
-+extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
-+
-+/* This is needed to ensure that two subsequent key/nonce-generation result in
-+ * different keys/nonces if the IPs and ports are the same.
-+ */
-+extern u32 mptcp_seed;
-+
-+#define MPTCP_HASH_SIZE                1024
-+
-+extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
-+
-+/* This second hashtable is needed to retrieve request socks
-+ * created as a result of a join request. While the SYN contains
-+ * the token, the final ack does not, so we need a separate hashtable
-+ * to retrieve the mpcb.
-+ */
-+extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
-+extern spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
-+
-+/* Lock, protecting the two hash-tables that hold the token. Namely,
-+ * mptcp_reqsk_tk_htb and tk_hashtable
-+ */
-+extern spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
-+
-+/* Request-sockets can be hashed in the tk_htb for collision-detection or in
-+ * the regular htb for join-connections. We need to define different NULLS
-+ * values so that we can correctly detect a request-socket that has been
-+ * recycled. See also c25eb3bfb9729.
-+ */
-+#define MPTCP_REQSK_NULLS_BASE (1U << 29)
-+
-+
-+void mptcp_data_ready(struct sock *sk);
-+void mptcp_write_space(struct sock *sk);
-+
-+void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
-+			      struct sock *sk);
-+void mptcp_ofo_queue(struct sock *meta_sk);
-+void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
-+void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
-+int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
-+		   gfp_t flags);
-+void mptcp_del_sock(struct sock *sk);
-+void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
-+void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
-+void mptcp_update_sndbuf(const struct tcp_sock *tp);
-+void mptcp_send_fin(struct sock *meta_sk);
-+void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
-+bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+		      int push_one, gfp_t gfp);
-+void tcp_parse_mptcp_options(const struct sk_buff *skb,
-+			     struct mptcp_options_received *mopt);
-+void mptcp_parse_options(const uint8_t *ptr, int opsize,
-+			 struct mptcp_options_received *mopt,
-+			 const struct sk_buff *skb);
-+void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
-+		       unsigned *remaining);
-+void mptcp_synack_options(struct request_sock *req,
-+			  struct tcp_out_options *opts,
-+			  unsigned *remaining);
-+void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
-+			       struct tcp_out_options *opts, unsigned *size);
-+void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+			 const struct tcp_out_options *opts,
-+			 struct sk_buff *skb);
-+void mptcp_close(struct sock *meta_sk, long timeout);
-+int mptcp_doit(struct sock *sk);
-+int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
-+int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
-+int mptcp_check_req_master(struct sock *sk, struct sock *child,
-+			   struct request_sock *req,
-+			   struct request_sock **prev);
-+struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
-+				   struct request_sock *req,
-+				   struct request_sock **prev,
-+				   const struct mptcp_options_received *mopt);
-+u32 __mptcp_select_window(struct sock *sk);
-+void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
-+					__u32 *window_clamp, int wscale_ok,
-+					__u8 *rcv_wscale, __u32 init_rcv_wnd,
-+					const struct sock *sk);
-+unsigned int mptcp_current_mss(struct sock *meta_sk);
-+int mptcp_select_size(const struct sock *meta_sk, bool sg);
-+void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
-+void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
-+		     u32 *hash_out);
-+void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
-+void mptcp_fin(struct sock *meta_sk);
-+void mptcp_retransmit_timer(struct sock *meta_sk);
-+int mptcp_write_wakeup(struct sock *meta_sk);
-+void mptcp_sub_close_wq(struct work_struct *work);
-+void mptcp_sub_close(struct sock *sk, unsigned long delay);
-+struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
-+void mptcp_fallback_meta_sk(struct sock *meta_sk);
-+int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+void mptcp_ack_handler(unsigned long);
-+int mptcp_check_rtt(const struct tcp_sock *tp, int time);
-+int mptcp_check_snd_buf(const struct tcp_sock *tp);
-+int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
-+			 const struct sk_buff *skb);
-+void __init mptcp_init(void);
-+int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
-+void mptcp_destroy_sock(struct sock *sk);
-+int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
-+				    const struct sk_buff *skb,
-+				    const struct mptcp_options_received *mopt);
-+unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
-+				  int large_allowed);
-+int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
-+void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
-+void mptcp_time_wait(struct sock *sk, int state, int timeo);
-+void mptcp_disconnect(struct sock *sk);
-+bool mptcp_should_expand_sndbuf(const struct sock *sk);
-+int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
-+void mptcp_tsq_flags(struct sock *sk);
-+void mptcp_tsq_sub_deferred(struct sock *meta_sk);
-+struct mp_join *mptcp_find_join(const struct sk_buff *skb);
-+void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
-+void mptcp_hash_remove(struct tcp_sock *meta_tp);
-+struct sock *mptcp_hash_find(const struct net *net, const u32 token);
-+int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
-+int mptcp_do_join_short(struct sk_buff *skb,
-+			const struct mptcp_options_received *mopt,
-+			struct net *net);
-+void mptcp_reqsk_destructor(struct request_sock *req);
-+void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+			   const struct mptcp_options_received *mopt,
-+			   const struct sk_buff *skb);
-+int mptcp_check_req(struct sk_buff *skb, struct net *net);
-+void mptcp_connect_init(struct sock *sk);
-+void mptcp_sub_force_close(struct sock *sk);
-+int mptcp_sub_len_remove_addr_align(u16 bitfield);
-+void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+			    const struct sk_buff *skb);
-+void mptcp_init_buffer_space(struct sock *sk);
-+void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
-+			   struct sk_buff *skb);
-+void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
-+int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
-+void mptcp_init_congestion_control(struct sock *sk);
-+
-+/* MPTCP-path-manager registration/initialization functions */
-+int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
-+void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
-+void mptcp_init_path_manager(struct mptcp_cb *mpcb);
-+void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
-+void mptcp_fallback_default(struct mptcp_cb *mpcb);
-+void mptcp_get_default_path_manager(char *name);
-+int mptcp_set_default_path_manager(const char *name);
-+extern struct mptcp_pm_ops mptcp_pm_default;
-+
-+/* MPTCP-scheduler registration/initialization functions */
-+int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
-+void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
-+void mptcp_init_scheduler(struct mptcp_cb *mpcb);
-+void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
-+void mptcp_get_default_scheduler(char *name);
-+int mptcp_set_default_scheduler(const char *name);
-+extern struct mptcp_sched_ops mptcp_sched_default;
-+
-+static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
-+					    unsigned long len)
-+{
-+	sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
-+		       jiffies + len);
-+}
-+
-+static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
-+{
-+	sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
-+}
-+
-+static inline bool is_mptcp_enabled(const struct sock *sk)
-+{
-+	if (!sysctl_mptcp_enabled || mptcp_init_failed)
-+		return false;
-+
-+	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
-+		return false;
-+
-+	return true;
-+}
-+
-+static inline int mptcp_pi_to_flag(int pi)
-+{
-+	return 1 << (pi - 1);
-+}
-+
-+static inline
-+struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
-+{
-+	return (struct mptcp_request_sock *)req;
-+}
-+
-+static inline
-+struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
-+{
-+	return (struct request_sock *)req;
-+}
-+
-+static inline bool mptcp_can_sendpage(struct sock *sk)
-+{
-+	struct sock *sk_it;
-+
-+	if (tcp_sk(sk)->mpcb->dss_csum)
-+		return false;
-+
-+	mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
-+		if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
-+		    !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
-+			return false;
-+	}
-+
-+	return true;
-+}
-+
-+static inline void mptcp_push_pending_frames(struct sock *meta_sk)
-+{
-+	/* We check packets out and send-head here. TCP only checks the
-+	 * send-head. But, MPTCP also checks packets_out, as this is an
-+	 * indication that we might want to do opportunistic reinjection.
-+	 */
-+	if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
-+		struct tcp_sock *tp = tcp_sk(meta_sk);
-+
-+		/* We don't care about the MSS, because it will be set in
-+		 * mptcp_write_xmit.
-+		 */
-+		__tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
-+	}
-+}
-+
-+static inline void mptcp_send_reset(struct sock *sk)
-+{
-+	tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
-+	mptcp_sub_force_close(sk);
-+}
-+
-+static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
-+{
-+	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
-+}
-+
-+static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
-+{
-+	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
-+}
-+
-+/* Is it a data-fin while in infinite mapping mode?
-+ * In infinite mode, a subflow-fin is in fact a data-fin.
-+ */
-+static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
-+				     const struct tcp_sock *tp)
-+{
-+	return mptcp_is_data_fin(skb) ||
-+	       (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
-+}
-+
-+static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
-+{
-+	u64 data_seq_high = (u32)(data_seq >> 32);
-+
-+	if (mpcb->rcv_high_order[0] == data_seq_high)
-+		return 0;
-+	else if (mpcb->rcv_high_order[1] == data_seq_high)
-+		return MPTCPHDR_SEQ64_INDEX;
-+	else
-+		return MPTCPHDR_SEQ64_OFO;
-+}
-+
-+/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
-+ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
-+ */
-+static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
-+					    u32 *data_seq,
-+					    struct mptcp_cb *mpcb)
-+{
-+	__u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
-+
-+	if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
-+		u64 data_seq64 = get_unaligned_be64(ptr);
-+
-+		if (mpcb)
-+			TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
-+
-+		*data_seq = (u32)data_seq64;
-+		ptr++;
-+	} else {
-+		*data_seq = get_unaligned_be32(ptr);
-+	}
-+
-+	return ptr;
-+}
-+
-+static inline struct sock *mptcp_meta_sk(const struct sock *sk)
-+{
-+	return tcp_sk(sk)->meta_sk;
-+}
-+
-+static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
-+{
-+	return tcp_sk(tp->meta_sk);
-+}
-+
-+static inline int is_meta_tp(const struct tcp_sock *tp)
-+{
-+	return tp->mpcb && mptcp_meta_tp(tp) == tp;
-+}
-+
-+static inline int is_meta_sk(const struct sock *sk)
-+{
-+	return sk->sk_type == SOCK_STREAM  && sk->sk_protocol == IPPROTO_TCP &&
-+	       mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
-+}
-+
-+static inline int is_master_tp(const struct tcp_sock *tp)
-+{
-+	return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
-+}
-+
-+static inline void mptcp_hash_request_remove(struct request_sock *req)
-+{
-+	int in_softirq = 0;
-+
-+	if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
-+		return;
-+
-+	if (in_softirq()) {
-+		spin_lock(&mptcp_reqsk_hlock);
-+		in_softirq = 1;
-+	} else {
-+		spin_lock_bh(&mptcp_reqsk_hlock);
-+	}
-+
-+	hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
-+
-+	if (in_softirq)
-+		spin_unlock(&mptcp_reqsk_hlock);
-+	else
-+		spin_unlock_bh(&mptcp_reqsk_hlock);
-+}
-+
-+static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
-+{
-+	mopt->saw_mpc = 0;
-+	mopt->dss_csum = 0;
-+	mopt->drop_me = 0;
-+
-+	mopt->is_mp_join = 0;
-+	mopt->join_ack = 0;
-+
-+	mopt->saw_low_prio = 0;
-+	mopt->low_prio = 0;
-+
-+	mopt->saw_add_addr = 0;
-+	mopt->more_add_addr = 0;
-+
-+	mopt->saw_rem_addr = 0;
-+	mopt->more_rem_addr = 0;
-+
-+	mopt->mp_fail = 0;
-+	mopt->mp_fclose = 0;
-+}
-+
-+static inline void mptcp_reset_mopt(struct tcp_sock *tp)
-+{
-+	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
-+
-+	mopt->saw_low_prio = 0;
-+	mopt->saw_add_addr = 0;
-+	mopt->more_add_addr = 0;
-+	mopt->saw_rem_addr = 0;
-+	mopt->more_rem_addr = 0;
-+	mopt->join_ack = 0;
-+	mopt->mp_fail = 0;
-+	mopt->mp_fclose = 0;
-+}
-+
-+static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
-+						 const struct mptcp_cb *mpcb)
-+{
-+	return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
-+			MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
-+}
-+
-+static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
-+					u32 data_seq_32)
-+{
-+	return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
-+}
-+
-+static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
-+{
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
-+				     meta_tp->rcv_nxt);
-+}
-+
-+static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
-+{
-+	if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
-+		struct mptcp_cb *mpcb = meta_tp->mpcb;
-+		mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
-+		mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
-+	}
-+}
-+
-+static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
-+					   u32 old_rcv_nxt)
-+{
-+	if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
-+		struct mptcp_cb *mpcb = meta_tp->mpcb;
-+		mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
-+		mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
-+	}
-+}
-+
-+static inline int mptcp_sk_can_send(const struct sock *sk)
-+{
-+	return tcp_passive_fastopen(sk) ||
-+	       ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
-+		!tcp_sk(sk)->mptcp->pre_established);
-+}
-+
-+static inline int mptcp_sk_can_recv(const struct sock *sk)
-+{
-+	return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
-+}
-+
-+static inline int mptcp_sk_can_send_ack(const struct sock *sk)
-+{
-+	return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
-+					TCPF_CLOSE | TCPF_LISTEN)) &&
-+	       !tcp_sk(sk)->mptcp->pre_established;
-+}
-+
-+/* Only support GSO if all subflows supports it */
-+static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
-+{
-+	struct sock *sk;
-+
-+	if (tcp_sk(meta_sk)->mpcb->dss_csum)
-+		return false;
-+
-+	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+		if (!sk_can_gso(sk))
-+			return false;
-+	}
-+	return true;
-+}
-+
-+static inline bool mptcp_can_sg(const struct sock *meta_sk)
-+{
-+	struct sock *sk;
-+
-+	if (tcp_sk(meta_sk)->mpcb->dss_csum)
-+		return false;
-+
-+	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+		if (!(sk->sk_route_caps & NETIF_F_SG))
-+			return false;
-+	}
-+	return true;
-+}
-+
-+static inline void mptcp_set_rto(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *sk_it;
-+	struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
-+	__u32 max_rto = 0;
-+
-+	/* We are in recovery-phase on the MPTCP-level. Do not update the
-+	 * RTO, because this would kill exponential backoff.
-+	 */
-+	if (micsk->icsk_retransmits)
-+		return;
-+
-+	mptcp_for_each_sk(tp->mpcb, sk_it) {
-+		if (mptcp_sk_can_send(sk_it) &&
-+		    inet_csk(sk_it)->icsk_rto > max_rto)
-+			max_rto = inet_csk(sk_it)->icsk_rto;
-+	}
-+	if (max_rto) {
-+		micsk->icsk_rto = max_rto << 1;
-+
-+		/* A successfull rto-measurement - reset backoff counter */
-+		micsk->icsk_backoff = 0;
-+	}
-+}
-+
-+static inline int mptcp_sysctl_syn_retries(void)
-+{
-+	return sysctl_mptcp_syn_retries;
-+}
-+
-+static inline void mptcp_sub_close_passive(struct sock *sk)
-+{
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
-+
-+	/* Only close, if the app did a send-shutdown (passive close), and we
-+	 * received the data-ack of the data-fin.
-+	 */
-+	if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
-+		mptcp_sub_close(sk, 0);
-+}
-+
-+static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	/* If data has been acknowleged on the meta-level, fully_established
-+	 * will have been set before and thus we will not fall back to infinite
-+	 * mapping.
-+	 */
-+	if (likely(tp->mptcp->fully_established))
-+		return false;
-+
-+	if (!(flag & MPTCP_FLAG_DATA_ACKED))
-+		return false;
-+
-+	/* Don't fallback twice ;) */
-+	if (tp->mpcb->infinite_mapping_snd)
-+		return false;
-+
-+	pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
-+	       __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
-+	       &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
-+	       __builtin_return_address(0));
-+	if (!is_master_tp(tp))
-+		return true;
-+
-+	tp->mpcb->infinite_mapping_snd = 1;
-+	tp->mpcb->infinite_mapping_rcv = 1;
-+	tp->mptcp->fully_established = 1;
-+
-+	return false;
-+}
-+
-+/* Find the first index whose bit in the bit-field == 0 */
-+static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
-+{
-+	u8 base = mpcb->next_path_index;
-+	int i;
-+
-+	/* Start at 1, because 0 is reserved for the meta-sk */
-+	mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
-+		if (i + base < 1)
-+			continue;
-+		if (i + base >= sizeof(mpcb->path_index_bits) * 8)
-+			break;
-+		i += base;
-+		mpcb->path_index_bits |= (1 << i);
-+		mpcb->next_path_index = i + 1;
-+		return i;
-+	}
-+	mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
-+		if (i >= sizeof(mpcb->path_index_bits) * 8)
-+			break;
-+		if (i < 1)
-+			continue;
-+		mpcb->path_index_bits |= (1 << i);
-+		mpcb->next_path_index = i + 1;
-+		return i;
-+	}
-+
-+	return 0;
-+}
-+
-+static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
-+{
-+	return sk->sk_family == AF_INET6 &&
-+	       ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
-+}
-+
-+/* TCP and MPTCP mpc flag-depending functions */
-+u16 mptcp_select_window(struct sock *sk);
-+void mptcp_init_buffer_space(struct sock *sk);
-+void mptcp_tcp_set_rto(struct sock *sk);
-+
-+/* TCP and MPTCP flag-depending functions */
-+bool mptcp_prune_ofo_queue(struct sock *sk);
-+
-+#else /* CONFIG_MPTCP */
-+#define mptcp_debug(fmt, args...)	\
-+	do {				\
-+	} while (0)
-+
-+/* Without MPTCP, we just do one iteration
-+ * over the only socket available. This assumes that
-+ * the sk/tp arg is the socket in that case.
-+ */
-+#define mptcp_for_each_sk(mpcb, sk)
-+#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
-+
-+static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
-+{
-+	return false;
-+}
-+static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
-+{
-+	return false;
-+}
-+static inline struct sock *mptcp_meta_sk(const struct sock *sk)
-+{
-+	return NULL;
-+}
-+static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
-+{
-+	return NULL;
-+}
-+static inline int is_meta_sk(const struct sock *sk)
-+{
-+	return 0;
-+}
-+static inline int is_master_tp(const struct tcp_sock *tp)
-+{
-+	return 0;
-+}
-+static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_del_sock(const struct sock *sk) {}
-+static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
-+static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
-+static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
-+static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
-+					    const struct sock *sk) {}
-+static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
-+static inline void mptcp_set_rto(const struct sock *sk) {}
-+static inline void mptcp_send_fin(const struct sock *meta_sk) {}
-+static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
-+				       const struct mptcp_options_received *mopt,
-+				       const struct sk_buff *skb) {}
-+static inline void mptcp_syn_options(const struct sock *sk,
-+				     struct tcp_out_options *opts,
-+				     unsigned *remaining) {}
-+static inline void mptcp_synack_options(struct request_sock *req,
-+					struct tcp_out_options *opts,
-+					unsigned *remaining) {}
-+
-+static inline void mptcp_established_options(struct sock *sk,
-+					     struct sk_buff *skb,
-+					     struct tcp_out_options *opts,
-+					     unsigned *size) {}
-+static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+				       const struct tcp_out_options *opts,
-+				       struct sk_buff *skb) {}
-+static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
-+static inline int mptcp_doit(struct sock *sk)
-+{
-+	return 0;
-+}
-+static inline int mptcp_check_req_fastopen(struct sock *child,
-+					   struct request_sock *req)
-+{
-+	return 1;
-+}
-+static inline int mptcp_check_req_master(const struct sock *sk,
-+					 const struct sock *child,
-+					 struct request_sock *req,
-+					 struct request_sock **prev)
-+{
-+	return 1;
-+}
-+static inline struct sock *mptcp_check_req_child(struct sock *sk,
-+						 struct sock *child,
-+						 struct request_sock *req,
-+						 struct request_sock **prev,
-+						 const struct mptcp_options_received *mopt)
-+{
-+	return NULL;
-+}
-+static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
-+{
-+	return 0;
-+}
-+static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
-+{
-+	return 0;
-+}
-+static inline void mptcp_sub_close_passive(struct sock *sk) {}
-+static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
-+{
-+	return false;
-+}
-+static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
-+static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
-+{
-+	return 0;
-+}
-+static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
-+{
-+	return 0;
-+}
-+static inline int mptcp_sysctl_syn_retries(void)
-+{
-+	return 0;
-+}
-+static inline void mptcp_send_reset(const struct sock *sk) {}
-+static inline int mptcp_handle_options(struct sock *sk,
-+				       const struct tcphdr *th,
-+				       struct sk_buff *skb)
-+{
-+	return 0;
-+}
-+static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
-+static inline void  __init mptcp_init(void) {}
-+static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
-+{
-+	return 0;
-+}
-+static inline bool mptcp_sk_can_gso(const struct sock *sk)
-+{
-+	return false;
-+}
-+static inline bool mptcp_can_sg(const struct sock *meta_sk)
-+{
-+	return false;
-+}
-+static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
-+						u32 mss_now, int large_allowed)
-+{
-+	return 0;
-+}
-+static inline void mptcp_destroy_sock(struct sock *sk) {}
-+static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
-+						  struct sock **skptr,
-+						  struct sk_buff *skb,
-+						  const struct mptcp_options_received *mopt)
-+{
-+	return 0;
-+}
-+static inline bool mptcp_can_sendpage(struct sock *sk)
-+{
-+	return false;
-+}
-+static inline int mptcp_init_tw_sock(struct sock *sk,
-+				     struct tcp_timewait_sock *tw)
-+{
-+	return 0;
-+}
-+static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
-+static inline void mptcp_disconnect(struct sock *sk) {}
-+static inline void mptcp_tsq_flags(struct sock *sk) {}
-+static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
-+static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+					 const struct tcp_options_received *rx_opt,
-+					 const struct mptcp_options_received *mopt,
-+					 const struct sk_buff *skb) {}
-+static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+					  const struct sk_buff *skb) {}
-+static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* _MPTCP_H */
-diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
-new file mode 100644
-index 000000000000..93ad97c77c5a
---- /dev/null
-+++ b/include/net/mptcp_v4.h
-@@ -0,0 +1,67 @@
-+/*
-+ *	MPTCP implementation
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef MPTCP_V4_H_
-+#define MPTCP_V4_H_
-+
-+
-+#include <linux/in.h>
-+#include <linux/skbuff.h>
-+#include <net/mptcp.h>
-+#include <net/request_sock.h>
-+#include <net/sock.h>
-+
-+extern struct request_sock_ops mptcp_request_sock_ops;
-+extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
-+extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
-+extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
-+
-+#ifdef CONFIG_MPTCP
-+
-+int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
-+				 const __be32 laddr, const struct net *net);
-+int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
-+			   struct mptcp_rem4 *rem);
-+int mptcp_pm_v4_init(void);
-+void mptcp_pm_v4_undo(void);
-+u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
-+u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
-+
-+#else
-+
-+static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
-+				  const struct sk_buff *skb)
-+{
-+	return 0;
-+}
-+
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* MPTCP_V4_H_ */
-diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
-new file mode 100644
-index 000000000000..49a4f30ccd4d
---- /dev/null
-+++ b/include/net/mptcp_v6.h
-@@ -0,0 +1,69 @@
-+/*
-+ *	MPTCP implementation
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef _MPTCP_V6_H
-+#define _MPTCP_V6_H
-+
-+#include <linux/in6.h>
-+#include <net/if_inet6.h>
-+
-+#include <net/mptcp.h>
-+
-+
-+#ifdef CONFIG_MPTCP
-+extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
-+extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
-+extern struct request_sock_ops mptcp6_request_sock_ops;
-+extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
-+extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
-+
-+int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
-+				 const struct in6_addr *laddr, const struct net *net);
-+int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
-+			   struct mptcp_rem6 *rem);
-+int mptcp_pm_v6_init(void);
-+void mptcp_pm_v6_undo(void);
-+__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
-+			 __be16 sport, __be16 dport);
-+u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
-+		     __be16 sport, __be16 dport);
-+
-+#else /* CONFIG_MPTCP */
-+
-+#define mptcp_v6_mapped ipv6_mapped
-+
-+static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	return 0;
-+}
-+
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* _MPTCP_V6_H */
-diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
-index 361d26077196..bae95a11c531 100644
---- a/include/net/net_namespace.h
-+++ b/include/net/net_namespace.h
-@@ -16,6 +16,7 @@
- #include <net/netns/packet.h>
- #include <net/netns/ipv4.h>
- #include <net/netns/ipv6.h>
-+#include <net/netns/mptcp.h>
- #include <net/netns/ieee802154_6lowpan.h>
- #include <net/netns/sctp.h>
- #include <net/netns/dccp.h>
-@@ -92,6 +93,9 @@ struct net {
- #if IS_ENABLED(CONFIG_IPV6)
- 	struct netns_ipv6	ipv6;
- #endif
-+#if IS_ENABLED(CONFIG_MPTCP)
-+	struct netns_mptcp	mptcp;
-+#endif
- #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
- 	struct netns_ieee802154_lowpan	ieee802154_lowpan;
- #endif
-diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
-new file mode 100644
-index 000000000000..bad418b04cc8
---- /dev/null
-+++ b/include/net/netns/mptcp.h
-@@ -0,0 +1,44 @@
-+/*
-+ *	MPTCP implementation - MPTCP namespace
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef __NETNS_MPTCP_H__
-+#define __NETNS_MPTCP_H__
-+
-+#include <linux/compiler.h>
-+
-+enum {
-+	MPTCP_PM_FULLMESH = 0,
-+	MPTCP_PM_MAX
-+};
-+
-+struct netns_mptcp {
-+	void *path_managers[MPTCP_PM_MAX];
-+};
-+
-+#endif /* __NETNS_MPTCP_H__ */
-diff --git a/include/net/request_sock.h b/include/net/request_sock.h
-index 7f830ff67f08..e79e87a8e1a6 100644
---- a/include/net/request_sock.h
-+++ b/include/net/request_sock.h
-@@ -164,7 +164,7 @@ struct request_sock_queue {
- };
- 
- int reqsk_queue_alloc(struct request_sock_queue *queue,
--		      unsigned int nr_table_entries);
-+		      unsigned int nr_table_entries, gfp_t flags);
- 
- void __reqsk_queue_destroy(struct request_sock_queue *queue);
- void reqsk_queue_destroy(struct request_sock_queue *queue);
-diff --git a/include/net/sock.h b/include/net/sock.h
-index 156350745700..0e23cae8861f 100644
---- a/include/net/sock.h
-+++ b/include/net/sock.h
-@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
- 
- int sk_wait_data(struct sock *sk, long *timeo);
- 
-+/* START - needed for MPTCP */
-+struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
-+void sock_lock_init(struct sock *sk);
-+
-+extern struct lock_class_key af_callback_keys[AF_MAX];
-+extern char *const af_family_clock_key_strings[AF_MAX+1];
-+
-+#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
-+/* END - needed for MPTCP */
-+
- struct request_sock_ops;
- struct timewait_sock_ops;
- struct inet_hashinfo;
-diff --git a/include/net/tcp.h b/include/net/tcp.h
-index 7286db80e8b8..ff92e74cd684 100644
---- a/include/net/tcp.h
-+++ b/include/net/tcp.h
-@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
- #define TCPOPT_SACK             5       /* SACK Block */
- #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
- #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
-+#define TCPOPT_MPTCP		30
- #define TCPOPT_EXP		254	/* Experimental */
- /* Magic number to be after the option value for sharing TCP
-  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
-@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
- #define	TFO_SERVER_WO_SOCKOPT1	0x400
- #define	TFO_SERVER_WO_SOCKOPT2	0x800
- 
-+/* Flags from tcp_input.c for tcp_ack */
-+#define FLAG_DATA               0x01 /* Incoming frame contained data.          */
-+#define FLAG_WIN_UPDATE         0x02 /* Incoming ACK was a window update.       */
-+#define FLAG_DATA_ACKED         0x04 /* This ACK acknowledged new data.         */
-+#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted.  */
-+#define FLAG_SYN_ACKED          0x10 /* This ACK acknowledged SYN.              */
-+#define FLAG_DATA_SACKED        0x20 /* New SACK.                               */
-+#define FLAG_ECE                0x40 /* ECE in this ACK                         */
-+#define FLAG_SLOWPATH           0x100 /* Do not skip RFC checks for window update.*/
-+#define FLAG_ORIG_SACK_ACKED    0x200 /* Never retransmitted data are (s)acked  */
-+#define FLAG_SND_UNA_ADVANCED   0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
-+#define FLAG_DSACKING_ACK       0x800 /* SACK blocks contained D-SACK info */
-+#define FLAG_SACK_RENEGING      0x2000 /* snd_una advanced to a sacked seq */
-+#define FLAG_UPDATE_TS_RECENT   0x4000 /* tcp_replace_ts_recent() */
-+#define MPTCP_FLAG_DATA_ACKED	0x8000
-+
-+#define FLAG_ACKED              (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
-+#define FLAG_NOT_DUP            (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
-+#define FLAG_CA_ALERT           (FLAG_DATA_SACKED|FLAG_ECE)
-+#define FLAG_FORWARD_PROGRESS   (FLAG_ACKED|FLAG_DATA_SACKED)
-+
- extern struct inet_timewait_death_row tcp_death_row;
- 
- /* sysctl variables for tcp */
-@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
- #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
- #define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
- 
-+/**** START - Exports needed for MPTCP ****/
-+extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
-+extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
-+
-+struct mptcp_options_received;
-+
-+void tcp_enter_quickack_mode(struct sock *sk);
-+int tcp_close_state(struct sock *sk);
-+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-+			 const struct sk_buff *skb);
-+int tcp_xmit_probe_skb(struct sock *sk, int urgent);
-+void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
-+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-+		     gfp_t gfp_mask);
-+unsigned int tcp_mss_split_point(const struct sock *sk,
-+				 const struct sk_buff *skb,
-+				 unsigned int mss_now,
-+				 unsigned int max_segs,
-+				 int nonagle);
-+bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+		    unsigned int cur_mss, int nonagle);
-+bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+		      unsigned int cur_mss);
-+unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
-+int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+		      unsigned int mss_now);
-+void __pskb_trim_head(struct sk_buff *skb, int len);
-+void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
-+void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
-+void tcp_reset(struct sock *sk);
-+bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
-+			   const u32 ack_seq, const u32 nwin);
-+bool tcp_urg_mode(const struct tcp_sock *tp);
-+void tcp_ack_probe(struct sock *sk);
-+void tcp_rearm_rto(struct sock *sk);
-+int tcp_write_timeout(struct sock *sk);
-+bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
-+			   unsigned int timeout, bool syn_set);
-+void tcp_write_err(struct sock *sk);
-+void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
-+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+			  unsigned int mss_now);
-+
-+int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
-+void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+			   struct request_sock *req);
-+__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
-+int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-+		       struct flowi *fl,
-+		       struct request_sock *req,
-+		       u16 queue_mapping,
-+		       struct tcp_fastopen_cookie *foc);
-+void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
-+struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
-+struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
-+void tcp_v4_reqsk_destructor(struct request_sock *req);
-+
-+int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
-+void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+			   struct request_sock *req);
-+__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
-+int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-+		       struct flowi *fl, struct request_sock *req,
-+		       u16 queue_mapping, struct tcp_fastopen_cookie *foc);
-+void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
-+int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
-+int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
-+void tcp_v6_destroy_sock(struct sock *sk);
-+void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
-+void tcp_v6_hash(struct sock *sk);
-+struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
-+struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-+			          struct request_sock *req,
-+				  struct dst_entry *dst);
-+void tcp_v6_reqsk_destructor(struct request_sock *req);
-+
-+unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
-+				       int large_allowed);
-+u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
-+
-+void skb_clone_fraglist(struct sk_buff *skb);
-+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
-+
-+void inet_twsk_free(struct inet_timewait_sock *tw);
-+int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
-+/* These states need RST on ABORT according to RFC793 */
-+static inline bool tcp_need_reset(int state)
-+{
-+	return (1 << state) &
-+	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
-+		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
-+}
-+
-+bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
-+			    int hlen);
-+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-+			       bool *fragstolen);
-+bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
-+		      struct sk_buff *from, bool *fragstolen);
-+/**** END - Exports needed for MPTCP ****/
-+
- void tcp_tasklet_init(void);
- 
- void tcp_v4_err(struct sk_buff *skb, u32);
-@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 		size_t len, int nonblock, int flags, int *addr_len);
- void tcp_parse_options(const struct sk_buff *skb,
- 		       struct tcp_options_received *opt_rx,
-+		       struct mptcp_options_received *mopt_rx,
- 		       int estab, struct tcp_fastopen_cookie *foc);
- const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
- 
-@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
- 
- u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
- 			      u16 *mssp);
--__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
--#else
--static inline __u32 cookie_v4_init_sequence(struct sock *sk,
--					    struct sk_buff *skb,
--					    __u16 *mss)
--{
--	return 0;
--}
-+__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
-+			      __u16 *mss);
- #endif
- 
- __u32 cookie_init_timestamp(struct request_sock *req);
-@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
- 			      const struct tcphdr *th, u16 *mssp);
- __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
- 			      __u16 *mss);
--#else
--static inline __u32 cookie_v6_init_sequence(struct sock *sk,
--					    struct sk_buff *skb,
--					    __u16 *mss)
--{
--	return 0;
--}
- #endif
- /* tcp_output.c */
- 
-@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
- void tcp_send_loss_probe(struct sock *sk);
- bool tcp_schedule_loss_probe(struct sock *sk);
- 
-+u16 tcp_select_window(struct sock *sk);
-+bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+		int push_one, gfp_t gfp);
-+
- /* tcp_input.c */
- void tcp_resume_early_retransmit(struct sock *sk);
- void tcp_rearm_rto(struct sock *sk);
- void tcp_reset(struct sock *sk);
-+void tcp_set_rto(struct sock *sk);
-+bool tcp_should_expand_sndbuf(const struct sock *sk);
-+bool tcp_prune_ofo_queue(struct sock *sk);
- 
- /* tcp_timer.c */
- void tcp_init_xmit_timers(struct sock *);
-@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
-  */
- struct tcp_skb_cb {
- 	union {
--		struct inet_skb_parm	h4;
-+		union {
-+			struct inet_skb_parm	h4;
- #if IS_ENABLED(CONFIG_IPV6)
--		struct inet6_skb_parm	h6;
-+			struct inet6_skb_parm	h6;
- #endif
--	} header;	/* For incoming frames		*/
-+		} header;	/* For incoming frames		*/
-+#ifdef CONFIG_MPTCP
-+		union {			/* For MPTCP outgoing frames */
-+			__u32 path_mask; /* paths that tried to send this skb */
-+			__u32 dss[6];	/* DSS options */
-+		};
-+#endif
-+	};
- 	__u32		seq;		/* Starting sequence number	*/
- 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
- 	__u32		when;		/* used to compute rtt's	*/
-+#ifdef CONFIG_MPTCP
-+	__u8		mptcp_flags;	/* flags for the MPTCP layer    */
-+	__u8		dss_off;	/* Number of 4-byte words until
-+					 * seq-number */
-+#endif
- 	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
- 
- 	__u8		sacked;		/* State flags for SACK/FACK.	*/
-@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
- /* Determine a window scaling and initial window to offer. */
- void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
- 			       __u32 *window_clamp, int wscale_ok,
--			       __u8 *rcv_wscale, __u32 init_rcv_wnd);
-+			       __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+			       const struct sock *sk);
- 
- static inline int tcp_win_from_space(int space)
- {
-@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
- 		space - (space>>sysctl_tcp_adv_win_scale);
- }
- 
-+#ifdef CONFIG_MPTCP
-+extern struct static_key mptcp_static_key;
-+static inline bool mptcp(const struct tcp_sock *tp)
-+{
-+	return static_key_false(&mptcp_static_key) && tp->mpc;
-+}
-+#else
-+static inline bool mptcp(const struct tcp_sock *tp)
-+{
-+	return 0;
-+}
-+#endif
-+
- /* Note: caller must be prepared to deal with negative returns */ 
- static inline int tcp_space(const struct sock *sk)
- {
-+	if (mptcp(tcp_sk(sk)))
-+		sk = tcp_sk(sk)->meta_sk;
-+
- 	return tcp_win_from_space(sk->sk_rcvbuf -
- 				  atomic_read(&sk->sk_rmem_alloc));
- } 
- 
- static inline int tcp_full_space(const struct sock *sk)
- {
-+	if (mptcp(tcp_sk(sk)))
-+		sk = tcp_sk(sk)->meta_sk;
-+
- 	return tcp_win_from_space(sk->sk_rcvbuf); 
- }
- 
-@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
- 	ireq->wscale_ok = rx_opt->wscale_ok;
- 	ireq->acked = 0;
- 	ireq->ecn_ok = 0;
-+	ireq->mptcp_rqsk = 0;
-+	ireq->saw_mpc = 0;
- 	ireq->ir_rmt_port = tcp_hdr(skb)->source;
- 	ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
- }
-@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
- void tcp4_proc_exit(void);
- #endif
- 
-+int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
-+int tcp_conn_request(struct request_sock_ops *rsk_ops,
-+		     const struct tcp_request_sock_ops *af_ops,
-+		     struct sock *sk, struct sk_buff *skb);
-+
- /* TCP af-specific functions */
- struct tcp_sock_af_ops {
- #ifdef CONFIG_TCP_MD5SIG
-@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
- #endif
- };
- 
-+/* TCP/MPTCP-specific functions */
-+struct tcp_sock_ops {
-+	u32 (*__select_window)(struct sock *sk);
-+	u16 (*select_window)(struct sock *sk);
-+	void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
-+				      __u32 *window_clamp, int wscale_ok,
-+				      __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+				      const struct sock *sk);
-+	void (*init_buffer_space)(struct sock *sk);
-+	void (*set_rto)(struct sock *sk);
-+	bool (*should_expand_sndbuf)(const struct sock *sk);
-+	void (*send_fin)(struct sock *sk);
-+	bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
-+			   int push_one, gfp_t gfp);
-+	void (*send_active_reset)(struct sock *sk, gfp_t priority);
-+	int (*write_wakeup)(struct sock *sk);
-+	bool (*prune_ofo_queue)(struct sock *sk);
-+	void (*retransmit_timer)(struct sock *sk);
-+	void (*time_wait)(struct sock *sk, int state, int timeo);
-+	void (*cleanup_rbuf)(struct sock *sk, int copied);
-+	void (*init_congestion_control)(struct sock *sk);
-+};
-+extern const struct tcp_sock_ops tcp_specific;
-+
- struct tcp_request_sock_ops {
-+	u16 mss_clamp;
- #ifdef CONFIG_TCP_MD5SIG
- 	struct tcp_md5sig_key	*(*md5_lookup) (struct sock *sk,
- 						struct request_sock *req);
-@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
- 						  const struct request_sock *req,
- 						  const struct sk_buff *skb);
- #endif
-+	int (*init_req)(struct request_sock *req, struct sock *sk,
-+			 struct sk_buff *skb);
-+#ifdef CONFIG_SYN_COOKIES
-+	__u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
-+				 __u16 *mss);
-+#endif
-+	struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
-+				       const struct request_sock *req,
-+				       bool *strict);
-+	__u32 (*init_seq)(const struct sk_buff *skb);
-+	int (*send_synack)(struct sock *sk, struct dst_entry *dst,
-+			   struct flowi *fl, struct request_sock *req,
-+			   u16 queue_mapping, struct tcp_fastopen_cookie *foc);
-+	void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
-+			       const unsigned long timeout);
- };
- 
-+#ifdef CONFIG_SYN_COOKIES
-+static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
-+					 struct sock *sk, struct sk_buff *skb,
-+					 __u16 *mss)
-+{
-+	return ops->cookie_init_seq(sk, skb, mss);
-+}
-+#else
-+static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
-+					 struct sock *sk, struct sk_buff *skb,
-+					 __u16 *mss)
-+{
-+	return 0;
-+}
-+#endif
-+
- int tcpv4_offload_init(void);
- 
- void tcp_v4_init(void);
-diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
-index 9cf2394f0bcf..c2634b6ed854 100644
---- a/include/uapi/linux/if.h
-+++ b/include/uapi/linux/if.h
-@@ -109,6 +109,9 @@ enum net_device_flags {
- #define IFF_DORMANT			IFF_DORMANT
- #define IFF_ECHO			IFF_ECHO
- 
-+#define IFF_NOMULTIPATH	0x80000		/* Disable for MPTCP 		*/
-+#define IFF_MPBACKUP	0x100000	/* Use as backup path for MPTCP */
-+
- #define IFF_VOLATILE	(IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
- 		IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
- 
-diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
-index 3b9718328d8b..487475681d84 100644
---- a/include/uapi/linux/tcp.h
-+++ b/include/uapi/linux/tcp.h
-@@ -112,6 +112,7 @@ enum {
- #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
- #define TCP_TIMESTAMP		24
- #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
-+#define MPTCP_ENABLED		26
- 
- struct tcp_repair_opt {
- 	__u32	opt_code;
-diff --git a/net/Kconfig b/net/Kconfig
-index d92afe4204d9..96b58593ad5e 100644
---- a/net/Kconfig
-+++ b/net/Kconfig
-@@ -79,6 +79,7 @@ if INET
- source "net/ipv4/Kconfig"
- source "net/ipv6/Kconfig"
- source "net/netlabel/Kconfig"
-+source "net/mptcp/Kconfig"
- 
- endif # if INET
- 
-diff --git a/net/Makefile b/net/Makefile
-index cbbbe6d657ca..244bac1435b1 100644
---- a/net/Makefile
-+++ b/net/Makefile
-@@ -20,6 +20,7 @@ obj-$(CONFIG_INET)		+= ipv4/
- obj-$(CONFIG_XFRM)		+= xfrm/
- obj-$(CONFIG_UNIX)		+= unix/
- obj-$(CONFIG_NET)		+= ipv6/
-+obj-$(CONFIG_MPTCP)		+= mptcp/
- obj-$(CONFIG_PACKET)		+= packet/
- obj-$(CONFIG_NET_KEY)		+= key/
- obj-$(CONFIG_BRIDGE)		+= bridge/
-diff --git a/net/core/dev.c b/net/core/dev.c
-index 367a586d0c8a..215d2757fbf6 100644
---- a/net/core/dev.c
-+++ b/net/core/dev.c
-@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
- 
- 	dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
- 			       IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
--			       IFF_AUTOMEDIA)) |
-+			       IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
- 		     (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
- 				    IFF_ALLMULTI));
- 
-diff --git a/net/core/request_sock.c b/net/core/request_sock.c
-index 467f326126e0..909dfa13f499 100644
---- a/net/core/request_sock.c
-+++ b/net/core/request_sock.c
-@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
- EXPORT_SYMBOL(sysctl_max_syn_backlog);
- 
- int reqsk_queue_alloc(struct request_sock_queue *queue,
--		      unsigned int nr_table_entries)
-+		      unsigned int nr_table_entries,
-+		      gfp_t flags)
- {
- 	size_t lopt_size = sizeof(struct listen_sock);
- 	struct listen_sock *lopt;
-@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
- 	nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
- 	lopt_size += nr_table_entries * sizeof(struct request_sock *);
- 	if (lopt_size > PAGE_SIZE)
--		lopt = vzalloc(lopt_size);
-+		lopt = __vmalloc(lopt_size,
-+			flags | __GFP_HIGHMEM | __GFP_ZERO,
-+			PAGE_KERNEL);
- 	else
--		lopt = kzalloc(lopt_size, GFP_KERNEL);
-+		lopt = kzalloc(lopt_size, flags);
- 	if (lopt == NULL)
- 		return -ENOMEM;
- 
-diff --git a/net/core/skbuff.c b/net/core/skbuff.c
-index c1a33033cbe2..8abc5d60fbe3 100644
---- a/net/core/skbuff.c
-+++ b/net/core/skbuff.c
-@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
- 	skb_drop_list(&skb_shinfo(skb)->frag_list);
- }
- 
--static void skb_clone_fraglist(struct sk_buff *skb)
-+void skb_clone_fraglist(struct sk_buff *skb)
- {
- 	struct sk_buff *list;
- 
-@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
- 	skb->inner_mac_header += off;
- }
- 
--static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
-+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
- {
- 	__copy_skb_header(new, old);
- 
-diff --git a/net/core/sock.c b/net/core/sock.c
-index 026e01f70274..359295523177 100644
---- a/net/core/sock.c
-+++ b/net/core/sock.c
-@@ -136,6 +136,11 @@
- 
- #include <trace/events/sock.h>
- 
-+#ifdef CONFIG_MPTCP
-+#include <net/mptcp.h>
-+#include <net/inet_common.h>
-+#endif
-+
- #ifdef CONFIG_INET
- #include <net/tcp.h>
- #endif
-@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
-   "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG"      ,
-   "slock-AF_NFC"   , "slock-AF_VSOCK"    ,"slock-AF_MAX"
- };
--static const char *const af_family_clock_key_strings[AF_MAX+1] = {
-+char *const af_family_clock_key_strings[AF_MAX+1] = {
-   "clock-AF_UNSPEC", "clock-AF_UNIX"     , "clock-AF_INET"     ,
-   "clock-AF_AX25"  , "clock-AF_IPX"      , "clock-AF_APPLETALK",
-   "clock-AF_NETROM", "clock-AF_BRIDGE"   , "clock-AF_ATMPVC"   ,
-@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
-  * sk_callback_lock locking rules are per-address-family,
-  * so split the lock classes by using a per-AF key:
-  */
--static struct lock_class_key af_callback_keys[AF_MAX];
-+struct lock_class_key af_callback_keys[AF_MAX];
- 
- /* Take into consideration the size of the struct sk_buff overhead in the
-  * determination of these values, since that is non-constant across
-@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
- 	}
- }
- 
--#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
--
- static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
- {
- 	if (sk->sk_flags & flags) {
-@@ -1253,8 +1256,25 @@ lenout:
-  *
-  * (We also register the sk_lock with the lock validator.)
-  */
--static inline void sock_lock_init(struct sock *sk)
--{
-+void sock_lock_init(struct sock *sk)
-+{
-+#ifdef CONFIG_MPTCP
-+	/* Reclassify the lock-class for subflows */
-+	if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
-+		if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
-+			sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
-+						      &meta_slock_key,
-+						      "sk_lock-AF_INET-MPTCP",
-+						      &meta_key);
-+
-+			/* We don't yet have the mptcp-point.
-+			 * Thus we still need inet_sock_destruct
-+			 */
-+			sk->sk_destruct = inet_sock_destruct;
-+			return;
-+		}
-+#endif
-+
- 	sock_lock_init_class_and_name(sk,
- 			af_family_slock_key_strings[sk->sk_family],
- 			af_family_slock_keys + sk->sk_family,
-@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
- }
- EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
- 
--static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
-+struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
- 		int family)
- {
- 	struct sock *sk;
-diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
-index 4db3c2a1679c..04cb17d4b0ce 100644
---- a/net/dccp/ipv6.c
-+++ b/net/dccp/ipv6.c
-@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
- 	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
- 		goto drop;
- 
--	req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
-+	req = inet_reqsk_alloc(&dccp6_request_sock_ops);
- 	if (req == NULL)
- 		goto drop;
- 
-diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
-index 05c57f0fcabe..630434db0085 100644
---- a/net/ipv4/Kconfig
-+++ b/net/ipv4/Kconfig
-@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
- 	For further details see:
- 	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
- 
-+config TCP_CONG_COUPLED
-+	tristate "MPTCP COUPLED CONGESTION CONTROL"
-+	depends on MPTCP
-+	default n
-+	---help---
-+	MultiPath TCP Coupled Congestion Control
-+	To enable it, just put 'coupled' in tcp_congestion_control
-+
-+config TCP_CONG_OLIA
-+	tristate "MPTCP Opportunistic Linked Increase"
-+	depends on MPTCP
-+	default n
-+	---help---
-+	MultiPath TCP Opportunistic Linked Increase Congestion Control
-+	To enable it, just put 'olia' in tcp_congestion_control
-+
-+config TCP_CONG_WVEGAS
-+	tristate "MPTCP WVEGAS CONGESTION CONTROL"
-+	depends on MPTCP
-+	default n
-+	---help---
-+	wVegas congestion control for MPTCP
-+	To enable it, just put 'wvegas' in tcp_congestion_control
-+
- choice
- 	prompt "Default TCP congestion control"
- 	default DEFAULT_CUBIC
-@@ -584,6 +608,15 @@ choice
- 	config DEFAULT_WESTWOOD
- 		bool "Westwood" if TCP_CONG_WESTWOOD=y
- 
-+	config DEFAULT_COUPLED
-+		bool "Coupled" if TCP_CONG_COUPLED=y
-+
-+	config DEFAULT_OLIA
-+		bool "Olia" if TCP_CONG_OLIA=y
-+
-+	config DEFAULT_WVEGAS
-+		bool "Wvegas" if TCP_CONG_WVEGAS=y
-+
- 	config DEFAULT_RENO
- 		bool "Reno"
- 
-@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
- 	default "vegas" if DEFAULT_VEGAS
- 	default "westwood" if DEFAULT_WESTWOOD
- 	default "veno" if DEFAULT_VENO
-+	default "coupled" if DEFAULT_COUPLED
-+	default "wvegas" if DEFAULT_WVEGAS
- 	default "reno" if DEFAULT_RENO
- 	default "cubic"
- 
-diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
-index d156b3c5f363..4afd6d8d9028 100644
---- a/net/ipv4/af_inet.c
-+++ b/net/ipv4/af_inet.c
-@@ -104,6 +104,7 @@
- #include <net/ip_fib.h>
- #include <net/inet_connection_sock.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
- #include <net/udp.h>
- #include <net/udplite.h>
- #include <net/ping.h>
-@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
-  *	Create an inet socket.
-  */
- 
--static int inet_create(struct net *net, struct socket *sock, int protocol,
--		       int kern)
-+int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
- {
- 	struct sock *sk;
- 	struct inet_protosw *answer;
-@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
- 	lock_sock(sk2);
- 
- 	sock_rps_record_flow(sk2);
-+
-+	if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
-+		struct sock *sk_it = sk2;
-+
-+		mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+
-+		if (tcp_sk(sk2)->mpcb->master_sk) {
-+			sk_it = tcp_sk(sk2)->mpcb->master_sk;
-+
-+			write_lock_bh(&sk_it->sk_callback_lock);
-+			sk_it->sk_wq = newsock->wq;
-+			sk_it->sk_socket = newsock;
-+			write_unlock_bh(&sk_it->sk_callback_lock);
-+		}
-+	}
-+
- 	WARN_ON(!((1 << sk2->sk_state) &
- 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
- 		  TCPF_CLOSE_WAIT | TCPF_CLOSE)));
-@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
- 
- 	ip_init();
- 
-+	/* We must initialize MPTCP before TCP. */
-+	mptcp_init();
-+
- 	tcp_v4_init();
- 
- 	/* Setup TCP slab cache for open requests. */
-diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
-index 14d02ea905b6..7d734d8af19b 100644
---- a/net/ipv4/inet_connection_sock.c
-+++ b/net/ipv4/inet_connection_sock.c
-@@ -23,6 +23,7 @@
- #include <net/route.h>
- #include <net/tcp_states.h>
- #include <net/xfrm.h>
-+#include <net/mptcp.h>
- 
- #ifdef INET_CSK_DEBUG
- const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
-@@ -465,8 +466,8 @@ no_route:
- }
- EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
- 
--static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
--				 const u32 rnd, const u32 synq_hsize)
-+u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
-+		   const u32 synq_hsize)
- {
- 	return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
- }
-@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
- 
- 	lopt->clock_hand = i;
- 
--	if (lopt->qlen)
-+	if (lopt->qlen && !is_meta_sk(parent))
- 		inet_csk_reset_keepalive_timer(parent, interval);
- }
- EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
-@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
- 				 const struct request_sock *req,
- 				 const gfp_t priority)
- {
--	struct sock *newsk = sk_clone_lock(sk, priority);
-+	struct sock *newsk;
-+
-+	newsk = sk_clone_lock(sk, priority);
- 
- 	if (newsk != NULL) {
- 		struct inet_connection_sock *newicsk = inet_csk(newsk);
-@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
- {
- 	struct inet_sock *inet = inet_sk(sk);
- 	struct inet_connection_sock *icsk = inet_csk(sk);
--	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
-+	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
-+				   GFP_KERNEL);
- 
- 	if (rc != 0)
- 		return rc;
-@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
- 
- 	while ((req = acc_req) != NULL) {
- 		struct sock *child = req->sk;
-+		bool mutex_taken = false;
- 
- 		acc_req = req->dl_next;
- 
-+		if (is_meta_sk(child)) {
-+			mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
-+			mutex_taken = true;
-+		}
- 		local_bh_disable();
- 		bh_lock_sock(child);
- 		WARN_ON(sock_owned_by_user(child));
-@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
- 
- 		bh_unlock_sock(child);
- 		local_bh_enable();
-+		if (mutex_taken)
-+			mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
- 		sock_put(child);
- 
- 		sk_acceptq_removed(sk);
-diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
-index c86624b36a62..0ff3fe004d62 100644
---- a/net/ipv4/syncookies.c
-+++ b/net/ipv4/syncookies.c
-@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
- }
- EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
- 
--__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
-+__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
-+			      __u16 *mssp)
- {
- 	const struct iphdr *iph = ip_hdr(skb);
- 	const struct tcphdr *th = tcp_hdr(skb);
-@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
- 
- 	/* check for timestamp cookie support */
- 	memset(&tcp_opt, 0, sizeof(tcp_opt));
--	tcp_parse_options(skb, &tcp_opt, 0, NULL);
-+	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
- 
- 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
- 		goto out;
-@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
- 	/* Try to redo what tcp_v4_send_synack did. */
- 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
- 
--	tcp_select_initial_window(tcp_full_space(sk), req->mss,
--				  &req->rcv_wnd, &req->window_clamp,
--				  ireq->wscale_ok, &rcv_wscale,
--				  dst_metric(&rt->dst, RTAX_INITRWND));
-+	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
-+				       &req->rcv_wnd, &req->window_clamp,
-+				       ireq->wscale_ok, &rcv_wscale,
-+				       dst_metric(&rt->dst, RTAX_INITRWND), sk);
- 
- 	ireq->rcv_wscale  = rcv_wscale;
- 
-diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
-index 9d2118e5fbc7..2cb89f886d45 100644
---- a/net/ipv4/tcp.c
-+++ b/net/ipv4/tcp.c
-@@ -271,6 +271,7 @@
- 
- #include <net/icmp.h>
- #include <net/inet_common.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- #include <net/xfrm.h>
- #include <net/ip.h>
-@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
- 	return period;
- }
- 
-+const struct tcp_sock_ops tcp_specific = {
-+	.__select_window		= __tcp_select_window,
-+	.select_window			= tcp_select_window,
-+	.select_initial_window		= tcp_select_initial_window,
-+	.init_buffer_space		= tcp_init_buffer_space,
-+	.set_rto			= tcp_set_rto,
-+	.should_expand_sndbuf		= tcp_should_expand_sndbuf,
-+	.init_congestion_control	= tcp_init_congestion_control,
-+	.send_fin			= tcp_send_fin,
-+	.write_xmit			= tcp_write_xmit,
-+	.send_active_reset		= tcp_send_active_reset,
-+	.write_wakeup			= tcp_write_wakeup,
-+	.prune_ofo_queue		= tcp_prune_ofo_queue,
-+	.retransmit_timer		= tcp_retransmit_timer,
-+	.time_wait			= tcp_time_wait,
-+	.cleanup_rbuf			= tcp_cleanup_rbuf,
-+};
-+
- /* Address-family independent initialization for a tcp_sock.
-  *
-  * NOTE: A lot of things set to zero explicitly by call to
-@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
- 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
- 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
- 
-+	tp->ops = &tcp_specific;
-+
- 	local_bh_disable();
- 	sock_update_memcg(sk);
- 	sk_sockets_allocated_inc(sk);
-@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
- 	int ret;
- 
- 	sock_rps_record_flow(sk);
-+
-+#ifdef CONFIG_MPTCP
-+	if (mptcp(tcp_sk(sk))) {
-+		struct sock *sk_it;
-+		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+	}
-+#endif
- 	/*
- 	 * We can't seek on a socket input
- 	 */
-@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
- 	return NULL;
- }
- 
--static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
--				       int large_allowed)
-+unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	u32 xmit_size_goal, old_size_goal;
-@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
- {
- 	int mss_now;
- 
--	mss_now = tcp_current_mss(sk);
--	*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+	if (mptcp(tcp_sk(sk))) {
-+		mss_now = mptcp_current_mss(sk);
-+		*size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+	} else {
-+		mss_now = tcp_current_mss(sk);
-+		*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+	}
- 
- 	return mss_now;
- }
-@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
- 	 * is fully established.
- 	 */
- 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
--	    !tcp_passive_fastopen(sk)) {
-+	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
-+				  tp->mpcb->master_sk : sk)) {
- 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- 			goto out_err;
- 	}
- 
-+	if (mptcp(tp)) {
-+		struct sock *sk_it = sk;
-+
-+		/* We must check this with socket-lock hold because we iterate
-+		 * over the subflows.
-+		 */
-+		if (!mptcp_can_sendpage(sk)) {
-+			ssize_t ret;
-+
-+			release_sock(sk);
-+			ret = sock_no_sendpage(sk->sk_socket, page, offset,
-+					       size, flags);
-+			lock_sock(sk);
-+			return ret;
-+		}
-+
-+		mptcp_for_each_sk(tp->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+	}
-+
- 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
- 
- 	mss_now = tcp_send_mss(sk, &size_goal, flags);
-@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
- {
- 	ssize_t res;
- 
--	if (!(sk->sk_route_caps & NETIF_F_SG) ||
--	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-+	/* If MPTCP is enabled, we check it later after establishment */
-+	if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
-+	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
- 		return sock_no_sendpage(sk->sk_socket, page, offset, size,
- 					flags);
- 
-@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 	int tmp = tp->mss_cache;
- 
-+	if (mptcp(tp))
-+		return mptcp_select_size(sk, sg);
-+
- 	if (sg) {
- 		if (sk_can_gso(sk)) {
- 			/* Small frames wont use a full page:
-@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 	 * is fully established.
- 	 */
- 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
--	    !tcp_passive_fastopen(sk)) {
-+	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
-+				  tp->mpcb->master_sk : sk)) {
- 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- 			goto do_error;
- 	}
- 
-+	if (mptcp(tp)) {
-+		struct sock *sk_it = sk;
-+		mptcp_for_each_sk(tp->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+	}
-+
- 	if (unlikely(tp->repair)) {
- 		if (tp->repair_queue == TCP_RECV_QUEUE) {
- 			copied = tcp_send_rcvq(sk, msg, size);
-@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
- 		goto out_err;
- 
--	sg = !!(sk->sk_route_caps & NETIF_F_SG);
-+	if (mptcp(tp))
-+		sg = mptcp_can_sg(sk);
-+	else
-+		sg = !!(sk->sk_route_caps & NETIF_F_SG);
- 
- 	while (--iovlen >= 0) {
- 		size_t seglen = iov->iov_len;
-@@ -1183,8 +1251,15 @@ new_segment:
- 
- 				/*
- 				 * Check whether we can use HW checksum.
-+				 *
-+				 * If dss-csum is enabled, we do not do hw-csum.
-+				 * In case of non-mptcp we check the
-+				 * device-capabilities.
-+				 * In case of mptcp, hw-csum's will be handled
-+				 * later in mptcp_write_xmit.
- 				 */
--				if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
-+				if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
-+				    (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
- 					skb->ip_summed = CHECKSUM_PARTIAL;
- 
- 				skb_entail(sk, skb);
-@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
- 
- 		/* Optimize, __tcp_select_window() is not cheap. */
- 		if (2*rcv_window_now <= tp->window_clamp) {
--			__u32 new_window = __tcp_select_window(sk);
-+			__u32 new_window = tp->ops->__select_window(sk);
- 
- 			/* Send ACK now, if this read freed lots of space
- 			 * in our buffer. Certainly, new_window is new window.
-@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
- 	/* Clean up data we have read: This will do ACK frames. */
- 	if (copied > 0) {
- 		tcp_recv_skb(sk, seq, &offset);
--		tcp_cleanup_rbuf(sk, copied);
-+		tp->ops->cleanup_rbuf(sk, copied);
- 	}
- 	return copied;
- }
-@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 
- 	lock_sock(sk);
- 
-+#ifdef CONFIG_MPTCP
-+	if (mptcp(tp)) {
-+		struct sock *sk_it;
-+		mptcp_for_each_sk(tp->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+	}
-+#endif
-+
- 	err = -ENOTCONN;
- 	if (sk->sk_state == TCP_LISTEN)
- 		goto out;
-@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 			}
- 		}
- 
--		tcp_cleanup_rbuf(sk, copied);
-+		tp->ops->cleanup_rbuf(sk, copied);
- 
- 		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
- 			/* Install new reader */
-@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 			if (tp->rcv_wnd == 0 &&
- 			    !skb_queue_empty(&sk->sk_async_wait_queue)) {
- 				tcp_service_net_dma(sk, true);
--				tcp_cleanup_rbuf(sk, copied);
-+				tp->ops->cleanup_rbuf(sk, copied);
- 			} else
- 				dma_async_issue_pending(tp->ucopy.dma_chan);
- 		}
-@@ -1993,7 +2076,7 @@ skip_copy:
- 	 */
- 
- 	/* Clean up data we have read: This will do ACK frames. */
--	tcp_cleanup_rbuf(sk, copied);
-+	tp->ops->cleanup_rbuf(sk, copied);
- 
- 	release_sock(sk);
- 	return copied;
-@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
-   /* TCP_CLOSING	*/ TCP_CLOSING,
- };
- 
--static int tcp_close_state(struct sock *sk)
-+int tcp_close_state(struct sock *sk)
- {
- 	int next = (int)new_state[sk->sk_state];
- 	int ns = next & TCP_STATE_MASK;
-@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
- 	     TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
- 		/* Clear out any half completed packets.  FIN if needed. */
- 		if (tcp_close_state(sk))
--			tcp_send_fin(sk);
-+			tcp_sk(sk)->ops->send_fin(sk);
- 	}
- }
- EXPORT_SYMBOL(tcp_shutdown);
-@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
- 	int data_was_unread = 0;
- 	int state;
- 
-+	if (is_meta_sk(sk)) {
-+		mptcp_close(sk, timeout);
-+		return;
-+	}
-+
- 	lock_sock(sk);
- 	sk->sk_shutdown = SHUTDOWN_MASK;
- 
-@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
- 		/* Unread data was tossed, zap the connection. */
- 		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
- 		tcp_set_state(sk, TCP_CLOSE);
--		tcp_send_active_reset(sk, sk->sk_allocation);
-+		tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
- 	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
- 		/* Check zero linger _after_ checking for unread data. */
- 		sk->sk_prot->disconnect(sk, 0);
-@@ -2247,7 +2335,7 @@ adjudge_to_death:
- 		struct tcp_sock *tp = tcp_sk(sk);
- 		if (tp->linger2 < 0) {
- 			tcp_set_state(sk, TCP_CLOSE);
--			tcp_send_active_reset(sk, GFP_ATOMIC);
-+			tp->ops->send_active_reset(sk, GFP_ATOMIC);
- 			NET_INC_STATS_BH(sock_net(sk),
- 					LINUX_MIB_TCPABORTONLINGER);
- 		} else {
-@@ -2257,7 +2345,8 @@ adjudge_to_death:
- 				inet_csk_reset_keepalive_timer(sk,
- 						tmo - TCP_TIMEWAIT_LEN);
- 			} else {
--				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+				tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
-+							   tmo);
- 				goto out;
- 			}
- 		}
-@@ -2266,7 +2355,7 @@ adjudge_to_death:
- 		sk_mem_reclaim(sk);
- 		if (tcp_check_oom(sk, 0)) {
- 			tcp_set_state(sk, TCP_CLOSE);
--			tcp_send_active_reset(sk, GFP_ATOMIC);
-+			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
- 			NET_INC_STATS_BH(sock_net(sk),
- 					LINUX_MIB_TCPABORTONMEMORY);
- 		}
-@@ -2291,15 +2380,6 @@ out:
- }
- EXPORT_SYMBOL(tcp_close);
- 
--/* These states need RST on ABORT according to RFC793 */
--
--static inline bool tcp_need_reset(int state)
--{
--	return (1 << state) &
--	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
--		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
--}
--
- int tcp_disconnect(struct sock *sk, int flags)
- {
- 	struct inet_sock *inet = inet_sk(sk);
-@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
- 		/* The last check adjusts for discrepancy of Linux wrt. RFC
- 		 * states
- 		 */
--		tcp_send_active_reset(sk, gfp_any());
-+		tp->ops->send_active_reset(sk, gfp_any());
- 		sk->sk_err = ECONNRESET;
- 	} else if (old_state == TCP_SYN_SENT)
- 		sk->sk_err = ECONNRESET;
-@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
- 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
- 		inet_reset_saddr(sk);
- 
-+	if (is_meta_sk(sk)) {
-+		mptcp_disconnect(sk);
-+	} else {
-+		if (tp->inside_tk_table)
-+			mptcp_hash_remove_bh(tp);
-+	}
-+
- 	sk->sk_shutdown = 0;
- 	sock_reset_flag(sk, SOCK_DONE);
- 	tp->srtt_us = 0;
-@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- 		break;
- 
- 	case TCP_DEFER_ACCEPT:
-+		/* An established MPTCP-connection (mptcp(tp) only returns true
-+		 * if the socket is established) should not use DEFER on new
-+		 * subflows.
-+		 */
-+		if (mptcp(tp))
-+			break;
- 		/* Translate value in seconds to number of retransmits */
- 		icsk->icsk_accept_queue.rskq_defer_accept =
- 			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
-@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- 			    (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
- 			    inet_csk_ack_scheduled(sk)) {
- 				icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
--				tcp_cleanup_rbuf(sk, 1);
-+				tp->ops->cleanup_rbuf(sk, 1);
- 				if (!(val & 1))
- 					icsk->icsk_ack.pingpong = 1;
- 			}
-@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- 		tp->notsent_lowat = val;
- 		sk->sk_write_space(sk);
- 		break;
-+#ifdef CONFIG_MPTCP
-+	case MPTCP_ENABLED:
-+		if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
-+			if (val)
-+				tp->mptcp_enabled = 1;
-+			else
-+				tp->mptcp_enabled = 0;
-+		} else {
-+			err = -EPERM;
-+		}
-+		break;
-+#endif
- 	default:
- 		err = -ENOPROTOOPT;
- 		break;
-@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
- 	case TCP_NOTSENT_LOWAT:
- 		val = tp->notsent_lowat;
- 		break;
-+#ifdef CONFIG_MPTCP
-+	case MPTCP_ENABLED:
-+		val = tp->mptcp_enabled;
-+		break;
-+#endif
- 	default:
- 		return -ENOPROTOOPT;
- 	}
-@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
- 	if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
- 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
- 
-+	WARN_ON(sk->sk_state == TCP_CLOSE);
- 	tcp_set_state(sk, TCP_CLOSE);
-+
- 	tcp_clear_xmit_timers(sk);
-+
- 	if (req != NULL)
- 		reqsk_fastopen_remove(sk, req, false);
- 
-diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
-index 9771563ab564..5c230d96c4c1 100644
---- a/net/ipv4/tcp_fastopen.c
-+++ b/net/ipv4/tcp_fastopen.c
-@@ -7,6 +7,7 @@
- #include <linux/rculist.h>
- #include <net/inetpeer.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
- 
- int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
- 
-@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- {
- 	struct tcp_sock *tp;
- 	struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
--	struct sock *child;
-+	struct sock *child, *meta_sk;
- 
- 	req->num_retrans = 0;
- 	req->num_timeout = 0;
-@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- 	/* Add the child socket directly into the accept queue */
- 	inet_csk_reqsk_queue_add(sk, req, child);
- 
--	/* Now finish processing the fastopen child socket. */
--	inet_csk(child)->icsk_af_ops->rebuild_header(child);
--	tcp_init_congestion_control(child);
--	tcp_mtup_init(child);
--	tcp_init_metrics(child);
--	tcp_init_buffer_space(child);
--
- 	/* Queue the data carried in the SYN packet. We need to first
- 	 * bump skb's refcnt because the caller will attempt to free it.
- 	 *
-@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- 		tp->syn_data_acked = 1;
- 	}
- 	tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-+
-+	meta_sk = child;
-+	if (!mptcp_check_req_fastopen(meta_sk, req)) {
-+		child = tcp_sk(meta_sk)->mpcb->master_sk;
-+		tp = tcp_sk(child);
-+	}
-+
-+	/* Now finish processing the fastopen child socket. */
-+	inet_csk(child)->icsk_af_ops->rebuild_header(child);
-+	tp->ops->init_congestion_control(child);
-+	tcp_mtup_init(child);
-+	tcp_init_metrics(child);
-+	tp->ops->init_buffer_space(child);
-+
- 	sk->sk_data_ready(sk);
--	bh_unlock_sock(child);
-+	if (mptcp(tcp_sk(child)))
-+		bh_unlock_sock(child);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(child);
- 	WARN_ON(req->sk == NULL);
- 	return true;
-diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
-index 40639c288dc2..3273bb69f387 100644
---- a/net/ipv4/tcp_input.c
-+++ b/net/ipv4/tcp_input.c
-@@ -74,6 +74,9 @@
- #include <linux/ipsec.h>
- #include <asm/unaligned.h>
- #include <net/netdma.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
- 
- int sysctl_tcp_timestamps __read_mostly = 1;
- int sysctl_tcp_window_scaling __read_mostly = 1;
-@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
- int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
- int sysctl_tcp_early_retrans __read_mostly = 3;
- 
--#define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
--#define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
--#define FLAG_DATA_ACKED		0x04 /* This ACK acknowledged new data.		*/
--#define FLAG_RETRANS_DATA_ACKED	0x08 /* "" "" some of which was retransmitted.	*/
--#define FLAG_SYN_ACKED		0x10 /* This ACK acknowledged SYN.		*/
--#define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
--#define FLAG_ECE		0x40 /* ECE in this ACK				*/
--#define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/
--#define FLAG_ORIG_SACK_ACKED	0x200 /* Never retransmitted data are (s)acked	*/
--#define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
--#define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info */
--#define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a sacked seq */
--#define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
--
--#define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
--#define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
--#define FLAG_CA_ALERT		(FLAG_DATA_SACKED|FLAG_ECE)
--#define FLAG_FORWARD_PROGRESS	(FLAG_ACKED|FLAG_DATA_SACKED)
--
- #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
- #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
- 
-@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
- 		icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
- }
- 
--static void tcp_enter_quickack_mode(struct sock *sk)
-+void tcp_enter_quickack_mode(struct sock *sk)
- {
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	tcp_incr_quickack(sk);
-@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
- 	per_mss = roundup_pow_of_two(per_mss) +
- 		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
- 
--	nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
--	nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
-+	if (mptcp(tp)) {
-+		nr_segs = mptcp_check_snd_buf(tp);
-+	} else {
-+		nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
-+		nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
-+	}
- 
- 	/* Fast Recovery (RFC 5681 3.2) :
- 	 * Cubic needs 1.7 factor, rounded to 2 to include
-@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
- 	 */
- 	sndmem = 2 * nr_segs * per_mss;
- 
--	if (sk->sk_sndbuf < sndmem)
-+	/* MPTCP: after this sndmem is the new contribution of the
-+	 * current subflow to the aggregated sndbuf */
-+	if (sk->sk_sndbuf < sndmem) {
-+		int old_sndbuf = sk->sk_sndbuf;
- 		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
-+		/* MPTCP: ok, the subflow sndbuf has grown, reflect
-+		 * this in the aggregate buffer.*/
-+		if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
-+			mptcp_update_sndbuf(tp);
-+	}
- }
- 
- /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
-@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
- static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
- 
- 	/* Check #1 */
--	if (tp->rcv_ssthresh < tp->window_clamp &&
--	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
-+	if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
-+	    (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
- 	    !sk_under_memory_pressure(sk)) {
- 		int incr;
- 
-@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
- 		 * will fit to rcvbuf in future.
- 		 */
- 		if (tcp_win_from_space(skb->truesize) <= skb->len)
--			incr = 2 * tp->advmss;
-+			incr = 2 * meta_tp->advmss;
- 		else
--			incr = __tcp_grow_window(sk, skb);
-+			incr = __tcp_grow_window(meta_sk, skb);
- 
- 		if (incr) {
- 			incr = max_t(int, incr, 2 * skb->len);
--			tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
--					       tp->window_clamp);
-+			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
-+					            meta_tp->window_clamp);
- 			inet_csk(sk)->icsk_ack.quick |= 1;
- 		}
- 	}
-@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
- 	int copied;
- 
- 	time = tcp_time_stamp - tp->rcvq_space.time;
--	if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
-+	if (mptcp(tp)) {
-+		if (mptcp_check_rtt(tp, time))
-+			return;
-+	} else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
- 		return;
- 
- 	/* Number of bytes copied to user in last RTT */
-@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
- /* Calculate rto without backoff.  This is the second half of Van Jacobson's
-  * routine referred to above.
-  */
--static void tcp_set_rto(struct sock *sk)
-+void tcp_set_rto(struct sock *sk)
- {
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 	/* Old crap is replaced with new one. 8)
-@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
- 	int len;
- 	int in_sack;
- 
--	if (!sk_can_gso(sk))
-+	/* For MPTCP we cannot shift skb-data and remove one skb from the
-+	 * send-queue, because this will make us loose the DSS-option (which
-+	 * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
-+	 */
-+	if (!sk_can_gso(sk) || mptcp(tp))
- 		goto fallback;
- 
- 	/* Normally R but no L won't result in plain S */
-@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
- 		return false;
- 
- 	tcp_rtt_estimator(sk, seq_rtt_us);
--	tcp_set_rto(sk);
-+	tp->ops->set_rto(sk);
- 
- 	/* RFC6298: only reset backoff on valid RTT measurement. */
- 	inet_csk(sk)->icsk_backoff = 0;
-@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
- }
- 
- /* If we get here, the whole TSO packet has not been acked. */
--static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
-+u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	u32 packets_acked;
-@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
- 		 */
- 		if (!(scb->tcp_flags & TCPHDR_SYN)) {
- 			flag |= FLAG_DATA_ACKED;
-+			if (mptcp(tp) && mptcp_is_data_seq(skb))
-+				flag |= MPTCP_FLAG_DATA_ACKED;
- 		} else {
- 			flag |= FLAG_SYN_ACKED;
- 			tp->retrans_stamp = 0;
-@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
- 	return flag;
- }
- 
--static void tcp_ack_probe(struct sock *sk)
-+void tcp_ack_probe(struct sock *sk)
- {
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 	struct inet_connection_sock *icsk = inet_csk(sk);
-@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
- /* Check that window update is acceptable.
-  * The function assumes that snd_una<=ack<=snd_next.
-  */
--static inline bool tcp_may_update_window(const struct tcp_sock *tp,
--					const u32 ack, const u32 ack_seq,
--					const u32 nwin)
-+bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
-+			   const u32 ack_seq, const u32 nwin)
- {
- 	return	after(ack, tp->snd_una) ||
- 		after(ack_seq, tp->snd_wl1) ||
-@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
- }
- 
- /* This routine deals with incoming acks, but not outgoing ones. */
--static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
-+static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
- {
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct tcp_sock *tp = tcp_sk(sk);
-@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
- 				    sack_rtt_us);
- 	acked -= tp->packets_out;
- 
-+	if (mptcp(tp)) {
-+		if (mptcp_fallback_infinite(sk, flag)) {
-+			pr_err("%s resetting flow\n", __func__);
-+			mptcp_send_reset(sk);
-+			goto invalid_ack;
-+		}
-+
-+		mptcp_clean_rtx_infinite(skb, sk);
-+	}
-+
- 	/* Advance cwnd if state allows */
- 	if (tcp_may_raise_cwnd(sk, flag))
- 		tcp_cong_avoid(sk, ack, acked);
-@@ -3512,8 +3528,9 @@ old_ack:
-  * the fast version below fails.
-  */
- void tcp_parse_options(const struct sk_buff *skb,
--		       struct tcp_options_received *opt_rx, int estab,
--		       struct tcp_fastopen_cookie *foc)
-+		       struct tcp_options_received *opt_rx,
-+		       struct mptcp_options_received *mopt,
-+		       int estab, struct tcp_fastopen_cookie *foc)
- {
- 	const unsigned char *ptr;
- 	const struct tcphdr *th = tcp_hdr(skb);
-@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
- 				 */
- 				break;
- #endif
-+			case TCPOPT_MPTCP:
-+				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
-+				break;
- 			case TCPOPT_EXP:
- 				/* Fast Open option shares code 254 using a
- 				 * 16 bits magic number. It's valid only in
-@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
- 		if (tcp_parse_aligned_timestamp(tp, th))
- 			return true;
- 	}
--
--	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
-+	tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
-+			  1, NULL);
- 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
- 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
- 
-@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
- 		dst = __sk_dst_get(sk);
- 		if (!dst || !dst_metric(dst, RTAX_QUICKACK))
- 			inet_csk(sk)->icsk_ack.pingpong = 1;
-+		if (mptcp(tp))
-+			mptcp_sub_close_passive(sk);
- 		break;
- 
- 	case TCP_CLOSE_WAIT:
-@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
- 		tcp_set_state(sk, TCP_CLOSING);
- 		break;
- 	case TCP_FIN_WAIT2:
-+		if (mptcp(tp)) {
-+			/* The socket will get closed by mptcp_data_ready.
-+			 * We first have to process all data-sequences.
-+			 */
-+			tp->close_it = 1;
-+			break;
-+		}
- 		/* Received a FIN -- send ACK and enter TIME_WAIT. */
- 		tcp_send_ack(sk);
--		tcp_time_wait(sk, TCP_TIME_WAIT, 0);
-+		tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
- 		break;
- 	default:
- 		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
-@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
- 	if (!sock_flag(sk, SOCK_DEAD)) {
- 		sk->sk_state_change(sk);
- 
-+		/* Don't wake up MPTCP-subflows */
-+		if (mptcp(tp))
-+			return;
-+
- 		/* Do not send POLL_HUP for half duplex close. */
- 		if (sk->sk_shutdown == SHUTDOWN_MASK ||
- 		    sk->sk_state == TCP_CLOSE)
-@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
- 			tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
- 		}
- 
--		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
-+		/* In case of MPTCP, the segment may be empty if it's a
-+		 * non-data DATA_FIN. (see beginning of tcp_data_queue)
-+		 */
-+		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
-+		    !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
- 			SOCK_DEBUG(sk, "ofo packet was already received\n");
- 			__skb_unlink(skb, &tp->out_of_order_queue);
- 			__kfree_skb(skb);
-@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
- 	}
- }
- 
--static bool tcp_prune_ofo_queue(struct sock *sk);
- static int tcp_prune_queue(struct sock *sk);
- 
- static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- 				 unsigned int size)
- {
-+	if (mptcp(tcp_sk(sk)))
-+		sk = mptcp_meta_sk(sk);
-+
- 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
- 	    !sk_rmem_schedule(sk, skb, size)) {
- 
-@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- 			return -1;
- 
- 		if (!sk_rmem_schedule(sk, skb, size)) {
--			if (!tcp_prune_ofo_queue(sk))
-+			if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
- 				return -1;
- 
- 			if (!sk_rmem_schedule(sk, skb, size))
-@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
-  * Better try to coalesce them right now to avoid future collapses.
-  * Returns true if caller should free @from instead of queueing it
-  */
--static bool tcp_try_coalesce(struct sock *sk,
--			     struct sk_buff *to,
--			     struct sk_buff *from,
--			     bool *fragstolen)
-+bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
-+		      bool *fragstolen)
- {
- 	int delta;
- 
- 	*fragstolen = false;
- 
-+	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
-+		return false;
-+
- 	if (tcp_hdr(from)->fin)
- 		return false;
- 
-@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
- 
- 	/* Do skb overlap to previous one? */
- 	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
--		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+		/* MPTCP allows non-data data-fin to be in the ofo-queue */
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
-+		    !(mptcp(tp) && end_seq == seq)) {
- 			/* All the bits are present. Drop. */
- 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
- 			__kfree_skb(skb);
-@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
- 					 end_seq);
- 			break;
- 		}
-+		/* MPTCP allows non-data data-fin to be in the ofo-queue */
-+		if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
-+			continue;
- 		__skb_unlink(skb1, &tp->out_of_order_queue);
- 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
- 				 TCP_SKB_CB(skb1)->end_seq);
-@@ -4280,8 +4325,8 @@ end:
- 	}
- }
- 
--static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
--		  bool *fragstolen)
-+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-+			       bool *fragstolen)
- {
- 	int eaten;
- 	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
-@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
- 	int eaten = -1;
- 	bool fragstolen = false;
- 
--	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-+	/* If no data is present, but a data_fin is in the options, we still
-+	 * have to call mptcp_queue_skb later on. */
-+	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
-+	    !(mptcp(tp) && mptcp_is_data_fin(skb)))
- 		goto drop;
- 
- 	skb_dst_drop(skb);
-@@ -4389,7 +4437,7 @@ queue_and_out:
- 			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
- 		}
- 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
--		if (skb->len)
-+		if (skb->len || mptcp_is_data_fin(skb))
- 			tcp_event_data_recv(sk, skb);
- 		if (th->fin)
- 			tcp_fin(sk);
-@@ -4411,7 +4459,11 @@ queue_and_out:
- 
- 		if (eaten > 0)
- 			kfree_skb_partial(skb, fragstolen);
--		if (!sock_flag(sk, SOCK_DEAD))
-+		if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
-+			/* MPTCP: we always have to call data_ready, because
-+			 * we may be about to receive a data-fin, which still
-+			 * must get queued.
-+			 */
- 			sk->sk_data_ready(sk);
- 		return;
- 	}
-@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
- 		next = skb_queue_next(list, skb);
- 
- 	__skb_unlink(skb, list);
-+	if (mptcp(tcp_sk(sk)))
-+		mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
- 	__kfree_skb(skb);
- 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
- 
-@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
-  * Purge the out-of-order queue.
-  * Return true if queue was pruned.
-  */
--static bool tcp_prune_ofo_queue(struct sock *sk)
-+bool tcp_prune_ofo_queue(struct sock *sk)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	bool res = false;
-@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
- 	/* Collapsing did not help, destructive actions follow.
- 	 * This must not ever occur. */
- 
--	tcp_prune_ofo_queue(sk);
-+	tp->ops->prune_ofo_queue(sk);
- 
- 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
- 		return 0;
-@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
- 	return -1;
- }
- 
--static bool tcp_should_expand_sndbuf(const struct sock *sk)
-+/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
-+ * As additional protections, we do not touch cwnd in retransmission phases,
-+ * and if application hit its sndbuf limit recently.
-+ */
-+void tcp_cwnd_application_limited(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
-+	    sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
-+		/* Limited by application or receiver window. */
-+		u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
-+		u32 win_used = max(tp->snd_cwnd_used, init_win);
-+		if (win_used < tp->snd_cwnd) {
-+			tp->snd_ssthresh = tcp_current_ssthresh(sk);
-+			tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
-+		}
-+		tp->snd_cwnd_used = 0;
-+	}
-+	tp->snd_cwnd_stamp = tcp_time_stamp;
-+}
-+
-+bool tcp_should_expand_sndbuf(const struct sock *sk)
- {
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 
-@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 
--	if (tcp_should_expand_sndbuf(sk)) {
-+	if (tp->ops->should_expand_sndbuf(sk)) {
- 		tcp_sndbuf_expand(sk);
- 		tp->snd_cwnd_stamp = tcp_time_stamp;
- 	}
-@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
- {
- 	if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
- 		sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
--		if (sk->sk_socket &&
--		    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
-+		if (mptcp(tcp_sk(sk)) ||
-+		    (sk->sk_socket &&
-+			test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
- 			tcp_new_space(sk);
- 	}
- }
-@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
- 	     /* ... and right edge of window advances far enough.
- 	      * (tcp_recvmsg() will send ACK otherwise). Or...
- 	      */
--	     __tcp_select_window(sk) >= tp->rcv_wnd) ||
-+	     tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
- 	    /* We ACK each frame or... */
- 	    tcp_in_quickack_mode(sk) ||
- 	    /* We have out of order data. */
-@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 
-+	/* MPTCP urgent data is not yet supported */
-+	if (mptcp(tp))
-+		return;
-+
- 	/* Check if we get a new urgent pointer - normally not. */
- 	if (th->urg)
- 		tcp_check_urg(sk, th);
-@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
- }
- 
- #ifdef CONFIG_NET_DMA
--static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
--				  int hlen)
-+bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	int chunk = skb->len - hlen;
-@@ -5052,9 +5132,15 @@ syn_challenge:
- 		goto discard;
- 	}
- 
-+	/* If valid: post process the received MPTCP options. */
-+	if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
-+		goto discard;
-+
- 	return true;
- 
- discard:
-+	if (mptcp(tp))
-+		mptcp_reset_mopt(tp);
- 	__kfree_skb(skb);
- 	return false;
- }
-@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- 
- 	tp->rx_opt.saw_tstamp = 0;
- 
-+	/* MPTCP: force slowpath. */
-+	if (mptcp(tp))
-+		goto slow_path;
-+
- 	/*	pred_flags is 0xS?10 << 16 + snd_wnd
- 	 *	if header_prediction is to be made
- 	 *	'S' will always be tp->tcp_header_len >> 2
-@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- 					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
- 				}
- 				if (copied_early)
--					tcp_cleanup_rbuf(sk, skb->len);
-+					tp->ops->cleanup_rbuf(sk, skb->len);
- 			}
- 			if (!eaten) {
- 				if (tcp_checksum_complete_user(sk, skb))
-@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
- 
- 	tcp_init_metrics(sk);
- 
--	tcp_init_congestion_control(sk);
-+	tp->ops->init_congestion_control(sk);
- 
- 	/* Prevent spurious tcp_cwnd_restart() on first data
- 	 * packet.
- 	 */
- 	tp->lsndtime = tcp_time_stamp;
- 
--	tcp_init_buffer_space(sk);
-+	tp->ops->init_buffer_space(sk);
- 
- 	if (sock_flag(sk, SOCK_KEEPOPEN))
- 		inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
-@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
- 		/* Get original SYNACK MSS value if user MSS sets mss_clamp */
- 		tcp_clear_options(&opt);
- 		opt.user_mss = opt.mss_clamp = 0;
--		tcp_parse_options(synack, &opt, 0, NULL);
-+		tcp_parse_options(synack, &opt, NULL, 0, NULL);
- 		mss = opt.mss_clamp;
- 	}
- 
-@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
- 
- 	tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
- 
--	if (data) { /* Retransmit unacked data in SYN */
-+	/* In mptcp case, we do not rely on "retransmit", but instead on
-+	 * "transmit", because if fastopen data is not acked, the retransmission
-+	 * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
-+	 */
-+	if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
- 		tcp_for_write_queue_from(data, sk) {
- 			if (data == tcp_send_head(sk) ||
- 			    __tcp_retransmit_skb(sk, data))
-@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	struct tcp_fastopen_cookie foc = { .len = -1 };
- 	int saved_clamp = tp->rx_opt.mss_clamp;
-+	struct mptcp_options_received mopt;
-+	mptcp_init_mp_opt(&mopt);
- 
--	tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
-+	tcp_parse_options(skb, &tp->rx_opt,
-+			  mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
- 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
- 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
- 
-@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
- 		tcp_ack(sk, skb, FLAG_SLOWPATH);
- 
-+		if (tp->request_mptcp || mptcp(tp)) {
-+			int ret;
-+			ret = mptcp_rcv_synsent_state_process(sk, &sk,
-+							      skb, &mopt);
-+
-+			/* May have changed if we support MPTCP */
-+			tp = tcp_sk(sk);
-+			icsk = inet_csk(sk);
-+
-+			if (ret == 1)
-+				goto reset_and_undo;
-+			if (ret == 2)
-+				goto discard;
-+		}
-+
-+		if (mptcp(tp) && !is_master_tp(tp)) {
-+			/* Timer for repeating the ACK until an answer
-+			 * arrives. Used only when establishing an additional
-+			 * subflow inside of an MPTCP connection.
-+			 */
-+			sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+				       jiffies + icsk->icsk_rto);
-+		}
-+
- 		/* Ok.. it's good. Set up sequence numbers and
- 		 * move to established.
- 		 */
-@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- 			tp->tcp_header_len = sizeof(struct tcphdr);
- 		}
- 
-+		if (mptcp(tp)) {
-+			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
-+			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-+		}
-+
- 		if (tcp_is_sack(tp) && sysctl_tcp_fack)
- 			tcp_enable_fack(tp);
- 
-@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- 		    tcp_rcv_fastopen_synack(sk, skb, &foc))
- 			return -1;
- 
--		if (sk->sk_write_pending ||
-+		/* With MPTCP we cannot send data on the third ack due to the
-+		 * lack of option-space to combine with an MP_CAPABLE.
-+		 */
-+		if (!mptcp(tp) && (sk->sk_write_pending ||
- 		    icsk->icsk_accept_queue.rskq_defer_accept ||
--		    icsk->icsk_ack.pingpong) {
-+		    icsk->icsk_ack.pingpong)) {
- 			/* Save one ACK. Data will be ready after
- 			 * several ticks, if write_pending is set.
- 			 *
-@@ -5536,6 +5665,7 @@ discard:
- 	    tcp_paws_reject(&tp->rx_opt, 0))
- 		goto discard_and_undo;
- 
-+	/* TODO - check this here for MPTCP */
- 	if (th->syn) {
- 		/* We see SYN without ACK. It is attempt of
- 		 * simultaneous connect with crossed SYNs.
-@@ -5552,6 +5682,11 @@ discard:
- 			tp->tcp_header_len = sizeof(struct tcphdr);
- 		}
- 
-+		if (mptcp(tp)) {
-+			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
-+			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-+		}
-+
- 		tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
- 		tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
- 
-@@ -5610,6 +5745,7 @@ reset_and_undo:
- 
- int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 			  const struct tcphdr *th, unsigned int len)
-+	__releases(&sk->sk_lock.slock)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	struct inet_connection_sock *icsk = inet_csk(sk);
-@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 
- 	case TCP_SYN_SENT:
- 		queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
-+		if (is_meta_sk(sk)) {
-+			sk = tcp_sk(sk)->mpcb->master_sk;
-+			tp = tcp_sk(sk);
-+
-+			/* Need to call it here, because it will announce new
-+			 * addresses, which can only be done after the third ack
-+			 * of the 3-way handshake.
-+			 */
-+			mptcp_update_metasocket(sk, tp->meta_sk);
-+		}
- 		if (queued >= 0)
- 			return queued;
- 
-@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 		tcp_urg(sk, skb, th);
- 		__kfree_skb(skb);
- 		tcp_data_snd_check(sk);
-+		if (mptcp(tp) && is_master_tp(tp))
-+			bh_unlock_sock(sk);
- 		return 0;
- 	}
- 
-@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 			synack_stamp = tp->lsndtime;
- 			/* Make sure socket is routed, for correct metrics. */
- 			icsk->icsk_af_ops->rebuild_header(sk);
--			tcp_init_congestion_control(sk);
-+			tp->ops->init_congestion_control(sk);
- 
- 			tcp_mtup_init(sk);
- 			tp->copied_seq = tp->rcv_nxt;
--			tcp_init_buffer_space(sk);
-+			tp->ops->init_buffer_space(sk);
- 		}
- 		smp_mb();
- 		tcp_set_state(sk, TCP_ESTABLISHED);
-@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 
- 		if (tp->rx_opt.tstamp_ok)
- 			tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
-+		if (mptcp(tp))
-+			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
- 
- 		if (req) {
- 			/* Re-arm the timer because data may have been sent out.
-@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 
- 		tcp_initialize_rcv_mss(sk);
- 		tcp_fast_path_on(tp);
-+		/* Send an ACK when establishing a new
-+		 * MPTCP subflow, i.e. using an MP_JOIN
-+		 * subtype.
-+		 */
-+		if (mptcp(tp) && !is_master_tp(tp))
-+			tcp_send_ack(sk);
- 		break;
- 
- 	case TCP_FIN_WAIT1: {
-@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 		tmo = tcp_fin_time(sk);
- 		if (tmo > TCP_TIMEWAIT_LEN) {
- 			inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
--		} else if (th->fin || sock_owned_by_user(sk)) {
-+		} else if (th->fin || mptcp_is_data_fin(skb) ||
-+			   sock_owned_by_user(sk)) {
- 			/* Bad case. We could lose such FIN otherwise.
- 			 * It is not a big problem, but it looks confusing
- 			 * and not so rare event. We still can lose it now,
-@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 			 */
- 			inet_csk_reset_keepalive_timer(sk, tmo);
- 		} else {
--			tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+			tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
- 			goto discard;
- 		}
- 		break;
-@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 
- 	case TCP_CLOSING:
- 		if (tp->snd_una == tp->write_seq) {
--			tcp_time_wait(sk, TCP_TIME_WAIT, 0);
-+			tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
- 			goto discard;
- 		}
- 		break;
-@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 			goto discard;
- 		}
- 		break;
-+	case TCP_CLOSE:
-+		if (tp->mp_killed)
-+			goto discard;
- 	}
- 
- 	/* step 6: check the URG bit */
-@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 		 */
- 		if (sk->sk_shutdown & RCV_SHUTDOWN) {
- 			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
--			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
-+			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
-+			    !mptcp(tp)) {
-+				/* In case of mptcp, the reset is handled by
-+				 * mptcp_rcv_state_process
-+				 */
- 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
- 				tcp_reset(sk);
- 				return 1;
-@@ -5877,3 +6041,154 @@ discard:
- 	return 0;
- }
- EXPORT_SYMBOL(tcp_rcv_state_process);
-+
-+static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
-+{
-+	struct inet_request_sock *ireq = inet_rsk(req);
-+
-+	if (family == AF_INET)
-+		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
-+			       &ireq->ir_rmt_addr, port);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else if (family == AF_INET6)
-+		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
-+			       &ireq->ir_v6_rmt_addr, port);
-+#endif
-+}
-+
-+int tcp_conn_request(struct request_sock_ops *rsk_ops,
-+		     const struct tcp_request_sock_ops *af_ops,
-+		     struct sock *sk, struct sk_buff *skb)
-+{
-+	struct tcp_options_received tmp_opt;
-+	struct request_sock *req;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct dst_entry *dst = NULL;
-+	__u32 isn = TCP_SKB_CB(skb)->when;
-+	bool want_cookie = false, fastopen;
-+	struct flowi fl;
-+	struct tcp_fastopen_cookie foc = { .len = -1 };
-+	int err;
-+
-+
-+	/* TW buckets are converted to open requests without
-+	 * limitations, they conserve resources and peer is
-+	 * evidently real one.
-+	 */
-+	if ((sysctl_tcp_syncookies == 2 ||
-+	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-+		want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
-+		if (!want_cookie)
-+			goto drop;
-+	}
-+
-+
-+	/* Accept backlog is full. If we have already queued enough
-+	 * of warm entries in syn queue, drop request. It is better than
-+	 * clogging syn queue with openreqs with exponentially increasing
-+	 * timeout.
-+	 */
-+	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
-+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-+		goto drop;
-+	}
-+
-+	req = inet_reqsk_alloc(rsk_ops);
-+	if (!req)
-+		goto drop;
-+
-+	tcp_rsk(req)->af_specific = af_ops;
-+
-+	tcp_clear_options(&tmp_opt);
-+	tmp_opt.mss_clamp = af_ops->mss_clamp;
-+	tmp_opt.user_mss  = tp->rx_opt.user_mss;
-+	tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
-+
-+	if (want_cookie && !tmp_opt.saw_tstamp)
-+		tcp_clear_options(&tmp_opt);
-+
-+	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-+	tcp_openreq_init(req, &tmp_opt, skb);
-+
-+	if (af_ops->init_req(req, sk, skb))
-+		goto drop_and_free;
-+
-+	if (security_inet_conn_request(sk, skb, req))
-+		goto drop_and_free;
-+
-+	if (!want_cookie || tmp_opt.tstamp_ok)
-+		TCP_ECN_create_request(req, skb, sock_net(sk));
-+
-+	if (want_cookie) {
-+		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
-+		req->cookie_ts = tmp_opt.tstamp_ok;
-+	} else if (!isn) {
-+		/* VJ's idea. We save last timestamp seen
-+		 * from the destination in peer table, when entering
-+		 * state TIME-WAIT, and check against it before
-+		 * accepting new connection request.
-+		 *
-+		 * If "isn" is not zero, this request hit alive
-+		 * timewait bucket, so that all the necessary checks
-+		 * are made in the function processing timewait state.
-+		 */
-+		if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
-+			bool strict;
-+
-+			dst = af_ops->route_req(sk, &fl, req, &strict);
-+			if (dst && strict &&
-+			    !tcp_peer_is_proven(req, dst, true)) {
-+				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
-+				goto drop_and_release;
-+			}
-+		}
-+		/* Kill the following clause, if you dislike this way. */
-+		else if (!sysctl_tcp_syncookies &&
-+			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
-+			  (sysctl_max_syn_backlog >> 2)) &&
-+			 !tcp_peer_is_proven(req, dst, false)) {
-+			/* Without syncookies last quarter of
-+			 * backlog is filled with destinations,
-+			 * proven to be alive.
-+			 * It means that we continue to communicate
-+			 * to destinations, already remembered
-+			 * to the moment of synflood.
-+			 */
-+			pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
-+				    rsk_ops->family);
-+			goto drop_and_release;
-+		}
-+
-+		isn = af_ops->init_seq(skb);
-+	}
-+	if (!dst) {
-+		dst = af_ops->route_req(sk, &fl, req, NULL);
-+		if (!dst)
-+			goto drop_and_free;
-+	}
-+
-+	tcp_rsk(req)->snt_isn = isn;
-+	tcp_openreq_init_rwin(req, sk, dst);
-+	fastopen = !want_cookie &&
-+		   tcp_try_fastopen(sk, skb, req, &foc, dst);
-+	err = af_ops->send_synack(sk, dst, &fl, req,
-+				  skb_get_queue_mapping(skb), &foc);
-+	if (!fastopen) {
-+		if (err || want_cookie)
-+			goto drop_and_free;
-+
-+		tcp_rsk(req)->listener = NULL;
-+		af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-+	}
-+
-+	return 0;
-+
-+drop_and_release:
-+	dst_release(dst);
-+drop_and_free:
-+	reqsk_free(req);
-+drop:
-+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-+	return 0;
-+}
-+EXPORT_SYMBOL(tcp_conn_request);
-diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
-index 77cccda1ad0c..c77017f600f1 100644
---- a/net/ipv4/tcp_ipv4.c
-+++ b/net/ipv4/tcp_ipv4.c
-@@ -67,6 +67,8 @@
- #include <net/icmp.h>
- #include <net/inet_hashtables.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
- #include <net/transp_v6.h>
- #include <net/ipv6.h>
- #include <net/inet_common.h>
-@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
- struct inet_hashinfo tcp_hashinfo;
- EXPORT_SYMBOL(tcp_hashinfo);
- 
--static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
-+__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
- {
- 	return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
- 					  ip_hdr(skb)->saddr,
-@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	struct inet_sock *inet;
- 	const int type = icmp_hdr(icmp_skb)->type;
- 	const int code = icmp_hdr(icmp_skb)->code;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk;
- 	struct sk_buff *skb;
- 	struct request_sock *fastopen;
- 	__u32 seq, snd_una;
-@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 		return;
- 	}
- 
--	bh_lock_sock(sk);
-+	tp = tcp_sk(sk);
-+	if (mptcp(tp))
-+		meta_sk = mptcp_meta_sk(sk);
-+	else
-+		meta_sk = sk;
-+
-+	bh_lock_sock(meta_sk);
- 	/* If too many ICMPs get dropped on busy
- 	 * servers this needs to be solved differently.
- 	 * We do take care of PMTU discovery (RFC1191) special case :
- 	 * we can receive locally generated ICMP messages while socket is held.
- 	 */
--	if (sock_owned_by_user(sk)) {
-+	if (sock_owned_by_user(meta_sk)) {
- 		if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
- 			NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
- 	}
-@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	}
- 
- 	icsk = inet_csk(sk);
--	tp = tcp_sk(sk);
- 	seq = ntohl(th->seq);
- 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
- 	fastopen = tp->fastopen_rsk;
-@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 				goto out;
- 
- 			tp->mtu_info = info;
--			if (!sock_owned_by_user(sk)) {
-+			if (!sock_owned_by_user(meta_sk)) {
- 				tcp_v4_mtu_reduced(sk);
- 			} else {
- 				if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
- 					sock_hold(sk);
-+				if (mptcp(tp))
-+					mptcp_tsq_flags(sk);
- 			}
- 			goto out;
- 		}
-@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 		    !icsk->icsk_backoff || fastopen)
- 			break;
- 
--		if (sock_owned_by_user(sk))
-+		if (sock_owned_by_user(meta_sk))
- 			break;
- 
- 		icsk->icsk_backoff--;
-@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	switch (sk->sk_state) {
- 		struct request_sock *req, **prev;
- 	case TCP_LISTEN:
--		if (sock_owned_by_user(sk))
-+		if (sock_owned_by_user(meta_sk))
- 			goto out;
- 
- 		req = inet_csk_search_req(sk, &prev, th->dest,
-@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 		if (fastopen && fastopen->sk == NULL)
- 			break;
- 
--		if (!sock_owned_by_user(sk)) {
-+		if (!sock_owned_by_user(meta_sk)) {
- 			sk->sk_err = err;
- 
- 			sk->sk_error_report(sk);
-@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	 */
- 
- 	inet = inet_sk(sk);
--	if (!sock_owned_by_user(sk) && inet->recverr) {
-+	if (!sock_owned_by_user(meta_sk) && inet->recverr) {
- 		sk->sk_err = err;
- 		sk->sk_error_report(sk);
- 	} else	{ /* Only an error on timeout */
-@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	}
- 
- out:
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
-@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
-  *	Exception: precedence violation. We do not implement it in any case.
-  */
- 
--static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
-+void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
- {
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	struct {
-@@ -702,10 +711,10 @@ release_sk1:
-    outside socket context is ugly, certainly. What can I do?
-  */
- 
--static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
-+static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
- 			    u32 win, u32 tsval, u32 tsecr, int oif,
- 			    struct tcp_md5sig_key *key,
--			    int reply_flags, u8 tos)
-+			    int reply_flags, u8 tos, int mptcp)
- {
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	struct {
-@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
- #ifdef CONFIG_TCP_MD5SIG
- 			   + (TCPOLEN_MD5SIG_ALIGNED >> 2)
- #endif
-+#ifdef CONFIG_MPTCP
-+			   + ((MPTCP_SUB_LEN_DSS >> 2) +
-+			      (MPTCP_SUB_LEN_ACK >> 2))
-+#endif
- 			];
- 	} rep;
- 	struct ip_reply_arg arg;
-@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
- 				    ip_hdr(skb)->daddr, &rep.th);
- 	}
- #endif
-+#ifdef CONFIG_MPTCP
-+	if (mptcp) {
-+		int offset = (tsecr) ? 3 : 0;
-+		/* Construction of 32-bit data_ack */
-+		rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
-+					  ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
-+					  (0x20 << 8) |
-+					  (0x01));
-+		rep.opt[offset] = htonl(data_ack);
-+
-+		arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
-+		rep.th.doff = arg.iov[0].iov_len / 4;
-+	}
-+#endif /* CONFIG_MPTCP */
-+
- 	arg.flags = reply_flags;
- 	arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
- 				      ip_hdr(skb)->saddr, /* XXX */
-@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
- {
- 	struct inet_timewait_sock *tw = inet_twsk(sk);
- 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
-+	u32 data_ack = 0;
-+	int mptcp = 0;
-+
-+	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
-+		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
-+		mptcp = 1;
-+	}
- 
- 	tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
-+			data_ack,
- 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
- 			tcp_time_stamp + tcptw->tw_ts_offset,
- 			tcptw->tw_ts_recent,
- 			tw->tw_bound_dev_if,
- 			tcp_twsk_md5_key(tcptw),
- 			tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
--			tw->tw_tos
-+			tw->tw_tos, mptcp
- 			);
- 
- 	inet_twsk_put(tw);
- }
- 
--static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
--				  struct request_sock *req)
-+void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+			   struct request_sock *req)
- {
- 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
- 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
- 	 */
- 	tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
- 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
--			tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
-+			tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
- 			tcp_time_stamp,
- 			req->ts_recent,
- 			0,
- 			tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
- 					  AF_INET),
- 			inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
--			ip_hdr(skb)->tos);
-+			ip_hdr(skb)->tos, 0);
- }
- 
- /*
-@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-  *	This still operates on a request_sock only, not on a big
-  *	socket.
-  */
--static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
--			      struct request_sock *req,
--			      u16 queue_mapping,
--			      struct tcp_fastopen_cookie *foc)
-+int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-+		       struct flowi *fl,
-+		       struct request_sock *req,
-+		       u16 queue_mapping,
-+		       struct tcp_fastopen_cookie *foc)
- {
- 	const struct inet_request_sock *ireq = inet_rsk(req);
- 	struct flowi4 fl4;
-@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
- 	return err;
- }
- 
--static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
--{
--	int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
--
--	if (!res) {
--		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
--		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
--	}
--	return res;
--}
--
- /*
-  *	IPv4 request_sock destructor.
-  */
--static void tcp_v4_reqsk_destructor(struct request_sock *req)
-+void tcp_v4_reqsk_destructor(struct request_sock *req)
- {
- 	kfree(inet_rsk(req)->opt);
- }
-@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
- /*
-  * Save and compile IPv4 options into the request_sock if needed.
-  */
--static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
-+struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
- {
- 	const struct ip_options *opt = &(IPCB(skb)->opt);
- 	struct ip_options_rcu *dopt = NULL;
-@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
- 
- #endif
- 
-+static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
-+			   struct sk_buff *skb)
-+{
-+	struct inet_request_sock *ireq = inet_rsk(req);
-+
-+	ireq->ir_loc_addr = ip_hdr(skb)->daddr;
-+	ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
-+	ireq->no_srccheck = inet_sk(sk)->transparent;
-+	ireq->opt = tcp_v4_save_options(skb);
-+	ireq->ir_mark = inet_request_mark(sk, skb);
-+
-+	return 0;
-+}
-+
-+static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
-+					  const struct request_sock *req,
-+					  bool *strict)
-+{
-+	struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
-+
-+	if (strict) {
-+		if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
-+			*strict = true;
-+		else
-+			*strict = false;
-+	}
-+
-+	return dst;
-+}
-+
- struct request_sock_ops tcp_request_sock_ops __read_mostly = {
- 	.family		=	PF_INET,
- 	.obj_size	=	sizeof(struct tcp_request_sock),
--	.rtx_syn_ack	=	tcp_v4_rtx_synack,
-+	.rtx_syn_ack	=	tcp_rtx_synack,
- 	.send_ack	=	tcp_v4_reqsk_send_ack,
- 	.destructor	=	tcp_v4_reqsk_destructor,
- 	.send_reset	=	tcp_v4_send_reset,
- 	.syn_ack_timeout = 	tcp_syn_ack_timeout,
- };
- 
-+const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
-+	.mss_clamp	=	TCP_MSS_DEFAULT,
- #ifdef CONFIG_TCP_MD5SIG
--static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
- 	.md5_lookup	=	tcp_v4_reqsk_md5_lookup,
- 	.calc_md5_hash	=	tcp_v4_md5_hash_skb,
--};
- #endif
-+	.init_req	=	tcp_v4_init_req,
-+#ifdef CONFIG_SYN_COOKIES
-+	.cookie_init_seq =	cookie_v4_init_sequence,
-+#endif
-+	.route_req	=	tcp_v4_route_req,
-+	.init_seq	=	tcp_v4_init_sequence,
-+	.send_synack	=	tcp_v4_send_synack,
-+	.queue_hash_add =	inet_csk_reqsk_queue_hash_add,
-+};
- 
- int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
- {
--	struct tcp_options_received tmp_opt;
--	struct request_sock *req;
--	struct inet_request_sock *ireq;
--	struct tcp_sock *tp = tcp_sk(sk);
--	struct dst_entry *dst = NULL;
--	__be32 saddr = ip_hdr(skb)->saddr;
--	__be32 daddr = ip_hdr(skb)->daddr;
--	__u32 isn = TCP_SKB_CB(skb)->when;
--	bool want_cookie = false, fastopen;
--	struct flowi4 fl4;
--	struct tcp_fastopen_cookie foc = { .len = -1 };
--	int err;
--
- 	/* Never answer to SYNs send to broadcast or multicast */
- 	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
- 		goto drop;
- 
--	/* TW buckets are converted to open requests without
--	 * limitations, they conserve resources and peer is
--	 * evidently real one.
--	 */
--	if ((sysctl_tcp_syncookies == 2 ||
--	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
--		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
--		if (!want_cookie)
--			goto drop;
--	}
--
--	/* Accept backlog is full. If we have already queued enough
--	 * of warm entries in syn queue, drop request. It is better than
--	 * clogging syn queue with openreqs with exponentially increasing
--	 * timeout.
--	 */
--	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
--		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
--		goto drop;
--	}
--
--	req = inet_reqsk_alloc(&tcp_request_sock_ops);
--	if (!req)
--		goto drop;
--
--#ifdef CONFIG_TCP_MD5SIG
--	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
--#endif
--
--	tcp_clear_options(&tmp_opt);
--	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
--	tmp_opt.user_mss  = tp->rx_opt.user_mss;
--	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
--
--	if (want_cookie && !tmp_opt.saw_tstamp)
--		tcp_clear_options(&tmp_opt);
--
--	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
--	tcp_openreq_init(req, &tmp_opt, skb);
-+	return tcp_conn_request(&tcp_request_sock_ops,
-+				&tcp_request_sock_ipv4_ops, sk, skb);
- 
--	ireq = inet_rsk(req);
--	ireq->ir_loc_addr = daddr;
--	ireq->ir_rmt_addr = saddr;
--	ireq->no_srccheck = inet_sk(sk)->transparent;
--	ireq->opt = tcp_v4_save_options(skb);
--	ireq->ir_mark = inet_request_mark(sk, skb);
--
--	if (security_inet_conn_request(sk, skb, req))
--		goto drop_and_free;
--
--	if (!want_cookie || tmp_opt.tstamp_ok)
--		TCP_ECN_create_request(req, skb, sock_net(sk));
--
--	if (want_cookie) {
--		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
--		req->cookie_ts = tmp_opt.tstamp_ok;
--	} else if (!isn) {
--		/* VJ's idea. We save last timestamp seen
--		 * from the destination in peer table, when entering
--		 * state TIME-WAIT, and check against it before
--		 * accepting new connection request.
--		 *
--		 * If "isn" is not zero, this request hit alive
--		 * timewait bucket, so that all the necessary checks
--		 * are made in the function processing timewait state.
--		 */
--		if (tmp_opt.saw_tstamp &&
--		    tcp_death_row.sysctl_tw_recycle &&
--		    (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
--		    fl4.daddr == saddr) {
--			if (!tcp_peer_is_proven(req, dst, true)) {
--				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
--				goto drop_and_release;
--			}
--		}
--		/* Kill the following clause, if you dislike this way. */
--		else if (!sysctl_tcp_syncookies &&
--			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
--			  (sysctl_max_syn_backlog >> 2)) &&
--			 !tcp_peer_is_proven(req, dst, false)) {
--			/* Without syncookies last quarter of
--			 * backlog is filled with destinations,
--			 * proven to be alive.
--			 * It means that we continue to communicate
--			 * to destinations, already remembered
--			 * to the moment of synflood.
--			 */
--			LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
--				       &saddr, ntohs(tcp_hdr(skb)->source));
--			goto drop_and_release;
--		}
--
--		isn = tcp_v4_init_sequence(skb);
--	}
--	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
--		goto drop_and_free;
--
--	tcp_rsk(req)->snt_isn = isn;
--	tcp_rsk(req)->snt_synack = tcp_time_stamp;
--	tcp_openreq_init_rwin(req, sk, dst);
--	fastopen = !want_cookie &&
--		   tcp_try_fastopen(sk, skb, req, &foc, dst);
--	err = tcp_v4_send_synack(sk, dst, req,
--				 skb_get_queue_mapping(skb), &foc);
--	if (!fastopen) {
--		if (err || want_cookie)
--			goto drop_and_free;
--
--		tcp_rsk(req)->snt_synack = tcp_time_stamp;
--		tcp_rsk(req)->listener = NULL;
--		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
--	}
--
--	return 0;
--
--drop_and_release:
--	dst_release(dst);
--drop_and_free:
--	reqsk_free(req);
- drop:
- 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
- 	return 0;
-@@ -1497,7 +1433,7 @@ put_and_exit:
- }
- EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
- 
--static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
-+struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- 	struct tcphdr *th = tcp_hdr(skb);
- 	const struct iphdr *iph = ip_hdr(skb);
-@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
- 
- 	if (nsk) {
- 		if (nsk->sk_state != TCP_TIME_WAIT) {
-+			/* Don't lock again the meta-sk. It has been locked
-+			 * before mptcp_v4_do_rcv.
-+			 */
-+			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
-+				bh_lock_sock(mptcp_meta_sk(nsk));
- 			bh_lock_sock(nsk);
-+
- 			return nsk;
-+
- 		}
- 		inet_twsk_put(inet_twsk(nsk));
- 		return NULL;
-@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
- 		goto discard;
- #endif
- 
-+	if (is_meta_sk(sk))
-+		return mptcp_v4_do_rcv(sk, skb);
-+
- 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
- 		struct dst_entry *dst = sk->sk_rx_dst;
- 
-@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
- 	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
- 		wake_up_interruptible_sync_poll(sk_sleep(sk),
- 					   POLLIN | POLLRDNORM | POLLRDBAND);
--		if (!inet_csk_ack_scheduled(sk))
-+		if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
- 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
- 						  (3 * tcp_rto_min(sk)) / 4,
- 						  TCP_RTO_MAX);
-@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
- {
- 	const struct iphdr *iph;
- 	const struct tcphdr *th;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk = NULL;
- 	int ret;
- 	struct net *net = dev_net(skb->dev);
- 
-@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
- 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
- 				    skb->len - th->doff * 4);
- 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-+#ifdef CONFIG_MPTCP
-+	TCP_SKB_CB(skb)->mptcp_flags = 0;
-+	TCP_SKB_CB(skb)->dss_off = 0;
-+#endif
- 	TCP_SKB_CB(skb)->when	 = 0;
- 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
- 	TCP_SKB_CB(skb)->sacked	 = 0;
- 
- 	sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
--	if (!sk)
--		goto no_tcp_socket;
- 
- process:
--	if (sk->sk_state == TCP_TIME_WAIT)
-+	if (sk && sk->sk_state == TCP_TIME_WAIT)
- 		goto do_time_wait;
- 
-+#ifdef CONFIG_MPTCP
-+	if (!sk && th->syn && !th->ack) {
-+		int ret = mptcp_lookup_join(skb, NULL);
-+
-+		if (ret < 0) {
-+			tcp_v4_send_reset(NULL, skb);
-+			goto discard_it;
-+		} else if (ret > 0) {
-+			return 0;
-+		}
-+	}
-+
-+	/* Is there a pending request sock for this segment ? */
-+	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
-+		if (sk)
-+			sock_put(sk);
-+		return 0;
-+	}
-+#endif
-+	if (!sk)
-+		goto no_tcp_socket;
-+
- 	if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
- 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- 		goto discard_and_relse;
-@@ -1759,11 +1729,21 @@ process:
- 	sk_mark_napi_id(sk, skb);
- 	skb->dev = NULL;
- 
--	bh_lock_sock_nested(sk);
-+	if (mptcp(tcp_sk(sk))) {
-+		meta_sk = mptcp_meta_sk(sk);
-+
-+		bh_lock_sock_nested(meta_sk);
-+		if (sock_owned_by_user(meta_sk))
-+			skb->sk = sk;
-+	} else {
-+		meta_sk = sk;
-+		bh_lock_sock_nested(sk);
-+	}
-+
- 	ret = 0;
--	if (!sock_owned_by_user(sk)) {
-+	if (!sock_owned_by_user(meta_sk)) {
- #ifdef CONFIG_NET_DMA
--		struct tcp_sock *tp = tcp_sk(sk);
-+		struct tcp_sock *tp = tcp_sk(meta_sk);
- 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- 			tp->ucopy.dma_chan = net_dma_find_channel();
- 		if (tp->ucopy.dma_chan)
-@@ -1771,16 +1751,16 @@ process:
- 		else
- #endif
- 		{
--			if (!tcp_prequeue(sk, skb))
-+			if (!tcp_prequeue(meta_sk, skb))
- 				ret = tcp_v4_do_rcv(sk, skb);
- 		}
--	} else if (unlikely(sk_add_backlog(sk, skb,
--					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
--		bh_unlock_sock(sk);
-+	} else if (unlikely(sk_add_backlog(meta_sk, skb,
-+					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+		bh_unlock_sock(meta_sk);
- 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
- 		goto discard_and_relse;
- 	}
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 
- 	sock_put(sk);
- 
-@@ -1835,6 +1815,18 @@ do_time_wait:
- 			sk = sk2;
- 			goto process;
- 		}
-+#ifdef CONFIG_MPTCP
-+		if (th->syn && !th->ack) {
-+			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
-+
-+			if (ret < 0) {
-+				tcp_v4_send_reset(NULL, skb);
-+				goto discard_it;
-+			} else if (ret > 0) {
-+				return 0;
-+			}
-+		}
-+#endif
- 		/* Fall through to ACK */
- 	}
- 	case TCP_TW_ACK:
-@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
- 
- 	tcp_init_sock(sk);
- 
--	icsk->icsk_af_ops = &ipv4_specific;
-+#ifdef CONFIG_MPTCP
-+	if (is_mptcp_enabled(sk))
-+		icsk->icsk_af_ops = &mptcp_v4_specific;
-+	else
-+#endif
-+		icsk->icsk_af_ops = &ipv4_specific;
- 
- #ifdef CONFIG_TCP_MD5SIG
- 	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
-@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
- 
- 	tcp_cleanup_congestion_control(sk);
- 
-+	if (mptcp(tp))
-+		mptcp_destroy_sock(sk);
-+	if (tp->inside_tk_table)
-+		mptcp_hash_remove(tp);
-+
- 	/* Cleanup up the write buffer. */
- 	tcp_write_queue_purge(sk);
- 
-@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
- }
- #endif /* CONFIG_PROC_FS */
- 
-+#ifdef CONFIG_MPTCP
-+static void tcp_v4_clear_sk(struct sock *sk, int size)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	/* we do not want to clear tk_table field, because of RCU lookups */
-+	sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
-+
-+	size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
-+	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
-+}
-+#endif
-+
- struct proto tcp_prot = {
- 	.name			= "TCP",
- 	.owner			= THIS_MODULE,
-@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
- 	.destroy_cgroup		= tcp_destroy_cgroup,
- 	.proto_cgroup		= tcp_proto_cgroup,
- #endif
-+#ifdef CONFIG_MPTCP
-+	.clear_sk		= tcp_v4_clear_sk,
-+#endif
- };
- EXPORT_SYMBOL(tcp_prot);
- 
-diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
-index e68e0d4af6c9..ae6946857dff 100644
---- a/net/ipv4/tcp_minisocks.c
-+++ b/net/ipv4/tcp_minisocks.c
-@@ -18,11 +18,13 @@
-  *		Jorge Cwik, <jorge@laser.satlink.net>
-  */
- 
-+#include <linux/kconfig.h>
- #include <linux/mm.h>
- #include <linux/module.h>
- #include <linux/slab.h>
- #include <linux/sysctl.h>
- #include <linux/workqueue.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- #include <net/inet_common.h>
- #include <net/xfrm.h>
-@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- 	struct tcp_options_received tmp_opt;
- 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
- 	bool paws_reject = false;
-+	struct mptcp_options_received mopt;
- 
- 	tmp_opt.saw_tstamp = 0;
- 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
--		tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+		mptcp_init_mp_opt(&mopt);
-+
-+		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
- 
- 		if (tmp_opt.saw_tstamp) {
- 			tmp_opt.rcv_tsecr	-= tcptw->tw_ts_offset;
-@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- 			tmp_opt.ts_recent_stamp	= tcptw->tw_ts_recent_stamp;
- 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
- 		}
-+
-+		if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
-+			if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
-+				goto kill_with_rst;
-+		}
- 	}
- 
- 	if (tw->tw_substate == TCP_FIN_WAIT2) {
-@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- 		if (!th->ack ||
- 		    !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
- 		    TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
-+			/* If mptcp_is_data_fin() returns true, we are sure that
-+			 * mopt has been initialized - otherwise it would not
-+			 * be a DATA_FIN.
-+			 */
-+			if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
-+			    mptcp_is_data_fin(skb) &&
-+			    TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
-+			    mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
-+				return TCP_TW_ACK;
-+
- 			inet_twsk_put(tw);
- 			return TCP_TW_SUCCESS;
- 		}
-@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
- 		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
- 		tcptw->tw_ts_offset	= tp->tsoffset;
- 
-+		if (mptcp(tp)) {
-+			if (mptcp_init_tw_sock(sk, tcptw)) {
-+				inet_twsk_free(tw);
-+				goto exit;
-+			}
-+		} else {
-+			tcptw->mptcp_tw = NULL;
-+		}
-+
- #if IS_ENABLED(CONFIG_IPV6)
- 		if (tw->tw_family == PF_INET6) {
- 			struct ipv6_pinfo *np = inet6_sk(sk);
-@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
- 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
- 	}
- 
-+exit:
- 	tcp_update_metrics(sk);
- 	tcp_done(sk);
- }
- 
- void tcp_twsk_destructor(struct sock *sk)
- {
--#ifdef CONFIG_TCP_MD5SIG
- 	struct tcp_timewait_sock *twsk = tcp_twsk(sk);
- 
-+	if (twsk->mptcp_tw)
-+		mptcp_twsk_destructor(twsk);
-+#ifdef CONFIG_TCP_MD5SIG
- 	if (twsk->tw_md5_key)
- 		kfree_rcu(twsk->tw_md5_key, rcu);
- #endif
-@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
- 		req->window_clamp = tcp_full_space(sk);
- 
- 	/* tcp_full_space because it is guaranteed to be the first packet */
--	tcp_select_initial_window(tcp_full_space(sk),
--		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
-+	tp->ops->select_initial_window(tcp_full_space(sk),
-+		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
-+		(ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
- 		&req->rcv_wnd,
- 		&req->window_clamp,
- 		ireq->wscale_ok,
- 		&rcv_wscale,
--		dst_metric(dst, RTAX_INITRWND));
-+		dst_metric(dst, RTAX_INITRWND), sk);
- 	ireq->rcv_wscale = rcv_wscale;
- }
- EXPORT_SYMBOL(tcp_openreq_init_rwin);
-@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
- 			newtp->rx_opt.ts_recent_stamp = 0;
- 			newtp->tcp_header_len = sizeof(struct tcphdr);
- 		}
-+		if (ireq->saw_mpc)
-+			newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
- 		newtp->tsoffset = 0;
- #ifdef CONFIG_TCP_MD5SIG
- 		newtp->md5sig_info = NULL;	/*XXX*/
-@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- 			   bool fastopen)
- {
- 	struct tcp_options_received tmp_opt;
-+	struct mptcp_options_received mopt;
- 	struct sock *child;
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
- 	bool paws_reject = false;
- 
--	BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
-+	BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
- 
- 	tmp_opt.saw_tstamp = 0;
-+
-+	mptcp_init_mp_opt(&mopt);
-+
- 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
--		tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
- 
- 		if (tmp_opt.saw_tstamp) {
- 			tmp_opt.ts_recent = req->ts_recent;
-@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- 		 *
- 		 * Reset timer after retransmitting SYNACK, similar to
- 		 * the idea of fast retransmit in recovery.
-+		 *
-+		 * Fall back to TCP if MP_CAPABLE is not set.
- 		 */
-+
-+		if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
-+			inet_rsk(req)->saw_mpc = false;
-+
-+
- 		if (!inet_rtx_syn_ack(sk, req))
- 			req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
- 					   TCP_RTO_MAX) + jiffies;
-@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- 	 * socket is created, wait for troubles.
- 	 */
- 	child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
-+
- 	if (child == NULL)
- 		goto listen_overflow;
- 
-+	if (!is_meta_sk(sk)) {
-+		int ret = mptcp_check_req_master(sk, child, req, prev);
-+		if (ret < 0)
-+			goto listen_overflow;
-+
-+		/* MPTCP-supported */
-+		if (!ret)
-+			return tcp_sk(child)->mpcb->master_sk;
-+	} else {
-+		return mptcp_check_req_child(sk, child, req, prev, &mopt);
-+	}
- 	inet_csk_reqsk_queue_unlink(sk, req, prev);
- 	inet_csk_reqsk_queue_removed(sk, req);
- 
-@@ -746,7 +804,17 @@ embryonic_reset:
- 		tcp_reset(sk);
- 	}
- 	if (!fastopen) {
--		inet_csk_reqsk_queue_drop(sk, req, prev);
-+		if (is_meta_sk(sk)) {
-+			/* We want to avoid stoping the keepalive-timer and so
-+			 * avoid ending up in inet_csk_reqsk_queue_removed ...
-+			 */
-+			inet_csk_reqsk_queue_unlink(sk, req, prev);
-+			if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
-+				mptcp_delete_synack_timer(sk);
-+			reqsk_free(req);
-+		} else {
-+			inet_csk_reqsk_queue_drop(sk, req, prev);
-+		}
- 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
- 	}
- 	return NULL;
-@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
- {
- 	int ret = 0;
- 	int state = child->sk_state;
-+	struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
- 
--	if (!sock_owned_by_user(child)) {
-+	if (!sock_owned_by_user(meta_sk)) {
- 		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
- 					    skb->len);
- 		/* Wakeup parent, send SIGIO */
-@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
- 		 * in main socket hash table and lock on listening
- 		 * socket does not protect us more.
- 		 */
--		__sk_add_backlog(child, skb);
-+		if (mptcp(tcp_sk(child)))
-+			skb->sk = child;
-+		__sk_add_backlog(meta_sk, skb);
- 	}
- 
--	bh_unlock_sock(child);
-+	if (mptcp(tcp_sk(child)))
-+		bh_unlock_sock(child);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(child);
- 	return ret;
- }
-diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
-index 179b51e6bda3..efd31b6c5784 100644
---- a/net/ipv4/tcp_output.c
-+++ b/net/ipv4/tcp_output.c
-@@ -36,6 +36,12 @@
- 
- #define pr_fmt(fmt) "TCP: " fmt
- 
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#endif
-+#include <net/ipv6.h>
- #include <net/tcp.h>
- 
- #include <linux/compiler.h>
-@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
- unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
- EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
- 
--static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
--			   int push_one, gfp_t gfp);
--
- /* Account for new data that has been sent to the network. */
--static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
-+void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
- {
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct tcp_sock *tp = tcp_sk(sk);
-@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
- void tcp_select_initial_window(int __space, __u32 mss,
- 			       __u32 *rcv_wnd, __u32 *window_clamp,
- 			       int wscale_ok, __u8 *rcv_wscale,
--			       __u32 init_rcv_wnd)
-+			       __u32 init_rcv_wnd, const struct sock *sk)
- {
- 	unsigned int space = (__space < 0 ? 0 : __space);
- 
-@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
-  * value can be stuffed directly into th->window for an outgoing
-  * frame.
-  */
--static u16 tcp_select_window(struct sock *sk)
-+u16 tcp_select_window(struct sock *sk)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	u32 old_win = tp->rcv_wnd;
--	u32 cur_win = tcp_receive_window(tp);
--	u32 new_win = __tcp_select_window(sk);
-+	/* The window must never shrink at the meta-level. At the subflow we
-+	 * have to allow this. Otherwise we may announce a window too large
-+	 * for the current meta-level sk_rcvbuf.
-+	 */
-+	u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
-+	u32 new_win = tp->ops->__select_window(sk);
- 
- 	/* Never shrink the offered window */
- 	if (new_win < cur_win) {
-@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
- 				      LINUX_MIB_TCPWANTZEROWINDOWADV);
- 		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
- 	}
-+
- 	tp->rcv_wnd = new_win;
- 	tp->rcv_wup = tp->rcv_nxt;
- 
-@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
- /* Constructs common control bits of non-data skb. If SYN/FIN is present,
-  * auto increment end seqno.
-  */
--static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
-+void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
- {
- 	struct skb_shared_info *shinfo = skb_shinfo(skb);
- 
-@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
- 	TCP_SKB_CB(skb)->end_seq = seq;
- }
- 
--static inline bool tcp_urg_mode(const struct tcp_sock *tp)
-+bool tcp_urg_mode(const struct tcp_sock *tp)
- {
- 	return tp->snd_una != tp->snd_up;
- }
-@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
- #define OPTION_MD5		(1 << 2)
- #define OPTION_WSCALE		(1 << 3)
- #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
--
--struct tcp_out_options {
--	u16 options;		/* bit field of OPTION_* */
--	u16 mss;		/* 0 to disable */
--	u8 ws;			/* window scale, 0 to disable */
--	u8 num_sack_blocks;	/* number of SACK blocks to include */
--	u8 hash_size;		/* bytes in hash_location */
--	__u8 *hash_location;	/* temporary pointer, overloaded */
--	__u32 tsval, tsecr;	/* need to include OPTION_TS */
--	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
--};
-+/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
- 
- /* Write previously computed TCP options to the packet.
-  *
-@@ -430,7 +428,7 @@ struct tcp_out_options {
-  * (but it may well be that other scenarios fail similarly).
-  */
- static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
--			      struct tcp_out_options *opts)
-+			      struct tcp_out_options *opts, struct sk_buff *skb)
- {
- 	u16 options = opts->options;	/* mungable copy */
- 
-@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
- 		}
- 		ptr += (foc->len + 3) >> 2;
- 	}
-+
-+	if (unlikely(OPTION_MPTCP & opts->options))
-+		mptcp_options_write(ptr, tp, opts, skb);
- }
- 
- /* Compute TCP options for SYN packets. This is not the final
-@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
- 		if (unlikely(!(OPTION_TS & opts->options)))
- 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
- 	}
-+	if (tp->request_mptcp || mptcp(tp))
-+		mptcp_syn_options(sk, opts, &remaining);
- 
- 	if (fastopen && fastopen->cookie.len >= 0) {
- 		u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
-@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
- 		}
- 	}
- 
-+	if (ireq->saw_mpc)
-+		mptcp_synack_options(req, opts, &remaining);
-+
- 	return MAX_TCP_OPTION_SPACE - remaining;
- }
- 
-@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
- 		opts->tsecr = tp->rx_opt.ts_recent;
- 		size += TCPOLEN_TSTAMP_ALIGNED;
- 	}
-+	if (mptcp(tp))
-+		mptcp_established_options(sk, skb, opts, &size);
- 
- 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
- 	if (unlikely(eff_sacks)) {
--		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
--		opts->num_sack_blocks =
--			min_t(unsigned int, eff_sacks,
--			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
--			      TCPOLEN_SACK_PERBLOCK);
--		size += TCPOLEN_SACK_BASE_ALIGNED +
--			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
-+		const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
-+		if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
-+			opts->num_sack_blocks = 0;
-+		else
-+			opts->num_sack_blocks =
-+			    min_t(unsigned int, eff_sacks,
-+				  (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
-+				  TCPOLEN_SACK_PERBLOCK);
-+		if (opts->num_sack_blocks)
-+			size += TCPOLEN_SACK_BASE_ALIGNED +
-+			    opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
- 	}
- 
- 	return size;
-@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
- 	if ((1 << sk->sk_state) &
- 	    (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
- 	     TCPF_CLOSE_WAIT  | TCPF_LAST_ACK))
--		tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
--			       0, GFP_ATOMIC);
-+		tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
-+					    tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
- }
- /*
-  * One tasklet per cpu tries to send more skbs.
-@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
- 	unsigned long flags;
- 	struct list_head *q, *n;
- 	struct tcp_sock *tp;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk;
- 
- 	local_irq_save(flags);
- 	list_splice_init(&tsq->head, &list);
-@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
- 		list_del(&tp->tsq_node);
- 
- 		sk = (struct sock *)tp;
--		bh_lock_sock(sk);
-+		meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-+		bh_lock_sock(meta_sk);
- 
--		if (!sock_owned_by_user(sk)) {
-+		if (!sock_owned_by_user(meta_sk)) {
- 			tcp_tsq_handler(sk);
-+			if (mptcp(tp))
-+				tcp_tsq_handler(meta_sk);
- 		} else {
-+			if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
-+				goto exit;
-+
- 			/* defer the work to tcp_release_cb() */
- 			set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
-+
-+			if (mptcp(tp))
-+				mptcp_tsq_flags(sk);
- 		}
--		bh_unlock_sock(sk);
-+exit:
-+		bh_unlock_sock(meta_sk);
- 
- 		clear_bit(TSQ_QUEUED, &tp->tsq_flags);
- 		sk_free(sk);
-@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
- #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) |		\
- 			  (1UL << TCP_WRITE_TIMER_DEFERRED) |	\
- 			  (1UL << TCP_DELACK_TIMER_DEFERRED) |	\
--			  (1UL << TCP_MTU_REDUCED_DEFERRED))
-+			  (1UL << TCP_MTU_REDUCED_DEFERRED) |   \
-+			  (1UL << MPTCP_PATH_MANAGER) |		\
-+			  (1UL << MPTCP_SUB_DEFERRED))
-+
- /**
-  * tcp_release_cb - tcp release_sock() callback
-  * @sk: socket
-@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
- 		sk->sk_prot->mtu_reduced(sk);
- 		__sock_put(sk);
- 	}
-+	if (flags & (1UL << MPTCP_PATH_MANAGER)) {
-+		if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
-+			tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
-+		__sock_put(sk);
-+	}
-+	if (flags & (1UL << MPTCP_SUB_DEFERRED))
-+		mptcp_tsq_sub_deferred(sk);
- }
- EXPORT_SYMBOL(tcp_release_cb);
- 
-@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
-  * We are working here with either a clone of the original
-  * SKB, or a fresh unique copy made by the retransmit engine.
-  */
--static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
--			    gfp_t gfp_mask)
-+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-+		        gfp_t gfp_mask)
- {
- 	const struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct inet_sock *inet;
-@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- 		 */
- 		th->window	= htons(min(tp->rcv_wnd, 65535U));
- 	} else {
--		th->window	= htons(tcp_select_window(sk));
-+		th->window	= htons(tp->ops->select_window(sk));
- 	}
- 	th->check		= 0;
- 	th->urg_ptr		= 0;
-@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- 		}
- 	}
- 
--	tcp_options_write((__be32 *)(th + 1), tp, &opts);
-+	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
- 	if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
- 		TCP_ECN_send(sk, skb, tcp_header_size);
- 
-@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-  * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
-  * otherwise socket can stall.
-  */
--static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
-+void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 
-@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
- }
- 
- /* Initialize TSO segments for a packet. */
--static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
--				 unsigned int mss_now)
-+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+			  unsigned int mss_now)
- {
- 	struct skb_shared_info *shinfo = skb_shinfo(skb);
- 
- 	/* Make sure we own this skb before messing gso_size/gso_segs */
- 	WARN_ON_ONCE(skb_cloned(skb));
- 
--	if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
-+	if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
-+	    (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
- 		/* Avoid the costly divide in the normal
- 		 * non-TSO case.
- 		 */
-@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
- /* Pcount in the middle of the write queue got changed, we need to do various
-  * tweaks to fix counters
-  */
--static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
-+void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 
-@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
-  * eventually). The difference is that pulled data not copied, but
-  * immediately discarded.
-  */
--static void __pskb_trim_head(struct sk_buff *skb, int len)
-+void __pskb_trim_head(struct sk_buff *skb, int len)
- {
- 	struct skb_shared_info *shinfo;
- 	int i, k, eat;
-@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
- /* Remove acked data from a packet in the transmit queue. */
- int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
- {
-+	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
-+		return mptcp_trim_head(sk, skb, len);
-+
- 	if (skb_unclone(skb, GFP_ATOMIC))
- 		return -ENOMEM;
- 
-@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
- 	if (tcp_skb_pcount(skb) > 1)
- 		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
- 
-+#ifdef CONFIG_MPTCP
-+	/* Some data got acked - we assume that the seq-number reached the dest.
-+	 * Anyway, our MPTCP-option has been trimmed above - we lost it here.
-+	 * Only remove the SEQ if the call does not come from a meta retransmit.
-+	 */
-+	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
-+		TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
-+#endif
-+
- 	return 0;
- }
- 
-@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
- 
- 	return mss_now;
- }
-+EXPORT_SYMBOL(tcp_current_mss);
- 
- /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
-  * As additional protections, we do not touch cwnd in retransmission phases,
-@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
-  * But we can avoid doing the divide again given we already have
-  *  skb_pcount = skb->len / mss_now
-  */
--static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
--				const struct sk_buff *skb)
-+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-+			 const struct sk_buff *skb)
- {
- 	if (skb->len < tcp_skb_pcount(skb) * mss_now)
- 		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
-@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
- 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
- }
- /* Returns the portion of skb which can be sent right away */
--static unsigned int tcp_mss_split_point(const struct sock *sk,
--					const struct sk_buff *skb,
--					unsigned int mss_now,
--					unsigned int max_segs,
--					int nonagle)
-+unsigned int tcp_mss_split_point(const struct sock *sk,
-+				 const struct sk_buff *skb,
-+				 unsigned int mss_now,
-+				 unsigned int max_segs,
-+				 int nonagle)
- {
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 	u32 partial, needed, window, max_len;
-@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
- /* Can at least one segment of SKB be sent right now, according to the
-  * congestion window rules?  If so, return how many segments are allowed.
-  */
--static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
--					 const struct sk_buff *skb)
-+unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-+			   const struct sk_buff *skb)
- {
- 	u32 in_flight, cwnd;
- 
- 	/* Don't be strict about the congestion window for the final FIN.  */
--	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
-+	if (skb &&
-+	    (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
- 	    tcp_skb_pcount(skb) == 1)
- 		return 1;
- 
-@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-  * This must be invoked the first time we consider transmitting
-  * SKB onto the wire.
-  */
--static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
--			     unsigned int mss_now)
-+int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+		      unsigned int mss_now)
- {
- 	int tso_segs = tcp_skb_pcount(skb);
- 
-@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
- /* Return true if the Nagle test allows this packet to be
-  * sent now.
-  */
--static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
--				  unsigned int cur_mss, int nonagle)
-+bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+		    unsigned int cur_mss, int nonagle)
- {
- 	/* Nagle rule does not apply to frames, which sit in the middle of the
- 	 * write_queue (they have no chances to get new data).
-@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
- 		return true;
- 
- 	/* Don't use the nagle rule for urgent data (or for the final FIN). */
--	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
-+	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
-+	    mptcp_is_data_fin(skb))
- 		return true;
- 
- 	if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
-@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
- }
- 
- /* Does at least the first segment of SKB fit into the send window? */
--static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
--			     const struct sk_buff *skb,
--			     unsigned int cur_mss)
-+bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+		      unsigned int cur_mss)
- {
- 	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
- 
-@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
- 	u32 send_win, cong_win, limit, in_flight;
- 	int win_divisor;
- 
--	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
-+	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
- 		goto send_now;
- 
- 	if (icsk->icsk_ca_state != TCP_CA_Open)
-@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
-  * Returns true, if no segments are in flight and we have queued segments,
-  * but cannot send anything now because of SWS or another problem.
-  */
--static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
- 			   int push_one, gfp_t gfp)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
-@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
- 
- 	sent_pkts = 0;
- 
--	if (!push_one) {
-+	/* pmtu not yet supported with MPTCP. Should be possible, by early
-+	 * exiting the loop inside tcp_mtu_probe, making sure that only one
-+	 * single DSS-mapping gets probed.
-+	 */
-+	if (!push_one && !mptcp(tp)) {
- 		/* Do MTU probing. */
- 		result = tcp_mtu_probe(sk);
- 		if (!result) {
-@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
- 	int err = -1;
- 
- 	if (tcp_send_head(sk) != NULL) {
--		err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
-+		err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
-+					  GFP_ATOMIC);
- 		goto rearm_timer;
- 	}
- 
-@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
- 	if (unlikely(sk->sk_state == TCP_CLOSE))
- 		return;
- 
--	if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
--			   sk_gfp_atomic(sk, GFP_ATOMIC)))
-+	if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
-+					sk_gfp_atomic(sk, GFP_ATOMIC)))
- 		tcp_check_probe_timer(sk);
- }
- 
-@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
- 
- 	BUG_ON(!skb || skb->len < mss_now);
- 
--	tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
-+	tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
-+				    sk->sk_allocation);
- }
- 
- /* This function returns the amount that we can raise the
-@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
- 	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
- 		return;
- 
-+	/* Currently not supported for MPTCP - but it should be possible */
-+	if (mptcp(tp))
-+		return;
-+
- 	tcp_for_write_queue_from_safe(skb, tmp, sk) {
- 		if (!tcp_can_collapse(sk, skb))
- 			break;
-@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
- 
- 	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
- 	th->window = htons(min(req->rcv_wnd, 65535U));
--	tcp_options_write((__be32 *)(th + 1), tp, &opts);
-+	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
- 	th->doff = (tcp_header_size >> 2);
- 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
- 
-@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
- 	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
- 		tp->window_clamp = tcp_full_space(sk);
- 
--	tcp_select_initial_window(tcp_full_space(sk),
--				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
--				  &tp->rcv_wnd,
--				  &tp->window_clamp,
--				  sysctl_tcp_window_scaling,
--				  &rcv_wscale,
--				  dst_metric(dst, RTAX_INITRWND));
-+	tp->ops->select_initial_window(tcp_full_space(sk),
-+				       tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
-+				       &tp->rcv_wnd,
-+				       &tp->window_clamp,
-+				       sysctl_tcp_window_scaling,
-+				       &rcv_wscale,
-+				       dst_metric(dst, RTAX_INITRWND), sk);
- 
- 	tp->rx_opt.rcv_wscale = rcv_wscale;
- 	tp->rcv_ssthresh = tp->rcv_wnd;
-@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
- 	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
- 	inet_csk(sk)->icsk_retransmits = 0;
- 	tcp_clear_retrans(tp);
-+
-+#ifdef CONFIG_MPTCP
-+	if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
-+		if (is_master_tp(tp)) {
-+			tp->request_mptcp = 1;
-+			mptcp_connect_init(sk);
-+		} else if (tp->mptcp) {
-+			struct inet_sock *inet	= inet_sk(sk);
-+
-+			tp->mptcp->snt_isn	= tp->write_seq;
-+			tp->mptcp->init_rcv_wnd	= tp->rcv_wnd;
-+
-+			/* Set nonce for new subflows */
-+			if (sk->sk_family == AF_INET)
-+				tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
-+							inet->inet_saddr,
-+							inet->inet_daddr,
-+							inet->inet_sport,
-+							inet->inet_dport);
-+#if IS_ENABLED(CONFIG_IPV6)
-+			else
-+				tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
-+						inet6_sk(sk)->saddr.s6_addr32,
-+						sk->sk_v6_daddr.s6_addr32,
-+						inet->inet_sport,
-+						inet->inet_dport);
-+#endif
-+		}
-+	}
-+#endif
- }
- 
- static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
-@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
- 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
- 	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
- }
-+EXPORT_SYMBOL(tcp_send_ack);
- 
- /* This routine sends a packet with an out of date sequence
-  * number. It assumes the other end will try to ack it.
-@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
-  * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
-  * out-of-date with SND.UNA-1 to probe window.
-  */
--static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
-+int tcp_xmit_probe_skb(struct sock *sk, int urgent)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	struct sk_buff *skb;
-@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	int err;
- 
--	err = tcp_write_wakeup(sk);
-+	err = tp->ops->write_wakeup(sk);
- 
- 	if (tp->packets_out || !tcp_send_head(sk)) {
- 		/* Cancel probe timer, if it is not required. */
-@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
- 					  TCP_RTO_MAX);
- 	}
- }
-+
-+int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
-+{
-+	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
-+	struct flowi fl;
-+	int res;
-+
-+	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
-+	if (!res) {
-+		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-+	}
-+	return res;
-+}
-+EXPORT_SYMBOL(tcp_rtx_synack);
-diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
-index 286227abed10..966b873cbf3e 100644
---- a/net/ipv4/tcp_timer.c
-+++ b/net/ipv4/tcp_timer.c
-@@ -20,6 +20,7 @@
- 
- #include <linux/module.h>
- #include <linux/gfp.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- 
- int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
-@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
- int sysctl_tcp_orphan_retries __read_mostly;
- int sysctl_tcp_thin_linear_timeouts __read_mostly;
- 
--static void tcp_write_err(struct sock *sk)
-+void tcp_write_err(struct sock *sk)
- {
- 	sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
- 	sk->sk_error_report(sk);
-@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
- 		    (!tp->snd_wnd && !tp->packets_out))
- 			do_reset = 1;
- 		if (do_reset)
--			tcp_send_active_reset(sk, GFP_ATOMIC);
-+			tp->ops->send_active_reset(sk, GFP_ATOMIC);
- 		tcp_done(sk);
- 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
- 		return 1;
-@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
-  * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
-  * syn_set flag is set.
-  */
--static bool retransmits_timed_out(struct sock *sk,
--				  unsigned int boundary,
--				  unsigned int timeout,
--				  bool syn_set)
-+bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
-+			   unsigned int timeout, bool syn_set)
- {
- 	unsigned int linear_backoff_thresh, start_ts;
- 	unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
-@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
- }
- 
- /* A write timeout has occurred. Process the after effects. */
--static int tcp_write_timeout(struct sock *sk)
-+int tcp_write_timeout(struct sock *sk)
- {
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct tcp_sock *tp = tcp_sk(sk);
-@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
- 		}
- 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
- 		syn_set = true;
-+		/* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
-+		if (tcp_sk(sk)->request_mptcp &&
-+		    icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
-+			tcp_sk(sk)->request_mptcp = 0;
- 	} else {
- 		if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
- 			/* Black hole detection */
-@@ -251,18 +254,22 @@ out:
- static void tcp_delack_timer(unsigned long data)
- {
- 	struct sock *sk = (struct sock *)data;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
- 
--	bh_lock_sock(sk);
--	if (!sock_owned_by_user(sk)) {
-+	bh_lock_sock(meta_sk);
-+	if (!sock_owned_by_user(meta_sk)) {
- 		tcp_delack_timer_handler(sk);
- 	} else {
- 		inet_csk(sk)->icsk_ack.blocked = 1;
--		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
-+		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
- 		/* deleguate our work to tcp_release_cb() */
- 		if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
- 			sock_hold(sk);
-+		if (mptcp(tp))
-+			mptcp_tsq_flags(sk);
- 	}
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
-@@ -479,6 +486,10 @@ out_reset_timer:
- 		__sk_dst_reset(sk);
- 
- out:;
-+	if (mptcp(tp)) {
-+		mptcp_reinject_data(sk, 1);
-+		mptcp_set_rto(sk);
-+	}
- }
- 
- void tcp_write_timer_handler(struct sock *sk)
-@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
- 		break;
- 	case ICSK_TIME_RETRANS:
- 		icsk->icsk_pending = 0;
--		tcp_retransmit_timer(sk);
-+		tcp_sk(sk)->ops->retransmit_timer(sk);
- 		break;
- 	case ICSK_TIME_PROBE0:
- 		icsk->icsk_pending = 0;
-@@ -520,16 +531,19 @@ out:
- static void tcp_write_timer(unsigned long data)
- {
- 	struct sock *sk = (struct sock *)data;
-+	struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
- 
--	bh_lock_sock(sk);
--	if (!sock_owned_by_user(sk)) {
-+	bh_lock_sock(meta_sk);
-+	if (!sock_owned_by_user(meta_sk)) {
- 		tcp_write_timer_handler(sk);
- 	} else {
- 		/* deleguate our work to tcp_release_cb() */
- 		if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
- 			sock_hold(sk);
-+		if (mptcp(tcp_sk(sk)))
-+			mptcp_tsq_flags(sk);
- 	}
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
-@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
- 	struct sock *sk = (struct sock *) data;
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
- 	u32 elapsed;
- 
- 	/* Only process if socket is not in use. */
--	bh_lock_sock(sk);
--	if (sock_owned_by_user(sk)) {
-+	bh_lock_sock(meta_sk);
-+	if (sock_owned_by_user(meta_sk)) {
- 		/* Try again later. */
- 		inet_csk_reset_keepalive_timer (sk, HZ/20);
- 		goto out;
-@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
- 		goto out;
- 	}
- 
-+	if (tp->send_mp_fclose) {
-+		/* MUST do this before tcp_write_timeout, because retrans_stamp
-+		 * may have been set to 0 in another part while we are
-+		 * retransmitting MP_FASTCLOSE. Then, we would crash, because
-+		 * retransmits_timed_out accesses the meta-write-queue.
-+		 *
-+		 * We make sure that the timestamp is != 0.
-+		 */
-+		if (!tp->retrans_stamp)
-+			tp->retrans_stamp = tcp_time_stamp ? : 1;
-+
-+		if (tcp_write_timeout(sk))
-+			goto out;
-+
-+		tcp_send_ack(sk);
-+		icsk->icsk_retransmits++;
-+
-+		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-+		elapsed = icsk->icsk_rto;
-+		goto resched;
-+	}
-+
- 	if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
- 		if (tp->linger2 >= 0) {
- 			const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
- 
- 			if (tmo > 0) {
--				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+				tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
- 				goto out;
- 			}
- 		}
--		tcp_send_active_reset(sk, GFP_ATOMIC);
-+		tp->ops->send_active_reset(sk, GFP_ATOMIC);
- 		goto death;
- 	}
- 
-@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
- 		    icsk->icsk_probes_out > 0) ||
- 		    (icsk->icsk_user_timeout == 0 &&
- 		    icsk->icsk_probes_out >= keepalive_probes(tp))) {
--			tcp_send_active_reset(sk, GFP_ATOMIC);
-+			tp->ops->send_active_reset(sk, GFP_ATOMIC);
- 			tcp_write_err(sk);
- 			goto out;
- 		}
--		if (tcp_write_wakeup(sk) <= 0) {
-+		if (tp->ops->write_wakeup(sk) <= 0) {
- 			icsk->icsk_probes_out++;
- 			elapsed = keepalive_intvl_when(tp);
- 		} else {
-@@ -642,7 +679,7 @@ death:
- 	tcp_done(sk);
- 
- out:
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
-diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
-index 5667b3003af9..7139c2973fd2 100644
---- a/net/ipv6/addrconf.c
-+++ b/net/ipv6/addrconf.c
-@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
- 
- 	kfree_rcu(ifp, rcu);
- }
-+EXPORT_SYMBOL(inet6_ifa_finish_destroy);
- 
- static void
- ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
-diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
-index 7cb4392690dd..7057afbca4df 100644
---- a/net/ipv6/af_inet6.c
-+++ b/net/ipv6/af_inet6.c
-@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
- 	return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
- }
- 
--static int inet6_create(struct net *net, struct socket *sock, int protocol,
--			int kern)
-+int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
- {
- 	struct inet_sock *inet;
- 	struct ipv6_pinfo *np;
-diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
-index a245e5ddffbd..99c892b8992d 100644
---- a/net/ipv6/inet6_connection_sock.c
-+++ b/net/ipv6/inet6_connection_sock.c
-@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
- /*
-  * request_sock (formerly open request) hash tables.
-  */
--static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
--			   const u32 rnd, const u32 synq_hsize)
-+u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-+		    const u32 rnd, const u32 synq_hsize)
- {
- 	u32 c;
- 
-diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
-index edb58aff4ae7..ea4d9fda0927 100644
---- a/net/ipv6/ipv6_sockglue.c
-+++ b/net/ipv6/ipv6_sockglue.c
-@@ -48,6 +48,8 @@
- #include <net/addrconf.h>
- #include <net/inet_common.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
- #include <net/udp.h>
- #include <net/udplite.h>
- #include <net/xfrm.h>
-@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
- 				sock_prot_inuse_add(net, &tcp_prot, 1);
- 				local_bh_enable();
- 				sk->sk_prot = &tcp_prot;
--				icsk->icsk_af_ops = &ipv4_specific;
-+#ifdef CONFIG_MPTCP
-+				if (is_mptcp_enabled(sk))
-+					icsk->icsk_af_ops = &mptcp_v4_specific;
-+				else
-+#endif
-+					icsk->icsk_af_ops = &ipv4_specific;
- 				sk->sk_socket->ops = &inet_stream_ops;
- 				sk->sk_family = PF_INET;
- 				tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
-diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
-index a822b880689b..b2b38869d795 100644
---- a/net/ipv6/syncookies.c
-+++ b/net/ipv6/syncookies.c
-@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
- 
- 	/* check for timestamp cookie support */
- 	memset(&tcp_opt, 0, sizeof(tcp_opt));
--	tcp_parse_options(skb, &tcp_opt, 0, NULL);
-+	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
- 
- 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
- 		goto out;
- 
- 	ret = NULL;
--	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
-+	req = inet_reqsk_alloc(&tcp6_request_sock_ops);
- 	if (!req)
- 		goto out;
- 
-@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
- 	}
- 
- 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
--	tcp_select_initial_window(tcp_full_space(sk), req->mss,
--				  &req->rcv_wnd, &req->window_clamp,
--				  ireq->wscale_ok, &rcv_wscale,
--				  dst_metric(dst, RTAX_INITRWND));
-+	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
-+				       &req->rcv_wnd, &req->window_clamp,
-+				       ireq->wscale_ok, &rcv_wscale,
-+				       dst_metric(dst, RTAX_INITRWND), sk);
- 
- 	ireq->rcv_wscale = rcv_wscale;
- 
-diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
-index 229239ad96b1..fda94d71666e 100644
---- a/net/ipv6/tcp_ipv6.c
-+++ b/net/ipv6/tcp_ipv6.c
-@@ -63,6 +63,8 @@
- #include <net/inet_common.h>
- #include <net/secure_seq.h>
- #include <net/tcp_memcontrol.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v6.h>
- #include <net/busy_poll.h>
- 
- #include <linux/proc_fs.h>
-@@ -71,12 +73,6 @@
- #include <linux/crypto.h>
- #include <linux/scatterlist.h>
- 
--static void	tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
--static void	tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
--				      struct request_sock *req);
--
--static int	tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
--
- static const struct inet_connection_sock_af_ops ipv6_mapped;
- static const struct inet_connection_sock_af_ops ipv6_specific;
- #ifdef CONFIG_TCP_MD5SIG
-@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
- }
- #endif
- 
--static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
-+void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
- {
- 	struct dst_entry *dst = skb_dst(skb);
- 	const struct rt6_info *rt = (const struct rt6_info *)dst;
-@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
- 		inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
- }
- 
--static void tcp_v6_hash(struct sock *sk)
-+void tcp_v6_hash(struct sock *sk)
- {
- 	if (sk->sk_state != TCP_CLOSE) {
--		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
-+		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
-+		    inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
- 			tcp_prot.hash(sk);
- 			return;
- 		}
-@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
- 	}
- }
- 
--static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
-+__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
- {
- 	return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
- 					    ipv6_hdr(skb)->saddr.s6_addr32,
-@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
- 					    tcp_hdr(skb)->source);
- }
- 
--static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
-+int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- 			  int addr_len)
- {
- 	struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
-@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- 		sin.sin_port = usin->sin6_port;
- 		sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
- 
--		icsk->icsk_af_ops = &ipv6_mapped;
-+#ifdef CONFIG_MPTCP
-+		if (is_mptcp_enabled(sk))
-+			icsk->icsk_af_ops = &mptcp_v6_mapped;
-+		else
-+#endif
-+			icsk->icsk_af_ops = &ipv6_mapped;
- 		sk->sk_backlog_rcv = tcp_v4_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- 		tp->af_specific = &tcp_sock_ipv6_mapped_specific;
-@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- 
- 		if (err) {
- 			icsk->icsk_ext_hdr_len = exthdrlen;
--			icsk->icsk_af_ops = &ipv6_specific;
-+#ifdef CONFIG_MPTCP
-+			if (is_mptcp_enabled(sk))
-+				icsk->icsk_af_ops = &mptcp_v6_specific;
-+			else
-+#endif
-+				icsk->icsk_af_ops = &ipv6_specific;
- 			sk->sk_backlog_rcv = tcp_v6_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- 			tp->af_specific = &tcp_sock_ipv6_specific;
-@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 	const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
- 	const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
- 	struct ipv6_pinfo *np;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk;
- 	int err;
- 	struct tcp_sock *tp;
- 	struct request_sock *fastopen;
-@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 		return;
- 	}
- 
--	bh_lock_sock(sk);
--	if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
-+	tp = tcp_sk(sk);
-+	if (mptcp(tp))
-+		meta_sk = mptcp_meta_sk(sk);
-+	else
-+		meta_sk = sk;
-+
-+	bh_lock_sock(meta_sk);
-+	if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
- 		NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
- 
- 	if (sk->sk_state == TCP_CLOSE)
-@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 		goto out;
- 	}
- 
--	tp = tcp_sk(sk);
- 	seq = ntohl(th->seq);
- 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
- 	fastopen = tp->fastopen_rsk;
-@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 			goto out;
- 
- 		tp->mtu_info = ntohl(info);
--		if (!sock_owned_by_user(sk))
-+		if (!sock_owned_by_user(meta_sk))
- 			tcp_v6_mtu_reduced(sk);
--		else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
-+		else {
-+			if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
- 					   &tp->tsq_flags))
--			sock_hold(sk);
-+				sock_hold(sk);
-+			if (mptcp(tp))
-+				mptcp_tsq_flags(sk);
-+		}
- 		goto out;
- 	}
- 
-@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 	switch (sk->sk_state) {
- 		struct request_sock *req, **prev;
- 	case TCP_LISTEN:
--		if (sock_owned_by_user(sk))
-+		if (sock_owned_by_user(meta_sk))
- 			goto out;
- 
- 		req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
-@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 		if (fastopen && fastopen->sk == NULL)
- 			break;
- 
--		if (!sock_owned_by_user(sk)) {
-+		if (!sock_owned_by_user(meta_sk)) {
- 			sk->sk_err = err;
- 			sk->sk_error_report(sk);		/* Wake people up to see the error (see connect in sock.c) */
- 
-@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 		goto out;
- 	}
- 
--	if (!sock_owned_by_user(sk) && np->recverr) {
-+	if (!sock_owned_by_user(meta_sk) && np->recverr) {
- 		sk->sk_err = err;
- 		sk->sk_error_report(sk);
- 	} else
- 		sk->sk_err_soft = err;
- 
- out:
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
- 
--static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
--			      struct flowi6 *fl6,
--			      struct request_sock *req,
--			      u16 queue_mapping,
--			      struct tcp_fastopen_cookie *foc)
-+int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-+		       struct flowi *fl,
-+		       struct request_sock *req,
-+		       u16 queue_mapping,
-+		       struct tcp_fastopen_cookie *foc)
- {
- 	struct inet_request_sock *ireq = inet_rsk(req);
- 	struct ipv6_pinfo *np = inet6_sk(sk);
-+	struct flowi6 *fl6 = &fl->u.ip6;
- 	struct sk_buff *skb;
- 	int err = -ENOMEM;
- 
-@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
- 		skb_set_queue_mapping(skb, queue_mapping);
- 		err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
- 		err = net_xmit_eval(err);
-+		if (!tcp_rsk(req)->snt_synack && !err)
-+			tcp_rsk(req)->snt_synack = tcp_time_stamp;
- 	}
- 
- done:
- 	return err;
- }
- 
--static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
-+int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
- {
--	struct flowi6 fl6;
-+	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
-+	struct flowi fl;
- 	int res;
- 
--	res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
-+	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
- 	if (!res) {
- 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
- 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
- 	return res;
- }
- 
--static void tcp_v6_reqsk_destructor(struct request_sock *req)
-+void tcp_v6_reqsk_destructor(struct request_sock *req)
- {
- 	kfree_skb(inet_rsk(req)->pktopts);
- }
-@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
- }
- #endif
- 
-+static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
-+			   struct sk_buff *skb)
-+{
-+	struct inet_request_sock *ireq = inet_rsk(req);
-+	struct ipv6_pinfo *np = inet6_sk(sk);
-+
-+	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
-+	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
-+
-+	ireq->ir_iif = sk->sk_bound_dev_if;
-+	ireq->ir_mark = inet_request_mark(sk, skb);
-+
-+	/* So that link locals have meaning */
-+	if (!sk->sk_bound_dev_if &&
-+	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
-+		ireq->ir_iif = inet6_iif(skb);
-+
-+	if (!TCP_SKB_CB(skb)->when &&
-+	    (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
-+	     np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
-+	     np->rxopt.bits.rxohlim || np->repflow)) {
-+		atomic_inc(&skb->users);
-+		ireq->pktopts = skb;
-+	}
-+
-+	return 0;
-+}
-+
-+static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
-+					  const struct request_sock *req,
-+					  bool *strict)
-+{
-+	if (strict)
-+		*strict = true;
-+	return inet6_csk_route_req(sk, &fl->u.ip6, req);
-+}
-+
- struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
- 	.family		=	AF_INET6,
- 	.obj_size	=	sizeof(struct tcp6_request_sock),
--	.rtx_syn_ack	=	tcp_v6_rtx_synack,
-+	.rtx_syn_ack	=	tcp_rtx_synack,
- 	.send_ack	=	tcp_v6_reqsk_send_ack,
- 	.destructor	=	tcp_v6_reqsk_destructor,
- 	.send_reset	=	tcp_v6_send_reset,
- 	.syn_ack_timeout =	tcp_syn_ack_timeout,
- };
- 
-+const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
-+	.mss_clamp	=	IPV6_MIN_MTU - sizeof(struct tcphdr) -
-+				sizeof(struct ipv6hdr),
- #ifdef CONFIG_TCP_MD5SIG
--static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
- 	.md5_lookup	=	tcp_v6_reqsk_md5_lookup,
- 	.calc_md5_hash	=	tcp_v6_md5_hash_skb,
--};
- #endif
-+	.init_req	=	tcp_v6_init_req,
-+#ifdef CONFIG_SYN_COOKIES
-+	.cookie_init_seq =	cookie_v6_init_sequence,
-+#endif
-+	.route_req	=	tcp_v6_route_req,
-+	.init_seq	=	tcp_v6_init_sequence,
-+	.send_synack	=	tcp_v6_send_synack,
-+	.queue_hash_add =	inet6_csk_reqsk_queue_hash_add,
-+};
- 
--static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
--				 u32 tsval, u32 tsecr, int oif,
--				 struct tcp_md5sig_key *key, int rst, u8 tclass,
--				 u32 label)
-+static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
-+				 u32 data_ack, u32 win, u32 tsval, u32 tsecr,
-+				 int oif, struct tcp_md5sig_key *key, int rst,
-+				 u8 tclass, u32 label, int mptcp)
- {
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	struct tcphdr *t1;
-@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- 	if (key)
- 		tot_len += TCPOLEN_MD5SIG_ALIGNED;
- #endif
--
-+#ifdef CONFIG_MPTCP
-+	if (mptcp)
-+		tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
-+#endif
- 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
- 			 GFP_ATOMIC);
- 	if (buff == NULL)
-@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- 		tcp_v6_md5_hash_hdr((__u8 *)topt, key,
- 				    &ipv6_hdr(skb)->saddr,
- 				    &ipv6_hdr(skb)->daddr, t1);
-+		topt += 4;
-+	}
-+#endif
-+#ifdef CONFIG_MPTCP
-+	if (mptcp) {
-+		/* Construction of 32-bit data_ack */
-+		*topt++ = htonl((TCPOPT_MPTCP << 24) |
-+				((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
-+				(0x20 << 8) |
-+				(0x01));
-+		*topt++ = htonl(data_ack);
- 	}
- #endif
- 
-@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- 	kfree_skb(buff);
- }
- 
--static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
-+void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
- {
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	u32 seq = 0, ack_seq = 0;
-@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
- 			  (th->doff << 2);
- 
- 	oif = sk ? sk->sk_bound_dev_if : 0;
--	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
-+	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
- 
- #ifdef CONFIG_TCP_MD5SIG
- release_sk1:
-@@ -902,45 +983,52 @@ release_sk1:
- #endif
- }
- 
--static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
-+static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
- 			    u32 win, u32 tsval, u32 tsecr, int oif,
- 			    struct tcp_md5sig_key *key, u8 tclass,
--			    u32 label)
-+			    u32 label, int mptcp)
- {
--	tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
--			     label);
-+	tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
-+			     key, 0, tclass, label, mptcp);
- }
- 
- static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
- {
- 	struct inet_timewait_sock *tw = inet_twsk(sk);
- 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
-+	u32 data_ack = 0;
-+	int mptcp = 0;
- 
-+	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
-+		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
-+		mptcp = 1;
-+	}
- 	tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
-+			data_ack,
- 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
- 			tcp_time_stamp + tcptw->tw_ts_offset,
- 			tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
--			tw->tw_tclass, (tw->tw_flowlabel << 12));
-+			tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
- 
- 	inet_twsk_put(tw);
- }
- 
--static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
--				  struct request_sock *req)
-+void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+			   struct request_sock *req)
- {
- 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
- 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
- 	 */
- 	tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
- 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
--			tcp_rsk(req)->rcv_nxt,
-+			tcp_rsk(req)->rcv_nxt, 0,
- 			req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
- 			tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
--			0, 0);
-+			0, 0, 0);
- }
- 
- 
--static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
-+struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- 	struct request_sock *req, **prev;
- 	const struct tcphdr *th = tcp_hdr(skb);
-@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- 
- 	if (nsk) {
- 		if (nsk->sk_state != TCP_TIME_WAIT) {
-+			/* Don't lock again the meta-sk. It has been locked
-+			 * before mptcp_v6_do_rcv.
-+			 */
-+			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
-+				bh_lock_sock(mptcp_meta_sk(nsk));
- 			bh_lock_sock(nsk);
-+
- 			return nsk;
- 		}
- 		inet_twsk_put(inet_twsk(nsk));
-@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- 	return sk;
- }
- 
--/* FIXME: this is substantially similar to the ipv4 code.
-- * Can some kind of merge be done? -- erics
-- */
--static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
-+int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
- {
--	struct tcp_options_received tmp_opt;
--	struct request_sock *req;
--	struct inet_request_sock *ireq;
--	struct ipv6_pinfo *np = inet6_sk(sk);
--	struct tcp_sock *tp = tcp_sk(sk);
--	__u32 isn = TCP_SKB_CB(skb)->when;
--	struct dst_entry *dst = NULL;
--	struct tcp_fastopen_cookie foc = { .len = -1 };
--	bool want_cookie = false, fastopen;
--	struct flowi6 fl6;
--	int err;
--
- 	if (skb->protocol == htons(ETH_P_IP))
- 		return tcp_v4_conn_request(sk, skb);
- 
- 	if (!ipv6_unicast_destination(skb))
- 		goto drop;
- 
--	if ((sysctl_tcp_syncookies == 2 ||
--	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
--		want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
--		if (!want_cookie)
--			goto drop;
--	}
--
--	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
--		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
--		goto drop;
--	}
--
--	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
--	if (req == NULL)
--		goto drop;
--
--#ifdef CONFIG_TCP_MD5SIG
--	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
--#endif
--
--	tcp_clear_options(&tmp_opt);
--	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
--	tmp_opt.user_mss = tp->rx_opt.user_mss;
--	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
--
--	if (want_cookie && !tmp_opt.saw_tstamp)
--		tcp_clear_options(&tmp_opt);
-+	return tcp_conn_request(&tcp6_request_sock_ops,
-+				&tcp_request_sock_ipv6_ops, sk, skb);
- 
--	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
--	tcp_openreq_init(req, &tmp_opt, skb);
--
--	ireq = inet_rsk(req);
--	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
--	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
--	if (!want_cookie || tmp_opt.tstamp_ok)
--		TCP_ECN_create_request(req, skb, sock_net(sk));
--
--	ireq->ir_iif = sk->sk_bound_dev_if;
--	ireq->ir_mark = inet_request_mark(sk, skb);
--
--	/* So that link locals have meaning */
--	if (!sk->sk_bound_dev_if &&
--	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
--		ireq->ir_iif = inet6_iif(skb);
--
--	if (!isn) {
--		if (ipv6_opt_accepted(sk, skb) ||
--		    np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
--		    np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
--		    np->repflow) {
--			atomic_inc(&skb->users);
--			ireq->pktopts = skb;
--		}
--
--		if (want_cookie) {
--			isn = cookie_v6_init_sequence(sk, skb, &req->mss);
--			req->cookie_ts = tmp_opt.tstamp_ok;
--			goto have_isn;
--		}
--
--		/* VJ's idea. We save last timestamp seen
--		 * from the destination in peer table, when entering
--		 * state TIME-WAIT, and check against it before
--		 * accepting new connection request.
--		 *
--		 * If "isn" is not zero, this request hit alive
--		 * timewait bucket, so that all the necessary checks
--		 * are made in the function processing timewait state.
--		 */
--		if (tmp_opt.saw_tstamp &&
--		    tcp_death_row.sysctl_tw_recycle &&
--		    (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
--			if (!tcp_peer_is_proven(req, dst, true)) {
--				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
--				goto drop_and_release;
--			}
--		}
--		/* Kill the following clause, if you dislike this way. */
--		else if (!sysctl_tcp_syncookies &&
--			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
--			  (sysctl_max_syn_backlog >> 2)) &&
--			 !tcp_peer_is_proven(req, dst, false)) {
--			/* Without syncookies last quarter of
--			 * backlog is filled with destinations,
--			 * proven to be alive.
--			 * It means that we continue to communicate
--			 * to destinations, already remembered
--			 * to the moment of synflood.
--			 */
--			LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
--				       &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
--			goto drop_and_release;
--		}
--
--		isn = tcp_v6_init_sequence(skb);
--	}
--have_isn:
--
--	if (security_inet_conn_request(sk, skb, req))
--		goto drop_and_release;
--
--	if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
--		goto drop_and_free;
--
--	tcp_rsk(req)->snt_isn = isn;
--	tcp_rsk(req)->snt_synack = tcp_time_stamp;
--	tcp_openreq_init_rwin(req, sk, dst);
--	fastopen = !want_cookie &&
--		   tcp_try_fastopen(sk, skb, req, &foc, dst);
--	err = tcp_v6_send_synack(sk, dst, &fl6, req,
--				 skb_get_queue_mapping(skb), &foc);
--	if (!fastopen) {
--		if (err || want_cookie)
--			goto drop_and_free;
--
--		tcp_rsk(req)->listener = NULL;
--		inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
--	}
--	return 0;
--
--drop_and_release:
--	dst_release(dst);
--drop_and_free:
--	reqsk_free(req);
- drop:
- 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
- 	return 0; /* don't send reset */
- }
- 
--static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
--					 struct request_sock *req,
--					 struct dst_entry *dst)
-+struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-+				  struct request_sock *req,
-+				  struct dst_entry *dst)
- {
- 	struct inet_request_sock *ireq;
- 	struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
-@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
- 
- 		newsk->sk_v6_rcv_saddr = newnp->saddr;
- 
--		inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
-+#ifdef CONFIG_MPTCP
-+		if (is_mptcp_enabled(newsk))
-+			inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
-+		else
-+#endif
-+			inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
- 		newsk->sk_backlog_rcv = tcp_v4_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- 		newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
-@@ -1329,7 +1292,7 @@ out:
-  * This is because we cannot sleep with the original spinlock
-  * held.
-  */
--static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
-+int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
- {
- 	struct ipv6_pinfo *np = inet6_sk(sk);
- 	struct tcp_sock *tp;
-@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
- 		goto discard;
- #endif
- 
-+	if (is_meta_sk(sk))
-+		return mptcp_v6_do_rcv(sk, skb);
-+
- 	if (sk_filter(sk, skb))
- 		goto discard;
- 
-@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
- {
- 	const struct tcphdr *th;
- 	const struct ipv6hdr *hdr;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk = NULL;
- 	int ret;
- 	struct net *net = dev_net(skb->dev);
- 
-@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
- 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
- 				    skb->len - th->doff*4);
- 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-+#ifdef CONFIG_MPTCP
-+	TCP_SKB_CB(skb)->mptcp_flags = 0;
-+	TCP_SKB_CB(skb)->dss_off = 0;
-+#endif
- 	TCP_SKB_CB(skb)->when = 0;
- 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
- 	TCP_SKB_CB(skb)->sacked = 0;
- 
- 	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
--	if (!sk)
--		goto no_tcp_socket;
- 
- process:
--	if (sk->sk_state == TCP_TIME_WAIT)
-+	if (sk && sk->sk_state == TCP_TIME_WAIT)
- 		goto do_time_wait;
- 
-+#ifdef CONFIG_MPTCP
-+	if (!sk && th->syn && !th->ack) {
-+		int ret = mptcp_lookup_join(skb, NULL);
-+
-+		if (ret < 0) {
-+			tcp_v6_send_reset(NULL, skb);
-+			goto discard_it;
-+		} else if (ret > 0) {
-+			return 0;
-+		}
-+	}
-+
-+	/* Is there a pending request sock for this segment ? */
-+	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
-+		if (sk)
-+			sock_put(sk);
-+		return 0;
-+	}
-+#endif
-+
-+	if (!sk)
-+		goto no_tcp_socket;
-+
- 	if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
- 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- 		goto discard_and_relse;
-@@ -1529,11 +1520,21 @@ process:
- 	sk_mark_napi_id(sk, skb);
- 	skb->dev = NULL;
- 
--	bh_lock_sock_nested(sk);
-+	if (mptcp(tcp_sk(sk))) {
-+		meta_sk = mptcp_meta_sk(sk);
-+
-+		bh_lock_sock_nested(meta_sk);
-+		if (sock_owned_by_user(meta_sk))
-+			skb->sk = sk;
-+	} else {
-+		meta_sk = sk;
-+		bh_lock_sock_nested(sk);
-+	}
-+
- 	ret = 0;
--	if (!sock_owned_by_user(sk)) {
-+	if (!sock_owned_by_user(meta_sk)) {
- #ifdef CONFIG_NET_DMA
--		struct tcp_sock *tp = tcp_sk(sk);
-+		struct tcp_sock *tp = tcp_sk(meta_sk);
- 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- 			tp->ucopy.dma_chan = net_dma_find_channel();
- 		if (tp->ucopy.dma_chan)
-@@ -1541,16 +1542,17 @@ process:
- 		else
- #endif
- 		{
--			if (!tcp_prequeue(sk, skb))
-+			if (!tcp_prequeue(meta_sk, skb))
- 				ret = tcp_v6_do_rcv(sk, skb);
- 		}
--	} else if (unlikely(sk_add_backlog(sk, skb,
--					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
--		bh_unlock_sock(sk);
-+	} else if (unlikely(sk_add_backlog(meta_sk, skb,
-+					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+		bh_unlock_sock(meta_sk);
- 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
- 		goto discard_and_relse;
- 	}
--	bh_unlock_sock(sk);
-+
-+	bh_unlock_sock(meta_sk);
- 
- 	sock_put(sk);
- 	return ret ? -1 : 0;
-@@ -1607,6 +1609,18 @@ do_time_wait:
- 			sk = sk2;
- 			goto process;
- 		}
-+#ifdef CONFIG_MPTCP
-+		if (th->syn && !th->ack) {
-+			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
-+
-+			if (ret < 0) {
-+				tcp_v6_send_reset(NULL, skb);
-+				goto discard_it;
-+			} else if (ret > 0) {
-+				return 0;
-+			}
-+		}
-+#endif
- 		/* Fall through to ACK */
- 	}
- 	case TCP_TW_ACK:
-@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
- 	}
- }
- 
--static struct timewait_sock_ops tcp6_timewait_sock_ops = {
-+struct timewait_sock_ops tcp6_timewait_sock_ops = {
- 	.twsk_obj_size	= sizeof(struct tcp6_timewait_sock),
- 	.twsk_unique	= tcp_twsk_unique,
- 	.twsk_destructor = tcp_twsk_destructor,
-@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
- 
- 	tcp_init_sock(sk);
- 
--	icsk->icsk_af_ops = &ipv6_specific;
-+#ifdef CONFIG_MPTCP
-+	if (is_mptcp_enabled(sk))
-+		icsk->icsk_af_ops = &mptcp_v6_specific;
-+	else
-+#endif
-+		icsk->icsk_af_ops = &ipv6_specific;
- 
- #ifdef CONFIG_TCP_MD5SIG
- 	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
-@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
- 	return 0;
- }
- 
--static void tcp_v6_destroy_sock(struct sock *sk)
-+void tcp_v6_destroy_sock(struct sock *sk)
- {
- 	tcp_v4_destroy_sock(sk);
- 	inet6_destroy_sock(sk);
-@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
- static void tcp_v6_clear_sk(struct sock *sk, int size)
- {
- 	struct inet_sock *inet = inet_sk(sk);
-+#ifdef CONFIG_MPTCP
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	/* size_tk_table goes from the end of tk_table to the end of sk */
-+	int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
-+			    sizeof(tp->tk_table);
-+#endif
- 
- 	/* we do not want to clear pinet6 field, because of RCU lookups */
- 	sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
- 
- 	size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
-+
-+#ifdef CONFIG_MPTCP
-+	/* We zero out only from pinet6 to tk_table */
-+	size -= size_tk_table + sizeof(tp->tk_table);
-+#endif
- 	memset(&inet->pinet6 + 1, 0, size);
-+
-+#ifdef CONFIG_MPTCP
-+	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
-+#endif
-+
- }
- 
- struct proto tcpv6_prot = {
-diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
-new file mode 100644
-index 000000000000..cdfc03adabf8
---- /dev/null
-+++ b/net/mptcp/Kconfig
-@@ -0,0 +1,115 @@
-+#
-+# MPTCP configuration
-+#
-+config MPTCP
-+        bool "MPTCP protocol"
-+        depends on (IPV6=y || IPV6=n)
-+        ---help---
-+          This replaces the normal TCP stack with a Multipath TCP stack,
-+          able to use several paths at once.
-+
-+menuconfig MPTCP_PM_ADVANCED
-+	bool "MPTCP: advanced path-manager control"
-+	depends on MPTCP=y
-+	---help---
-+	  Support for selection of different path-managers. You should choose 'Y' here,
-+	  because otherwise you will not actively create new MPTCP-subflows.
-+
-+if MPTCP_PM_ADVANCED
-+
-+config MPTCP_FULLMESH
-+	tristate "MPTCP Full-Mesh Path-Manager"
-+	depends on MPTCP=y
-+	---help---
-+	  This path-management module will create a full-mesh among all IP-addresses.
-+
-+config MPTCP_NDIFFPORTS
-+	tristate "MPTCP ndiff-ports"
-+	depends on MPTCP=y
-+	---help---
-+	  This path-management module will create multiple subflows between the same
-+	  pair of IP-addresses, modifying the source-port. You can set the number
-+	  of subflows via the mptcp_ndiffports-sysctl.
-+
-+config MPTCP_BINDER
-+	tristate "MPTCP Binder"
-+	depends on (MPTCP=y)
-+	---help---
-+	  This path-management module works like ndiffports, and adds the sysctl
-+	  option to set the gateway (and/or path to) per each additional subflow
-+	  via Loose Source Routing (IPv4 only).
-+
-+choice
-+	prompt "Default MPTCP Path-Manager"
-+	default DEFAULT
-+	help
-+	  Select the Path-Manager of your choice
-+
-+	config DEFAULT_FULLMESH
-+		bool "Full mesh" if MPTCP_FULLMESH=y
-+
-+	config DEFAULT_NDIFFPORTS
-+		bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
-+
-+	config DEFAULT_BINDER
-+		bool "binder" if MPTCP_BINDER=y
-+
-+	config DEFAULT_DUMMY
-+		bool "Default"
-+
-+endchoice
-+
-+endif
-+
-+config DEFAULT_MPTCP_PM
-+	string
-+	default "default" if DEFAULT_DUMMY
-+	default "fullmesh" if DEFAULT_FULLMESH 
-+	default "ndiffports" if DEFAULT_NDIFFPORTS
-+	default "binder" if DEFAULT_BINDER
-+	default "default"
-+
-+menuconfig MPTCP_SCHED_ADVANCED
-+	bool "MPTCP: advanced scheduler control"
-+	depends on MPTCP=y
-+	---help---
-+	  Support for selection of different schedulers. You should choose 'Y' here,
-+	  if you want to choose a different scheduler than the default one.
-+
-+if MPTCP_SCHED_ADVANCED
-+
-+config MPTCP_ROUNDROBIN
-+	tristate "MPTCP Round-Robin"
-+	depends on (MPTCP=y)
-+	---help---
-+	  This is a very simple round-robin scheduler. Probably has bad performance
-+	  but might be interesting for researchers.
-+
-+choice
-+	prompt "Default MPTCP Scheduler"
-+	default DEFAULT
-+	help
-+	  Select the Scheduler of your choice
-+
-+	config DEFAULT_SCHEDULER
-+		bool "Default"
-+		---help---
-+		  This is the default scheduler, sending first on the subflow
-+		  with the lowest RTT.
-+
-+	config DEFAULT_ROUNDROBIN
-+		bool "Round-Robin" if MPTCP_ROUNDROBIN=y
-+		---help---
-+		  This is the round-rob scheduler, sending in a round-robin
-+		  fashion..
-+
-+endchoice
-+endif
-+
-+config DEFAULT_MPTCP_SCHED
-+	string
-+	depends on (MPTCP=y)
-+	default "default" if DEFAULT_SCHEDULER
-+	default "roundrobin" if DEFAULT_ROUNDROBIN
-+	default "default"
-+
-diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
-new file mode 100644
-index 000000000000..35561a7012e3
---- /dev/null
-+++ b/net/mptcp/Makefile
-@@ -0,0 +1,20 @@
-+#
-+## Makefile for MultiPath TCP support code.
-+#
-+#
-+
-+obj-$(CONFIG_MPTCP) += mptcp.o
-+
-+mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
-+	   mptcp_output.o mptcp_input.o mptcp_sched.o
-+
-+obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
-+obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
-+obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
-+obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
-+obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
-+obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
-+obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
-+
-+mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
-+
-diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
-new file mode 100644
-index 000000000000..95d8da560715
---- /dev/null
-+++ b/net/mptcp/mptcp_binder.c
-@@ -0,0 +1,487 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#include <linux/route.h>
-+#include <linux/inet.h>
-+#include <linux/mroute.h>
-+#include <linux/spinlock_types.h>
-+#include <net/inet_ecn.h>
-+#include <net/route.h>
-+#include <net/xfrm.h>
-+#include <net/compat.h>
-+#include <linux/slab.h>
-+
-+#define MPTCP_GW_MAX_LISTS	10
-+#define MPTCP_GW_LIST_MAX_LEN	6
-+#define MPTCP_GW_SYSCTL_MAX_LEN	(15 * MPTCP_GW_LIST_MAX_LEN *	\
-+							MPTCP_GW_MAX_LISTS)
-+
-+struct mptcp_gw_list {
-+	struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
-+	u8 len[MPTCP_GW_MAX_LISTS];
-+};
-+
-+struct binder_priv {
-+	/* Worker struct for subflow establishment */
-+	struct work_struct subflow_work;
-+
-+	struct mptcp_cb *mpcb;
-+
-+	/* Prevent multiple sub-sockets concurrently iterating over sockets */
-+	spinlock_t *flow_lock;
-+};
-+
-+static struct mptcp_gw_list *mptcp_gws;
-+static rwlock_t mptcp_gws_lock;
-+
-+static int mptcp_binder_ndiffports __read_mostly = 1;
-+
-+static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
-+
-+static int mptcp_get_avail_list_ipv4(struct sock *sk)
-+{
-+	int i, j, list_taken, opt_ret, opt_len;
-+	unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
-+
-+	for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
-+		if (mptcp_gws->len[i] == 0)
-+			goto error;
-+
-+		mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
-+		list_taken = 0;
-+
-+		/* Loop through all sub-sockets in this connection */
-+		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
-+			mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
-+
-+			/* Reset length and options buffer, then retrieve
-+			 * from socket
-+			 */
-+			opt_len = MAX_IPOPTLEN;
-+			memset(opt, 0, MAX_IPOPTLEN);
-+			opt_ret = ip_getsockopt(sk, IPPROTO_IP,
-+				IP_OPTIONS, opt, &opt_len);
-+			if (opt_ret < 0) {
-+				mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
-+					    __func__, opt_ret);
-+				goto error;
-+			}
-+
-+			/* If socket has no options, it has no stake in this list */
-+			if (opt_len <= 0)
-+				continue;
-+
-+			/* Iterate options buffer */
-+			for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
-+				if (*opt_ptr == IPOPT_LSRR) {
-+					mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
-+					goto sock_lsrr;
-+				}
-+			}
-+			continue;
-+
-+sock_lsrr:
-+			/* Pointer to the 2nd to last address */
-+			opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
-+
-+			/* Addresses start 3 bytes after type offset */
-+			opt_ptr += 3;
-+			j = 0;
-+
-+			/* Different length lists cannot be the same */
-+			if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
-+				continue;
-+
-+			/* Iterate if we are still inside options list
-+			 * and sysctl list
-+			 */
-+			while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
-+				/* If there is a different address, this list must
-+				 * not be set on this socket
-+				 */
-+				if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
-+					break;
-+
-+				/* Jump 4 bytes to next address */
-+				opt_ptr += 4;
-+				j++;
-+			}
-+
-+			/* Reached the end without a differing address, lists
-+			 * are therefore identical.
-+			 */
-+			if (j == mptcp_gws->len[i]) {
-+				mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
-+				list_taken = 1;
-+				break;
-+			}
-+		}
-+
-+		/* Free list found if not taken by a socket */
-+		if (!list_taken) {
-+			mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
-+			break;
-+		}
-+	}
-+
-+	if (i >= MPTCP_GW_MAX_LISTS)
-+		goto error;
-+
-+	return i;
-+error:
-+	return -1;
-+}
-+
-+/* The list of addresses is parsed each time a new connection is opened,
-+ *  to make sure it's up to date. In case of error, all the lists are
-+ *  marked as unavailable and the subflow's fingerprint is set to 0.
-+ */
-+static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
-+{
-+	int i, j, ret;
-+	unsigned char opt[MAX_IPOPTLEN] = {0};
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
-+
-+	/* Read lock: multiple sockets can read LSRR addresses at the same
-+	 * time, but writes are done in mutual exclusion.
-+	 * Spin lock: must search for free list for one socket at a time, or
-+	 * multiple sockets could take the same list.
-+	 */
-+	read_lock(&mptcp_gws_lock);
-+	spin_lock(fmp->flow_lock);
-+
-+	i = mptcp_get_avail_list_ipv4(sk);
-+
-+	/* Execution enters here only if a free path is found.
-+	 */
-+	if (i >= 0) {
-+		opt[0] = IPOPT_NOP;
-+		opt[1] = IPOPT_LSRR;
-+		opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
-+				(mptcp_gws->len[i] + 1) + 3;
-+		opt[3] = IPOPT_MINOFF;
-+		for (j = 0; j < mptcp_gws->len[i]; ++j)
-+			memcpy(opt + 4 +
-+				(j * sizeof(mptcp_gws->list[i][0].s_addr)),
-+				&mptcp_gws->list[i][j].s_addr,
-+				sizeof(mptcp_gws->list[i][0].s_addr));
-+		/* Final destination must be part of IP_OPTIONS parameter. */
-+		memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
-+		       sizeof(addr.s_addr));
-+
-+		/* setsockopt must be inside the lock, otherwise another
-+		 * subflow could fail to see that we have taken a list.
-+		 */
-+		ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
-+				4 + sizeof(mptcp_gws->list[i][0].s_addr)
-+				* (mptcp_gws->len[i] + 1));
-+
-+		if (ret < 0) {
-+			mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
-+				    __func__, ret);
-+		}
-+	}
-+
-+	spin_unlock(fmp->flow_lock);
-+	read_unlock(&mptcp_gws_lock);
-+
-+	return;
-+}
-+
-+/* Parses gateways string for a list of paths to different
-+ * gateways, and stores them for use with the Loose Source Routing (LSRR)
-+ * socket option. Each list must have "," separated addresses, and the lists
-+ * themselves must be separated by "-". Returns -1 in case one or more of the
-+ * addresses is not a valid ipv4/6 address.
-+ */
-+static int mptcp_parse_gateway_ipv4(char *gateways)
-+{
-+	int i, j, k, ret;
-+	char *tmp_string = NULL;
-+	struct in_addr tmp_addr;
-+
-+	tmp_string = kzalloc(16, GFP_KERNEL);
-+	if (tmp_string == NULL)
-+		return -ENOMEM;
-+
-+	write_lock(&mptcp_gws_lock);
-+
-+	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
-+
-+	/* A TMP string is used since inet_pton needs a null terminated string
-+	 * but we do not want to modify the sysctl for obvious reasons.
-+	 * i will iterate over the SYSCTL string, j will iterate over the
-+	 * temporary string where each IP is copied into, k will iterate over
-+	 * the IPs in each list.
-+	 */
-+	for (i = j = k = 0;
-+			i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
-+			++i) {
-+		if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
-+			/* If the temp IP is empty and the current list is
-+			 *  empty, we are done.
-+			 */
-+			if (j == 0 && mptcp_gws->len[k] == 0)
-+				break;
-+
-+			/* Terminate the temp IP string, then if it is
-+			 * non-empty parse the IP and copy it.
-+			 */
-+			tmp_string[j] = '\0';
-+			if (j > 0) {
-+				mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
-+
-+				ret = in4_pton(tmp_string, strlen(tmp_string),
-+						(u8 *)&tmp_addr.s_addr, '\0',
-+						NULL);
-+
-+				if (ret) {
-+					mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
-+						    ret,
-+						    &tmp_addr.s_addr);
-+					memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
-+					       &tmp_addr.s_addr,
-+					       sizeof(tmp_addr.s_addr));
-+					mptcp_gws->len[k]++;
-+					j = 0;
-+					tmp_string[j] = '\0';
-+					/* Since we can't impose a limit to
-+					 * what the user can input, make sure
-+					 * there are not too many IPs in the
-+					 * SYSCTL string.
-+					 */
-+					if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
-+						mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
-+							    k,
-+							    MPTCP_GW_LIST_MAX_LEN);
-+						goto error;
-+					}
-+				} else {
-+					goto error;
-+				}
-+			}
-+
-+			if (gateways[i] == '-' || gateways[i] == '\0')
-+				++k;
-+		} else {
-+			tmp_string[j] = gateways[i];
-+			++j;
-+		}
-+	}
-+
-+	/* Number of flows is number of gateway lists plus master flow */
-+	mptcp_binder_ndiffports = k+1;
-+
-+	write_unlock(&mptcp_gws_lock);
-+	kfree(tmp_string);
-+
-+	return 0;
-+
-+error:
-+	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
-+	memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
-+	write_unlock(&mptcp_gws_lock);
-+	kfree(tmp_string);
-+	return -1;
-+}
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+	const struct binder_priv *pm_priv = container_of(work,
-+						     struct binder_priv,
-+						     subflow_work);
-+	struct mptcp_cb *mpcb = pm_priv->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	int iter = 0;
-+
-+next_subflow:
-+	if (iter) {
-+		release_sock(meta_sk);
-+		mutex_unlock(&mpcb->mpcb_mutex);
-+
-+		cond_resched();
-+	}
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	iter++;
-+
-+	if (sock_flag(meta_sk, SOCK_DEAD))
-+		goto exit;
-+
-+	if (mpcb->master_sk &&
-+	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+		goto exit;
-+
-+	if (mptcp_binder_ndiffports > iter &&
-+	    mptcp_binder_ndiffports > mpcb->cnt_subflows) {
-+		struct mptcp_loc4 loc;
-+		struct mptcp_rem4 rem;
-+
-+		loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
-+		loc.loc4_id = 0;
-+		loc.low_prio = 0;
-+
-+		rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
-+		rem.port = inet_sk(meta_sk)->inet_dport;
-+		rem.rem4_id = 0; /* Default 0 */
-+
-+		mptcp_init4_subsockets(meta_sk, &loc, &rem);
-+
-+		goto next_subflow;
-+	}
-+
-+exit:
-+	release_sock(meta_sk);
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk);
-+}
-+
-+static void binder_new_session(const struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
-+	static DEFINE_SPINLOCK(flow_lock);
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	if (meta_sk->sk_family == AF_INET6 &&
-+	    !mptcp_v6_is_v4_mapped(meta_sk)) {
-+			mptcp_fallback_default(mpcb);
-+			return;
-+	}
-+#endif
-+
-+	/* Initialize workqueue-struct */
-+	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+	fmp->mpcb = mpcb;
-+
-+	fmp->flow_lock = &flow_lock;
-+}
-+
-+static void binder_create_subflows(struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
-+
-+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+	    mpcb->send_infinite_mapping ||
-+	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+		return;
-+
-+	if (!work_pending(&pm_priv->subflow_work)) {
-+		sock_hold(meta_sk);
-+		queue_work(mptcp_wq, &pm_priv->subflow_work);
-+	}
-+}
-+
-+static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
-+				  struct net *net, bool *low_prio)
-+{
-+	return 0;
-+}
-+
-+/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
-+ * Inspired from proc_tcp_congestion_control().
-+ */
-+static int proc_mptcp_gateways(ctl_table *ctl, int write,
-+				       void __user *buffer, size_t *lenp,
-+				       loff_t *ppos)
-+{
-+	int ret;
-+	ctl_table tbl = {
-+		.maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
-+	};
-+
-+	if (write) {
-+		tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
-+		if (tbl.data == NULL)
-+			return -1;
-+		ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+		if (ret == 0) {
-+			ret = mptcp_parse_gateway_ipv4(tbl.data);
-+			memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
-+		}
-+		kfree(tbl.data);
-+	} else {
-+		ret = proc_dostring(ctl, write, buffer, lenp, ppos);
-+	}
-+
-+
-+	return ret;
-+}
-+
-+static struct mptcp_pm_ops binder __read_mostly = {
-+	.new_session = binder_new_session,
-+	.fully_established = binder_create_subflows,
-+	.get_local_id = binder_get_local_id,
-+	.init_subsocket_v4 = mptcp_v4_add_lsrr,
-+	.name = "binder",
-+	.owner = THIS_MODULE,
-+};
-+
-+static struct ctl_table binder_table[] = {
-+	{
-+		.procname = "mptcp_binder_gateways",
-+		.data = &sysctl_mptcp_binder_gateways,
-+		.maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
-+		.mode = 0644,
-+		.proc_handler = &proc_mptcp_gateways
-+	},
-+	{ }
-+};
-+
-+struct ctl_table_header *mptcp_sysctl_binder;
-+
-+/* General initialization of MPTCP_PM */
-+static int __init binder_register(void)
-+{
-+	mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
-+	if (!mptcp_gws)
-+		return -ENOMEM;
-+
-+	rwlock_init(&mptcp_gws_lock);
-+
-+	BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
-+
-+	mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
-+			binder_table);
-+	if (!mptcp_sysctl_binder)
-+		goto sysctl_fail;
-+
-+	if (mptcp_register_path_manager(&binder))
-+		goto pm_failed;
-+
-+	return 0;
-+
-+pm_failed:
-+	unregister_net_sysctl_table(mptcp_sysctl_binder);
-+sysctl_fail:
-+	kfree(mptcp_gws);
-+
-+	return -1;
-+}
-+
-+static void binder_unregister(void)
-+{
-+	mptcp_unregister_path_manager(&binder);
-+	unregister_net_sysctl_table(mptcp_sysctl_binder);
-+	kfree(mptcp_gws);
-+}
-+
-+module_init(binder_register);
-+module_exit(binder_unregister);
-+
-+MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("BINDER MPTCP");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
-new file mode 100644
-index 000000000000..5d761164eb85
---- /dev/null
-+++ b/net/mptcp/mptcp_coupled.c
-@@ -0,0 +1,270 @@
-+/*
-+ *	MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+#include <linux/module.h>
-+
-+/* Scaling is done in the numerator with alpha_scale_num and in the denominator
-+ * with alpha_scale_den.
-+ *
-+ * To downscale, we just need to use alpha_scale.
-+ *
-+ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
-+ */
-+static int alpha_scale_den = 10;
-+static int alpha_scale_num = 32;
-+static int alpha_scale = 12;
-+
-+struct mptcp_ccc {
-+	u64	alpha;
-+	bool	forced_update;
-+};
-+
-+static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
-+{
-+	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
-+}
-+
-+static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
-+{
-+	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
-+}
-+
-+static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
-+{
-+	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
-+}
-+
-+static inline u64 mptcp_ccc_scale(u32 val, int scale)
-+{
-+	return (u64) val << scale;
-+}
-+
-+static inline bool mptcp_get_forced(const struct sock *meta_sk)
-+{
-+	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
-+}
-+
-+static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
-+{
-+	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
-+}
-+
-+static void mptcp_ccc_recalc_alpha(const struct sock *sk)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+	const struct sock *sub_sk;
-+	int best_cwnd = 0, best_rtt = 0, can_send = 0;
-+	u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
-+
-+	if (!mpcb)
-+		return;
-+
-+	/* Only one subflow left - fall back to normal reno-behavior
-+	 * (set alpha to 1)
-+	 */
-+	if (mpcb->cnt_established <= 1)
-+		goto exit;
-+
-+	/* Do regular alpha-calculation for multiple subflows */
-+
-+	/* Find the max numerator of the alpha-calculation */
-+	mptcp_for_each_sk(mpcb, sub_sk) {
-+		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+		u64 tmp;
-+
-+		if (!mptcp_ccc_sk_can_send(sub_sk))
-+			continue;
-+
-+		can_send++;
-+
-+		/* We need to look for the path, that provides the max-value.
-+		 * Integer-overflow is not possible here, because
-+		 * tmp will be in u64.
-+		 */
-+		tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
-+				alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
-+
-+		if (tmp >= max_numerator) {
-+			max_numerator = tmp;
-+			best_cwnd = sub_tp->snd_cwnd;
-+			best_rtt = sub_tp->srtt_us;
-+		}
-+	}
-+
-+	/* No subflow is able to send - we don't care anymore */
-+	if (unlikely(!can_send))
-+		goto exit;
-+
-+	/* Calculate the denominator */
-+	mptcp_for_each_sk(mpcb, sub_sk) {
-+		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+
-+		if (!mptcp_ccc_sk_can_send(sub_sk))
-+			continue;
-+
-+		sum_denominator += div_u64(
-+				mptcp_ccc_scale(sub_tp->snd_cwnd,
-+						alpha_scale_den) * best_rtt,
-+						sub_tp->srtt_us);
-+	}
-+	sum_denominator *= sum_denominator;
-+	if (unlikely(!sum_denominator)) {
-+		pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
-+		       __func__, mpcb->cnt_established);
-+		mptcp_for_each_sk(mpcb, sub_sk) {
-+			struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+			pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
-+			       __func__, sub_tp->mptcp->path_index,
-+			       sub_sk->sk_state, sub_tp->srtt_us,
-+			       sub_tp->snd_cwnd);
-+		}
-+	}
-+
-+	alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
-+
-+	if (unlikely(!alpha))
-+		alpha = 1;
-+
-+exit:
-+	mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
-+}
-+
-+static void mptcp_ccc_init(struct sock *sk)
-+{
-+	if (mptcp(tcp_sk(sk))) {
-+		mptcp_set_forced(mptcp_meta_sk(sk), 0);
-+		mptcp_set_alpha(mptcp_meta_sk(sk), 1);
-+	}
-+	/* If we do not mptcp, behave like reno: return */
-+}
-+
-+static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
-+{
-+	if (event == CA_EVENT_LOSS)
-+		mptcp_ccc_recalc_alpha(sk);
-+}
-+
-+static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
-+{
-+	if (!mptcp(tcp_sk(sk)))
-+		return;
-+
-+	mptcp_set_forced(mptcp_meta_sk(sk), 1);
-+}
-+
-+static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	const struct mptcp_cb *mpcb = tp->mpcb;
-+	int snd_cwnd;
-+
-+	if (!mptcp(tp)) {
-+		tcp_reno_cong_avoid(sk, ack, acked);
-+		return;
-+	}
-+
-+	if (!tcp_is_cwnd_limited(sk))
-+		return;
-+
-+	if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+		/* In "safe" area, increase. */
-+		tcp_slow_start(tp, acked);
-+		mptcp_ccc_recalc_alpha(sk);
-+		return;
-+	}
-+
-+	if (mptcp_get_forced(mptcp_meta_sk(sk))) {
-+		mptcp_ccc_recalc_alpha(sk);
-+		mptcp_set_forced(mptcp_meta_sk(sk), 0);
-+	}
-+
-+	if (mpcb->cnt_established > 1) {
-+		u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
-+
-+		/* This may happen, if at the initialization, the mpcb
-+		 * was not yet attached to the sock, and thus
-+		 * initializing alpha failed.
-+		 */
-+		if (unlikely(!alpha))
-+			alpha = 1;
-+
-+		snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
-+						alpha);
-+
-+		/* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
-+		 * Thus, we select here the max value.
-+		 */
-+		if (snd_cwnd < tp->snd_cwnd)
-+			snd_cwnd = tp->snd_cwnd;
-+	} else {
-+		snd_cwnd = tp->snd_cwnd;
-+	}
-+
-+	if (tp->snd_cwnd_cnt >= snd_cwnd) {
-+		if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
-+			tp->snd_cwnd++;
-+			mptcp_ccc_recalc_alpha(sk);
-+		}
-+
-+		tp->snd_cwnd_cnt = 0;
-+	} else {
-+		tp->snd_cwnd_cnt++;
-+	}
-+}
-+
-+static struct tcp_congestion_ops mptcp_ccc = {
-+	.init		= mptcp_ccc_init,
-+	.ssthresh	= tcp_reno_ssthresh,
-+	.cong_avoid	= mptcp_ccc_cong_avoid,
-+	.cwnd_event	= mptcp_ccc_cwnd_event,
-+	.set_state	= mptcp_ccc_set_state,
-+	.owner		= THIS_MODULE,
-+	.name		= "lia",
-+};
-+
-+static int __init mptcp_ccc_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
-+	return tcp_register_congestion_control(&mptcp_ccc);
-+}
-+
-+static void __exit mptcp_ccc_unregister(void)
-+{
-+	tcp_unregister_congestion_control(&mptcp_ccc);
-+}
-+
-+module_init(mptcp_ccc_register);
-+module_exit(mptcp_ccc_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
-new file mode 100644
-index 000000000000..28dfa0479f5e
---- /dev/null
-+++ b/net/mptcp/mptcp_ctrl.c
-@@ -0,0 +1,2401 @@
-+/*
-+ *	MPTCP implementation - MPTCP-control
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <net/inet_common.h>
-+#include <net/inet6_hashtables.h>
-+#include <net/ipv6.h>
-+#include <net/ip6_checksum.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/ip6_route.h>
-+#include <net/mptcp_v6.h>
-+#endif
-+#include <net/sock.h>
-+#include <net/tcp.h>
-+#include <net/tcp_states.h>
-+#include <net/transp_v6.h>
-+#include <net/xfrm.h>
-+
-+#include <linux/cryptohash.h>
-+#include <linux/kconfig.h>
-+#include <linux/module.h>
-+#include <linux/netpoll.h>
-+#include <linux/list.h>
-+#include <linux/jhash.h>
-+#include <linux/tcp.h>
-+#include <linux/net.h>
-+#include <linux/in.h>
-+#include <linux/random.h>
-+#include <linux/inetdevice.h>
-+#include <linux/workqueue.h>
-+#include <linux/atomic.h>
-+#include <linux/sysctl.h>
-+
-+static struct kmem_cache *mptcp_sock_cache __read_mostly;
-+static struct kmem_cache *mptcp_cb_cache __read_mostly;
-+static struct kmem_cache *mptcp_tw_cache __read_mostly;
-+
-+int sysctl_mptcp_enabled __read_mostly = 1;
-+int sysctl_mptcp_checksum __read_mostly = 1;
-+int sysctl_mptcp_debug __read_mostly;
-+EXPORT_SYMBOL(sysctl_mptcp_debug);
-+int sysctl_mptcp_syn_retries __read_mostly = 3;
-+
-+bool mptcp_init_failed __read_mostly;
-+
-+struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
-+EXPORT_SYMBOL(mptcp_static_key);
-+
-+static int proc_mptcp_path_manager(ctl_table *ctl, int write,
-+				   void __user *buffer, size_t *lenp,
-+				   loff_t *ppos)
-+{
-+	char val[MPTCP_PM_NAME_MAX];
-+	ctl_table tbl = {
-+		.data = val,
-+		.maxlen = MPTCP_PM_NAME_MAX,
-+	};
-+	int ret;
-+
-+	mptcp_get_default_path_manager(val);
-+
-+	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+	if (write && ret == 0)
-+		ret = mptcp_set_default_path_manager(val);
-+	return ret;
-+}
-+
-+static int proc_mptcp_scheduler(ctl_table *ctl, int write,
-+				void __user *buffer, size_t *lenp,
-+				loff_t *ppos)
-+{
-+	char val[MPTCP_SCHED_NAME_MAX];
-+	ctl_table tbl = {
-+		.data = val,
-+		.maxlen = MPTCP_SCHED_NAME_MAX,
-+	};
-+	int ret;
-+
-+	mptcp_get_default_scheduler(val);
-+
-+	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+	if (write && ret == 0)
-+		ret = mptcp_set_default_scheduler(val);
-+	return ret;
-+}
-+
-+static struct ctl_table mptcp_table[] = {
-+	{
-+		.procname = "mptcp_enabled",
-+		.data = &sysctl_mptcp_enabled,
-+		.maxlen = sizeof(int),
-+		.mode = 0644,
-+		.proc_handler = &proc_dointvec
-+	},
-+	{
-+		.procname = "mptcp_checksum",
-+		.data = &sysctl_mptcp_checksum,
-+		.maxlen = sizeof(int),
-+		.mode = 0644,
-+		.proc_handler = &proc_dointvec
-+	},
-+	{
-+		.procname = "mptcp_debug",
-+		.data = &sysctl_mptcp_debug,
-+		.maxlen = sizeof(int),
-+		.mode = 0644,
-+		.proc_handler = &proc_dointvec
-+	},
-+	{
-+		.procname = "mptcp_syn_retries",
-+		.data = &sysctl_mptcp_syn_retries,
-+		.maxlen = sizeof(int),
-+		.mode = 0644,
-+		.proc_handler = &proc_dointvec
-+	},
-+	{
-+		.procname	= "mptcp_path_manager",
-+		.mode		= 0644,
-+		.maxlen		= MPTCP_PM_NAME_MAX,
-+		.proc_handler	= proc_mptcp_path_manager,
-+	},
-+	{
-+		.procname	= "mptcp_scheduler",
-+		.mode		= 0644,
-+		.maxlen		= MPTCP_SCHED_NAME_MAX,
-+		.proc_handler	= proc_mptcp_scheduler,
-+	},
-+	{ }
-+};
-+
-+static inline u32 mptcp_hash_tk(u32 token)
-+{
-+	return token % MPTCP_HASH_SIZE;
-+}
-+
-+struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
-+EXPORT_SYMBOL(tk_hashtable);
-+
-+/* This second hashtable is needed to retrieve request socks
-+ * created as a result of a join request. While the SYN contains
-+ * the token, the final ack does not, so we need a separate hashtable
-+ * to retrieve the mpcb.
-+ */
-+struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
-+spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
-+
-+/* The following hash table is used to avoid collision of token */
-+static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
-+spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
-+
-+static bool mptcp_reqsk_find_tk(const u32 token)
-+{
-+	const u32 hash = mptcp_hash_tk(token);
-+	const struct mptcp_request_sock *mtreqsk;
-+	const struct hlist_nulls_node *node;
-+
-+begin:
-+	hlist_nulls_for_each_entry_rcu(mtreqsk, node,
-+				       &mptcp_reqsk_tk_htb[hash], hash_entry) {
-+		if (token == mtreqsk->mptcp_loc_token)
-+			return true;
-+	}
-+	/* A request-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash)
-+		goto begin;
-+	return false;
-+}
-+
-+static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
-+{
-+	u32 hash = mptcp_hash_tk(token);
-+
-+	hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
-+				 &mptcp_reqsk_tk_htb[hash]);
-+}
-+
-+static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
-+{
-+	rcu_read_lock();
-+	spin_lock(&mptcp_tk_hashlock);
-+	hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock();
-+}
-+
-+void mptcp_reqsk_destructor(struct request_sock *req)
-+{
-+	if (!mptcp_rsk(req)->is_sub) {
-+		if (in_softirq()) {
-+			mptcp_reqsk_remove_tk(req);
-+		} else {
-+			rcu_read_lock_bh();
-+			spin_lock(&mptcp_tk_hashlock);
-+			hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
-+			spin_unlock(&mptcp_tk_hashlock);
-+			rcu_read_unlock_bh();
-+		}
-+	} else {
-+		mptcp_hash_request_remove(req);
-+	}
-+}
-+
-+static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
-+{
-+	u32 hash = mptcp_hash_tk(token);
-+	hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
-+	meta_tp->inside_tk_table = 1;
-+}
-+
-+static bool mptcp_find_token(u32 token)
-+{
-+	const u32 hash = mptcp_hash_tk(token);
-+	const struct tcp_sock *meta_tp;
-+	const struct hlist_nulls_node *node;
-+
-+begin:
-+	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
-+		if (token == meta_tp->mptcp_loc_token)
-+			return true;
-+	}
-+	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash)
-+		goto begin;
-+	return false;
-+}
-+
-+static void mptcp_set_key_reqsk(struct request_sock *req,
-+				const struct sk_buff *skb)
-+{
-+	const struct inet_request_sock *ireq = inet_rsk(req);
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+
-+	if (skb->protocol == htons(ETH_P_IP)) {
-+		mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
-+							ip_hdr(skb)->daddr,
-+							htons(ireq->ir_num),
-+							ireq->ir_rmt_port);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else {
-+		mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
-+							ipv6_hdr(skb)->daddr.s6_addr32,
-+							htons(ireq->ir_num),
-+							ireq->ir_rmt_port);
-+#endif
-+	}
-+
-+	mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
-+}
-+
-+/* New MPTCP-connection request, prepare a new token for the meta-socket that
-+ * will be created in mptcp_check_req_master(), and store the received token.
-+ */
-+void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+			   const struct mptcp_options_received *mopt,
-+			   const struct sk_buff *skb)
-+{
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+
-+	inet_rsk(req)->saw_mpc = 1;
-+
-+	rcu_read_lock();
-+	spin_lock(&mptcp_tk_hashlock);
-+	do {
-+		mptcp_set_key_reqsk(req, skb);
-+	} while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
-+		 mptcp_find_token(mtreq->mptcp_loc_token));
-+
-+	mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock();
-+	mtreq->mptcp_rem_key = mopt->mptcp_key;
-+}
-+
-+static void mptcp_set_key_sk(const struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	const struct inet_sock *isk = inet_sk(sk);
-+
-+	if (sk->sk_family == AF_INET)
-+		tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
-+						     isk->inet_daddr,
-+						     isk->inet_sport,
-+						     isk->inet_dport);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else
-+		tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
-+						     sk->sk_v6_daddr.s6_addr32,
-+						     isk->inet_sport,
-+						     isk->inet_dport);
-+#endif
-+
-+	mptcp_key_sha1(tp->mptcp_loc_key,
-+		       &tp->mptcp_loc_token, NULL);
-+}
-+
-+void mptcp_connect_init(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	rcu_read_lock_bh();
-+	spin_lock(&mptcp_tk_hashlock);
-+	do {
-+		mptcp_set_key_sk(sk);
-+	} while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
-+		 mptcp_find_token(tp->mptcp_loc_token));
-+
-+	__mptcp_hash_insert(tp, tp->mptcp_loc_token);
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock_bh();
-+}
-+
-+/**
-+ * This function increments the refcount of the mpcb struct.
-+ * It is the responsibility of the caller to decrement when releasing
-+ * the structure.
-+ */
-+struct sock *mptcp_hash_find(const struct net *net, const u32 token)
-+{
-+	const u32 hash = mptcp_hash_tk(token);
-+	const struct tcp_sock *meta_tp;
-+	struct sock *meta_sk = NULL;
-+	const struct hlist_nulls_node *node;
-+
-+	rcu_read_lock();
-+begin:
-+	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
-+				       tk_table) {
-+		meta_sk = (struct sock *)meta_tp;
-+		if (token == meta_tp->mptcp_loc_token &&
-+		    net_eq(net, sock_net(meta_sk))) {
-+			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+				goto out;
-+			if (unlikely(token != meta_tp->mptcp_loc_token ||
-+				     !net_eq(net, sock_net(meta_sk)))) {
-+				sock_gen_put(meta_sk);
-+				goto begin;
-+			}
-+			goto found;
-+		}
-+	}
-+	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash)
-+		goto begin;
-+out:
-+	meta_sk = NULL;
-+found:
-+	rcu_read_unlock();
-+	return meta_sk;
-+}
-+
-+void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
-+{
-+	/* remove from the token hashtable */
-+	rcu_read_lock_bh();
-+	spin_lock(&mptcp_tk_hashlock);
-+	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
-+	meta_tp->inside_tk_table = 0;
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock_bh();
-+}
-+
-+void mptcp_hash_remove(struct tcp_sock *meta_tp)
-+{
-+	rcu_read_lock();
-+	spin_lock(&mptcp_tk_hashlock);
-+	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
-+	meta_tp->inside_tk_table = 0;
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock();
-+}
-+
-+struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
-+{
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sock *sk, *rttsk = NULL, *lastsk = NULL;
-+	u32 min_time = 0, last_active = 0;
-+
-+	mptcp_for_each_sk(meta_tp->mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+		u32 elapsed;
-+
-+		if (!mptcp_sk_can_send_ack(sk) || tp->pf)
-+			continue;
-+
-+		elapsed = keepalive_time_elapsed(tp);
-+
-+		/* We take the one with the lowest RTT within a reasonable
-+		 * (meta-RTO)-timeframe
-+		 */
-+		if (elapsed < inet_csk(meta_sk)->icsk_rto) {
-+			if (!min_time || tp->srtt_us < min_time) {
-+				min_time = tp->srtt_us;
-+				rttsk = sk;
-+			}
-+			continue;
-+		}
-+
-+		/* Otherwise, we just take the most recent active */
-+		if (!rttsk && (!last_active || elapsed < last_active)) {
-+			last_active = elapsed;
-+			lastsk = sk;
-+		}
-+	}
-+
-+	if (rttsk)
-+		return rttsk;
-+
-+	return lastsk;
-+}
-+EXPORT_SYMBOL(mptcp_select_ack_sock);
-+
-+static void mptcp_sock_def_error_report(struct sock *sk)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+	if (!sock_flag(sk, SOCK_DEAD))
-+		mptcp_sub_close(sk, 0);
-+
-+	if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
-+	    mpcb->send_infinite_mapping) {
-+		struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+		meta_sk->sk_err = sk->sk_err;
-+		meta_sk->sk_err_soft = sk->sk_err_soft;
-+
-+		if (!sock_flag(meta_sk, SOCK_DEAD))
-+			meta_sk->sk_error_report(meta_sk);
-+
-+		tcp_done(meta_sk);
-+	}
-+
-+	sk->sk_err = 0;
-+	return;
-+}
-+
-+static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
-+{
-+	if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
-+		mptcp_cleanup_path_manager(mpcb);
-+		mptcp_cleanup_scheduler(mpcb);
-+		kmem_cache_free(mptcp_cb_cache, mpcb);
-+	}
-+}
-+
-+static void mptcp_sock_destruct(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	inet_sock_destruct(sk);
-+
-+	if (!is_meta_sk(sk) && !tp->was_meta_sk) {
-+		BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
-+
-+		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
-+		tp->mptcp = NULL;
-+
-+		/* Taken when mpcb pointer was set */
-+		sock_put(mptcp_meta_sk(sk));
-+		mptcp_mpcb_put(tp->mpcb);
-+	} else {
-+		struct mptcp_cb *mpcb = tp->mpcb;
-+		struct mptcp_tw *mptw;
-+
-+		/* The mpcb is disappearing - we can make the final
-+		 * update to the rcv_nxt of the time-wait-sock and remove
-+		 * its reference to the mpcb.
-+		 */
-+		spin_lock_bh(&mpcb->tw_lock);
-+		list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
-+			list_del_rcu(&mptw->list);
-+			mptw->in_list = 0;
-+			mptcp_mpcb_put(mpcb);
-+			rcu_assign_pointer(mptw->mpcb, NULL);
-+		}
-+		spin_unlock_bh(&mpcb->tw_lock);
-+
-+		mptcp_mpcb_put(mpcb);
-+
-+		mptcp_debug("%s destroying meta-sk\n", __func__);
-+	}
-+
-+	WARN_ON(!static_key_false(&mptcp_static_key));
-+	/* Must be the last call, because is_meta_sk() above still needs the
-+	 * static key
-+	 */
-+	static_key_slow_dec(&mptcp_static_key);
-+}
-+
-+void mptcp_destroy_sock(struct sock *sk)
-+{
-+	if (is_meta_sk(sk)) {
-+		struct sock *sk_it, *tmpsk;
-+
-+		__skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
-+		mptcp_purge_ofo_queue(tcp_sk(sk));
-+
-+		/* We have to close all remaining subflows. Normally, they
-+		 * should all be about to get closed. But, if the kernel is
-+		 * forcing a closure (e.g., tcp_write_err), the subflows might
-+		 * not have been closed properly (as we are waiting for the
-+		 * DATA_ACK of the DATA_FIN).
-+		 */
-+		mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
-+			/* Already did call tcp_close - waiting for graceful
-+			 * closure, or if we are retransmitting fast-close on
-+			 * the subflow. The reset (or timeout) will kill the
-+			 * subflow..
-+			 */
-+			if (tcp_sk(sk_it)->closing ||
-+			    tcp_sk(sk_it)->send_mp_fclose)
-+				continue;
-+
-+			/* Allow the delayed work first to prevent time-wait state */
-+			if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
-+				continue;
-+
-+			mptcp_sub_close(sk_it, 0);
-+		}
-+
-+		mptcp_delete_synack_timer(sk);
-+	} else {
-+		mptcp_del_sock(sk);
-+	}
-+}
-+
-+static void mptcp_set_state(struct sock *sk)
-+{
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+	/* Meta is not yet established - wake up the application */
-+	if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
-+	    sk->sk_state == TCP_ESTABLISHED) {
-+		tcp_set_state(meta_sk, TCP_ESTABLISHED);
-+
-+		if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+			meta_sk->sk_state_change(meta_sk);
-+			sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
-+		}
-+	}
-+
-+	if (sk->sk_state == TCP_ESTABLISHED) {
-+		tcp_sk(sk)->mptcp->establish_increased = 1;
-+		tcp_sk(sk)->mpcb->cnt_established++;
-+	}
-+}
-+
-+void mptcp_init_congestion_control(struct sock *sk)
-+{
-+	struct inet_connection_sock *icsk = inet_csk(sk);
-+	struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
-+	const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
-+
-+	/* The application didn't set the congestion control to use
-+	 * fallback to the default one.
-+	 */
-+	if (ca == &tcp_init_congestion_ops)
-+		goto use_default;
-+
-+	/* Use the same congestion control as set by the user. If the
-+	 * module is not available fallback to the default one.
-+	 */
-+	if (!try_module_get(ca->owner)) {
-+		pr_warn("%s: fallback to the system default CC\n", __func__);
-+		goto use_default;
-+	}
-+
-+	icsk->icsk_ca_ops = ca;
-+	if (icsk->icsk_ca_ops->init)
-+		icsk->icsk_ca_ops->init(sk);
-+
-+	return;
-+
-+use_default:
-+	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
-+	tcp_init_congestion_control(sk);
-+}
-+
-+u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
-+u32 mptcp_seed = 0;
-+
-+void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
-+{
-+	u32 workspace[SHA_WORKSPACE_WORDS];
-+	u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
-+	u8 input[64];
-+	int i;
-+
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	/* Initialize input with appropriate padding */
-+	memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
-+						   * is explicitly set too
-+						   */
-+	memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
-+	input[8] = 0x80; /* Padding: First bit after message = 1 */
-+	input[63] = 0x40; /* Padding: Length of the message = 64 bits */
-+
-+	sha_init(mptcp_hashed_key);
-+	sha_transform(mptcp_hashed_key, input, workspace);
-+
-+	for (i = 0; i < 5; i++)
-+		mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
-+
-+	if (token)
-+		*token = mptcp_hashed_key[0];
-+	if (idsn)
-+		*idsn = *((u64 *)&mptcp_hashed_key[3]);
-+}
-+
-+void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
-+		       u32 *hash_out)
-+{
-+	u32 workspace[SHA_WORKSPACE_WORDS];
-+	u8 input[128]; /* 2 512-bit blocks */
-+	int i;
-+
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	/* Generate key xored with ipad */
-+	memset(input, 0x36, 64);
-+	for (i = 0; i < 8; i++)
-+		input[i] ^= key_1[i];
-+	for (i = 0; i < 8; i++)
-+		input[i + 8] ^= key_2[i];
-+
-+	memcpy(&input[64], rand_1, 4);
-+	memcpy(&input[68], rand_2, 4);
-+	input[72] = 0x80; /* Padding: First bit after message = 1 */
-+	memset(&input[73], 0, 53);
-+
-+	/* Padding: Length of the message = 512 + 64 bits */
-+	input[126] = 0x02;
-+	input[127] = 0x40;
-+
-+	sha_init(hash_out);
-+	sha_transform(hash_out, input, workspace);
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	sha_transform(hash_out, &input[64], workspace);
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	for (i = 0; i < 5; i++)
-+		hash_out[i] = cpu_to_be32(hash_out[i]);
-+
-+	/* Prepare second part of hmac */
-+	memset(input, 0x5C, 64);
-+	for (i = 0; i < 8; i++)
-+		input[i] ^= key_1[i];
-+	for (i = 0; i < 8; i++)
-+		input[i + 8] ^= key_2[i];
-+
-+	memcpy(&input[64], hash_out, 20);
-+	input[84] = 0x80;
-+	memset(&input[85], 0, 41);
-+
-+	/* Padding: Length of the message = 512 + 160 bits */
-+	input[126] = 0x02;
-+	input[127] = 0xA0;
-+
-+	sha_init(hash_out);
-+	sha_transform(hash_out, input, workspace);
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	sha_transform(hash_out, &input[64], workspace);
-+
-+	for (i = 0; i < 5; i++)
-+		hash_out[i] = cpu_to_be32(hash_out[i]);
-+}
-+
-+static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
-+{
-+	/* Socket-options handled by sk_clone_lock while creating the meta-sk.
-+	 * ======
-+	 * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
-+	 * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
-+	 * TCP_NODELAY, TCP_CORK
-+	 *
-+	 * Socket-options handled in this function here
-+	 * ======
-+	 * TCP_DEFER_ACCEPT
-+	 * SO_KEEPALIVE
-+	 *
-+	 * Socket-options on the todo-list
-+	 * ======
-+	 * SO_BINDTODEVICE - should probably prevent creation of new subsocks
-+	 *		     across other devices. - what about the api-draft?
-+	 * SO_DEBUG
-+	 * SO_REUSEADDR - probably we don't care about this
-+	 * SO_DONTROUTE, SO_BROADCAST
-+	 * SO_OOBINLINE
-+	 * SO_LINGER
-+	 * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
-+	 * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
-+	 * SO_RXQ_OVFL
-+	 * TCP_COOKIE_TRANSACTIONS
-+	 * TCP_MAXSEG
-+	 * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
-+	 *		in mptcp_retransmit_timer. AND we need to check what is
-+	 *		about the subsockets.
-+	 * TCP_LINGER2
-+	 * TCP_WINDOW_CLAMP
-+	 * TCP_USER_TIMEOUT
-+	 * TCP_MD5SIG
-+	 *
-+	 * Socket-options of no concern for the meta-socket (but for the subsocket)
-+	 * ======
-+	 * SO_PRIORITY
-+	 * SO_MARK
-+	 * TCP_CONGESTION
-+	 * TCP_SYNCNT
-+	 * TCP_QUICKACK
-+	 */
-+
-+	/* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
-+	inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
-+
-+	/* Keepalives are handled entirely at the MPTCP-layer */
-+	if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
-+		inet_csk_reset_keepalive_timer(meta_sk,
-+					       keepalive_time_when(tcp_sk(meta_sk)));
-+		sock_reset_flag(master_sk, SOCK_KEEPOPEN);
-+		inet_csk_delete_keepalive_timer(master_sk);
-+	}
-+
-+	/* Do not propagate subflow-errors up to the MPTCP-layer */
-+	inet_sk(master_sk)->recverr = 0;
-+}
-+
-+static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
-+{
-+	/* IP_TOS also goes to the subflow. */
-+	if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
-+		inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
-+		sub_sk->sk_priority = meta_sk->sk_priority;
-+		sk_dst_reset(sub_sk);
-+	}
-+
-+	/* Inherit SO_REUSEADDR */
-+	sub_sk->sk_reuse = meta_sk->sk_reuse;
-+
-+	/* Inherit snd/rcv-buffer locks */
-+	sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
-+
-+	/* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
-+	tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
-+
-+	/* Keepalives are handled entirely at the MPTCP-layer */
-+	if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
-+		sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
-+		inet_csk_delete_keepalive_timer(sub_sk);
-+	}
-+
-+	/* Do not propagate subflow-errors up to the MPTCP-layer */
-+	inet_sk(sub_sk)->recverr = 0;
-+}
-+
-+int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	/* skb-sk may be NULL if we receive a packet immediatly after the
-+	 * SYN/ACK + MP_CAPABLE.
-+	 */
-+	struct sock *sk = skb->sk ? skb->sk : meta_sk;
-+	int ret = 0;
-+
-+	skb->sk = NULL;
-+
-+	if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
-+		kfree_skb(skb);
-+		return 0;
-+	}
-+
-+	if (sk->sk_family == AF_INET)
-+		ret = tcp_v4_do_rcv(sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else
-+		ret = tcp_v6_do_rcv(sk, skb);
-+#endif
-+
-+	sock_put(sk);
-+	return ret;
-+}
-+
-+struct lock_class_key meta_key;
-+struct lock_class_key meta_slock_key;
-+
-+static void mptcp_synack_timer_handler(unsigned long data)
-+{
-+	struct sock *meta_sk = (struct sock *) data;
-+	struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
-+
-+	/* Only process if socket is not in use. */
-+	bh_lock_sock(meta_sk);
-+
-+	if (sock_owned_by_user(meta_sk)) {
-+		/* Try again later. */
-+		mptcp_reset_synack_timer(meta_sk, HZ/20);
-+		goto out;
-+	}
-+
-+	/* May happen if the queue got destructed in mptcp_close */
-+	if (!lopt)
-+		goto out;
-+
-+	inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
-+				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
-+
-+	if (lopt->qlen)
-+		mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
-+
-+out:
-+	bh_unlock_sock(meta_sk);
-+	sock_put(meta_sk);
-+}
-+
-+static const struct tcp_sock_ops mptcp_meta_specific = {
-+	.__select_window		= __mptcp_select_window,
-+	.select_window			= mptcp_select_window,
-+	.select_initial_window		= mptcp_select_initial_window,
-+	.init_buffer_space		= mptcp_init_buffer_space,
-+	.set_rto			= mptcp_tcp_set_rto,
-+	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
-+	.init_congestion_control	= mptcp_init_congestion_control,
-+	.send_fin			= mptcp_send_fin,
-+	.write_xmit			= mptcp_write_xmit,
-+	.send_active_reset		= mptcp_send_active_reset,
-+	.write_wakeup			= mptcp_write_wakeup,
-+	.prune_ofo_queue		= mptcp_prune_ofo_queue,
-+	.retransmit_timer		= mptcp_retransmit_timer,
-+	.time_wait			= mptcp_time_wait,
-+	.cleanup_rbuf			= mptcp_cleanup_rbuf,
-+};
-+
-+static const struct tcp_sock_ops mptcp_sub_specific = {
-+	.__select_window		= __mptcp_select_window,
-+	.select_window			= mptcp_select_window,
-+	.select_initial_window		= mptcp_select_initial_window,
-+	.init_buffer_space		= mptcp_init_buffer_space,
-+	.set_rto			= mptcp_tcp_set_rto,
-+	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
-+	.init_congestion_control	= mptcp_init_congestion_control,
-+	.send_fin			= tcp_send_fin,
-+	.write_xmit			= tcp_write_xmit,
-+	.send_active_reset		= tcp_send_active_reset,
-+	.write_wakeup			= tcp_write_wakeup,
-+	.prune_ofo_queue		= tcp_prune_ofo_queue,
-+	.retransmit_timer		= tcp_retransmit_timer,
-+	.time_wait			= tcp_time_wait,
-+	.cleanup_rbuf			= tcp_cleanup_rbuf,
-+};
-+
-+static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
-+{
-+	struct mptcp_cb *mpcb;
-+	struct sock *master_sk;
-+	struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
-+	struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
-+	u64 idsn;
-+
-+	dst_release(meta_sk->sk_rx_dst);
-+	meta_sk->sk_rx_dst = NULL;
-+	/* This flag is set to announce sock_lock_init to
-+	 * reclassify the lock-class of the master socket.
-+	 */
-+	meta_tp->is_master_sk = 1;
-+	master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
-+	meta_tp->is_master_sk = 0;
-+	if (!master_sk)
-+		return -ENOBUFS;
-+
-+	master_tp = tcp_sk(master_sk);
-+	master_icsk = inet_csk(master_sk);
-+
-+	mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
-+	if (!mpcb) {
-+		/* sk_free (and __sk_free) requirese wmem_alloc to be 1.
-+		 * All the rest is set to 0 thanks to __GFP_ZERO above.
-+		 */
-+		atomic_set(&master_sk->sk_wmem_alloc, 1);
-+		sk_free(master_sk);
-+		return -ENOBUFS;
-+	}
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
-+		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
-+
-+		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
-+
-+		newnp = inet6_sk(master_sk);
-+		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
-+
-+		newnp->ipv6_mc_list = NULL;
-+		newnp->ipv6_ac_list = NULL;
-+		newnp->ipv6_fl_list = NULL;
-+		newnp->opt = NULL;
-+		newnp->pktoptions = NULL;
-+		(void)xchg(&newnp->rxpmtu, NULL);
-+	} else if (meta_sk->sk_family == AF_INET6) {
-+		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
-+
-+		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
-+
-+		newnp = inet6_sk(master_sk);
-+		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
-+
-+		newnp->hop_limit	= -1;
-+		newnp->mcast_hops	= IPV6_DEFAULT_MCASTHOPS;
-+		newnp->mc_loop	= 1;
-+		newnp->pmtudisc	= IPV6_PMTUDISC_WANT;
-+		newnp->ipv6only	= sock_net(master_sk)->ipv6.sysctl.bindv6only;
-+	}
-+#endif
-+
-+	meta_tp->mptcp = NULL;
-+
-+	/* Store the keys and generate the peer's token */
-+	mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
-+	mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
-+
-+	/* Generate Initial data-sequence-numbers */
-+	mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
-+	idsn = ntohll(idsn) + 1;
-+	mpcb->snd_high_order[0] = idsn >> 32;
-+	mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
-+
-+	meta_tp->write_seq = (u32)idsn;
-+	meta_tp->snd_sml = meta_tp->write_seq;
-+	meta_tp->snd_una = meta_tp->write_seq;
-+	meta_tp->snd_nxt = meta_tp->write_seq;
-+	meta_tp->pushed_seq = meta_tp->write_seq;
-+	meta_tp->snd_up = meta_tp->write_seq;
-+
-+	mpcb->mptcp_rem_key = remote_key;
-+	mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
-+	idsn = ntohll(idsn) + 1;
-+	mpcb->rcv_high_order[0] = idsn >> 32;
-+	mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
-+	meta_tp->copied_seq = (u32) idsn;
-+	meta_tp->rcv_nxt = (u32) idsn;
-+	meta_tp->rcv_wup = (u32) idsn;
-+
-+	meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
-+	meta_tp->snd_wnd = window;
-+	meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
-+
-+	meta_tp->packets_out = 0;
-+	meta_icsk->icsk_probes_out = 0;
-+
-+	/* Set mptcp-pointers */
-+	master_tp->mpcb = mpcb;
-+	master_tp->meta_sk = meta_sk;
-+	meta_tp->mpcb = mpcb;
-+	meta_tp->meta_sk = meta_sk;
-+	mpcb->meta_sk = meta_sk;
-+	mpcb->master_sk = master_sk;
-+
-+	meta_tp->was_meta_sk = 0;
-+
-+	/* Initialize the queues */
-+	skb_queue_head_init(&mpcb->reinject_queue);
-+	skb_queue_head_init(&master_tp->out_of_order_queue);
-+	tcp_prequeue_init(master_tp);
-+	INIT_LIST_HEAD(&master_tp->tsq_node);
-+
-+	master_tp->tsq_flags = 0;
-+
-+	mutex_init(&mpcb->mpcb_mutex);
-+
-+	/* Init the accept_queue structure, we support a queue of 32 pending
-+	 * connections, it does not need to be huge, since we only store  here
-+	 * pending subflow creations.
-+	 */
-+	if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
-+		inet_put_port(master_sk);
-+		kmem_cache_free(mptcp_cb_cache, mpcb);
-+		sk_free(master_sk);
-+		return -ENOMEM;
-+	}
-+
-+	/* Redefine function-pointers as the meta-sk is now fully ready */
-+	static_key_slow_inc(&mptcp_static_key);
-+	meta_tp->mpc = 1;
-+	meta_tp->ops = &mptcp_meta_specific;
-+
-+	meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
-+	meta_sk->sk_destruct = mptcp_sock_destruct;
-+
-+	/* Meta-level retransmit timer */
-+	meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
-+
-+	tcp_init_xmit_timers(master_sk);
-+	/* Has been set for sending out the SYN */
-+	inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
-+
-+	if (!meta_tp->inside_tk_table) {
-+		/* Adding the meta_tp in the token hashtable - coming from server-side */
-+		rcu_read_lock();
-+		spin_lock(&mptcp_tk_hashlock);
-+
-+		__mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
-+
-+		spin_unlock(&mptcp_tk_hashlock);
-+		rcu_read_unlock();
-+	}
-+	master_tp->inside_tk_table = 0;
-+
-+	/* Init time-wait stuff */
-+	INIT_LIST_HEAD(&mpcb->tw_list);
-+	spin_lock_init(&mpcb->tw_lock);
-+
-+	INIT_HLIST_HEAD(&mpcb->callback_list);
-+
-+	mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
-+
-+	mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
-+	mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
-+	mpcb->orig_window_clamp = meta_tp->window_clamp;
-+
-+	/* The meta is directly linked - set refcnt to 1 */
-+	atomic_set(&mpcb->mpcb_refcnt, 1);
-+
-+	mptcp_init_path_manager(mpcb);
-+	mptcp_init_scheduler(mpcb);
-+
-+	setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
-+		    (unsigned long)meta_sk);
-+
-+	mptcp_debug("%s: created mpcb with token %#x\n",
-+		    __func__, mpcb->mptcp_loc_token);
-+
-+	return 0;
-+}
-+
-+void mptcp_fallback_meta_sk(struct sock *meta_sk)
-+{
-+	kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
-+	kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
-+}
-+
-+int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
-+		   gfp_t flags)
-+{
-+	struct mptcp_cb *mpcb	= tcp_sk(meta_sk)->mpcb;
-+	struct tcp_sock *tp	= tcp_sk(sk);
-+
-+	tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
-+	if (!tp->mptcp)
-+		return -ENOMEM;
-+
-+	tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
-+	/* No more space for more subflows? */
-+	if (!tp->mptcp->path_index) {
-+		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
-+		return -EPERM;
-+	}
-+
-+	INIT_HLIST_NODE(&tp->mptcp->cb_list);
-+
-+	tp->mptcp->tp = tp;
-+	tp->mpcb = mpcb;
-+	tp->meta_sk = meta_sk;
-+
-+	static_key_slow_inc(&mptcp_static_key);
-+	tp->mpc = 1;
-+	tp->ops = &mptcp_sub_specific;
-+
-+	tp->mptcp->loc_id = loc_id;
-+	tp->mptcp->rem_id = rem_id;
-+	if (mpcb->sched_ops->init)
-+		mpcb->sched_ops->init(sk);
-+
-+	/* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
-+	 * included in mptcp_del_sock(), because the mpcb must remain alive
-+	 * until the last subsocket is completely destroyed.
-+	 */
-+	sock_hold(meta_sk);
-+	atomic_inc(&mpcb->mpcb_refcnt);
-+
-+	tp->mptcp->next = mpcb->connection_list;
-+	mpcb->connection_list = tp;
-+	tp->mptcp->attached = 1;
-+
-+	mpcb->cnt_subflows++;
-+	atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
-+		   &meta_sk->sk_rmem_alloc);
-+
-+	mptcp_sub_inherit_sockopts(meta_sk, sk);
-+	INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
-+
-+	/* As we successfully allocated the mptcp_tcp_sock, we have to
-+	 * change the function-pointers here (for sk_destruct to work correctly)
-+	 */
-+	sk->sk_error_report = mptcp_sock_def_error_report;
-+	sk->sk_data_ready = mptcp_data_ready;
-+	sk->sk_write_space = mptcp_write_space;
-+	sk->sk_state_change = mptcp_set_state;
-+	sk->sk_destruct = mptcp_sock_destruct;
-+
-+	if (sk->sk_family == AF_INET)
-+		mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
-+			    __func__ , mpcb->mptcp_loc_token,
-+			    tp->mptcp->path_index,
-+			    &((struct inet_sock *)tp)->inet_saddr,
-+			    ntohs(((struct inet_sock *)tp)->inet_sport),
-+			    &((struct inet_sock *)tp)->inet_daddr,
-+			    ntohs(((struct inet_sock *)tp)->inet_dport),
-+			    mpcb->cnt_subflows);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else
-+		mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
-+			    __func__ , mpcb->mptcp_loc_token,
-+			    tp->mptcp->path_index, &inet6_sk(sk)->saddr,
-+			    ntohs(((struct inet_sock *)tp)->inet_sport),
-+			    &sk->sk_v6_daddr,
-+			    ntohs(((struct inet_sock *)tp)->inet_dport),
-+			    mpcb->cnt_subflows);
-+#endif
-+
-+	return 0;
-+}
-+
-+void mptcp_del_sock(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
-+	struct mptcp_cb *mpcb;
-+
-+	if (!tp->mptcp || !tp->mptcp->attached)
-+		return;
-+
-+	mpcb = tp->mpcb;
-+	tp_prev = mpcb->connection_list;
-+
-+	mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
-+		    __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
-+		    sk->sk_state, is_meta_sk(sk));
-+
-+	if (tp_prev == tp) {
-+		mpcb->connection_list = tp->mptcp->next;
-+	} else {
-+		for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
-+			if (tp_prev->mptcp->next == tp) {
-+				tp_prev->mptcp->next = tp->mptcp->next;
-+				break;
-+			}
-+		}
-+	}
-+	mpcb->cnt_subflows--;
-+	if (tp->mptcp->establish_increased)
-+		mpcb->cnt_established--;
-+
-+	tp->mptcp->next = NULL;
-+	tp->mptcp->attached = 0;
-+	mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
-+
-+	if (!skb_queue_empty(&sk->sk_write_queue))
-+		mptcp_reinject_data(sk, 0);
-+
-+	if (is_master_tp(tp))
-+		mpcb->master_sk = NULL;
-+	else if (tp->mptcp->pre_established)
-+		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+
-+	rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
-+}
-+
-+/* Updates the metasocket ULID/port data, based on the given sock.
-+ * The argument sock must be the sock accessible to the application.
-+ * In this function, we update the meta socket info, based on the changes
-+ * in the application socket (bind, address allocation, ...)
-+ */
-+void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
-+{
-+	if (tcp_sk(sk)->mpcb->pm_ops->new_session)
-+		tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
-+
-+	tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
-+}
-+
-+/* Clean up the receive buffer for full frames taken by the user,
-+ * then send an ACK if necessary.  COPIED is the number of bytes
-+ * tcp_recvmsg has given to the user so far, it speeds up the
-+ * calculation of whether or not we must ACK for the sake of
-+ * a window update.
-+ */
-+void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sock *sk;
-+	__u32 rcv_window_now = 0;
-+
-+	if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
-+		rcv_window_now = tcp_receive_window(meta_tp);
-+
-+		if (2 * rcv_window_now > meta_tp->window_clamp)
-+			rcv_window_now = 0;
-+	}
-+
-+	mptcp_for_each_sk(meta_tp->mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+		const struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+		if (!mptcp_sk_can_send_ack(sk))
-+			continue;
-+
-+		if (!inet_csk_ack_scheduled(sk))
-+			goto second_part;
-+		/* Delayed ACKs frequently hit locked sockets during bulk
-+		 * receive.
-+		 */
-+		if (icsk->icsk_ack.blocked ||
-+		    /* Once-per-two-segments ACK was not sent by tcp_input.c */
-+		    tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
-+		    /* If this read emptied read buffer, we send ACK, if
-+		     * connection is not bidirectional, user drained
-+		     * receive buffer and there was a small segment
-+		     * in queue.
-+		     */
-+		    (copied > 0 &&
-+		     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
-+		      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
-+		       !icsk->icsk_ack.pingpong)) &&
-+		     !atomic_read(&meta_sk->sk_rmem_alloc))) {
-+			tcp_send_ack(sk);
-+			continue;
-+		}
-+
-+second_part:
-+		/* This here is the second part of tcp_cleanup_rbuf */
-+		if (rcv_window_now) {
-+			__u32 new_window = tp->ops->__select_window(sk);
-+
-+			/* Send ACK now, if this read freed lots of space
-+			 * in our buffer. Certainly, new_window is new window.
-+			 * We can advertise it now, if it is not less than
-+			 * current one.
-+			 * "Lots" means "at least twice" here.
-+			 */
-+			if (new_window && new_window >= 2 * rcv_window_now)
-+				tcp_send_ack(sk);
-+		}
-+	}
-+}
-+
-+static int mptcp_sub_send_fin(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sk_buff *skb = tcp_write_queue_tail(sk);
-+	int mss_now;
-+
-+	/* Optimization, tack on the FIN if we have a queue of
-+	 * unsent frames.  But be careful about outgoing SACKS
-+	 * and IP options.
-+	 */
-+	mss_now = tcp_current_mss(sk);
-+
-+	if (tcp_send_head(sk) != NULL) {
-+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+		TCP_SKB_CB(skb)->end_seq++;
-+		tp->write_seq++;
-+	} else {
-+		skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
-+		if (!skb)
-+			return 1;
-+
-+		/* Reserve space for headers and prepare control bits. */
-+		skb_reserve(skb, MAX_TCP_HEADER);
-+		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
-+		tcp_init_nondata_skb(skb, tp->write_seq,
-+				     TCPHDR_ACK | TCPHDR_FIN);
-+		tcp_queue_skb(sk, skb);
-+	}
-+	__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
-+
-+	return 0;
-+}
-+
-+void mptcp_sub_close_wq(struct work_struct *work)
-+{
-+	struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
-+	struct sock *sk = (struct sock *)tp;
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+	mutex_lock(&tp->mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	if (sock_flag(sk, SOCK_DEAD))
-+		goto exit;
-+
-+	/* We come from tcp_disconnect. We are sure that meta_sk is set */
-+	if (!mptcp(tp)) {
-+		tp->closing = 1;
-+		sock_rps_reset_flow(sk);
-+		tcp_close(sk, 0);
-+		goto exit;
-+	}
-+
-+	if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
-+		tp->closing = 1;
-+		sock_rps_reset_flow(sk);
-+		tcp_close(sk, 0);
-+	} else if (tcp_close_state(sk)) {
-+		sk->sk_shutdown |= SEND_SHUTDOWN;
-+		tcp_send_fin(sk);
-+	}
-+
-+exit:
-+	release_sock(meta_sk);
-+	mutex_unlock(&tp->mpcb->mpcb_mutex);
-+	sock_put(sk);
-+}
-+
-+void mptcp_sub_close(struct sock *sk, unsigned long delay)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
-+
-+	/* We are already closing - e.g., call from sock_def_error_report upon
-+	 * tcp_disconnect in tcp_close.
-+	 */
-+	if (tp->closing)
-+		return;
-+
-+	/* Work already scheduled ? */
-+	if (work_pending(&work->work)) {
-+		/* Work present - who will be first ? */
-+		if (jiffies + delay > work->timer.expires)
-+			return;
-+
-+		/* Try canceling - if it fails, work will be executed soon */
-+		if (!cancel_delayed_work(work))
-+			return;
-+		sock_put(sk);
-+	}
-+
-+	if (!delay) {
-+		unsigned char old_state = sk->sk_state;
-+
-+		/* If we are in user-context we can directly do the closing
-+		 * procedure. No need to schedule a work-queue.
-+		 */
-+		if (!in_softirq()) {
-+			if (sock_flag(sk, SOCK_DEAD))
-+				return;
-+
-+			if (!mptcp(tp)) {
-+				tp->closing = 1;
-+				sock_rps_reset_flow(sk);
-+				tcp_close(sk, 0);
-+				return;
-+			}
-+
-+			if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
-+			    sk->sk_state == TCP_CLOSE) {
-+				tp->closing = 1;
-+				sock_rps_reset_flow(sk);
-+				tcp_close(sk, 0);
-+			} else if (tcp_close_state(sk)) {
-+				sk->sk_shutdown |= SEND_SHUTDOWN;
-+				tcp_send_fin(sk);
-+			}
-+
-+			return;
-+		}
-+
-+		/* We directly send the FIN. Because it may take so a long time,
-+		 * untile the work-queue will get scheduled...
-+		 *
-+		 * If mptcp_sub_send_fin returns 1, it failed and thus we reset
-+		 * the old state so that tcp_close will finally send the fin
-+		 * in user-context.
-+		 */
-+		if (!sk->sk_err && old_state != TCP_CLOSE &&
-+		    tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
-+			if (old_state == TCP_ESTABLISHED)
-+				TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
-+			sk->sk_state = old_state;
-+		}
-+	}
-+
-+	sock_hold(sk);
-+	queue_delayed_work(mptcp_wq, work, delay);
-+}
-+
-+void mptcp_sub_force_close(struct sock *sk)
-+{
-+	/* The below tcp_done may have freed the socket, if he is already dead.
-+	 * Thus, we are not allowed to access it afterwards. That's why
-+	 * we have to store the dead-state in this local variable.
-+	 */
-+	int sock_is_dead = sock_flag(sk, SOCK_DEAD);
-+
-+	tcp_sk(sk)->mp_killed = 1;
-+
-+	if (sk->sk_state != TCP_CLOSE)
-+		tcp_done(sk);
-+
-+	if (!sock_is_dead)
-+		mptcp_sub_close(sk, 0);
-+}
-+EXPORT_SYMBOL(mptcp_sub_force_close);
-+
-+/* Update the mpcb send window, based on the contributions
-+ * of each subflow
-+ */
-+void mptcp_update_sndbuf(const struct tcp_sock *tp)
-+{
-+	struct sock *meta_sk = tp->meta_sk, *sk;
-+	int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
-+
-+	mptcp_for_each_sk(tp->mpcb, sk) {
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+
-+		new_sndbuf += sk->sk_sndbuf;
-+
-+		if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
-+			new_sndbuf = sysctl_tcp_wmem[2];
-+			break;
-+		}
-+	}
-+	meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
-+
-+	/* The subflow's call to sk_write_space in tcp_new_space ends up in
-+	 * mptcp_write_space.
-+	 * It has nothing to do with waking up the application.
-+	 * So, we do it here.
-+	 */
-+	if (old_sndbuf != meta_sk->sk_sndbuf)
-+		meta_sk->sk_write_space(meta_sk);
-+}
-+
-+void mptcp_close(struct sock *meta_sk, long timeout)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sock *sk_it, *tmpsk;
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sk_buff *skb;
-+	int data_was_unread = 0;
-+	int state;
-+
-+	mptcp_debug("%s: Close of meta_sk with tok %#x\n",
-+		    __func__, mpcb->mptcp_loc_token);
-+
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock(meta_sk);
-+
-+	if (meta_tp->inside_tk_table) {
-+		/* Detach the mpcb from the token hashtable */
-+		mptcp_hash_remove_bh(meta_tp);
-+		reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
-+	}
-+
-+	meta_sk->sk_shutdown = SHUTDOWN_MASK;
-+	/* We need to flush the recv. buffs.  We do this only on the
-+	 * descriptor close, not protocol-sourced closes, because the
-+	 * reader process may not have drained the data yet!
-+	 */
-+	while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
-+		u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
-+			  tcp_hdr(skb)->fin;
-+		data_was_unread += len;
-+		__kfree_skb(skb);
-+	}
-+
-+	sk_mem_reclaim(meta_sk);
-+
-+	/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
-+	if (meta_sk->sk_state == TCP_CLOSE) {
-+		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+			if (tcp_sk(sk_it)->send_mp_fclose)
-+				continue;
-+			mptcp_sub_close(sk_it, 0);
-+		}
-+		goto adjudge_to_death;
-+	}
-+
-+	if (data_was_unread) {
-+		/* Unread data was tossed, zap the connection. */
-+		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
-+		tcp_set_state(meta_sk, TCP_CLOSE);
-+		tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
-+							meta_sk->sk_allocation);
-+	} else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
-+		/* Check zero linger _after_ checking for unread data. */
-+		meta_sk->sk_prot->disconnect(meta_sk, 0);
-+		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+	} else if (tcp_close_state(meta_sk)) {
-+		mptcp_send_fin(meta_sk);
-+	} else if (meta_tp->snd_una == meta_tp->write_seq) {
-+		/* The DATA_FIN has been sent and acknowledged
-+		 * (e.g., by sk_shutdown). Close all the other subflows
-+		 */
-+		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+			unsigned long delay = 0;
-+			/* If we are the passive closer, don't trigger
-+			 * subflow-fin until the subflow has been finned
-+			 * by the peer. - thus we add a delay
-+			 */
-+			if (mpcb->passive_close &&
-+			    sk_it->sk_state == TCP_ESTABLISHED)
-+				delay = inet_csk(sk_it)->icsk_rto << 3;
-+
-+			mptcp_sub_close(sk_it, delay);
-+		}
-+	}
-+
-+	sk_stream_wait_close(meta_sk, timeout);
-+
-+adjudge_to_death:
-+	state = meta_sk->sk_state;
-+	sock_hold(meta_sk);
-+	sock_orphan(meta_sk);
-+
-+	/* socket will be freed after mptcp_close - we have to prevent
-+	 * access from the subflows.
-+	 */
-+	mptcp_for_each_sk(mpcb, sk_it) {
-+		/* Similar to sock_orphan, but we don't set it DEAD, because
-+		 * the callbacks are still set and must be called.
-+		 */
-+		write_lock_bh(&sk_it->sk_callback_lock);
-+		sk_set_socket(sk_it, NULL);
-+		sk_it->sk_wq  = NULL;
-+		write_unlock_bh(&sk_it->sk_callback_lock);
-+	}
-+
-+	/* It is the last release_sock in its life. It will remove backlog. */
-+	release_sock(meta_sk);
-+
-+	/* Now socket is owned by kernel and we acquire BH lock
-+	 * to finish close. No need to check for user refs.
-+	 */
-+	local_bh_disable();
-+	bh_lock_sock(meta_sk);
-+	WARN_ON(sock_owned_by_user(meta_sk));
-+
-+	percpu_counter_inc(meta_sk->sk_prot->orphan_count);
-+
-+	/* Have we already been destroyed by a softirq or backlog? */
-+	if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
-+		goto out;
-+
-+	/*	This is a (useful) BSD violating of the RFC. There is a
-+	 *	problem with TCP as specified in that the other end could
-+	 *	keep a socket open forever with no application left this end.
-+	 *	We use a 3 minute timeout (about the same as BSD) then kill
-+	 *	our end. If they send after that then tough - BUT: long enough
-+	 *	that we won't make the old 4*rto = almost no time - whoops
-+	 *	reset mistake.
-+	 *
-+	 *	Nope, it was not mistake. It is really desired behaviour
-+	 *	f.e. on http servers, when such sockets are useless, but
-+	 *	consume significant resources. Let's do it with special
-+	 *	linger2	option.					--ANK
-+	 */
-+
-+	if (meta_sk->sk_state == TCP_FIN_WAIT2) {
-+		if (meta_tp->linger2 < 0) {
-+			tcp_set_state(meta_sk, TCP_CLOSE);
-+			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
-+			NET_INC_STATS_BH(sock_net(meta_sk),
-+					 LINUX_MIB_TCPABORTONLINGER);
-+		} else {
-+			const int tmo = tcp_fin_time(meta_sk);
-+
-+			if (tmo > TCP_TIMEWAIT_LEN) {
-+				inet_csk_reset_keepalive_timer(meta_sk,
-+							       tmo - TCP_TIMEWAIT_LEN);
-+			} else {
-+				meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
-+							tmo);
-+				goto out;
-+			}
-+		}
-+	}
-+	if (meta_sk->sk_state != TCP_CLOSE) {
-+		sk_mem_reclaim(meta_sk);
-+		if (tcp_too_many_orphans(meta_sk, 0)) {
-+			if (net_ratelimit())
-+				pr_info("MPTCP: too many of orphaned sockets\n");
-+			tcp_set_state(meta_sk, TCP_CLOSE);
-+			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
-+			NET_INC_STATS_BH(sock_net(meta_sk),
-+					 LINUX_MIB_TCPABORTONMEMORY);
-+		}
-+	}
-+
-+
-+	if (meta_sk->sk_state == TCP_CLOSE)
-+		inet_csk_destroy_sock(meta_sk);
-+	/* Otherwise, socket is reprieved until protocol close. */
-+
-+out:
-+	bh_unlock_sock(meta_sk);
-+	local_bh_enable();
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk); /* Taken by sock_hold */
-+}
-+
-+void mptcp_disconnect(struct sock *sk)
-+{
-+	struct sock *subsk, *tmpsk;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	mptcp_delete_synack_timer(sk);
-+
-+	__skb_queue_purge(&tp->mpcb->reinject_queue);
-+
-+	if (tp->inside_tk_table) {
-+		mptcp_hash_remove_bh(tp);
-+		reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
-+	}
-+
-+	local_bh_disable();
-+	mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
-+		/* The socket will get removed from the subsocket-list
-+		 * and made non-mptcp by setting mpc to 0.
-+		 *
-+		 * This is necessary, because tcp_disconnect assumes
-+		 * that the connection is completly dead afterwards.
-+		 * Thus we need to do a mptcp_del_sock. Due to this call
-+		 * we have to make it non-mptcp.
-+		 *
-+		 * We have to lock the socket, because we set mpc to 0.
-+		 * An incoming packet would take the subsocket's lock
-+		 * and go on into the receive-path.
-+		 * This would be a race.
-+		 */
-+
-+		bh_lock_sock(subsk);
-+		mptcp_del_sock(subsk);
-+		tcp_sk(subsk)->mpc = 0;
-+		tcp_sk(subsk)->ops = &tcp_specific;
-+		mptcp_sub_force_close(subsk);
-+		bh_unlock_sock(subsk);
-+	}
-+	local_bh_enable();
-+
-+	tp->was_meta_sk = 1;
-+	tp->mpc = 0;
-+	tp->ops = &tcp_specific;
-+}
-+
-+
-+/* Returns 1 if we should enable MPTCP for that socket. */
-+int mptcp_doit(struct sock *sk)
-+{
-+	/* Do not allow MPTCP enabling if the MPTCP initialization failed */
-+	if (mptcp_init_failed)
-+		return 0;
-+
-+	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
-+		return 0;
-+
-+	/* Socket may already be established (e.g., called from tcp_recvmsg) */
-+	if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
-+		return 1;
-+
-+	/* Don't do mptcp over loopback */
-+	if (sk->sk_family == AF_INET &&
-+	    (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
-+	     ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
-+		return 0;
-+#if IS_ENABLED(CONFIG_IPV6)
-+	if (sk->sk_family == AF_INET6 &&
-+	    (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
-+	     ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
-+		return 0;
-+#endif
-+	if (mptcp_v6_is_v4_mapped(sk) &&
-+	    ipv4_is_loopback(inet_sk(sk)->inet_saddr))
-+		return 0;
-+
-+#ifdef CONFIG_TCP_MD5SIG
-+	/* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
-+	if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
-+		return 0;
-+#endif
-+
-+	return 1;
-+}
-+
-+int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
-+{
-+	struct tcp_sock *master_tp;
-+	struct sock *master_sk;
-+
-+	if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
-+		goto err_alloc_mpcb;
-+
-+	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
-+	master_tp = tcp_sk(master_sk);
-+
-+	if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
-+		goto err_add_sock;
-+
-+	if (__inet_inherit_port(meta_sk, master_sk) < 0)
-+		goto err_add_sock;
-+
-+	meta_sk->sk_prot->unhash(meta_sk);
-+
-+	if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
-+		__inet_hash_nolisten(master_sk, NULL);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else
-+		__inet6_hash(master_sk, NULL);
-+#endif
-+
-+	master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
-+
-+	return 0;
-+
-+err_add_sock:
-+	mptcp_fallback_meta_sk(meta_sk);
-+
-+	inet_csk_prepare_forced_close(master_sk);
-+	tcp_done(master_sk);
-+	inet_csk_prepare_forced_close(meta_sk);
-+	tcp_done(meta_sk);
-+
-+err_alloc_mpcb:
-+	return -ENOBUFS;
-+}
-+
-+static int __mptcp_check_req_master(struct sock *child,
-+				    struct request_sock *req)
-+{
-+	struct tcp_sock *child_tp = tcp_sk(child);
-+	struct sock *meta_sk = child;
-+	struct mptcp_cb *mpcb;
-+	struct mptcp_request_sock *mtreq;
-+
-+	/* Never contained an MP_CAPABLE */
-+	if (!inet_rsk(req)->mptcp_rqsk)
-+		return 1;
-+
-+	if (!inet_rsk(req)->saw_mpc) {
-+		/* Fallback to regular TCP, because we saw one SYN without
-+		 * MP_CAPABLE. In tcp_check_req we continue the regular path.
-+		 * But, the socket has been added to the reqsk_tk_htb, so we
-+		 * must still remove it.
-+		 */
-+		mptcp_reqsk_remove_tk(req);
-+		return 1;
-+	}
-+
-+	/* Just set this values to pass them to mptcp_alloc_mpcb */
-+	mtreq = mptcp_rsk(req);
-+	child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
-+	child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
-+
-+	if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
-+				   child_tp->snd_wnd))
-+		return -ENOBUFS;
-+
-+	child = tcp_sk(child)->mpcb->master_sk;
-+	child_tp = tcp_sk(child);
-+	mpcb = child_tp->mpcb;
-+
-+	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
-+	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
-+
-+	mpcb->dss_csum = mtreq->dss_csum;
-+	mpcb->server_side = 1;
-+
-+	/* Will be moved to ESTABLISHED by  tcp_rcv_state_process() */
-+	mptcp_update_metasocket(child, meta_sk);
-+
-+	/* Needs to be done here additionally, because when accepting a
-+	 * new connection we pass by __reqsk_free and not reqsk_free.
-+	 */
-+	mptcp_reqsk_remove_tk(req);
-+
-+	/* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
-+	sock_put(meta_sk);
-+
-+	return 0;
-+}
-+
-+int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
-+{
-+	struct sock *meta_sk = child, *master_sk;
-+	struct sk_buff *skb;
-+	u32 new_mapping;
-+	int ret;
-+
-+	ret = __mptcp_check_req_master(child, req);
-+	if (ret)
-+		return ret;
-+
-+	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
-+
-+	/* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
-+	 * pre-MPTCP data in the receive queue.
-+	 */
-+	tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
-+				       tcp_rsk(req)->rcv_isn - 1;
-+
-+	/* Map subflow sequence number to data sequence numbers. We need to map
-+	 * these data to [IDSN - len - 1, IDSN[.
-+	 */
-+	new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
-+
-+	/* There should be only one skb: the SYN + data. */
-+	skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
-+		TCP_SKB_CB(skb)->seq += new_mapping;
-+		TCP_SKB_CB(skb)->end_seq += new_mapping;
-+	}
-+
-+	/* With fastopen we change the semantics of the relative subflow
-+	 * sequence numbers to deal with middleboxes that could add/remove
-+	 * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
-+	 * instead of the regular TCP ISN.
-+	 */
-+	tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
-+
-+	/* We need to update copied_seq of the master_sk to account for the
-+	 * already moved data to the meta receive queue.
-+	 */
-+	tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
-+
-+	/* Handled by the master_sk */
-+	tcp_sk(meta_sk)->fastopen_rsk = NULL;
-+
-+	return 0;
-+}
-+
-+int mptcp_check_req_master(struct sock *sk, struct sock *child,
-+			   struct request_sock *req,
-+			   struct request_sock **prev)
-+{
-+	struct sock *meta_sk = child;
-+	int ret;
-+
-+	ret = __mptcp_check_req_master(child, req);
-+	if (ret)
-+		return ret;
-+
-+	inet_csk_reqsk_queue_unlink(sk, req, prev);
-+	inet_csk_reqsk_queue_removed(sk, req);
-+	inet_csk_reqsk_queue_add(sk, req, meta_sk);
-+
-+	return 0;
-+}
-+
-+struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
-+				   struct request_sock *req,
-+				   struct request_sock **prev,
-+				   const struct mptcp_options_received *mopt)
-+{
-+	struct tcp_sock *child_tp = tcp_sk(child);
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	u8 hash_mac_check[20];
-+
-+	child_tp->inside_tk_table = 0;
-+
-+	if (!mopt->join_ack)
-+		goto teardown;
-+
-+	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
-+			(u8 *)&mpcb->mptcp_loc_key,
-+			(u8 *)&mtreq->mptcp_rem_nonce,
-+			(u8 *)&mtreq->mptcp_loc_nonce,
-+			(u32 *)hash_mac_check);
-+
-+	if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
-+		goto teardown;
-+
-+	/* Point it to the same struct socket and wq as the meta_sk */
-+	sk_set_socket(child, meta_sk->sk_socket);
-+	child->sk_wq = meta_sk->sk_wq;
-+
-+	if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
-+		/* Has been inherited, but now child_tp->mptcp is NULL */
-+		child_tp->mpc = 0;
-+		child_tp->ops = &tcp_specific;
-+
-+		/* TODO when we support acking the third ack for new subflows,
-+		 * we should silently discard this third ack, by returning NULL.
-+		 *
-+		 * Maybe, at the retransmission we will have enough memory to
-+		 * fully add the socket to the meta-sk.
-+		 */
-+		goto teardown;
-+	}
-+
-+	/* The child is a clone of the meta socket, we must now reset
-+	 * some of the fields
-+	 */
-+	child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
-+
-+	/* We should allow proper increase of the snd/rcv-buffers. Thus, we
-+	 * use the original values instead of the bloated up ones from the
-+	 * clone.
-+	 */
-+	child->sk_sndbuf = mpcb->orig_sk_sndbuf;
-+	child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
-+
-+	child_tp->mptcp->slave_sk = 1;
-+	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
-+	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
-+	child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
-+
-+	child_tp->tsq_flags = 0;
-+
-+	/* Subflows do not use the accept queue, as they
-+	 * are attached immediately to the mpcb.
-+	 */
-+	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
-+	reqsk_free(req);
-+	return child;
-+
-+teardown:
-+	/* Drop this request - sock creation failed. */
-+	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
-+	reqsk_free(req);
-+	inet_csk_prepare_forced_close(child);
-+	tcp_done(child);
-+	return meta_sk;
-+}
-+
-+int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
-+{
-+	struct mptcp_tw *mptw;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+
-+	/* A subsocket in tw can only receive data. So, if we are in
-+	 * infinite-receive, then we should not reply with a data-ack or act
-+	 * upon general MPTCP-signaling. We prevent this by simply not creating
-+	 * the mptcp_tw_sock.
-+	 */
-+	if (mpcb->infinite_mapping_rcv) {
-+		tw->mptcp_tw = NULL;
-+		return 0;
-+	}
-+
-+	/* Alloc MPTCP-tw-sock */
-+	mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
-+	if (!mptw)
-+		return -ENOBUFS;
-+
-+	atomic_inc(&mpcb->mpcb_refcnt);
-+
-+	tw->mptcp_tw = mptw;
-+	mptw->loc_key = mpcb->mptcp_loc_key;
-+	mptw->meta_tw = mpcb->in_time_wait;
-+	if (mptw->meta_tw) {
-+		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
-+		if (mpcb->mptw_state != TCP_TIME_WAIT)
-+			mptw->rcv_nxt++;
-+	}
-+	rcu_assign_pointer(mptw->mpcb, mpcb);
-+
-+	spin_lock(&mpcb->tw_lock);
-+	list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
-+	mptw->in_list = 1;
-+	spin_unlock(&mpcb->tw_lock);
-+
-+	return 0;
-+}
-+
-+void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
-+{
-+	struct mptcp_cb *mpcb;
-+
-+	rcu_read_lock();
-+	mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
-+
-+	/* If we are still holding a ref to the mpcb, we have to remove ourself
-+	 * from the list and drop the ref properly.
-+	 */
-+	if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
-+		spin_lock(&mpcb->tw_lock);
-+		if (tw->mptcp_tw->in_list) {
-+			list_del_rcu(&tw->mptcp_tw->list);
-+			tw->mptcp_tw->in_list = 0;
-+		}
-+		spin_unlock(&mpcb->tw_lock);
-+
-+		/* Twice, because we increased it above */
-+		mptcp_mpcb_put(mpcb);
-+		mptcp_mpcb_put(mpcb);
-+	}
-+
-+	rcu_read_unlock();
-+
-+	kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
-+}
-+
-+/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
-+ * data-fin.
-+ */
-+void mptcp_time_wait(struct sock *sk, int state, int timeo)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_tw *mptw;
-+
-+	/* Used for sockets that go into tw after the meta
-+	 * (see mptcp_init_tw_sock())
-+	 */
-+	tp->mpcb->in_time_wait = 1;
-+	tp->mpcb->mptw_state = state;
-+
-+	/* Update the time-wait-sock's information */
-+	rcu_read_lock_bh();
-+	list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
-+		mptw->meta_tw = 1;
-+		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
-+
-+		/* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
-+		 * pretend as if the DATA_FIN has already reached us, that way
-+		 * the checks in tcp_timewait_state_process will be good as the
-+		 * DATA_FIN comes in.
-+		 */
-+		if (state != TCP_TIME_WAIT)
-+			mptw->rcv_nxt++;
-+	}
-+	rcu_read_unlock_bh();
-+
-+	tcp_done(sk);
-+}
-+
-+void mptcp_tsq_flags(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+	/* It will be handled as a regular deferred-call */
-+	if (is_meta_sk(sk))
-+		return;
-+
-+	if (hlist_unhashed(&tp->mptcp->cb_list)) {
-+		hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
-+		/* We need to hold it here, as the sock_hold is not assured
-+		 * by the release_sock as it is done in regular TCP.
-+		 *
-+		 * The subsocket may get inet_csk_destroy'd while it is inside
-+		 * the callback_list.
-+		 */
-+		sock_hold(sk);
-+	}
-+
-+	if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
-+		sock_hold(meta_sk);
-+}
-+
-+void mptcp_tsq_sub_deferred(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_tcp_sock *mptcp;
-+	struct hlist_node *tmp;
-+
-+	BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
-+
-+	__sock_put(meta_sk);
-+	hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
-+		struct tcp_sock *tp = mptcp->tp;
-+		struct sock *sk = (struct sock *)tp;
-+
-+		hlist_del_init(&mptcp->cb_list);
-+		sk->sk_prot->release_cb(sk);
-+		/* Final sock_put (cfr. mptcp_tsq_flags */
-+		sock_put(sk);
-+	}
-+}
-+
-+void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
-+			   struct sk_buff *skb)
-+{
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+	struct mptcp_options_received mopt;
-+	u8 mptcp_hash_mac[20];
-+
-+	mptcp_init_mp_opt(&mopt);
-+	tcp_parse_mptcp_options(skb, &mopt);
-+
-+	mtreq = mptcp_rsk(req);
-+	mtreq->mptcp_mpcb = mpcb;
-+	mtreq->is_sub = 1;
-+	inet_rsk(req)->mptcp_rqsk = 1;
-+
-+	mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
-+
-+	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
-+			(u8 *)&mpcb->mptcp_rem_key,
-+			(u8 *)&mtreq->mptcp_loc_nonce,
-+			(u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
-+	mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
-+
-+	mtreq->rem_id = mopt.rem_id;
-+	mtreq->rcv_low_prio = mopt.low_prio;
-+	inet_rsk(req)->saw_mpc = 1;
-+}
-+
-+void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
-+{
-+	struct mptcp_options_received mopt;
-+	struct mptcp_request_sock *mreq = mptcp_rsk(req);
-+
-+	mptcp_init_mp_opt(&mopt);
-+	tcp_parse_mptcp_options(skb, &mopt);
-+
-+	mreq->is_sub = 0;
-+	inet_rsk(req)->mptcp_rqsk = 1;
-+	mreq->dss_csum = mopt.dss_csum;
-+	mreq->hash_entry.pprev = NULL;
-+
-+	mptcp_reqsk_new_mptcp(req, &mopt, skb);
-+}
-+
-+int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
-+{
-+	struct mptcp_options_received mopt;
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	__u32 isn = TCP_SKB_CB(skb)->when;
-+	bool want_cookie = false;
-+
-+	if ((sysctl_tcp_syncookies == 2 ||
-+	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-+		want_cookie = tcp_syn_flood_action(sk, skb,
-+						   mptcp_request_sock_ops.slab_name);
-+		if (!want_cookie)
-+			goto drop;
-+	}
-+
-+	mptcp_init_mp_opt(&mopt);
-+	tcp_parse_mptcp_options(skb, &mopt);
-+
-+	if (mopt.is_mp_join)
-+		return mptcp_do_join_short(skb, &mopt, sock_net(sk));
-+	if (mopt.drop_me)
-+		goto drop;
-+
-+	if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
-+		mopt.saw_mpc = 0;
-+
-+	if (skb->protocol == htons(ETH_P_IP)) {
-+		if (mopt.saw_mpc && !want_cookie) {
-+			if (skb_rtable(skb)->rt_flags &
-+			    (RTCF_BROADCAST | RTCF_MULTICAST))
-+				goto drop;
-+
-+			return tcp_conn_request(&mptcp_request_sock_ops,
-+						&mptcp_request_sock_ipv4_ops,
-+						sk, skb);
-+		}
-+
-+		return tcp_v4_conn_request(sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else {
-+		if (mopt.saw_mpc && !want_cookie) {
-+			if (!ipv6_unicast_destination(skb))
-+				goto drop;
-+
-+			return tcp_conn_request(&mptcp6_request_sock_ops,
-+						&mptcp_request_sock_ipv6_ops,
-+						sk, skb);
-+		}
-+
-+		return tcp_v6_conn_request(sk, skb);
-+#endif
-+	}
-+drop:
-+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-+	return 0;
-+}
-+
-+struct workqueue_struct *mptcp_wq;
-+EXPORT_SYMBOL(mptcp_wq);
-+
-+/* Output /proc/net/mptcp */
-+static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
-+{
-+	struct tcp_sock *meta_tp;
-+	const struct net *net = seq->private;
-+	int i, n = 0;
-+
-+	seq_printf(seq, "  sl  loc_tok  rem_tok  v6 local_address                         remote_address                        st ns tx_queue rx_queue inode");
-+	seq_putc(seq, '\n');
-+
-+	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+		struct hlist_nulls_node *node;
-+		rcu_read_lock_bh();
-+		hlist_nulls_for_each_entry_rcu(meta_tp, node,
-+					       &tk_hashtable[i], tk_table) {
-+			struct mptcp_cb *mpcb = meta_tp->mpcb;
-+			struct sock *meta_sk = (struct sock *)meta_tp;
-+			struct inet_sock *isk = inet_sk(meta_sk);
-+
-+			if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
-+				continue;
-+
-+			if (capable(CAP_NET_ADMIN)) {
-+				seq_printf(seq, "%4d: %04X %04X ", n++,
-+						mpcb->mptcp_loc_token,
-+						mpcb->mptcp_rem_token);
-+			} else {
-+				seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
-+			}
-+			if (meta_sk->sk_family == AF_INET ||
-+			    mptcp_v6_is_v4_mapped(meta_sk)) {
-+				seq_printf(seq, " 0 %08X:%04X                         %08X:%04X                        ",
-+					   isk->inet_rcv_saddr,
-+					   ntohs(isk->inet_sport),
-+					   isk->inet_daddr,
-+					   ntohs(isk->inet_dport));
-+#if IS_ENABLED(CONFIG_IPV6)
-+			} else if (meta_sk->sk_family == AF_INET6) {
-+				struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
-+				struct in6_addr *dst = &meta_sk->sk_v6_daddr;
-+				seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
-+					   src->s6_addr32[0], src->s6_addr32[1],
-+					   src->s6_addr32[2], src->s6_addr32[3],
-+					   ntohs(isk->inet_sport),
-+					   dst->s6_addr32[0], dst->s6_addr32[1],
-+					   dst->s6_addr32[2], dst->s6_addr32[3],
-+					   ntohs(isk->inet_dport));
-+#endif
-+			}
-+			seq_printf(seq, " %02X %02X %08X:%08X %lu",
-+				   meta_sk->sk_state, mpcb->cnt_subflows,
-+				   meta_tp->write_seq - meta_tp->snd_una,
-+				   max_t(int, meta_tp->rcv_nxt -
-+					 meta_tp->copied_seq, 0),
-+				   sock_i_ino(meta_sk));
-+			seq_putc(seq, '\n');
-+		}
-+
-+		rcu_read_unlock_bh();
-+	}
-+
-+	return 0;
-+}
-+
-+static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
-+{
-+	return single_open_net(inode, file, mptcp_pm_seq_show);
-+}
-+
-+static const struct file_operations mptcp_pm_seq_fops = {
-+	.owner = THIS_MODULE,
-+	.open = mptcp_pm_seq_open,
-+	.read = seq_read,
-+	.llseek = seq_lseek,
-+	.release = single_release_net,
-+};
-+
-+static int mptcp_pm_init_net(struct net *net)
-+{
-+	if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
-+		return -ENOMEM;
-+
-+	return 0;
-+}
-+
-+static void mptcp_pm_exit_net(struct net *net)
-+{
-+	remove_proc_entry("mptcp", net->proc_net);
-+}
-+
-+static struct pernet_operations mptcp_pm_proc_ops = {
-+	.init = mptcp_pm_init_net,
-+	.exit = mptcp_pm_exit_net,
-+};
-+
-+/* General initialization of mptcp */
-+void __init mptcp_init(void)
-+{
-+	int i;
-+	struct ctl_table_header *mptcp_sysctl;
-+
-+	mptcp_sock_cache = kmem_cache_create("mptcp_sock",
-+					     sizeof(struct mptcp_tcp_sock),
-+					     0, SLAB_HWCACHE_ALIGN,
-+					     NULL);
-+	if (!mptcp_sock_cache)
-+		goto mptcp_sock_cache_failed;
-+
-+	mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
-+					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+					   NULL);
-+	if (!mptcp_cb_cache)
-+		goto mptcp_cb_cache_failed;
-+
-+	mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
-+					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+					   NULL);
-+	if (!mptcp_tw_cache)
-+		goto mptcp_tw_cache_failed;
-+
-+	get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
-+
-+	mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
-+	if (!mptcp_wq)
-+		goto alloc_workqueue_failed;
-+
-+	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+		INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
-+		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
-+				      i + MPTCP_REQSK_NULLS_BASE);
-+		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
-+	}
-+
-+	spin_lock_init(&mptcp_reqsk_hlock);
-+	spin_lock_init(&mptcp_tk_hashlock);
-+
-+	if (register_pernet_subsys(&mptcp_pm_proc_ops))
-+		goto pernet_failed;
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	if (mptcp_pm_v6_init())
-+		goto mptcp_pm_v6_failed;
-+#endif
-+	if (mptcp_pm_v4_init())
-+		goto mptcp_pm_v4_failed;
-+
-+	mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
-+	if (!mptcp_sysctl)
-+		goto register_sysctl_failed;
-+
-+	if (mptcp_register_path_manager(&mptcp_pm_default))
-+		goto register_pm_failed;
-+
-+	if (mptcp_register_scheduler(&mptcp_sched_default))
-+		goto register_sched_failed;
-+
-+	pr_info("MPTCP: Stable release v0.89.0-rc");
-+
-+	mptcp_init_failed = false;
-+
-+	return;
-+
-+register_sched_failed:
-+	mptcp_unregister_path_manager(&mptcp_pm_default);
-+register_pm_failed:
-+	unregister_net_sysctl_table(mptcp_sysctl);
-+register_sysctl_failed:
-+	mptcp_pm_v4_undo();
-+mptcp_pm_v4_failed:
-+#if IS_ENABLED(CONFIG_IPV6)
-+	mptcp_pm_v6_undo();
-+mptcp_pm_v6_failed:
-+#endif
-+	unregister_pernet_subsys(&mptcp_pm_proc_ops);
-+pernet_failed:
-+	destroy_workqueue(mptcp_wq);
-+alloc_workqueue_failed:
-+	kmem_cache_destroy(mptcp_tw_cache);
-+mptcp_tw_cache_failed:
-+	kmem_cache_destroy(mptcp_cb_cache);
-+mptcp_cb_cache_failed:
-+	kmem_cache_destroy(mptcp_sock_cache);
-+mptcp_sock_cache_failed:
-+	mptcp_init_failed = true;
-+}
-diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
-new file mode 100644
-index 000000000000..3a54413ce25b
---- /dev/null
-+++ b/net/mptcp/mptcp_fullmesh.c
-@@ -0,0 +1,1722 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#include <net/addrconf.h>
-+#endif
-+
-+enum {
-+	MPTCP_EVENT_ADD = 1,
-+	MPTCP_EVENT_DEL,
-+	MPTCP_EVENT_MOD,
-+};
-+
-+#define MPTCP_SUBFLOW_RETRY_DELAY	1000
-+
-+/* Max number of local or remote addresses we can store.
-+ * When changing, see the bitfield below in fullmesh_rem4/6.
-+ */
-+#define MPTCP_MAX_ADDR	8
-+
-+struct fullmesh_rem4 {
-+	u8		rem4_id;
-+	u8		bitfield;
-+	u8		retry_bitfield;
-+	__be16		port;
-+	struct in_addr	addr;
-+};
-+
-+struct fullmesh_rem6 {
-+	u8		rem6_id;
-+	u8		bitfield;
-+	u8		retry_bitfield;
-+	__be16		port;
-+	struct in6_addr	addr;
-+};
-+
-+struct mptcp_loc_addr {
-+	struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
-+	u8 loc4_bits;
-+	u8 next_v4_index;
-+
-+	struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
-+	u8 loc6_bits;
-+	u8 next_v6_index;
-+};
-+
-+struct mptcp_addr_event {
-+	struct list_head list;
-+	unsigned short	family;
-+	u8	code:7,
-+		low_prio:1;
-+	union inet_addr addr;
-+};
-+
-+struct fullmesh_priv {
-+	/* Worker struct for subflow establishment */
-+	struct work_struct subflow_work;
-+	/* Delayed worker, when the routing-tables are not yet ready. */
-+	struct delayed_work subflow_retry_work;
-+
-+	/* Remote addresses */
-+	struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
-+	struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
-+
-+	struct mptcp_cb *mpcb;
-+
-+	u16 remove_addrs; /* Addresses to remove */
-+	u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
-+	u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
-+
-+	u8	add_addr; /* Are we sending an add_addr? */
-+
-+	u8 rem4_bits;
-+	u8 rem6_bits;
-+};
-+
-+struct mptcp_fm_ns {
-+	struct mptcp_loc_addr __rcu *local;
-+	spinlock_t local_lock; /* Protecting the above pointer */
-+	struct list_head events;
-+	struct delayed_work address_worker;
-+
-+	struct net *net;
-+};
-+
-+static struct mptcp_pm_ops full_mesh __read_mostly;
-+
-+static void full_mesh_create_subflows(struct sock *meta_sk);
-+
-+static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
-+{
-+	return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
-+}
-+
-+static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
-+{
-+	return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
-+}
-+
-+/* Find the first free index in the bitfield */
-+static int __mptcp_find_free_index(u8 bitfield, u8 base)
-+{
-+	int i;
-+
-+	/* There are anyways no free bits... */
-+	if (bitfield == 0xff)
-+		goto exit;
-+
-+	i = ffs(~(bitfield >> base)) - 1;
-+	if (i < 0)
-+		goto exit;
-+
-+	/* No free bits when starting at base, try from 0 on */
-+	if (i + base >= sizeof(bitfield) * 8)
-+		return __mptcp_find_free_index(bitfield, 0);
-+
-+	return i + base;
-+exit:
-+	return -1;
-+}
-+
-+static int mptcp_find_free_index(u8 bitfield)
-+{
-+	return __mptcp_find_free_index(bitfield, 0);
-+}
-+
-+static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
-+			      const struct in_addr *addr,
-+			      __be16 port, u8 id)
-+{
-+	int i;
-+	struct fullmesh_rem4 *rem4;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		rem4 = &fmp->remaddr4[i];
-+
-+		/* Address is already in the list --- continue */
-+		if (rem4->rem4_id == id &&
-+		    rem4->addr.s_addr == addr->s_addr && rem4->port == port)
-+			return;
-+
-+		/* This may be the case, when the peer is behind a NAT. He is
-+		 * trying to JOIN, thus sending the JOIN with a certain ID.
-+		 * However the src_addr of the IP-packet has been changed. We
-+		 * update the addr in the list, because this is the address as
-+		 * OUR BOX sees it.
-+		 */
-+		if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
-+			/* update the address */
-+			mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
-+				    __func__, &rem4->addr.s_addr,
-+				    &addr->s_addr, id);
-+			rem4->addr.s_addr = addr->s_addr;
-+			rem4->port = port;
-+			mpcb->list_rcvd = 1;
-+			return;
-+		}
-+	}
-+
-+	i = mptcp_find_free_index(fmp->rem4_bits);
-+	/* Do we have already the maximum number of local/remote addresses? */
-+	if (i < 0) {
-+		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
-+			    __func__, MPTCP_MAX_ADDR, &addr->s_addr);
-+		return;
-+	}
-+
-+	rem4 = &fmp->remaddr4[i];
-+
-+	/* Address is not known yet, store it */
-+	rem4->addr.s_addr = addr->s_addr;
-+	rem4->port = port;
-+	rem4->bitfield = 0;
-+	rem4->retry_bitfield = 0;
-+	rem4->rem4_id = id;
-+	mpcb->list_rcvd = 1;
-+	fmp->rem4_bits |= (1 << i);
-+
-+	return;
-+}
-+
-+static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
-+			      const struct in6_addr *addr,
-+			      __be16 port, u8 id)
-+{
-+	int i;
-+	struct fullmesh_rem6 *rem6;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		rem6 = &fmp->remaddr6[i];
-+
-+		/* Address is already in the list --- continue */
-+		if (rem6->rem6_id == id &&
-+		    ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
-+			return;
-+
-+		/* This may be the case, when the peer is behind a NAT. He is
-+		 * trying to JOIN, thus sending the JOIN with a certain ID.
-+		 * However the src_addr of the IP-packet has been changed. We
-+		 * update the addr in the list, because this is the address as
-+		 * OUR BOX sees it.
-+		 */
-+		if (rem6->rem6_id == id) {
-+			/* update the address */
-+			mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
-+				    __func__, &rem6->addr, addr, id);
-+			rem6->addr = *addr;
-+			rem6->port = port;
-+			mpcb->list_rcvd = 1;
-+			return;
-+		}
-+	}
-+
-+	i = mptcp_find_free_index(fmp->rem6_bits);
-+	/* Do we have already the maximum number of local/remote addresses? */
-+	if (i < 0) {
-+		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
-+			    __func__, MPTCP_MAX_ADDR, addr);
-+		return;
-+	}
-+
-+	rem6 = &fmp->remaddr6[i];
-+
-+	/* Address is not known yet, store it */
-+	rem6->addr = *addr;
-+	rem6->port = port;
-+	rem6->bitfield = 0;
-+	rem6->retry_bitfield = 0;
-+	rem6->rem6_id = id;
-+	mpcb->list_rcvd = 1;
-+	fmp->rem6_bits |= (1 << i);
-+
-+	return;
-+}
-+
-+static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
-+{
-+	int i;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		if (fmp->remaddr4[i].rem4_id == id) {
-+			/* remove address from bitfield */
-+			fmp->rem4_bits &= ~(1 << i);
-+
-+			break;
-+		}
-+	}
-+}
-+
-+static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
-+{
-+	int i;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		if (fmp->remaddr6[i].rem6_id == id) {
-+			/* remove address from bitfield */
-+			fmp->rem6_bits &= ~(1 << i);
-+
-+			break;
-+		}
-+	}
-+}
-+
-+/* Sets the bitfield of the remote-address field */
-+static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
-+				       const struct in_addr *addr, u8 index)
-+{
-+	int i;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
-+			fmp->remaddr4[i].bitfield |= (1 << index);
-+			return;
-+		}
-+	}
-+}
-+
-+/* Sets the bitfield of the remote-address field */
-+static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
-+				       const struct in6_addr *addr, u8 index)
-+{
-+	int i;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
-+			fmp->remaddr6[i].bitfield |= (1 << index);
-+			return;
-+		}
-+	}
-+}
-+
-+static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
-+				    const union inet_addr *addr,
-+				    sa_family_t family, u8 id)
-+{
-+	if (family == AF_INET)
-+		mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
-+	else
-+		mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
-+}
-+
-+static void retry_subflow_worker(struct work_struct *work)
-+{
-+	struct delayed_work *delayed_work = container_of(work,
-+							 struct delayed_work,
-+							 work);
-+	struct fullmesh_priv *fmp = container_of(delayed_work,
-+						 struct fullmesh_priv,
-+						 subflow_retry_work);
-+	struct mptcp_cb *mpcb = fmp->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+	int iter = 0, i;
-+
-+	/* We need a local (stable) copy of the address-list. Really, it is not
-+	 * such a big deal, if the address-list is not 100% up-to-date.
-+	 */
-+	rcu_read_lock_bh();
-+	mptcp_local = rcu_dereference_bh(fm_ns->local);
-+	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
-+	rcu_read_unlock_bh();
-+
-+	if (!mptcp_local)
-+		return;
-+
-+next_subflow:
-+	if (iter) {
-+		release_sock(meta_sk);
-+		mutex_unlock(&mpcb->mpcb_mutex);
-+
-+		cond_resched();
-+	}
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	iter++;
-+
-+	if (sock_flag(meta_sk, SOCK_DEAD))
-+		goto exit;
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
-+		/* Do we need to retry establishing a subflow ? */
-+		if (rem->retry_bitfield) {
-+			int i = mptcp_find_free_index(~rem->retry_bitfield);
-+			struct mptcp_rem4 rem4;
-+
-+			rem->bitfield |= (1 << i);
-+			rem->retry_bitfield &= ~(1 << i);
-+
-+			rem4.addr = rem->addr;
-+			rem4.port = rem->port;
-+			rem4.rem4_id = rem->rem4_id;
-+
-+			mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
-+			goto next_subflow;
-+		}
-+	}
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
-+
-+		/* Do we need to retry establishing a subflow ? */
-+		if (rem->retry_bitfield) {
-+			int i = mptcp_find_free_index(~rem->retry_bitfield);
-+			struct mptcp_rem6 rem6;
-+
-+			rem->bitfield |= (1 << i);
-+			rem->retry_bitfield &= ~(1 << i);
-+
-+			rem6.addr = rem->addr;
-+			rem6.port = rem->port;
-+			rem6.rem6_id = rem->rem6_id;
-+
-+			mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
-+			goto next_subflow;
-+		}
-+	}
-+#endif
-+
-+exit:
-+	kfree(mptcp_local);
-+	release_sock(meta_sk);
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk);
-+}
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+	struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
-+						 subflow_work);
-+	struct mptcp_cb *mpcb = fmp->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	struct mptcp_loc_addr *mptcp_local;
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+	int iter = 0, retry = 0;
-+	int i;
-+
-+	/* We need a local (stable) copy of the address-list. Really, it is not
-+	 * such a big deal, if the address-list is not 100% up-to-date.
-+	 */
-+	rcu_read_lock_bh();
-+	mptcp_local = rcu_dereference_bh(fm_ns->local);
-+	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
-+	rcu_read_unlock_bh();
-+
-+	if (!mptcp_local)
-+		return;
-+
-+next_subflow:
-+	if (iter) {
-+		release_sock(meta_sk);
-+		mutex_unlock(&mpcb->mpcb_mutex);
-+
-+		cond_resched();
-+	}
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	iter++;
-+
-+	if (sock_flag(meta_sk, SOCK_DEAD))
-+		goto exit;
-+
-+	if (mpcb->master_sk &&
-+	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+		goto exit;
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		struct fullmesh_rem4 *rem;
-+		u8 remaining_bits;
-+
-+		rem = &fmp->remaddr4[i];
-+		remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
-+
-+		/* Are there still combinations to handle? */
-+		if (remaining_bits) {
-+			int i = mptcp_find_free_index(~remaining_bits);
-+			struct mptcp_rem4 rem4;
-+
-+			rem->bitfield |= (1 << i);
-+
-+			rem4.addr = rem->addr;
-+			rem4.port = rem->port;
-+			rem4.rem4_id = rem->rem4_id;
-+
-+			/* If a route is not yet available then retry once */
-+			if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
-+						   &rem4) == -ENETUNREACH)
-+				retry = rem->retry_bitfield |= (1 << i);
-+			goto next_subflow;
-+		}
-+	}
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		struct fullmesh_rem6 *rem;
-+		u8 remaining_bits;
-+
-+		rem = &fmp->remaddr6[i];
-+		remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
-+
-+		/* Are there still combinations to handle? */
-+		if (remaining_bits) {
-+			int i = mptcp_find_free_index(~remaining_bits);
-+			struct mptcp_rem6 rem6;
-+
-+			rem->bitfield |= (1 << i);
-+
-+			rem6.addr = rem->addr;
-+			rem6.port = rem->port;
-+			rem6.rem6_id = rem->rem6_id;
-+
-+			/* If a route is not yet available then retry once */
-+			if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
-+						   &rem6) == -ENETUNREACH)
-+				retry = rem->retry_bitfield |= (1 << i);
-+			goto next_subflow;
-+		}
-+	}
-+#endif
-+
-+	if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
-+		sock_hold(meta_sk);
-+		queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
-+				   msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
-+	}
-+
-+exit:
-+	kfree(mptcp_local);
-+	release_sock(meta_sk);
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk);
-+}
-+
-+static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	struct sock *sk = mptcp_select_ack_sock(meta_sk);
-+
-+	fmp->remove_addrs |= (1 << addr_id);
-+	mpcb->addr_signal = 1;
-+
-+	if (sk)
-+		tcp_send_ack(sk);
-+}
-+
-+static void update_addr_bitfields(struct sock *meta_sk,
-+				  const struct mptcp_loc_addr *mptcp_local)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	int i;
-+
-+	/* The bits in announced_addrs_* always match with loc*_bits. So, a
-+	 * simply & operation unsets the correct bits, because these go from
-+	 * announced to non-announced
-+	 */
-+	fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
-+		fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
-+	}
-+
-+	fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
-+
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
-+		fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
-+	}
-+}
-+
-+static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
-+			      sa_family_t family, const union inet_addr *addr)
-+{
-+	int i;
-+	u8 loc_bits;
-+	bool found = false;
-+
-+	if (family == AF_INET)
-+		loc_bits = mptcp_local->loc4_bits;
-+	else
-+		loc_bits = mptcp_local->loc6_bits;
-+
-+	mptcp_for_each_bit_set(loc_bits, i) {
-+		if (family == AF_INET &&
-+		    mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
-+			found = true;
-+			break;
-+		}
-+		if (family == AF_INET6 &&
-+		    ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
-+				    &addr->in6)) {
-+			found = true;
-+			break;
-+		}
-+	}
-+
-+	if (!found)
-+		return -1;
-+
-+	return i;
-+}
-+
-+static void mptcp_address_worker(struct work_struct *work)
-+{
-+	const struct delayed_work *delayed_work = container_of(work,
-+							 struct delayed_work,
-+							 work);
-+	struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
-+						 struct mptcp_fm_ns,
-+						 address_worker);
-+	struct net *net = fm_ns->net;
-+	struct mptcp_addr_event *event = NULL;
-+	struct mptcp_loc_addr *mptcp_local, *old;
-+	int i, id = -1; /* id is used in the socket-code on a delete-event */
-+	bool success; /* Used to indicate if we succeeded handling the event */
-+
-+next_event:
-+	success = false;
-+	kfree(event);
-+
-+	/* First, let's dequeue an event from our event-list */
-+	rcu_read_lock_bh();
-+	spin_lock(&fm_ns->local_lock);
-+
-+	event = list_first_entry_or_null(&fm_ns->events,
-+					 struct mptcp_addr_event, list);
-+	if (!event) {
-+		spin_unlock(&fm_ns->local_lock);
-+		rcu_read_unlock_bh();
-+		return;
-+	}
-+
-+	list_del(&event->list);
-+
-+	mptcp_local = rcu_dereference_bh(fm_ns->local);
-+
-+	if (event->code == MPTCP_EVENT_DEL) {
-+		id = mptcp_find_address(mptcp_local, event->family, &event->addr);
-+
-+		/* Not in the list - so we don't care */
-+		if (id < 0) {
-+			mptcp_debug("%s could not find id\n", __func__);
-+			goto duno;
-+		}
-+
-+		old = mptcp_local;
-+		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
-+				      GFP_ATOMIC);
-+		if (!mptcp_local)
-+			goto duno;
-+
-+		if (event->family == AF_INET)
-+			mptcp_local->loc4_bits &= ~(1 << id);
-+		else
-+			mptcp_local->loc6_bits &= ~(1 << id);
-+
-+		rcu_assign_pointer(fm_ns->local, mptcp_local);
-+		kfree(old);
-+	} else {
-+		int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
-+		int j = i;
-+
-+		if (j < 0) {
-+			/* Not in the list, so we have to find an empty slot */
-+			if (event->family == AF_INET)
-+				i = __mptcp_find_free_index(mptcp_local->loc4_bits,
-+							    mptcp_local->next_v4_index);
-+			if (event->family == AF_INET6)
-+				i = __mptcp_find_free_index(mptcp_local->loc6_bits,
-+							    mptcp_local->next_v6_index);
-+
-+			if (i < 0) {
-+				mptcp_debug("%s no more space\n", __func__);
-+				goto duno;
-+			}
-+
-+			/* It might have been a MOD-event. */
-+			event->code = MPTCP_EVENT_ADD;
-+		} else {
-+			/* Let's check if anything changes */
-+			if (event->family == AF_INET &&
-+			    event->low_prio == mptcp_local->locaddr4[i].low_prio)
-+				goto duno;
-+
-+			if (event->family == AF_INET6 &&
-+			    event->low_prio == mptcp_local->locaddr6[i].low_prio)
-+				goto duno;
-+		}
-+
-+		old = mptcp_local;
-+		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
-+				      GFP_ATOMIC);
-+		if (!mptcp_local)
-+			goto duno;
-+
-+		if (event->family == AF_INET) {
-+			mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
-+			mptcp_local->locaddr4[i].loc4_id = i + 1;
-+			mptcp_local->locaddr4[i].low_prio = event->low_prio;
-+		} else {
-+			mptcp_local->locaddr6[i].addr = event->addr.in6;
-+			mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
-+			mptcp_local->locaddr6[i].low_prio = event->low_prio;
-+		}
-+
-+		if (j < 0) {
-+			if (event->family == AF_INET) {
-+				mptcp_local->loc4_bits |= (1 << i);
-+				mptcp_local->next_v4_index = i + 1;
-+			} else {
-+				mptcp_local->loc6_bits |= (1 << i);
-+				mptcp_local->next_v6_index = i + 1;
-+			}
-+		}
-+
-+		rcu_assign_pointer(fm_ns->local, mptcp_local);
-+		kfree(old);
-+	}
-+	success = true;
-+
-+duno:
-+	spin_unlock(&fm_ns->local_lock);
-+	rcu_read_unlock_bh();
-+
-+	if (!success)
-+		goto next_event;
-+
-+	/* Now we iterate over the MPTCP-sockets and apply the event. */
-+	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+		const struct hlist_nulls_node *node;
-+		struct tcp_sock *meta_tp;
-+
-+		rcu_read_lock_bh();
-+		hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
-+					       tk_table) {
-+			struct mptcp_cb *mpcb = meta_tp->mpcb;
-+			struct sock *meta_sk = (struct sock *)meta_tp, *sk;
-+			struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+			bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+			if (sock_net(meta_sk) != net)
-+				continue;
-+
-+			if (meta_v4) {
-+				/* skip IPv6 events if meta is IPv4 */
-+				if (event->family == AF_INET6)
-+					continue;
-+			}
-+			/* skip IPv4 events if IPV6_V6ONLY is set */
-+			else if (event->family == AF_INET &&
-+				 inet6_sk(meta_sk)->ipv6only)
-+				continue;
-+
-+			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+				continue;
-+
-+			bh_lock_sock(meta_sk);
-+
-+			if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
-+			    mpcb->infinite_mapping_snd ||
-+			    mpcb->infinite_mapping_rcv ||
-+			    mpcb->send_infinite_mapping)
-+				goto next;
-+
-+			/* May be that the pm has changed in-between */
-+			if (mpcb->pm_ops != &full_mesh)
-+				goto next;
-+
-+			if (sock_owned_by_user(meta_sk)) {
-+				if (!test_and_set_bit(MPTCP_PATH_MANAGER,
-+						      &meta_tp->tsq_flags))
-+					sock_hold(meta_sk);
-+
-+				goto next;
-+			}
-+
-+			if (event->code == MPTCP_EVENT_ADD) {
-+				fmp->add_addr++;
-+				mpcb->addr_signal = 1;
-+
-+				sk = mptcp_select_ack_sock(meta_sk);
-+				if (sk)
-+					tcp_send_ack(sk);
-+
-+				full_mesh_create_subflows(meta_sk);
-+			}
-+
-+			if (event->code == MPTCP_EVENT_DEL) {
-+				struct sock *sk, *tmpsk;
-+				struct mptcp_loc_addr *mptcp_local;
-+				bool found = false;
-+
-+				mptcp_local = rcu_dereference_bh(fm_ns->local);
-+
-+				/* In any case, we need to update our bitfields */
-+				if (id >= 0)
-+					update_addr_bitfields(meta_sk, mptcp_local);
-+
-+				/* Look for the socket and remove him */
-+				mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
-+					if ((event->family == AF_INET6 &&
-+					     (sk->sk_family == AF_INET ||
-+					      mptcp_v6_is_v4_mapped(sk))) ||
-+					    (event->family == AF_INET &&
-+					     (sk->sk_family == AF_INET6 &&
-+					      !mptcp_v6_is_v4_mapped(sk))))
-+						continue;
-+
-+					if (event->family == AF_INET &&
-+					    (sk->sk_family == AF_INET ||
-+					     mptcp_v6_is_v4_mapped(sk)) &&
-+					     inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
-+						continue;
-+
-+					if (event->family == AF_INET6 &&
-+					    sk->sk_family == AF_INET6 &&
-+					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
-+						continue;
-+
-+					/* Reinject, so that pf = 1 and so we
-+					 * won't select this one as the
-+					 * ack-sock.
-+					 */
-+					mptcp_reinject_data(sk, 0);
-+
-+					/* We announce the removal of this id */
-+					announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
-+
-+					mptcp_sub_force_close(sk);
-+					found = true;
-+				}
-+
-+				if (found)
-+					goto next;
-+
-+				/* The id may have been given by the event,
-+				 * matching on a local address. And it may not
-+				 * have matched on one of the above sockets,
-+				 * because the client never created a subflow.
-+				 * So, we have to finally remove it here.
-+				 */
-+				if (id > 0)
-+					announce_remove_addr(id, meta_sk);
-+			}
-+
-+			if (event->code == MPTCP_EVENT_MOD) {
-+				struct sock *sk;
-+
-+				mptcp_for_each_sk(mpcb, sk) {
-+					struct tcp_sock *tp = tcp_sk(sk);
-+					if (event->family == AF_INET &&
-+					    (sk->sk_family == AF_INET ||
-+					     mptcp_v6_is_v4_mapped(sk)) &&
-+					     inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
-+						if (event->low_prio != tp->mptcp->low_prio) {
-+							tp->mptcp->send_mp_prio = 1;
-+							tp->mptcp->low_prio = event->low_prio;
-+
-+							tcp_send_ack(sk);
-+						}
-+					}
-+
-+					if (event->family == AF_INET6 &&
-+					    sk->sk_family == AF_INET6 &&
-+					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
-+						if (event->low_prio != tp->mptcp->low_prio) {
-+							tp->mptcp->send_mp_prio = 1;
-+							tp->mptcp->low_prio = event->low_prio;
-+
-+							tcp_send_ack(sk);
-+						}
-+					}
-+				}
-+			}
-+next:
-+			bh_unlock_sock(meta_sk);
-+			sock_put(meta_sk);
-+		}
-+		rcu_read_unlock_bh();
-+	}
-+	goto next_event;
-+}
-+
-+static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
-+						     const struct mptcp_addr_event *event)
-+{
-+	struct mptcp_addr_event *eventq;
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+
-+	list_for_each_entry(eventq, &fm_ns->events, list) {
-+		if (eventq->family != event->family)
-+			continue;
-+		if (event->family == AF_INET) {
-+			if (eventq->addr.in.s_addr == event->addr.in.s_addr)
-+				return eventq;
-+		} else {
-+			if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
-+				return eventq;
-+		}
-+	}
-+	return NULL;
-+}
-+
-+/* We already hold the net-namespace MPTCP-lock */
-+static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
-+{
-+	struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+
-+	if (eventq) {
-+		switch (event->code) {
-+		case MPTCP_EVENT_DEL:
-+			mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
-+			list_del(&eventq->list);
-+			kfree(eventq);
-+			break;
-+		case MPTCP_EVENT_ADD:
-+			mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
-+			eventq->low_prio = event->low_prio;
-+			eventq->code = MPTCP_EVENT_ADD;
-+			return;
-+		case MPTCP_EVENT_MOD:
-+			mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
-+			eventq->low_prio = event->low_prio;
-+			eventq->code = MPTCP_EVENT_MOD;
-+			return;
-+		}
-+	}
-+
-+	/* OK, we have to add the new address to the wait queue */
-+	eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
-+	if (!eventq)
-+		return;
-+
-+	list_add_tail(&eventq->list, &fm_ns->events);
-+
-+	/* Create work-queue */
-+	if (!delayed_work_pending(&fm_ns->address_worker))
-+		queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
-+				   msecs_to_jiffies(500));
-+}
-+
-+static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
-+				struct net *net)
-+{
-+	const struct net_device *netdev = ifa->ifa_dev->dev;
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+	struct mptcp_addr_event mpevent;
-+
-+	if (ifa->ifa_scope > RT_SCOPE_LINK ||
-+	    ipv4_is_loopback(ifa->ifa_local))
-+		return;
-+
-+	spin_lock_bh(&fm_ns->local_lock);
-+
-+	mpevent.family = AF_INET;
-+	mpevent.addr.in.s_addr = ifa->ifa_local;
-+	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
-+
-+	if (event == NETDEV_DOWN || !netif_running(netdev) ||
-+	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
-+		mpevent.code = MPTCP_EVENT_DEL;
-+	else if (event == NETDEV_UP)
-+		mpevent.code = MPTCP_EVENT_ADD;
-+	else if (event == NETDEV_CHANGE)
-+		mpevent.code = MPTCP_EVENT_MOD;
-+
-+	mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
-+		    &ifa->ifa_local, mpevent.code, mpevent.low_prio);
-+	add_pm_event(net, &mpevent);
-+
-+	spin_unlock_bh(&fm_ns->local_lock);
-+	return;
-+}
-+
-+/* React on IPv4-addr add/rem-events */
-+static int mptcp_pm_inetaddr_event(struct notifier_block *this,
-+				   unsigned long event, void *ptr)
-+{
-+	const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
-+	struct net *net = dev_net(ifa->ifa_dev->dev);
-+
-+	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+	      event == NETDEV_CHANGE))
-+		return NOTIFY_DONE;
-+
-+	addr4_event_handler(ifa, event, net);
-+
-+	return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block mptcp_pm_inetaddr_notifier = {
-+		.notifier_call = mptcp_pm_inetaddr_event,
-+};
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+
-+/* IPV6-related address/interface watchers */
-+struct mptcp_dad_data {
-+	struct timer_list timer;
-+	struct inet6_ifaddr *ifa;
-+};
-+
-+static void dad_callback(unsigned long arg);
-+static int inet6_addr_event(struct notifier_block *this,
-+				     unsigned long event, void *ptr);
-+
-+static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
-+{
-+	return (ifa->flags & IFA_F_TENTATIVE) &&
-+	       ifa->state == INET6_IFADDR_STATE_DAD;
-+}
-+
-+static void dad_init_timer(struct mptcp_dad_data *data,
-+				 struct inet6_ifaddr *ifa)
-+{
-+	data->ifa = ifa;
-+	data->timer.data = (unsigned long)data;
-+	data->timer.function = dad_callback;
-+	if (ifa->idev->cnf.rtr_solicit_delay)
-+		data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
-+	else
-+		data->timer.expires = jiffies + (HZ/10);
-+}
-+
-+static void dad_callback(unsigned long arg)
-+{
-+	struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
-+
-+	if (ipv6_is_in_dad_state(data->ifa)) {
-+		dad_init_timer(data, data->ifa);
-+		add_timer(&data->timer);
-+	} else {
-+		inet6_addr_event(NULL, NETDEV_UP, data->ifa);
-+		in6_ifa_put(data->ifa);
-+		kfree(data);
-+	}
-+}
-+
-+static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
-+{
-+	struct mptcp_dad_data *data;
-+
-+	data = kmalloc(sizeof(*data), GFP_ATOMIC);
-+
-+	if (!data)
-+		return;
-+
-+	init_timer(&data->timer);
-+	dad_init_timer(data, ifa);
-+	add_timer(&data->timer);
-+	in6_ifa_hold(ifa);
-+}
-+
-+static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
-+				struct net *net)
-+{
-+	const struct net_device *netdev = ifa->idev->dev;
-+	int addr_type = ipv6_addr_type(&ifa->addr);
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+	struct mptcp_addr_event mpevent;
-+
-+	if (ifa->scope > RT_SCOPE_LINK ||
-+	    addr_type == IPV6_ADDR_ANY ||
-+	    (addr_type & IPV6_ADDR_LOOPBACK) ||
-+	    (addr_type & IPV6_ADDR_LINKLOCAL))
-+		return;
-+
-+	spin_lock_bh(&fm_ns->local_lock);
-+
-+	mpevent.family = AF_INET6;
-+	mpevent.addr.in6 = ifa->addr;
-+	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
-+
-+	if (event == NETDEV_DOWN || !netif_running(netdev) ||
-+	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
-+		mpevent.code = MPTCP_EVENT_DEL;
-+	else if (event == NETDEV_UP)
-+		mpevent.code = MPTCP_EVENT_ADD;
-+	else if (event == NETDEV_CHANGE)
-+		mpevent.code = MPTCP_EVENT_MOD;
-+
-+	mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
-+		    &ifa->addr, mpevent.code, mpevent.low_prio);
-+	add_pm_event(net, &mpevent);
-+
-+	spin_unlock_bh(&fm_ns->local_lock);
-+	return;
-+}
-+
-+/* React on IPv6-addr add/rem-events */
-+static int inet6_addr_event(struct notifier_block *this, unsigned long event,
-+			    void *ptr)
-+{
-+	struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
-+	struct net *net = dev_net(ifa6->idev->dev);
-+
-+	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+	      event == NETDEV_CHANGE))
-+		return NOTIFY_DONE;
-+
-+	if (ipv6_is_in_dad_state(ifa6))
-+		dad_setup_timer(ifa6);
-+	else
-+		addr6_event_handler(ifa6, event, net);
-+
-+	return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block inet6_addr_notifier = {
-+		.notifier_call = inet6_addr_event,
-+};
-+
-+#endif
-+
-+/* React on ifup/down-events */
-+static int netdev_event(struct notifier_block *this, unsigned long event,
-+			void *ptr)
-+{
-+	const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-+	struct in_device *in_dev;
-+#if IS_ENABLED(CONFIG_IPV6)
-+	struct inet6_dev *in6_dev;
-+#endif
-+
-+	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+	      event == NETDEV_CHANGE))
-+		return NOTIFY_DONE;
-+
-+	rcu_read_lock();
-+	in_dev = __in_dev_get_rtnl(dev);
-+
-+	if (in_dev) {
-+		for_ifa(in_dev) {
-+			mptcp_pm_inetaddr_event(NULL, event, ifa);
-+		} endfor_ifa(in_dev);
-+	}
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	in6_dev = __in6_dev_get(dev);
-+
-+	if (in6_dev) {
-+		struct inet6_ifaddr *ifa6;
-+		list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
-+			inet6_addr_event(NULL, event, ifa6);
-+	}
-+#endif
-+
-+	rcu_read_unlock();
-+	return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block mptcp_pm_netdev_notifier = {
-+		.notifier_call = netdev_event,
-+};
-+
-+static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
-+				const union inet_addr *addr,
-+				sa_family_t family, __be16 port, u8 id)
-+{
-+	if (family == AF_INET)
-+		mptcp_addv4_raddr(mpcb, &addr->in, port, id);
-+	else
-+		mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
-+}
-+
-+static void full_mesh_new_session(const struct sock *meta_sk)
-+{
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+	int i, index;
-+	union inet_addr saddr, daddr;
-+	sa_family_t family;
-+	bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+	/* Init local variables necessary for the rest */
-+	if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
-+		saddr.ip = inet_sk(meta_sk)->inet_saddr;
-+		daddr.ip = inet_sk(meta_sk)->inet_daddr;
-+		family = AF_INET;
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else {
-+		saddr.in6 = inet6_sk(meta_sk)->saddr;
-+		daddr.in6 = meta_sk->sk_v6_daddr;
-+		family = AF_INET6;
-+#endif
-+	}
-+
-+	rcu_read_lock();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	index = mptcp_find_address(mptcp_local, family, &saddr);
-+	if (index < 0)
-+		goto fallback;
-+
-+	full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
-+	mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
-+
-+	/* Initialize workqueue-struct */
-+	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+	INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
-+	fmp->mpcb = mpcb;
-+
-+	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+		goto skip_ipv4;
-+
-+	/* Look for the address among the local addresses */
-+	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+		__be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
-+
-+		/* We do not need to announce the initial subflow's address again */
-+		if (family == AF_INET && saddr.ip == ifa_address)
-+			continue;
-+
-+		fmp->add_addr++;
-+		mpcb->addr_signal = 1;
-+	}
-+
-+skip_ipv4:
-+#if IS_ENABLED(CONFIG_IPV6)
-+	/* skip IPv6 addresses if meta-socket is IPv4 */
-+	if (meta_v4)
-+		goto skip_ipv6;
-+
-+	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+		const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
-+
-+		/* We do not need to announce the initial subflow's address again */
-+		if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
-+			continue;
-+
-+		fmp->add_addr++;
-+		mpcb->addr_signal = 1;
-+	}
-+
-+skip_ipv6:
-+#endif
-+
-+	rcu_read_unlock();
-+
-+	if (family == AF_INET)
-+		fmp->announced_addrs_v4 |= (1 << index);
-+	else
-+		fmp->announced_addrs_v6 |= (1 << index);
-+
-+	for (i = fmp->add_addr; i && fmp->add_addr; i--)
-+		tcp_send_ack(mpcb->master_sk);
-+
-+	return;
-+
-+fallback:
-+	rcu_read_unlock();
-+	mptcp_fallback_default(mpcb);
-+	return;
-+}
-+
-+static void full_mesh_create_subflows(struct sock *meta_sk)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+	    mpcb->send_infinite_mapping ||
-+	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+		return;
-+
-+	if (mpcb->master_sk &&
-+	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+		return;
-+
-+	if (!work_pending(&fmp->subflow_work)) {
-+		sock_hold(meta_sk);
-+		queue_work(mptcp_wq, &fmp->subflow_work);
-+	}
-+}
-+
-+/* Called upon release_sock, if the socket was owned by the user during
-+ * a path-management event.
-+ */
-+static void full_mesh_release_sock(struct sock *meta_sk)
-+{
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+	struct sock *sk, *tmpsk;
-+	bool meta_v4 = meta_sk->sk_family == AF_INET;
-+	int i;
-+
-+	rcu_read_lock();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+		goto skip_ipv4;
-+
-+	/* First, detect modifications or additions */
-+	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+		struct in_addr ifa = mptcp_local->locaddr4[i].addr;
-+		bool found = false;
-+
-+		mptcp_for_each_sk(mpcb, sk) {
-+			struct tcp_sock *tp = tcp_sk(sk);
-+
-+			if (sk->sk_family == AF_INET6 &&
-+			    !mptcp_v6_is_v4_mapped(sk))
-+				continue;
-+
-+			if (inet_sk(sk)->inet_saddr != ifa.s_addr)
-+				continue;
-+
-+			found = true;
-+
-+			if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
-+				tp->mptcp->send_mp_prio = 1;
-+				tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
-+
-+				tcp_send_ack(sk);
-+			}
-+		}
-+
-+		if (!found) {
-+			fmp->add_addr++;
-+			mpcb->addr_signal = 1;
-+
-+			sk = mptcp_select_ack_sock(meta_sk);
-+			if (sk)
-+				tcp_send_ack(sk);
-+			full_mesh_create_subflows(meta_sk);
-+		}
-+	}
-+
-+skip_ipv4:
-+#if IS_ENABLED(CONFIG_IPV6)
-+	/* skip IPv6 addresses if meta-socket is IPv4 */
-+	if (meta_v4)
-+		goto removal;
-+
-+	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+		struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
-+		bool found = false;
-+
-+		mptcp_for_each_sk(mpcb, sk) {
-+			struct tcp_sock *tp = tcp_sk(sk);
-+
-+			if (sk->sk_family == AF_INET ||
-+			    mptcp_v6_is_v4_mapped(sk))
-+				continue;
-+
-+			if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
-+				continue;
-+
-+			found = true;
-+
-+			if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
-+				tp->mptcp->send_mp_prio = 1;
-+				tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
-+
-+				tcp_send_ack(sk);
-+			}
-+		}
-+
-+		if (!found) {
-+			fmp->add_addr++;
-+			mpcb->addr_signal = 1;
-+
-+			sk = mptcp_select_ack_sock(meta_sk);
-+			if (sk)
-+				tcp_send_ack(sk);
-+			full_mesh_create_subflows(meta_sk);
-+		}
-+	}
-+
-+removal:
-+#endif
-+
-+	/* Now, detect address-removals */
-+	mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
-+		bool shall_remove = true;
-+
-+		if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
-+			mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+				if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
-+					shall_remove = false;
-+					break;
-+				}
-+			}
-+		} else {
-+			mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+				if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
-+					shall_remove = false;
-+					break;
-+				}
-+			}
-+		}
-+
-+		if (shall_remove) {
-+			/* Reinject, so that pf = 1 and so we
-+			 * won't select this one as the
-+			 * ack-sock.
-+			 */
-+			mptcp_reinject_data(sk, 0);
-+
-+			announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
-+					     meta_sk);
-+
-+			mptcp_sub_force_close(sk);
-+		}
-+	}
-+
-+	/* Just call it optimistically. It actually cannot do any harm */
-+	update_addr_bitfields(meta_sk, mptcp_local);
-+
-+	rcu_read_unlock();
-+}
-+
-+static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
-+				  struct net *net, bool *low_prio)
-+{
-+	struct mptcp_loc_addr *mptcp_local;
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+	int index, id = -1;
-+
-+	/* Handle the backup-flows */
-+	rcu_read_lock();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	index = mptcp_find_address(mptcp_local, family, addr);
-+
-+	if (index != -1) {
-+		if (family == AF_INET) {
-+			id = mptcp_local->locaddr4[index].loc4_id;
-+			*low_prio = mptcp_local->locaddr4[index].low_prio;
-+		} else {
-+			id = mptcp_local->locaddr6[index].loc6_id;
-+			*low_prio = mptcp_local->locaddr6[index].low_prio;
-+		}
-+	}
-+
-+
-+	rcu_read_unlock();
-+
-+	return id;
-+}
-+
-+static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
-+				  struct tcp_out_options *opts,
-+				  struct sk_buff *skb)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
-+	int remove_addr_len;
-+	u8 unannouncedv4 = 0, unannouncedv6 = 0;
-+	bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+	mpcb->addr_signal = 0;
-+
-+	if (likely(!fmp->add_addr))
-+		goto remove_addr;
-+
-+	rcu_read_lock();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+		goto skip_ipv4;
-+
-+	/* IPv4 */
-+	unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
-+	if (unannouncedv4 &&
-+	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
-+		int ind = mptcp_find_free_index(~unannouncedv4);
-+
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_ADD_ADDR;
-+		opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
-+		opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
-+		opts->add_addr_v4 = 1;
-+
-+		if (skb) {
-+			fmp->announced_addrs_v4 |= (1 << ind);
-+			fmp->add_addr--;
-+		}
-+		*size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
-+	}
-+
-+	if (meta_v4)
-+		goto skip_ipv6;
-+
-+skip_ipv4:
-+	/* IPv6 */
-+	unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
-+	if (unannouncedv6 &&
-+	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
-+		int ind = mptcp_find_free_index(~unannouncedv6);
-+
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_ADD_ADDR;
-+		opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
-+		opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
-+		opts->add_addr_v6 = 1;
-+
-+		if (skb) {
-+			fmp->announced_addrs_v6 |= (1 << ind);
-+			fmp->add_addr--;
-+		}
-+		*size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
-+	}
-+
-+skip_ipv6:
-+	rcu_read_unlock();
-+
-+	if (!unannouncedv4 && !unannouncedv6 && skb)
-+		fmp->add_addr--;
-+
-+remove_addr:
-+	if (likely(!fmp->remove_addrs))
-+		goto exit;
-+
-+	remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
-+	if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
-+		goto exit;
-+
-+	opts->options |= OPTION_MPTCP;
-+	opts->mptcp_options |= OPTION_REMOVE_ADDR;
-+	opts->remove_addrs = fmp->remove_addrs;
-+	*size += remove_addr_len;
-+	if (skb)
-+		fmp->remove_addrs = 0;
-+
-+exit:
-+	mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
-+}
-+
-+static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
-+{
-+	mptcp_v4_rem_raddress(mpcb, rem_id);
-+	mptcp_v6_rem_raddress(mpcb, rem_id);
-+}
-+
-+/* Output /proc/net/mptcp_fullmesh */
-+static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
-+{
-+	const struct net *net = seq->private;
-+	struct mptcp_loc_addr *mptcp_local;
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+	int i;
-+
-+	seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
-+
-+	rcu_read_lock_bh();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
-+
-+	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+		struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
-+
-+		seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
-+			   loc4->low_prio, &loc4->addr);
-+	}
-+
-+	seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
-+
-+	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+		struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
-+
-+		seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
-+			   loc6->low_prio, &loc6->addr);
-+	}
-+	rcu_read_unlock_bh();
-+
-+	return 0;
-+}
-+
-+static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
-+{
-+	return single_open_net(inode, file, mptcp_fm_seq_show);
-+}
-+
-+static const struct file_operations mptcp_fm_seq_fops = {
-+	.owner = THIS_MODULE,
-+	.open = mptcp_fm_seq_open,
-+	.read = seq_read,
-+	.llseek = seq_lseek,
-+	.release = single_release_net,
-+};
-+
-+static int mptcp_fm_init_net(struct net *net)
-+{
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_fm_ns *fm_ns;
-+	int err = 0;
-+
-+	fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
-+	if (!fm_ns)
-+		return -ENOBUFS;
-+
-+	mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
-+	if (!mptcp_local) {
-+		err = -ENOBUFS;
-+		goto err_mptcp_local;
-+	}
-+
-+	if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
-+			 &mptcp_fm_seq_fops)) {
-+		err = -ENOMEM;
-+		goto err_seq_fops;
-+	}
-+
-+	mptcp_local->next_v4_index = 1;
-+
-+	rcu_assign_pointer(fm_ns->local, mptcp_local);
-+	INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
-+	INIT_LIST_HEAD(&fm_ns->events);
-+	spin_lock_init(&fm_ns->local_lock);
-+	fm_ns->net = net;
-+	net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
-+
-+	return 0;
-+err_seq_fops:
-+	kfree(mptcp_local);
-+err_mptcp_local:
-+	kfree(fm_ns);
-+	return err;
-+}
-+
-+static void mptcp_fm_exit_net(struct net *net)
-+{
-+	struct mptcp_addr_event *eventq, *tmp;
-+	struct mptcp_fm_ns *fm_ns;
-+	struct mptcp_loc_addr *mptcp_local;
-+
-+	fm_ns = fm_get_ns(net);
-+	cancel_delayed_work_sync(&fm_ns->address_worker);
-+
-+	rcu_read_lock_bh();
-+
-+	mptcp_local = rcu_dereference_bh(fm_ns->local);
-+	kfree(mptcp_local);
-+
-+	spin_lock(&fm_ns->local_lock);
-+	list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
-+		list_del(&eventq->list);
-+		kfree(eventq);
-+	}
-+	spin_unlock(&fm_ns->local_lock);
-+
-+	rcu_read_unlock_bh();
-+
-+	remove_proc_entry("mptcp_fullmesh", net->proc_net);
-+
-+	kfree(fm_ns);
-+}
-+
-+static struct pernet_operations full_mesh_net_ops = {
-+	.init = mptcp_fm_init_net,
-+	.exit = mptcp_fm_exit_net,
-+};
-+
-+static struct mptcp_pm_ops full_mesh __read_mostly = {
-+	.new_session = full_mesh_new_session,
-+	.release_sock = full_mesh_release_sock,
-+	.fully_established = full_mesh_create_subflows,
-+	.new_remote_address = full_mesh_create_subflows,
-+	.get_local_id = full_mesh_get_local_id,
-+	.addr_signal = full_mesh_addr_signal,
-+	.add_raddr = full_mesh_add_raddr,
-+	.rem_raddr = full_mesh_rem_raddr,
-+	.name = "fullmesh",
-+	.owner = THIS_MODULE,
-+};
-+
-+/* General initialization of MPTCP_PM */
-+static int __init full_mesh_register(void)
-+{
-+	int ret;
-+
-+	BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
-+
-+	ret = register_pernet_subsys(&full_mesh_net_ops);
-+	if (ret)
-+		goto out;
-+
-+	ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+	if (ret)
-+		goto err_reg_inetaddr;
-+	ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+	if (ret)
-+		goto err_reg_netdev;
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	ret = register_inet6addr_notifier(&inet6_addr_notifier);
-+	if (ret)
-+		goto err_reg_inet6addr;
-+#endif
-+
-+	ret = mptcp_register_path_manager(&full_mesh);
-+	if (ret)
-+		goto err_reg_pm;
-+
-+out:
-+	return ret;
-+
-+
-+err_reg_pm:
-+#if IS_ENABLED(CONFIG_IPV6)
-+	unregister_inet6addr_notifier(&inet6_addr_notifier);
-+err_reg_inet6addr:
-+#endif
-+	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+err_reg_netdev:
-+	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+err_reg_inetaddr:
-+	unregister_pernet_subsys(&full_mesh_net_ops);
-+	goto out;
-+}
-+
-+static void full_mesh_unregister(void)
-+{
-+#if IS_ENABLED(CONFIG_IPV6)
-+	unregister_inet6addr_notifier(&inet6_addr_notifier);
-+#endif
-+	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+	unregister_pernet_subsys(&full_mesh_net_ops);
-+	mptcp_unregister_path_manager(&full_mesh);
-+}
-+
-+module_init(full_mesh_register);
-+module_exit(full_mesh_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("Full-Mesh MPTCP");
-+MODULE_VERSION("0.88");
-diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
-new file mode 100644
-index 000000000000..43704ccb639e
---- /dev/null
-+++ b/net/mptcp/mptcp_input.c
-@@ -0,0 +1,2405 @@
-+/*
-+ *	MPTCP implementation - Sending side
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <asm/unaligned.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-+
-+#include <linux/kconfig.h>
-+
-+/* is seq1 < seq2 ? */
-+static inline bool before64(const u64 seq1, const u64 seq2)
-+{
-+	return (s64)(seq1 - seq2) < 0;
-+}
-+
-+/* is seq1 > seq2 ? */
-+#define after64(seq1, seq2)	before64(seq2, seq1)
-+
-+static inline void mptcp_become_fully_estab(struct sock *sk)
-+{
-+	tcp_sk(sk)->mptcp->fully_established = 1;
-+
-+	if (is_master_tp(tcp_sk(sk)) &&
-+	    tcp_sk(sk)->mpcb->pm_ops->fully_established)
-+		tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
-+}
-+
-+/* Similar to tcp_tso_acked without any memory accounting */
-+static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
-+					   struct sk_buff *skb)
-+{
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	u32 packets_acked, len;
-+
-+	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
-+
-+	packets_acked = tcp_skb_pcount(skb);
-+
-+	if (skb_unclone(skb, GFP_ATOMIC))
-+		return 0;
-+
-+	len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
-+	__pskb_trim_head(skb, len);
-+
-+	TCP_SKB_CB(skb)->seq += len;
-+	skb->ip_summed = CHECKSUM_PARTIAL;
-+	skb->truesize	     -= len;
-+
-+	/* Any change of skb->len requires recalculation of tso factor. */
-+	if (tcp_skb_pcount(skb) > 1)
-+		tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
-+	packets_acked -= tcp_skb_pcount(skb);
-+
-+	if (packets_acked) {
-+		BUG_ON(tcp_skb_pcount(skb) == 0);
-+		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
-+	}
-+
-+	return packets_acked;
-+}
-+
-+/**
-+ * Cleans the meta-socket retransmission queue and the reinject-queue.
-+ * @sk must be the metasocket.
-+ */
-+static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
-+{
-+	struct sk_buff *skb, *tmp;
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	bool acked = false;
-+	u32 acked_pcount;
-+
-+	while ((skb = tcp_write_queue_head(meta_sk)) &&
-+	       skb != tcp_send_head(meta_sk)) {
-+		bool fully_acked = true;
-+
-+		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
-+			if (tcp_skb_pcount(skb) == 1 ||
-+			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
-+				break;
-+
-+			acked_pcount = tcp_tso_acked(meta_sk, skb);
-+			if (!acked_pcount)
-+				break;
-+
-+			fully_acked = false;
-+		} else {
-+			acked_pcount = tcp_skb_pcount(skb);
-+		}
-+
-+		acked = true;
-+		meta_tp->packets_out -= acked_pcount;
-+		meta_tp->retrans_stamp = 0;
-+
-+		if (!fully_acked)
-+			break;
-+
-+		tcp_unlink_write_queue(skb, meta_sk);
-+
-+		if (mptcp_is_data_fin(skb)) {
-+			struct sock *sk_it;
-+
-+			/* DATA_FIN has been acknowledged - now we can close
-+			 * the subflows
-+			 */
-+			mptcp_for_each_sk(mpcb, sk_it) {
-+				unsigned long delay = 0;
-+
-+				/* If we are the passive closer, don't trigger
-+				 * subflow-fin until the subflow has been finned
-+				 * by the peer - thus we add a delay.
-+				 */
-+				if (mpcb->passive_close &&
-+				    sk_it->sk_state == TCP_ESTABLISHED)
-+					delay = inet_csk(sk_it)->icsk_rto << 3;
-+
-+				mptcp_sub_close(sk_it, delay);
-+			}
-+		}
-+		sk_wmem_free_skb(meta_sk, skb);
-+	}
-+	/* Remove acknowledged data from the reinject queue */
-+	skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
-+		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
-+			if (tcp_skb_pcount(skb) == 1 ||
-+			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
-+				break;
-+
-+			mptcp_tso_acked_reinject(meta_sk, skb);
-+			break;
-+		}
-+
-+		__skb_unlink(skb, &mpcb->reinject_queue);
-+		__kfree_skb(skb);
-+	}
-+
-+	if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
-+		meta_tp->snd_up = meta_tp->snd_una;
-+
-+	if (acked) {
-+		tcp_rearm_rto(meta_sk);
-+		/* Normally this is done in tcp_try_undo_loss - but MPTCP
-+		 * does not call this function.
-+		 */
-+		inet_csk(meta_sk)->icsk_retransmits = 0;
-+	}
-+}
-+
-+/* Inspired by tcp_rcv_state_process */
-+static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
-+				   const struct sk_buff *skb, u32 data_seq,
-+				   u16 data_len)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-+	const struct tcphdr *th = tcp_hdr(skb);
-+
-+	/* State-machine handling if FIN has been enqueued and he has
-+	 * been acked (snd_una == write_seq) - it's important that this
-+	 * here is after sk_wmem_free_skb because otherwise
-+	 * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
-+	 */
-+	switch (meta_sk->sk_state) {
-+	case TCP_FIN_WAIT1: {
-+		struct dst_entry *dst;
-+		int tmo;
-+
-+		if (meta_tp->snd_una != meta_tp->write_seq)
-+			break;
-+
-+		tcp_set_state(meta_sk, TCP_FIN_WAIT2);
-+		meta_sk->sk_shutdown |= SEND_SHUTDOWN;
-+
-+		dst = __sk_dst_get(sk);
-+		if (dst)
-+			dst_confirm(dst);
-+
-+		if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+			/* Wake up lingering close() */
-+			meta_sk->sk_state_change(meta_sk);
-+			break;
-+		}
-+
-+		if (meta_tp->linger2 < 0 ||
-+		    (data_len &&
-+		     after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
-+			   meta_tp->rcv_nxt))) {
-+			mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
-+			tcp_done(meta_sk);
-+			NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+			return 1;
-+		}
-+
-+		tmo = tcp_fin_time(meta_sk);
-+		if (tmo > TCP_TIMEWAIT_LEN) {
-+			inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
-+		} else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
-+			/* Bad case. We could lose such FIN otherwise.
-+			 * It is not a big problem, but it looks confusing
-+			 * and not so rare event. We still can lose it now,
-+			 * if it spins in bh_lock_sock(), but it is really
-+			 * marginal case.
-+			 */
-+			inet_csk_reset_keepalive_timer(meta_sk, tmo);
-+		} else {
-+			meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
-+		}
-+		break;
-+	}
-+	case TCP_CLOSING:
-+	case TCP_LAST_ACK:
-+		if (meta_tp->snd_una == meta_tp->write_seq) {
-+			tcp_done(meta_sk);
-+			return 1;
-+		}
-+		break;
-+	}
-+
-+	/* step 7: process the segment text */
-+	switch (meta_sk->sk_state) {
-+	case TCP_FIN_WAIT1:
-+	case TCP_FIN_WAIT2:
-+		/* RFC 793 says to queue data in these states,
-+		 * RFC 1122 says we MUST send a reset.
-+		 * BSD 4.4 also does reset.
-+		 */
-+		if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
-+			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
-+			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
-+			    !mptcp_is_data_fin2(skb, tp)) {
-+				NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+				mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
-+				tcp_reset(meta_sk);
-+				return 1;
-+			}
-+		}
-+		break;
-+	}
-+
-+	return 0;
-+}
-+
-+/**
-+ * @return:
-+ *  i) 1: Everything's fine.
-+ *  ii) -1: A reset has been sent on the subflow - csum-failure
-+ *  iii) 0: csum-failure but no reset sent, because it's the last subflow.
-+ *	 Last packet should not be destroyed by the caller because it has
-+ *	 been done here.
-+ */
-+static int mptcp_verif_dss_csum(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sk_buff *tmp, *tmp1, *last = NULL;
-+	__wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
-+	int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
-+	int iter = 0;
-+
-+	skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
-+		unsigned int csum_len;
-+
-+		if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
-+			/* Mapping ends in the middle of the packet -
-+			 * csum only these bytes
-+			 */
-+			csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
-+		else
-+			csum_len = tmp->len;
-+
-+		offset = 0;
-+		if (overflowed) {
-+			char first_word[4];
-+			first_word[0] = 0;
-+			first_word[1] = 0;
-+			first_word[2] = 0;
-+			first_word[3] = *(tmp->data);
-+			csum_tcp = csum_partial(first_word, 4, csum_tcp);
-+			offset = 1;
-+			csum_len--;
-+			overflowed = 0;
-+		}
-+
-+		csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
-+
-+		/* Was it on an odd-length? Then we have to merge the next byte
-+		 * correctly (see above)
-+		 */
-+		if (csum_len != (csum_len & (~1)))
-+			overflowed = 1;
-+
-+		if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
-+			__be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
-+
-+			/* If a 64-bit dss is present, we increase the offset
-+			 * by 4 bytes, as the high-order 64-bits will be added
-+			 * in the final csum_partial-call.
-+			 */
-+			u32 offset = skb_transport_offset(tmp) +
-+				     TCP_SKB_CB(tmp)->dss_off;
-+			if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
-+				offset += 4;
-+
-+			csum_tcp = skb_checksum(tmp, offset,
-+						MPTCP_SUB_LEN_SEQ_CSUM,
-+						csum_tcp);
-+
-+			csum_tcp = csum_partial(&data_seq,
-+						sizeof(data_seq), csum_tcp);
-+
-+			dss_csum_added = 1; /* Just do it once */
-+		}
-+		last = tmp;
-+		iter++;
-+
-+		if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
-+		    !before(TCP_SKB_CB(tmp1)->seq,
-+			    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+			break;
-+	}
-+
-+	/* Now, checksum must be 0 */
-+	if (unlikely(csum_fold(csum_tcp))) {
-+		pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
-+		       __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
-+		       dss_csum_added, overflowed, iter);
-+
-+		tp->mptcp->send_mp_fail = 1;
-+
-+		/* map_data_seq is the data-seq number of the
-+		 * mapping we are currently checking
-+		 */
-+		tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
-+
-+		if (tp->mpcb->cnt_subflows > 1) {
-+			mptcp_send_reset(sk);
-+			ans = -1;
-+		} else {
-+			tp->mpcb->send_infinite_mapping = 1;
-+
-+			/* Need to purge the rcv-queue as it's no more valid */
-+			while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
-+				tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
-+				kfree_skb(tmp);
-+			}
-+
-+			ans = 0;
-+		}
-+	}
-+
-+	return ans;
-+}
-+
-+static inline void mptcp_prepare_skb(struct sk_buff *skb,
-+				     const struct sock *sk)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	u32 inc = 0;
-+
-+	/* If skb is the end of this mapping (end is always at mapping-boundary
-+	 * thanks to the splitting/trimming), then we need to increase
-+	 * data-end-seq by 1 if this here is a data-fin.
-+	 *
-+	 * We need to do -1 because end_seq includes the subflow-FIN.
-+	 */
-+	if (tp->mptcp->map_data_fin &&
-+	    (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
-+	    (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
-+		inc = 1;
-+
-+		/* We manually set the fin-flag if it is a data-fin. For easy
-+		 * processing in tcp_recvmsg.
-+		 */
-+		tcp_hdr(skb)->fin = 1;
-+	} else {
-+		/* We may have a subflow-fin with data but without data-fin */
-+		tcp_hdr(skb)->fin = 0;
-+	}
-+
-+	/* Adapt data-seq's to the packet itself. We kinda transform the
-+	 * dss-mapping to a per-packet granularity. This is necessary to
-+	 * correctly handle overlapping mappings coming from different
-+	 * subflows. Otherwise it would be a complete mess.
-+	 */
-+	tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
-+	tcb->end_seq = tcb->seq + skb->len + inc;
-+}
-+
-+/**
-+ * @return: 1 if the segment has been eaten and can be suppressed,
-+ *          otherwise 0.
-+ */
-+static inline int mptcp_direct_copy(const struct sk_buff *skb,
-+				    struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
-+	int eaten = 0;
-+
-+	__set_current_state(TASK_RUNNING);
-+
-+	local_bh_enable();
-+	if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
-+		meta_tp->ucopy.len -= chunk;
-+		meta_tp->copied_seq += chunk;
-+		eaten = (chunk == skb->len);
-+		tcp_rcv_space_adjust(meta_sk);
-+	}
-+	local_bh_disable();
-+	return eaten;
-+}
-+
-+static inline void mptcp_reset_mapping(struct tcp_sock *tp)
-+{
-+	tp->mptcp->map_data_len = 0;
-+	tp->mptcp->map_data_seq = 0;
-+	tp->mptcp->map_subseq = 0;
-+	tp->mptcp->map_data_fin = 0;
-+	tp->mptcp->mapping_present = 0;
-+}
-+
-+/* The DSS-mapping received on the sk only covers the second half of the skb
-+ * (cut at seq). We trim the head from the skb.
-+ * Data will be freed upon kfree().
-+ *
-+ * Inspired by tcp_trim_head().
-+ */
-+static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
-+{
-+	int len = seq - TCP_SKB_CB(skb)->seq;
-+	u32 new_seq = TCP_SKB_CB(skb)->seq + len;
-+
-+	if (len < skb_headlen(skb))
-+		__skb_pull(skb, len);
-+	else
-+		__pskb_trim_head(skb, len - skb_headlen(skb));
-+
-+	TCP_SKB_CB(skb)->seq = new_seq;
-+
-+	skb->truesize -= len;
-+	atomic_sub(len, &sk->sk_rmem_alloc);
-+	sk_mem_uncharge(sk, len);
-+}
-+
-+/* The DSS-mapping received on the sk only covers the first half of the skb
-+ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
-+ * as further packets may resolve the mapping of the second half of data.
-+ *
-+ * Inspired by tcp_fragment().
-+ */
-+static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
-+{
-+	struct sk_buff *buff;
-+	int nsize;
-+	int nlen, len;
-+
-+	len = seq - TCP_SKB_CB(skb)->seq;
-+	nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
-+	if (nsize < 0)
-+		nsize = 0;
-+
-+	/* Get a new skb... force flag on. */
-+	buff = alloc_skb(nsize, GFP_ATOMIC);
-+	if (buff == NULL)
-+		return -ENOMEM;
-+
-+	skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
-+	skb_reset_transport_header(buff);
-+
-+	tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
-+	tcp_hdr(skb)->fin = 0;
-+
-+	/* We absolutly need to call skb_set_owner_r before refreshing the
-+	 * truesize of buff, otherwise the moved data will account twice.
-+	 */
-+	skb_set_owner_r(buff, sk);
-+	nlen = skb->len - len - nsize;
-+	buff->truesize += nlen;
-+	skb->truesize -= nlen;
-+
-+	/* Correct the sequence numbers. */
-+	TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
-+	TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
-+	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
-+
-+	skb_split(skb, buff, len);
-+
-+	__skb_queue_after(&sk->sk_receive_queue, skb, buff);
-+
-+	return 0;
-+}
-+
-+/* @return: 0  everything is fine. Just continue processing
-+ *	    1  subflow is broken stop everything
-+ *	    -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	/* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
-+	if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
-+	    !tp->mpcb->infinite_mapping_rcv) {
-+		/* Remove a pure subflow-fin from the queue and increase
-+		 * copied_seq.
-+		 */
-+		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+		__skb_unlink(skb, &sk->sk_receive_queue);
-+		__kfree_skb(skb);
-+		return -1;
-+	}
-+
-+	/* If we are not yet fully established and do not know the mapping for
-+	 * this segment, this path has to fallback to infinite or be torn down.
-+	 */
-+	if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
-+	    !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
-+		pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
-+		       __func__, tp->mpcb->mptcp_loc_token,
-+		       tp->mptcp->path_index, __builtin_return_address(0),
-+		       TCP_SKB_CB(skb)->seq);
-+
-+		if (!is_master_tp(tp)) {
-+			mptcp_send_reset(sk);
-+			return 1;
-+		}
-+
-+		tp->mpcb->infinite_mapping_snd = 1;
-+		tp->mpcb->infinite_mapping_rcv = 1;
-+		/* We do a seamless fallback and should not send a inf.mapping. */
-+		tp->mpcb->send_infinite_mapping = 0;
-+		tp->mptcp->fully_established = 1;
-+	}
-+
-+	/* Receiver-side becomes fully established when a whole rcv-window has
-+	 * been received without the need to fallback due to the previous
-+	 * condition.
-+	 */
-+	if (!tp->mptcp->fully_established) {
-+		tp->mptcp->init_rcv_wnd -= skb->len;
-+		if (tp->mptcp->init_rcv_wnd < 0)
-+			mptcp_become_fully_estab(sk);
-+	}
-+
-+	return 0;
-+}
-+
-+/* @return: 0  everything is fine. Just continue processing
-+ *	    1  subflow is broken stop everything
-+ *	    -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	u32 *ptr;
-+	u32 data_seq, sub_seq, data_len, tcp_end_seq;
-+
-+	/* If we are in infinite-mapping-mode, the subflow is guaranteed to be
-+	 * in-order at the data-level. Thus data-seq-numbers can be inferred
-+	 * from what is expected at the data-level.
-+	 */
-+	if (mpcb->infinite_mapping_rcv) {
-+		tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
-+		tp->mptcp->map_subseq = tcb->seq;
-+		tp->mptcp->map_data_len = skb->len;
-+		tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
-+		tp->mptcp->mapping_present = 1;
-+		return 0;
-+	}
-+
-+	/* No mapping here? Exit - it is either already set or still on its way */
-+	if (!mptcp_is_data_seq(skb)) {
-+		/* Too many packets without a mapping - this subflow is broken */
-+		if (!tp->mptcp->mapping_present &&
-+		    tp->rcv_nxt - tp->copied_seq > 65536) {
-+			mptcp_send_reset(sk);
-+			return 1;
-+		}
-+
-+		return 0;
-+	}
-+
-+	ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
-+	ptr++;
-+	sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
-+	ptr++;
-+	data_len = get_unaligned_be16(ptr);
-+
-+	/* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
-+	 * The draft sets it to 0, but we really would like to have the
-+	 * real value, to have an easy handling afterwards here in this
-+	 * function.
-+	 */
-+	if (mptcp_is_data_fin(skb) && skb->len == 0)
-+		sub_seq = TCP_SKB_CB(skb)->seq;
-+
-+	/* If there is already a mapping - we check if it maps with the current
-+	 * one. If not - we reset.
-+	 */
-+	if (tp->mptcp->mapping_present &&
-+	    (data_seq != (u32)tp->mptcp->map_data_seq ||
-+	     sub_seq != tp->mptcp->map_subseq ||
-+	     data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
-+	     mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
-+		/* Mapping in packet is different from what we want */
-+		pr_err("%s Mappings do not match!\n", __func__);
-+		pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
-+		       __func__, data_seq, (u32)tp->mptcp->map_data_seq,
-+		       sub_seq, tp->mptcp->map_subseq, data_len,
-+		       tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
-+		       tp->mptcp->map_data_fin);
-+		mptcp_send_reset(sk);
-+		return 1;
-+	}
-+
-+	/* If the previous check was good, the current mapping is valid and we exit. */
-+	if (tp->mptcp->mapping_present)
-+		return 0;
-+
-+	/* Mapping not yet set on this subflow - we set it here! */
-+
-+	if (!data_len) {
-+		mpcb->infinite_mapping_rcv = 1;
-+		tp->mptcp->fully_established = 1;
-+		/* We need to repeat mp_fail's until the sender felt
-+		 * back to infinite-mapping - here we stop repeating it.
-+		 */
-+		tp->mptcp->send_mp_fail = 0;
-+
-+		/* We have to fixup data_len - it must be the same as skb->len */
-+		data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
-+		sub_seq = tcb->seq;
-+
-+		/* TODO kill all other subflows than this one */
-+		/* data_seq and so on are set correctly */
-+
-+		/* At this point, the meta-ofo-queue has to be emptied,
-+		 * as the following data is guaranteed to be in-order at
-+		 * the data and subflow-level
-+		 */
-+		mptcp_purge_ofo_queue(meta_tp);
-+	}
-+
-+	/* We are sending mp-fail's and thus are in fallback mode.
-+	 * Ignore packets which do not announce the fallback and still
-+	 * want to provide a mapping.
-+	 */
-+	if (tp->mptcp->send_mp_fail) {
-+		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+		__skb_unlink(skb, &sk->sk_receive_queue);
-+		__kfree_skb(skb);
-+		return -1;
-+	}
-+
-+	/* FIN increased the mapping-length by 1 */
-+	if (mptcp_is_data_fin(skb))
-+		data_len--;
-+
-+	/* Subflow-sequences of packet must be
-+	 * (at least partially) be part of the DSS-mapping's
-+	 * subflow-sequence-space.
-+	 *
-+	 * Basically the mapping is not valid, if either of the
-+	 * following conditions is true:
-+	 *
-+	 * 1. It's not a data_fin and
-+	 *    MPTCP-sub_seq >= TCP-end_seq
-+	 *
-+	 * 2. It's a data_fin and TCP-end_seq > TCP-seq and
-+	 *    MPTCP-sub_seq >= TCP-end_seq
-+	 *
-+	 * The previous two can be merged into:
-+	 *    TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
-+	 *    Because if it's not a data-fin, TCP-end_seq > TCP-seq
-+	 *
-+	 * 3. It's a data_fin and skb->len == 0 and
-+	 *    MPTCP-sub_seq > TCP-end_seq
-+	 *
-+	 * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
-+	 *    MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
-+	 *
-+	 * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
-+	 */
-+
-+	/* subflow-fin is not part of the mapping - ignore it here ! */
-+	tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
-+	if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
-+	    (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
-+	    (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
-+	    before(sub_seq, tp->copied_seq)) {
-+		/* Subflow-sequences of packet is different from what is in the
-+		 * packet's dss-mapping. The peer is misbehaving - reset
-+		 */
-+		pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
-+		       "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
-+		       "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
-+		       skb->len, data_len, tp->copied_seq);
-+		mptcp_send_reset(sk);
-+		return 1;
-+	}
-+
-+	/* Does the DSS had 64-bit seqnum's ? */
-+	if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
-+		/* Wrapped around? */
-+		if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
-+			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
-+		} else {
-+			/* Else, access the default high-order bits */
-+			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
-+		}
-+	} else {
-+		tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
-+
-+		if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
-+			/* We make sure that the data_seq is invalid.
-+			 * It will be dropped later.
-+			 */
-+			tp->mptcp->map_data_seq += 0xFFFFFFFF;
-+			tp->mptcp->map_data_seq += 0xFFFFFFFF;
-+		}
-+	}
-+
-+	tp->mptcp->map_data_len = data_len;
-+	tp->mptcp->map_subseq = sub_seq;
-+	tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
-+	tp->mptcp->mapping_present = 1;
-+
-+	return 0;
-+}
-+
-+/* Similar to tcp_sequence(...) */
-+static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
-+				 u64 data_seq, u64 end_data_seq)
-+{
-+	const struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	u64 rcv_wup64;
-+
-+	/* Wrap-around? */
-+	if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
-+		rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
-+				meta_tp->rcv_wup;
-+	} else {
-+		rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
-+						  meta_tp->rcv_wup);
-+	}
-+
-+	return	!before64(end_data_seq, rcv_wup64) &&
-+		!after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
-+}
-+
-+/* @return: 0  everything is fine. Just continue processing
-+ *	    -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sk_buff *tmp, *tmp1;
-+	u32 tcp_end_seq;
-+
-+	if (!tp->mptcp->mapping_present)
-+		return 0;
-+
-+	/* either, the new skb gave us the mapping and the first segment
-+	 * in the sub-rcv-queue has to be trimmed ...
-+	 */
-+	tmp = skb_peek(&sk->sk_receive_queue);
-+	if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
-+	    after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
-+		mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
-+
-+	/* ... or the new skb (tail) has to be split at the end. */
-+	tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
-+	if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
-+		u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
-+		if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
-+			/* TODO : maybe handle this here better.
-+			 * We now just force meta-retransmission.
-+			 */
-+			tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+			__skb_unlink(skb, &sk->sk_receive_queue);
-+			__kfree_skb(skb);
-+			return -1;
-+		}
-+	}
-+
-+	/* Now, remove old sk_buff's from the receive-queue.
-+	 * This may happen if the mapping has been lost for these segments and
-+	 * the next mapping has already been received.
-+	 */
-+	if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
-+		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+			if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
-+				break;
-+
-+			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+			__skb_unlink(tmp1, &sk->sk_receive_queue);
-+
-+			/* Impossible that we could free skb here, because his
-+			 * mapping is known to be valid from previous checks
-+			 */
-+			__kfree_skb(tmp1);
-+		}
-+	}
-+
-+	return 0;
-+}
-+
-+/* @return: 0  everything is fine. Just continue processing
-+ *	    1  subflow is broken stop everything
-+ *	    -1 this mapping has been put in the meta-receive-queue
-+ *	    -2 this mapping has been eaten by the application
-+ */
-+static int mptcp_queue_skb(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	struct sk_buff *tmp, *tmp1;
-+	u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
-+	bool data_queued = false;
-+
-+	/* Have we not yet received the full mapping? */
-+	if (!tp->mptcp->mapping_present ||
-+	    before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+		return 0;
-+
-+	/* Is this an overlapping mapping? rcv_nxt >= end_data_seq
-+	 * OR
-+	 * This mapping is out of window
-+	 */
-+	if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
-+	    !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
-+			    tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
-+		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+			__skb_unlink(tmp1, &sk->sk_receive_queue);
-+			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+			__kfree_skb(tmp1);
-+
-+			if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+			    !before(TCP_SKB_CB(tmp)->seq,
-+				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+				break;
-+		}
-+
-+		mptcp_reset_mapping(tp);
-+
-+		return -1;
-+	}
-+
-+	/* Record it, because we want to send our data_fin on the same path */
-+	if (tp->mptcp->map_data_fin) {
-+		mpcb->dfin_path_index = tp->mptcp->path_index;
-+		mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
-+	}
-+
-+	/* Verify the checksum */
-+	if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
-+		int ret = mptcp_verif_dss_csum(sk);
-+
-+		if (ret <= 0) {
-+			mptcp_reset_mapping(tp);
-+			return 1;
-+		}
-+	}
-+
-+	if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
-+		/* Seg's have to go to the meta-ofo-queue */
-+		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+			mptcp_prepare_skb(tmp1, sk);
-+			__skb_unlink(tmp1, &sk->sk_receive_queue);
-+			/* MUST be done here, because fragstolen may be true later.
-+			 * Then, kfree_skb_partial will not account the memory.
-+			 */
-+			skb_orphan(tmp1);
-+
-+			if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
-+				mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
-+			else
-+				__kfree_skb(tmp1);
-+
-+			if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+			    !before(TCP_SKB_CB(tmp)->seq,
-+				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+				break;
-+		}
-+		tcp_enter_quickack_mode(sk);
-+	} else {
-+		/* Ready for the meta-rcv-queue */
-+		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+			int eaten = 0;
-+			const bool copied_early = false;
-+			bool fragstolen = false;
-+			u32 old_rcv_nxt = meta_tp->rcv_nxt;
-+
-+			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+			mptcp_prepare_skb(tmp1, sk);
-+			__skb_unlink(tmp1, &sk->sk_receive_queue);
-+			/* MUST be done here, because fragstolen may be true.
-+			 * Then, kfree_skb_partial will not account the memory.
-+			 */
-+			skb_orphan(tmp1);
-+
-+			/* This segment has already been received */
-+			if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
-+				__kfree_skb(tmp1);
-+				goto next;
-+			}
-+
-+#ifdef CONFIG_NET_DMA
-+			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt  &&
-+			    meta_tp->ucopy.task == current &&
-+			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
-+			    tmp1->len <= meta_tp->ucopy.len &&
-+			    sock_owned_by_user(meta_sk) &&
-+			    tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
-+				copied_early = true;
-+				eaten = 1;
-+			}
-+#endif
-+
-+			/* Is direct copy possible ? */
-+			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
-+			    meta_tp->ucopy.task == current &&
-+			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
-+			    meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
-+			    !copied_early)
-+				eaten = mptcp_direct_copy(tmp1, meta_sk);
-+
-+			if (mpcb->in_time_wait) /* In time-wait, do not receive data */
-+				eaten = 1;
-+
-+			if (!eaten)
-+				eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
-+
-+			meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
-+			mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
-+
-+#ifdef CONFIG_NET_DMA
-+			if (copied_early)
-+				meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
-+#endif
-+
-+			if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
-+				mptcp_fin(meta_sk);
-+
-+			/* Check if this fills a gap in the ofo queue */
-+			if (!skb_queue_empty(&meta_tp->out_of_order_queue))
-+				mptcp_ofo_queue(meta_sk);
-+
-+#ifdef CONFIG_NET_DMA
-+			if (copied_early)
-+				__skb_queue_tail(&meta_sk->sk_async_wait_queue,
-+						 tmp1);
-+			else
-+#endif
-+			if (eaten)
-+				kfree_skb_partial(tmp1, fragstolen);
-+
-+			data_queued = true;
-+next:
-+			if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+			    !before(TCP_SKB_CB(tmp)->seq,
-+				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+				break;
-+		}
-+	}
-+
-+	inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
-+	mptcp_reset_mapping(tp);
-+
-+	return data_queued ? -1 : -2;
-+}
-+
-+void mptcp_data_ready(struct sock *sk)
-+{
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct sk_buff *skb, *tmp;
-+	int queued = 0;
-+
-+	/* restart before the check, because mptcp_fin might have changed the
-+	 * state.
-+	 */
-+restart:
-+	/* If the meta cannot receive data, there is no point in pushing data.
-+	 * If we are in time-wait, we may still be waiting for the final FIN.
-+	 * So, we should proceed with the processing.
-+	 */
-+	if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
-+		skb_queue_purge(&sk->sk_receive_queue);
-+		tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
-+		goto exit;
-+	}
-+
-+	/* Iterate over all segments, detect their mapping (if we don't have
-+	 * one yet), validate them and push everything one level higher.
-+	 */
-+	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
-+		int ret;
-+		/* Pre-validation - e.g., early fallback */
-+		ret = mptcp_prevalidate_skb(sk, skb);
-+		if (ret < 0)
-+			goto restart;
-+		else if (ret > 0)
-+			break;
-+
-+		/* Set the current mapping */
-+		ret = mptcp_detect_mapping(sk, skb);
-+		if (ret < 0)
-+			goto restart;
-+		else if (ret > 0)
-+			break;
-+
-+		/* Validation */
-+		if (mptcp_validate_mapping(sk, skb) < 0)
-+			goto restart;
-+
-+		/* Push a level higher */
-+		ret = mptcp_queue_skb(sk);
-+		if (ret < 0) {
-+			if (ret == -1)
-+				queued = ret;
-+			goto restart;
-+		} else if (ret == 0) {
-+			continue;
-+		} else { /* ret == 1 */
-+			break;
-+		}
-+	}
-+
-+exit:
-+	if (tcp_sk(sk)->close_it) {
-+		tcp_send_ack(sk);
-+		tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
-+	}
-+
-+	if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
-+		meta_sk->sk_data_ready(meta_sk);
-+}
-+
-+
-+int mptcp_check_req(struct sk_buff *skb, struct net *net)
-+{
-+	const struct tcphdr *th = tcp_hdr(skb);
-+	struct sock *meta_sk = NULL;
-+
-+	/* MPTCP structures not initialized */
-+	if (mptcp_init_failed)
-+		return 0;
-+
-+	if (skb->protocol == htons(ETH_P_IP))
-+		meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
-+					      ip_hdr(skb)->daddr, net);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else /* IPv6 */
-+		meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
-+					      &ipv6_hdr(skb)->daddr, net);
-+#endif /* CONFIG_IPV6 */
-+
-+	if (!meta_sk)
-+		return 0;
-+
-+	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+
-+	bh_lock_sock_nested(meta_sk);
-+	if (sock_owned_by_user(meta_sk)) {
-+		skb->sk = meta_sk;
-+		if (unlikely(sk_add_backlog(meta_sk, skb,
-+					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+			bh_unlock_sock(meta_sk);
-+			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-+			sock_put(meta_sk); /* Taken by mptcp_search_req */
-+			kfree_skb(skb);
-+			return 1;
-+		}
-+	} else if (skb->protocol == htons(ETH_P_IP)) {
-+		tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else { /* IPv6 */
-+		tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+	}
-+	bh_unlock_sock(meta_sk);
-+	sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
-+	return 1;
-+}
-+
-+struct mp_join *mptcp_find_join(const struct sk_buff *skb)
-+{
-+	const struct tcphdr *th = tcp_hdr(skb);
-+	unsigned char *ptr;
-+	int length = (th->doff * 4) - sizeof(struct tcphdr);
-+
-+	/* Jump through the options to check whether JOIN is there */
-+	ptr = (unsigned char *)(th + 1);
-+	while (length > 0) {
-+		int opcode = *ptr++;
-+		int opsize;
-+
-+		switch (opcode) {
-+		case TCPOPT_EOL:
-+			return NULL;
-+		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
-+			length--;
-+			continue;
-+		default:
-+			opsize = *ptr++;
-+			if (opsize < 2)	/* "silly options" */
-+				return NULL;
-+			if (opsize > length)
-+				return NULL;  /* don't parse partial options */
-+			if (opcode == TCPOPT_MPTCP &&
-+			    ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
-+				return (struct mp_join *)(ptr - 2);
-+			}
-+			ptr += opsize - 2;
-+			length -= opsize;
-+		}
-+	}
-+	return NULL;
-+}
-+
-+int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
-+{
-+	const struct mptcp_cb *mpcb;
-+	struct sock *meta_sk;
-+	u32 token;
-+	bool meta_v4;
-+	struct mp_join *join_opt = mptcp_find_join(skb);
-+	if (!join_opt)
-+		return 0;
-+
-+	/* MPTCP structures were not initialized, so return error */
-+	if (mptcp_init_failed)
-+		return -1;
-+
-+	token = join_opt->u.syn.token;
-+	meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
-+	if (!meta_sk) {
-+		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
-+		return -1;
-+	}
-+
-+	meta_v4 = meta_sk->sk_family == AF_INET;
-+	if (meta_v4) {
-+		if (skb->protocol == htons(ETH_P_IPV6)) {
-+			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
-+			sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+			return -1;
-+		}
-+	} else if (skb->protocol == htons(ETH_P_IP) &&
-+		   inet6_sk(meta_sk)->ipv6only) {
-+		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
-+		sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+		return -1;
-+	}
-+
-+	mpcb = tcp_sk(meta_sk)->mpcb;
-+	if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
-+		/* We are in fallback-mode on the reception-side -
-+		 * no new subflows!
-+		 */
-+		sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+		return -1;
-+	}
-+
-+	/* Coming from time-wait-sock processing in tcp_v4_rcv.
-+	 * We have to deschedule it before continuing, because otherwise
-+	 * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
-+	 */
-+	if (tw) {
-+		inet_twsk_deschedule(tw, &tcp_death_row);
-+		inet_twsk_put(tw);
-+	}
-+
-+	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+	/* OK, this is a new syn/join, let's create a new open request and
-+	 * send syn+ack
-+	 */
-+	bh_lock_sock_nested(meta_sk);
-+	if (sock_owned_by_user(meta_sk)) {
-+		skb->sk = meta_sk;
-+		if (unlikely(sk_add_backlog(meta_sk, skb,
-+					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+			bh_unlock_sock(meta_sk);
-+			NET_INC_STATS_BH(sock_net(meta_sk),
-+					 LINUX_MIB_TCPBACKLOGDROP);
-+			sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+			kfree_skb(skb);
-+			return 1;
-+		}
-+	} else if (skb->protocol == htons(ETH_P_IP)) {
-+		tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else {
-+		tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+	}
-+	bh_unlock_sock(meta_sk);
-+	sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+	return 1;
-+}
-+
-+int mptcp_do_join_short(struct sk_buff *skb,
-+			const struct mptcp_options_received *mopt,
-+			struct net *net)
-+{
-+	struct sock *meta_sk;
-+	u32 token;
-+	bool meta_v4;
-+
-+	token = mopt->mptcp_rem_token;
-+	meta_sk = mptcp_hash_find(net, token);
-+	if (!meta_sk) {
-+		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
-+		return -1;
-+	}
-+
-+	meta_v4 = meta_sk->sk_family == AF_INET;
-+	if (meta_v4) {
-+		if (skb->protocol == htons(ETH_P_IPV6)) {
-+			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
-+			sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+			return -1;
-+		}
-+	} else if (skb->protocol == htons(ETH_P_IP) &&
-+		   inet6_sk(meta_sk)->ipv6only) {
-+		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
-+		sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+		return -1;
-+	}
-+
-+	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+
-+	/* OK, this is a new syn/join, let's create a new open request and
-+	 * send syn+ack
-+	 */
-+	bh_lock_sock(meta_sk);
-+
-+	/* This check is also done in mptcp_vX_do_rcv. But, there we cannot
-+	 * call tcp_vX_send_reset, because we hold already two socket-locks.
-+	 * (the listener and the meta from above)
-+	 *
-+	 * And the send-reset will try to take yet another one (ip_send_reply).
-+	 * Thus, we propagate the reset up to tcp_rcv_state_process.
-+	 */
-+	if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
-+	    tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
-+	    meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
-+		bh_unlock_sock(meta_sk);
-+		sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+		return -1;
-+	}
-+
-+	if (sock_owned_by_user(meta_sk)) {
-+		skb->sk = meta_sk;
-+		if (unlikely(sk_add_backlog(meta_sk, skb,
-+					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
-+			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-+		else
-+			/* Must make sure that upper layers won't free the
-+			 * skb if it is added to the backlog-queue.
-+			 */
-+			skb_get(skb);
-+	} else {
-+		/* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
-+		 * the skb will finally be freed by tcp_v4_do_rcv (where we are
-+		 * coming from)
-+		 */
-+		skb_get(skb);
-+		if (skb->protocol == htons(ETH_P_IP)) {
-+			tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+		} else { /* IPv6 */
-+			tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+		}
-+	}
-+
-+	bh_unlock_sock(meta_sk);
-+	sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+	return 0;
-+}
-+
-+/**
-+ * Equivalent of tcp_fin() for MPTCP
-+ * Can be called only when the FIN is validly part
-+ * of the data seqnum space. Not before when we get holes.
-+ */
-+void mptcp_fin(struct sock *meta_sk)
-+{
-+	struct sock *sk = NULL, *sk_it;
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+
-+	mptcp_for_each_sk(mpcb, sk_it) {
-+		if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
-+			sk = sk_it;
-+			break;
-+		}
-+	}
-+
-+	if (!sk || sk->sk_state == TCP_CLOSE)
-+		sk = mptcp_select_ack_sock(meta_sk);
-+
-+	inet_csk_schedule_ack(sk);
-+
-+	meta_sk->sk_shutdown |= RCV_SHUTDOWN;
-+	sock_set_flag(meta_sk, SOCK_DONE);
-+
-+	switch (meta_sk->sk_state) {
-+	case TCP_SYN_RECV:
-+	case TCP_ESTABLISHED:
-+		/* Move to CLOSE_WAIT */
-+		tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
-+		inet_csk(sk)->icsk_ack.pingpong = 1;
-+		break;
-+
-+	case TCP_CLOSE_WAIT:
-+	case TCP_CLOSING:
-+		/* Received a retransmission of the FIN, do
-+		 * nothing.
-+		 */
-+		break;
-+	case TCP_LAST_ACK:
-+		/* RFC793: Remain in the LAST-ACK state. */
-+		break;
-+
-+	case TCP_FIN_WAIT1:
-+		/* This case occurs when a simultaneous close
-+		 * happens, we must ack the received FIN and
-+		 * enter the CLOSING state.
-+		 */
-+		tcp_send_ack(sk);
-+		tcp_set_state(meta_sk, TCP_CLOSING);
-+		break;
-+	case TCP_FIN_WAIT2:
-+		/* Received a FIN -- send ACK and enter TIME_WAIT. */
-+		tcp_send_ack(sk);
-+		meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
-+		break;
-+	default:
-+		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
-+		 * cases we should never reach this piece of code.
-+		 */
-+		pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
-+		       meta_sk->sk_state);
-+		break;
-+	}
-+
-+	/* It _is_ possible, that we have something out-of-order _after_ FIN.
-+	 * Probably, we should reset in this case. For now drop them.
-+	 */
-+	mptcp_purge_ofo_queue(meta_tp);
-+	sk_mem_reclaim(meta_sk);
-+
-+	if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+		meta_sk->sk_state_change(meta_sk);
-+
-+		/* Do not send POLL_HUP for half duplex close. */
-+		if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
-+		    meta_sk->sk_state == TCP_CLOSE)
-+			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
-+		else
-+			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
-+	}
-+
-+	return;
-+}
-+
-+static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sk_buff *skb;
-+
-+	if (!meta_tp->packets_out)
-+		return;
-+
-+	tcp_for_write_queue(skb, meta_sk) {
-+		if (skb == tcp_send_head(meta_sk))
-+			break;
-+
-+		if (mptcp_retransmit_skb(meta_sk, skb))
-+			return;
-+
-+		if (skb == tcp_write_queue_head(meta_sk))
-+			inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
-+						  inet_csk(meta_sk)->icsk_rto,
-+						  TCP_RTO_MAX);
-+	}
-+}
-+
-+/* Handle the DATA_ACK */
-+static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
-+{
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	u32 prior_snd_una = meta_tp->snd_una;
-+	int prior_packets;
-+	u32 nwin, data_ack, data_seq;
-+	u16 data_len = 0;
-+
-+	/* A valid packet came in - subflow is operational again */
-+	tp->pf = 0;
-+
-+	/* Even if there is no data-ack, we stop retransmitting.
-+	 * Except if this is a SYN/ACK. Then it is just a retransmission
-+	 */
-+	if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
-+		tp->mptcp->pre_established = 0;
-+		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+	}
-+
-+	/* If we are in infinite mapping mode, rx_opt.data_ack has been
-+	 * set by mptcp_clean_rtx_infinite.
-+	 */
-+	if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
-+		goto exit;
-+
-+	data_ack = tp->mptcp->rx_opt.data_ack;
-+
-+	if (unlikely(!tp->mptcp->fully_established) &&
-+	    tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
-+		/* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
-+		 * includes a data-ack, we are fully established
-+		 */
-+		mptcp_become_fully_estab(sk);
-+
-+	/* Get the data_seq */
-+	if (mptcp_is_data_seq(skb)) {
-+		data_seq = tp->mptcp->rx_opt.data_seq;
-+		data_len = tp->mptcp->rx_opt.data_len;
-+	} else {
-+		data_seq = meta_tp->snd_wl1;
-+	}
-+
-+	/* If the ack is older than previous acks
-+	 * then we can probably ignore it.
-+	 */
-+	if (before(data_ack, prior_snd_una))
-+		goto exit;
-+
-+	/* If the ack includes data we haven't sent yet, discard
-+	 * this segment (RFC793 Section 3.9).
-+	 */
-+	if (after(data_ack, meta_tp->snd_nxt))
-+		goto exit;
-+
-+	/*** Now, update the window  - inspired by tcp_ack_update_window ***/
-+	nwin = ntohs(tcp_hdr(skb)->window);
-+
-+	if (likely(!tcp_hdr(skb)->syn))
-+		nwin <<= tp->rx_opt.snd_wscale;
-+
-+	if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
-+		tcp_update_wl(meta_tp, data_seq);
-+
-+		/* Draft v09, Section 3.3.5:
-+		 * [...] It should only update its local receive window values
-+		 * when the largest sequence number allowed (i.e.  DATA_ACK +
-+		 * receive window) increases. [...]
-+		 */
-+		if (meta_tp->snd_wnd != nwin &&
-+		    !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
-+			meta_tp->snd_wnd = nwin;
-+
-+			if (nwin > meta_tp->max_window)
-+				meta_tp->max_window = nwin;
-+		}
-+	}
-+	/*** Done, update the window ***/
-+
-+	/* We passed data and got it acked, remove any soft error
-+	 * log. Something worked...
-+	 */
-+	sk->sk_err_soft = 0;
-+	inet_csk(meta_sk)->icsk_probes_out = 0;
-+	meta_tp->rcv_tstamp = tcp_time_stamp;
-+	prior_packets = meta_tp->packets_out;
-+	if (!prior_packets)
-+		goto no_queue;
-+
-+	meta_tp->snd_una = data_ack;
-+
-+	mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
-+
-+	/* We are in loss-state, and something got acked, retransmit the whole
-+	 * queue now!
-+	 */
-+	if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
-+	    after(data_ack, prior_snd_una)) {
-+		mptcp_xmit_retransmit_queue(meta_sk);
-+		inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
-+	}
-+
-+	/* Simplified version of tcp_new_space, because the snd-buffer
-+	 * is handled by all the subflows.
-+	 */
-+	if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
-+		sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
-+		if (meta_sk->sk_socket &&
-+		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
-+			meta_sk->sk_write_space(meta_sk);
-+	}
-+
-+	if (meta_sk->sk_state != TCP_ESTABLISHED &&
-+	    mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
-+		return;
-+
-+exit:
-+	mptcp_push_pending_frames(meta_sk);
-+
-+	return;
-+
-+no_queue:
-+	if (tcp_send_head(meta_sk))
-+		tcp_ack_probe(meta_sk);
-+
-+	mptcp_push_pending_frames(meta_sk);
-+
-+	return;
-+}
-+
-+void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
-+
-+	if (!tp->mpcb->infinite_mapping_snd)
-+		return;
-+
-+	/* The difference between both write_seq's represents the offset between
-+	 * data-sequence and subflow-sequence. As we are infinite, this must
-+	 * match.
-+	 *
-+	 * Thus, from this difference we can infer the meta snd_una.
-+	 */
-+	tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
-+				     tp->snd_una;
-+
-+	mptcp_data_ack(sk, skb);
-+}
-+
-+/**** static functions used by mptcp_parse_options */
-+
-+static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
-+{
-+	struct sock *sk_it, *tmpsk;
-+
-+	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+		if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
-+			mptcp_reinject_data(sk_it, 0);
-+			sk_it->sk_err = ECONNRESET;
-+			if (tcp_need_reset(sk_it->sk_state))
-+				tcp_sk(sk_it)->ops->send_active_reset(sk_it,
-+								      GFP_ATOMIC);
-+			mptcp_sub_force_close(sk_it);
-+		}
-+	}
-+}
-+
-+void mptcp_parse_options(const uint8_t *ptr, int opsize,
-+			 struct mptcp_options_received *mopt,
-+			 const struct sk_buff *skb)
-+{
-+	const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
-+
-+	/* If the socket is mp-capable we would have a mopt. */
-+	if (!mopt)
-+		return;
-+
-+	switch (mp_opt->sub) {
-+	case MPTCP_SUB_CAPABLE:
-+	{
-+		const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
-+
-+		if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
-+		    opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
-+			mptcp_debug("%s: mp_capable: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		if (!sysctl_mptcp_enabled)
-+			break;
-+
-+		/* We only support MPTCP version 0 */
-+		if (mpcapable->ver != 0)
-+			break;
-+
-+		/* MPTCP-RFC 6824:
-+		 * "If receiving a message with the 'B' flag set to 1, and this
-+		 * is not understood, then this SYN MUST be silently ignored;
-+		 */
-+		if (mpcapable->b) {
-+			mopt->drop_me = 1;
-+			break;
-+		}
-+
-+		/* MPTCP-RFC 6824:
-+		 * "An implementation that only supports this method MUST set
-+		 *  bit "H" to 1, and bits "C" through "G" to 0."
-+		 */
-+		if (!mpcapable->h)
-+			break;
-+
-+		mopt->saw_mpc = 1;
-+		mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
-+
-+		if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
-+			mopt->mptcp_key = mpcapable->sender_key;
-+
-+		break;
-+	}
-+	case MPTCP_SUB_JOIN:
-+	{
-+		const struct mp_join *mpjoin = (struct mp_join *)ptr;
-+
-+		if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
-+		    opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
-+		    opsize != MPTCP_SUB_LEN_JOIN_ACK) {
-+			mptcp_debug("%s: mp_join: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		/* saw_mpc must be set, because in tcp_check_req we assume that
-+		 * it is set to support falling back to reg. TCP if a rexmitted
-+		 * SYN has no MP_CAPABLE or MP_JOIN
-+		 */
-+		switch (opsize) {
-+		case MPTCP_SUB_LEN_JOIN_SYN:
-+			mopt->is_mp_join = 1;
-+			mopt->saw_mpc = 1;
-+			mopt->low_prio = mpjoin->b;
-+			mopt->rem_id = mpjoin->addr_id;
-+			mopt->mptcp_rem_token = mpjoin->u.syn.token;
-+			mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
-+			break;
-+		case MPTCP_SUB_LEN_JOIN_SYNACK:
-+			mopt->saw_mpc = 1;
-+			mopt->low_prio = mpjoin->b;
-+			mopt->rem_id = mpjoin->addr_id;
-+			mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
-+			mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
-+			break;
-+		case MPTCP_SUB_LEN_JOIN_ACK:
-+			mopt->saw_mpc = 1;
-+			mopt->join_ack = 1;
-+			memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
-+			break;
-+		}
-+		break;
-+	}
-+	case MPTCP_SUB_DSS:
-+	{
-+		const struct mp_dss *mdss = (struct mp_dss *)ptr;
-+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+
-+		/* We check opsize for the csum and non-csum case. We do this,
-+		 * because the draft says that the csum SHOULD be ignored if
-+		 * it has not been negotiated in the MP_CAPABLE but still is
-+		 * present in the data.
-+		 *
-+		 * It will get ignored later in mptcp_queue_skb.
-+		 */
-+		if (opsize != mptcp_sub_len_dss(mdss, 0) &&
-+		    opsize != mptcp_sub_len_dss(mdss, 1)) {
-+			mptcp_debug("%s: mp_dss: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		ptr += 4;
-+
-+		if (mdss->A) {
-+			tcb->mptcp_flags |= MPTCPHDR_ACK;
-+
-+			if (mdss->a) {
-+				mopt->data_ack = (u32) get_unaligned_be64(ptr);
-+				ptr += MPTCP_SUB_LEN_ACK_64;
-+			} else {
-+				mopt->data_ack = get_unaligned_be32(ptr);
-+				ptr += MPTCP_SUB_LEN_ACK;
-+			}
-+		}
-+
-+		tcb->dss_off = (ptr - skb_transport_header(skb));
-+
-+		if (mdss->M) {
-+			if (mdss->m) {
-+				u64 data_seq64 = get_unaligned_be64(ptr);
-+
-+				tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
-+				mopt->data_seq = (u32) data_seq64;
-+
-+				ptr += 12; /* 64-bit dseq + subseq */
-+			} else {
-+				mopt->data_seq = get_unaligned_be32(ptr);
-+				ptr += 8; /* 32-bit dseq + subseq */
-+			}
-+			mopt->data_len = get_unaligned_be16(ptr);
-+
-+			tcb->mptcp_flags |= MPTCPHDR_SEQ;
-+
-+			/* Is a check-sum present? */
-+			if (opsize == mptcp_sub_len_dss(mdss, 1))
-+				tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
-+
-+			/* DATA_FIN only possible with DSS-mapping */
-+			if (mdss->F)
-+				tcb->mptcp_flags |= MPTCPHDR_FIN;
-+		}
-+
-+		break;
-+	}
-+	case MPTCP_SUB_ADD_ADDR:
-+	{
-+#if IS_ENABLED(CONFIG_IPV6)
-+		const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+
-+		if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+		     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
-+		    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
-+		     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
-+#else
-+		if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+		    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
-+#endif /* CONFIG_IPV6 */
-+			mptcp_debug("%s: mp_add_addr: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		/* We have to manually parse the options if we got two of them. */
-+		if (mopt->saw_add_addr) {
-+			mopt->more_add_addr = 1;
-+			break;
-+		}
-+		mopt->saw_add_addr = 1;
-+		mopt->add_addr_ptr = ptr;
-+		break;
-+	}
-+	case MPTCP_SUB_REMOVE_ADDR:
-+		if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
-+			mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		if (mopt->saw_rem_addr) {
-+			mopt->more_rem_addr = 1;
-+			break;
-+		}
-+		mopt->saw_rem_addr = 1;
-+		mopt->rem_addr_ptr = ptr;
-+		break;
-+	case MPTCP_SUB_PRIO:
-+	{
-+		const struct mp_prio *mpprio = (struct mp_prio *)ptr;
-+
-+		if (opsize != MPTCP_SUB_LEN_PRIO &&
-+		    opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
-+			mptcp_debug("%s: mp_prio: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		mopt->saw_low_prio = 1;
-+		mopt->low_prio = mpprio->b;
-+
-+		if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
-+			mopt->saw_low_prio = 2;
-+			mopt->prio_addr_id = mpprio->addr_id;
-+		}
-+		break;
-+	}
-+	case MPTCP_SUB_FAIL:
-+		if (opsize != MPTCP_SUB_LEN_FAIL) {
-+			mptcp_debug("%s: mp_fail: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+		mopt->mp_fail = 1;
-+		break;
-+	case MPTCP_SUB_FCLOSE:
-+		if (opsize != MPTCP_SUB_LEN_FCLOSE) {
-+			mptcp_debug("%s: mp_fclose: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		mopt->mp_fclose = 1;
-+		mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
-+
-+		break;
-+	default:
-+		mptcp_debug("%s: Received unkown subtype: %d\n",
-+			    __func__, mp_opt->sub);
-+		break;
-+	}
-+}
-+
-+/** Parse only MPTCP options */
-+void tcp_parse_mptcp_options(const struct sk_buff *skb,
-+			     struct mptcp_options_received *mopt)
-+{
-+	const struct tcphdr *th = tcp_hdr(skb);
-+	int length = (th->doff * 4) - sizeof(struct tcphdr);
-+	const unsigned char *ptr = (const unsigned char *)(th + 1);
-+
-+	while (length > 0) {
-+		int opcode = *ptr++;
-+		int opsize;
-+
-+		switch (opcode) {
-+		case TCPOPT_EOL:
-+			return;
-+		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
-+			length--;
-+			continue;
-+		default:
-+			opsize = *ptr++;
-+			if (opsize < 2)	/* "silly options" */
-+				return;
-+			if (opsize > length)
-+				return;	/* don't parse partial options */
-+			if (opcode == TCPOPT_MPTCP)
-+				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
-+		}
-+		ptr += opsize - 2;
-+		length -= opsize;
-+	}
-+}
-+
-+int mptcp_check_rtt(const struct tcp_sock *tp, int time)
-+{
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	struct sock *sk;
-+	u32 rtt_max = 0;
-+
-+	/* In MPTCP, we take the max delay across all flows,
-+	 * in order to take into account meta-reordering buffers.
-+	 */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		if (!mptcp_sk_can_recv(sk))
-+			continue;
-+
-+		if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
-+			rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
-+	}
-+	if (time < (rtt_max >> 3) || !rtt_max)
-+		return 1;
-+
-+	return 0;
-+}
-+
-+static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
-+{
-+	struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+	__be16 port = 0;
-+	union inet_addr addr;
-+	sa_family_t family;
-+
-+	if (mpadd->ipver == 4) {
-+		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
-+			port  = mpadd->u.v4.port;
-+		family = AF_INET;
-+		addr.in = mpadd->u.v4.addr;
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else if (mpadd->ipver == 6) {
-+		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
-+			port  = mpadd->u.v6.port;
-+		family = AF_INET6;
-+		addr.in6 = mpadd->u.v6.addr;
-+#endif /* CONFIG_IPV6 */
-+	} else {
-+		return;
-+	}
-+
-+	if (mpcb->pm_ops->add_raddr)
-+		mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
-+}
-+
-+static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
-+{
-+	struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
-+	int i;
-+	u8 rem_id;
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+	for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
-+		rem_id = (&mprem->addrs_id)[i];
-+
-+		if (mpcb->pm_ops->rem_raddr)
-+			mpcb->pm_ops->rem_raddr(mpcb, rem_id);
-+		mptcp_send_reset_rem_id(mpcb, rem_id);
-+	}
-+}
-+
-+static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
-+{
-+	struct tcphdr *th = tcp_hdr(skb);
-+	unsigned char *ptr;
-+	int length = (th->doff * 4) - sizeof(struct tcphdr);
-+
-+	/* Jump through the options to check whether ADD_ADDR is there */
-+	ptr = (unsigned char *)(th + 1);
-+	while (length > 0) {
-+		int opcode = *ptr++;
-+		int opsize;
-+
-+		switch (opcode) {
-+		case TCPOPT_EOL:
-+			return;
-+		case TCPOPT_NOP:
-+			length--;
-+			continue;
-+		default:
-+			opsize = *ptr++;
-+			if (opsize < 2)
-+				return;
-+			if (opsize > length)
-+				return;  /* don't parse partial options */
-+			if (opcode == TCPOPT_MPTCP &&
-+			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
-+#if IS_ENABLED(CONFIG_IPV6)
-+				struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+				if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+				     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
-+				    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
-+				     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
-+#else
-+				if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+				    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
-+#endif /* CONFIG_IPV6 */
-+					goto cont;
-+
-+				mptcp_handle_add_addr(ptr, sk);
-+			}
-+			if (opcode == TCPOPT_MPTCP &&
-+			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
-+				if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
-+					goto cont;
-+
-+				mptcp_handle_rem_addr(ptr, sk);
-+			}
-+cont:
-+			ptr += opsize - 2;
-+			length -= opsize;
-+		}
-+	}
-+	return;
-+}
-+
-+static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
-+{
-+	struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+	if (unlikely(mptcp->rx_opt.mp_fail)) {
-+		mptcp->rx_opt.mp_fail = 0;
-+
-+		if (!th->rst && !mpcb->infinite_mapping_snd) {
-+			struct sock *sk_it;
-+
-+			mpcb->send_infinite_mapping = 1;
-+			/* We resend everything that has not been acknowledged */
-+			meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
-+
-+			/* We artificially restart the whole send-queue. Thus,
-+			 * it is as if no packets are in flight
-+			 */
-+			tcp_sk(meta_sk)->packets_out = 0;
-+
-+			/* If the snd_nxt already wrapped around, we have to
-+			 * undo the wrapping, as we are restarting from snd_una
-+			 * on.
-+			 */
-+			if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
-+				mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
-+				mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
-+			}
-+			tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
-+
-+			/* Trigger a sending on the meta. */
-+			mptcp_push_pending_frames(meta_sk);
-+
-+			mptcp_for_each_sk(mpcb, sk_it) {
-+				if (sk != sk_it)
-+					mptcp_sub_force_close(sk_it);
-+			}
-+		}
-+
-+		return 0;
-+	}
-+
-+	if (unlikely(mptcp->rx_opt.mp_fclose)) {
-+		struct sock *sk_it, *tmpsk;
-+
-+		mptcp->rx_opt.mp_fclose = 0;
-+		if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
-+			return 0;
-+
-+		if (tcp_need_reset(sk->sk_state))
-+			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
-+
-+		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
-+			mptcp_sub_force_close(sk_it);
-+
-+		tcp_reset(meta_sk);
-+
-+		return 1;
-+	}
-+
-+	return 0;
-+}
-+
-+static inline void mptcp_path_array_check(struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+
-+	if (unlikely(mpcb->list_rcvd)) {
-+		mpcb->list_rcvd = 0;
-+		if (mpcb->pm_ops->new_remote_address)
-+			mpcb->pm_ops->new_remote_address(meta_sk);
-+	}
-+}
-+
-+int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
-+			 const struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
-+
-+	if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
-+		return 0;
-+
-+	if (mptcp_mp_fail_rcvd(sk, th))
-+		return 1;
-+
-+	/* RFC 6824, Section 3.3:
-+	 * If a checksum is not present when its use has been negotiated, the
-+	 * receiver MUST close the subflow with a RST as it is considered broken.
-+	 */
-+	if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
-+	    !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
-+		if (tcp_need_reset(sk->sk_state))
-+			tp->ops->send_active_reset(sk, GFP_ATOMIC);
-+
-+		mptcp_sub_force_close(sk);
-+		return 1;
-+	}
-+
-+	/* We have to acknowledge retransmissions of the third
-+	 * ack.
-+	 */
-+	if (mopt->join_ack) {
-+		tcp_send_delayed_ack(sk);
-+		mopt->join_ack = 0;
-+	}
-+
-+	if (mopt->saw_add_addr || mopt->saw_rem_addr) {
-+		if (mopt->more_add_addr || mopt->more_rem_addr) {
-+			mptcp_parse_addropt(skb, sk);
-+		} else {
-+			if (mopt->saw_add_addr)
-+				mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
-+			if (mopt->saw_rem_addr)
-+				mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
-+		}
-+
-+		mopt->more_add_addr = 0;
-+		mopt->saw_add_addr = 0;
-+		mopt->more_rem_addr = 0;
-+		mopt->saw_rem_addr = 0;
-+	}
-+	if (mopt->saw_low_prio) {
-+		if (mopt->saw_low_prio == 1) {
-+			tp->mptcp->rcv_low_prio = mopt->low_prio;
-+		} else {
-+			struct sock *sk_it;
-+			mptcp_for_each_sk(tp->mpcb, sk_it) {
-+				struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
-+				if (mptcp->rem_id == mopt->prio_addr_id)
-+					mptcp->rcv_low_prio = mopt->low_prio;
-+			}
-+		}
-+		mopt->saw_low_prio = 0;
-+	}
-+
-+	mptcp_data_ack(sk, skb);
-+
-+	mptcp_path_array_check(mptcp_meta_sk(sk));
-+	/* Socket may have been mp_killed by a REMOVE_ADDR */
-+	if (tp->mp_killed)
-+		return 1;
-+
-+	return 0;
-+}
-+
-+/* In case of fastopen, some data can already be in the write queue.
-+ * We need to update the sequence number of the segments as they
-+ * were initially TCP sequence numbers.
-+ */
-+static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
-+	struct sk_buff *skb;
-+	u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
-+
-+	/* There should only be one skb in write queue: the data not
-+	 * acknowledged in the SYN+ACK. In this case, we need to map
-+	 * this data to data sequence numbers.
-+	 */
-+	skb_queue_walk(&meta_sk->sk_write_queue, skb) {
-+		/* If the server only acknowledges partially the data sent in
-+		 * the SYN, we need to trim the acknowledged part because
-+		 * we don't want to retransmit this already received data.
-+		 * When we reach this point, tcp_ack() has already cleaned up
-+		 * fully acked segments. However, tcp trims partially acked
-+		 * segments only when retransmitting. Since MPTCP comes into
-+		 * play only now, we will fake an initial transmit, and
-+		 * retransmit_skb() will not be called. The following fragment
-+		 * comes from __tcp_retransmit_skb().
-+		 */
-+		if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
-+			BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
-+				      master_tp->snd_una));
-+			/* tcp_trim_head can only returns ENOMEM if skb is
-+			 * cloned. It is not the case here (see
-+			 * tcp_send_syn_data).
-+			 */
-+			BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
-+					     TCP_SKB_CB(skb)->seq));
-+		}
-+
-+		TCP_SKB_CB(skb)->seq += new_mapping;
-+		TCP_SKB_CB(skb)->end_seq += new_mapping;
-+	}
-+
-+	/* We can advance write_seq by the number of bytes unacknowledged
-+	 * and that were mapped in the previous loop.
-+	 */
-+	meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
-+
-+	/* The packets from the master_sk will be entailed to it later
-+	 * Until that time, its write queue is empty, and
-+	 * write_seq must align with snd_una
-+	 */
-+	master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
-+	master_tp->packets_out = 0;
-+
-+	/* Although these data have been sent already over the subsk,
-+	 * They have never been sent over the meta_sk, so we rewind
-+	 * the send_head so that tcp considers it as an initial send
-+	 * (instead of retransmit).
-+	 */
-+	meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
-+}
-+
-+/* The skptr is needed, because if we become MPTCP-capable, we have to switch
-+ * from meta-socket to master-socket.
-+ *
-+ * @return: 1 - we want to reset this connection
-+ *	    2 - we want to discard the received syn/ack
-+ *	    0 - everything is fine - continue
-+ */
-+int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
-+				    const struct sk_buff *skb,
-+				    const struct mptcp_options_received *mopt)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	if (mptcp(tp)) {
-+		u8 hash_mac_check[20];
-+		struct mptcp_cb *mpcb = tp->mpcb;
-+
-+		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
-+				(u8 *)&mpcb->mptcp_loc_key,
-+				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
-+				(u8 *)&tp->mptcp->mptcp_loc_nonce,
-+				(u32 *)hash_mac_check);
-+		if (memcmp(hash_mac_check,
-+			   (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
-+			mptcp_sub_force_close(sk);
-+			return 1;
-+		}
-+
-+		/* Set this flag in order to postpone data sending
-+		 * until the 4th ack arrives.
-+		 */
-+		tp->mptcp->pre_established = 1;
-+		tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
-+
-+		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
-+				(u8 *)&mpcb->mptcp_rem_key,
-+				(u8 *)&tp->mptcp->mptcp_loc_nonce,
-+				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
-+				(u32 *)&tp->mptcp->sender_mac[0]);
-+
-+	} else if (mopt->saw_mpc) {
-+		struct sock *meta_sk = sk;
-+
-+		if (mptcp_create_master_sk(sk, mopt->mptcp_key,
-+					   ntohs(tcp_hdr(skb)->window)))
-+			return 2;
-+
-+		sk = tcp_sk(sk)->mpcb->master_sk;
-+		*skptr = sk;
-+		tp = tcp_sk(sk);
-+
-+		/* If fastopen was used data might be in the send queue. We
-+		 * need to update their sequence number to MPTCP-level seqno.
-+		 * Note that it can happen in rare cases that fastopen_req is
-+		 * NULL and syn_data is 0 but fastopen indeed occurred and
-+		 * data has been queued in the write queue (but not sent).
-+		 * Example of such rare cases: connect is non-blocking and
-+		 * TFO is configured to work without cookies.
-+		 */
-+		if (!skb_queue_empty(&meta_sk->sk_write_queue))
-+			mptcp_rcv_synsent_fastopen(meta_sk);
-+
-+		/* -1, because the SYN consumed 1 byte. In case of TFO, we
-+		 * start the subflow-sequence number as if the data of the SYN
-+		 * is not part of any mapping.
-+		 */
-+		tp->mptcp->snt_isn = tp->snd_una - 1;
-+		tp->mpcb->dss_csum = mopt->dss_csum;
-+		tp->mptcp->include_mpc = 1;
-+
-+		/* Ensure that fastopen is handled at the meta-level. */
-+		tp->fastopen_req = NULL;
-+
-+		sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
-+		sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
-+
-+		 /* hold in sk_clone_lock due to initialization to 2 */
-+		sock_put(sk);
-+	} else {
-+		tp->request_mptcp = 0;
-+
-+		if (tp->inside_tk_table)
-+			mptcp_hash_remove(tp);
-+	}
-+
-+	if (mptcp(tp))
-+		tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
-+
-+	return 0;
-+}
-+
-+bool mptcp_should_expand_sndbuf(const struct sock *sk)
-+{
-+	const struct sock *sk_it;
-+	const struct sock *meta_sk = mptcp_meta_sk(sk);
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	int cnt_backups = 0;
-+	int backup_available = 0;
-+
-+	/* We circumvent this check in tcp_check_space, because we want to
-+	 * always call sk_write_space. So, we reproduce the check here.
-+	 */
-+	if (!meta_sk->sk_socket ||
-+	    !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
-+		return false;
-+
-+	/* If the user specified a specific send buffer setting, do
-+	 * not modify it.
-+	 */
-+	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-+		return false;
-+
-+	/* If we are under global TCP memory pressure, do not expand.  */
-+	if (sk_under_memory_pressure(meta_sk))
-+		return false;
-+
-+	/* If we are under soft global TCP memory pressure, do not expand.  */
-+	if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
-+		return false;
-+
-+
-+	/* For MPTCP we look for a subsocket that could send data.
-+	 * If we found one, then we update the send-buffer.
-+	 */
-+	mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+		struct tcp_sock *tp_it = tcp_sk(sk_it);
-+
-+		if (!mptcp_sk_can_send(sk_it))
-+			continue;
-+
-+		/* Backup-flows have to be counted - if there is no other
-+		 * subflow we take the backup-flow into account.
-+		 */
-+		if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
-+			cnt_backups++;
-+
-+		if (tp_it->packets_out < tp_it->snd_cwnd) {
-+			if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
-+				backup_available = 1;
-+				continue;
-+			}
-+			return true;
-+		}
-+	}
-+
-+	/* Backup-flow is available for sending - update send-buffer */
-+	if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
-+		return true;
-+	return false;
-+}
-+
-+void mptcp_init_buffer_space(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	int space;
-+
-+	tcp_init_buffer_space(sk);
-+
-+	if (is_master_tp(tp)) {
-+		meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
-+		meta_tp->rcvq_space.time = tcp_time_stamp;
-+		meta_tp->rcvq_space.seq = meta_tp->copied_seq;
-+
-+		/* If there is only one subflow, we just use regular TCP
-+		 * autotuning. User-locks are handled already by
-+		 * tcp_init_buffer_space
-+		 */
-+		meta_tp->window_clamp = tp->window_clamp;
-+		meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
-+		meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
-+		meta_sk->sk_sndbuf = sk->sk_sndbuf;
-+
-+		return;
-+	}
-+
-+	if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
-+		goto snd_buf;
-+
-+	/* Adding a new subflow to the rcv-buffer space. We make a simple
-+	 * addition, to give some space to allow traffic on the new subflow.
-+	 * Autotuning will increase it further later on.
-+	 */
-+	space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
-+	if (space > meta_sk->sk_rcvbuf) {
-+		meta_tp->window_clamp += tp->window_clamp;
-+		meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
-+		meta_sk->sk_rcvbuf = space;
-+	}
-+
-+snd_buf:
-+	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-+		return;
-+
-+	/* Adding a new subflow to the send-buffer space. We make a simple
-+	 * addition, to give some space to allow traffic on the new subflow.
-+	 * Autotuning will increase it further later on.
-+	 */
-+	space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
-+	if (space > meta_sk->sk_sndbuf) {
-+		meta_sk->sk_sndbuf = space;
-+		meta_sk->sk_write_space(meta_sk);
-+	}
-+}
-+
-+void mptcp_tcp_set_rto(struct sock *sk)
-+{
-+	tcp_set_rto(sk);
-+	mptcp_set_rto(sk);
-+}
-diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
-new file mode 100644
-index 000000000000..1183d1305d35
---- /dev/null
-+++ b/net/mptcp/mptcp_ipv4.c
-@@ -0,0 +1,483 @@
-+/*
-+ *	MPTCP implementation - IPv4-specific functions
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/export.h>
-+#include <linux/ip.h>
-+#include <linux/list.h>
-+#include <linux/skbuff.h>
-+#include <linux/spinlock.h>
-+#include <linux/tcp.h>
-+
-+#include <net/inet_common.h>
-+#include <net/inet_connection_sock.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/request_sock.h>
-+#include <net/tcp.h>
-+
-+u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
-+{
-+	u32 hash[MD5_DIGEST_WORDS];
-+
-+	hash[0] = (__force u32)saddr;
-+	hash[1] = (__force u32)daddr;
-+	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-+	hash[3] = mptcp_seed++;
-+
-+	md5_transform(hash, mptcp_secret);
-+
-+	return hash[0];
-+}
-+
-+u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
-+{
-+	u32 hash[MD5_DIGEST_WORDS];
-+
-+	hash[0] = (__force u32)saddr;
-+	hash[1] = (__force u32)daddr;
-+	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-+	hash[3] = mptcp_seed++;
-+
-+	md5_transform(hash, mptcp_secret);
-+
-+	return *((u64 *)hash);
-+}
-+
-+
-+static void mptcp_v4_reqsk_destructor(struct request_sock *req)
-+{
-+	mptcp_reqsk_destructor(req);
-+
-+	tcp_v4_reqsk_destructor(req);
-+}
-+
-+static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
-+			     struct sk_buff *skb)
-+{
-+	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
-+	mptcp_reqsk_init(req, skb);
-+
-+	return 0;
-+}
-+
-+static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
-+				  struct sk_buff *skb)
-+{
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+	union inet_addr addr;
-+	int loc_id;
-+	bool low_prio = false;
-+
-+	/* We need to do this as early as possible. Because, if we fail later
-+	 * (e.g., get_local_id), then reqsk_free tries to remove the
-+	 * request-socket from the htb in mptcp_hash_request_remove as pprev
-+	 * may be different from NULL.
-+	 */
-+	mtreq->hash_entry.pprev = NULL;
-+
-+	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
-+
-+	mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
-+						    ip_hdr(skb)->daddr,
-+						    tcp_hdr(skb)->source,
-+						    tcp_hdr(skb)->dest);
-+	addr.ip = inet_rsk(req)->ir_loc_addr;
-+	loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
-+	if (loc_id == -1)
-+		return -1;
-+	mtreq->loc_id = loc_id;
-+	mtreq->low_prio = low_prio;
-+
-+	mptcp_join_reqsk_init(mpcb, req, skb);
-+
-+	return 0;
-+}
-+
-+/* Similar to tcp_request_sock_ops */
-+struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
-+	.family		=	PF_INET,
-+	.obj_size	=	sizeof(struct mptcp_request_sock),
-+	.rtx_syn_ack	=	tcp_rtx_synack,
-+	.send_ack	=	tcp_v4_reqsk_send_ack,
-+	.destructor	=	mptcp_v4_reqsk_destructor,
-+	.send_reset	=	tcp_v4_send_reset,
-+	.syn_ack_timeout =	tcp_syn_ack_timeout,
-+};
-+
-+static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
-+					  struct request_sock *req,
-+					  const unsigned long timeout)
-+{
-+	const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
-+				     inet_rsk(req)->ir_rmt_port,
-+				     0, MPTCP_HASH_SIZE);
-+	/* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
-+	 * want to reset the keepalive-timer (responsible for retransmitting
-+	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
-+	 * overload the keepalive timer. Also, it's not a big deal, because the
-+	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
-+	 * if the third ACK gets lost, the client will handle the retransmission
-+	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
-+	 * SYN.
-+	 */
-+	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
-+	const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
-+				     inet_rsk(req)->ir_rmt_port,
-+				     lopt->hash_rnd, lopt->nr_table_entries);
-+
-+	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
-+	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
-+		mptcp_reset_synack_timer(meta_sk, timeout);
-+
-+	rcu_read_lock();
-+	spin_lock(&mptcp_reqsk_hlock);
-+	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
-+	spin_unlock(&mptcp_reqsk_hlock);
-+	rcu_read_unlock();
-+}
-+
-+/* Similar to tcp_v4_conn_request */
-+static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	return tcp_conn_request(&mptcp_request_sock_ops,
-+				&mptcp_join_request_sock_ipv4_ops,
-+				meta_sk, skb);
-+}
-+
-+/* We only process join requests here. (either the SYN or the final ACK) */
-+int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *child, *rsk = NULL;
-+	int ret;
-+
-+	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
-+		struct tcphdr *th = tcp_hdr(skb);
-+		const struct iphdr *iph = ip_hdr(skb);
-+		struct sock *sk;
-+
-+		sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
-+					     iph->saddr, th->source, iph->daddr,
-+					     th->dest, inet_iif(skb));
-+
-+		if (!sk) {
-+			kfree_skb(skb);
-+			return 0;
-+		}
-+		if (is_meta_sk(sk)) {
-+			WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
-+			kfree_skb(skb);
-+			sock_put(sk);
-+			return 0;
-+		}
-+
-+		if (sk->sk_state == TCP_TIME_WAIT) {
-+			inet_twsk_put(inet_twsk(sk));
-+			kfree_skb(skb);
-+			return 0;
-+		}
-+
-+		ret = tcp_v4_do_rcv(sk, skb);
-+		sock_put(sk);
-+
-+		return ret;
-+	}
-+	TCP_SKB_CB(skb)->mptcp_flags = 0;
-+
-+	/* Has been removed from the tk-table. Thus, no new subflows.
-+	 *
-+	 * Check for close-state is necessary, because we may have been closed
-+	 * without passing by mptcp_close().
-+	 *
-+	 * When falling back, no new subflows are allowed either.
-+	 */
-+	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
-+	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
-+		goto reset_and_discard;
-+
-+	child = tcp_v4_hnd_req(meta_sk, skb);
-+
-+	if (!child)
-+		goto discard;
-+
-+	if (child != meta_sk) {
-+		sock_rps_save_rxhash(child, skb);
-+		/* We don't call tcp_child_process here, because we hold
-+		 * already the meta-sk-lock and are sure that it is not owned
-+		 * by the user.
-+		 */
-+		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
-+		bh_unlock_sock(child);
-+		sock_put(child);
-+		if (ret) {
-+			rsk = child;
-+			goto reset_and_discard;
-+		}
-+	} else {
-+		if (tcp_hdr(skb)->syn) {
-+			mptcp_v4_join_request(meta_sk, skb);
-+			goto discard;
-+		}
-+		goto reset_and_discard;
-+	}
-+	return 0;
-+
-+reset_and_discard:
-+	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
-+		const struct tcphdr *th = tcp_hdr(skb);
-+		const struct iphdr *iph = ip_hdr(skb);
-+		struct request_sock **prev, *req;
-+		/* If we end up here, it means we should not have matched on the
-+		 * request-socket. But, because the request-sock queue is only
-+		 * destroyed in mptcp_close, the socket may actually already be
-+		 * in close-state (e.g., through shutdown()) while still having
-+		 * pending request sockets.
-+		 */
-+		req = inet_csk_search_req(meta_sk, &prev, th->source,
-+					  iph->saddr, iph->daddr);
-+		if (req) {
-+			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
-+					    req);
-+			reqsk_free(req);
-+		}
-+	}
-+
-+	tcp_v4_send_reset(rsk, skb);
-+discard:
-+	kfree_skb(skb);
-+	return 0;
-+}
-+
-+/* After this, the ref count of the meta_sk associated with the request_sock
-+ * is incremented. Thus it is the responsibility of the caller
-+ * to call sock_put() when the reference is not needed anymore.
-+ */
-+struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
-+				 const __be32 laddr, const struct net *net)
-+{
-+	const struct mptcp_request_sock *mtreq;
-+	struct sock *meta_sk = NULL;
-+	const struct hlist_nulls_node *node;
-+	const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
-+
-+	rcu_read_lock();
-+begin:
-+	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
-+				       hash_entry) {
-+		struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
-+		meta_sk = mtreq->mptcp_mpcb->meta_sk;
-+
-+		if (ireq->ir_rmt_port == rport &&
-+		    ireq->ir_rmt_addr == raddr &&
-+		    ireq->ir_loc_addr == laddr &&
-+		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
-+		    net_eq(net, sock_net(meta_sk)))
-+			goto found;
-+		meta_sk = NULL;
-+	}
-+	/* A request-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
-+		goto begin;
-+
-+found:
-+	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+		meta_sk = NULL;
-+	rcu_read_unlock();
-+
-+	return meta_sk;
-+}
-+
-+/* Create a new IPv4 subflow.
-+ *
-+ * We are in user-context and meta-sock-lock is hold.
-+ */
-+int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
-+			   struct mptcp_rem4 *rem)
-+{
-+	struct tcp_sock *tp;
-+	struct sock *sk;
-+	struct sockaddr_in loc_in, rem_in;
-+	struct socket sock;
-+	int ret;
-+
-+	/** First, create and prepare the new socket */
-+
-+	sock.type = meta_sk->sk_socket->type;
-+	sock.state = SS_UNCONNECTED;
-+	sock.wq = meta_sk->sk_socket->wq;
-+	sock.file = meta_sk->sk_socket->file;
-+	sock.ops = NULL;
-+
-+	ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
-+	if (unlikely(ret < 0)) {
-+		mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
-+		return ret;
-+	}
-+
-+	sk = sock.sk;
-+	tp = tcp_sk(sk);
-+
-+	/* All subsockets need the MPTCP-lock-class */
-+	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
-+	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
-+
-+	if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
-+		goto error;
-+
-+	tp->mptcp->slave_sk = 1;
-+	tp->mptcp->low_prio = loc->low_prio;
-+
-+	/* Initializing the timer for an MPTCP subflow */
-+	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
-+
-+	/** Then, connect the socket to the peer */
-+	loc_in.sin_family = AF_INET;
-+	rem_in.sin_family = AF_INET;
-+	loc_in.sin_port = 0;
-+	if (rem->port)
-+		rem_in.sin_port = rem->port;
-+	else
-+		rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
-+	loc_in.sin_addr = loc->addr;
-+	rem_in.sin_addr = rem->addr;
-+
-+	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
-+	if (ret < 0) {
-+		mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
-+			    __func__, ret);
-+		goto error;
-+	}
-+
-+	mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
-+		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
-+		    tp->mptcp->path_index, &loc_in.sin_addr,
-+		    ntohs(loc_in.sin_port), &rem_in.sin_addr,
-+		    ntohs(rem_in.sin_port));
-+
-+	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
-+		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
-+
-+	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
-+				sizeof(struct sockaddr_in), O_NONBLOCK);
-+	if (ret < 0 && ret != -EINPROGRESS) {
-+		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
-+			    __func__, ret);
-+		goto error;
-+	}
-+
-+	sk_set_socket(sk, meta_sk->sk_socket);
-+	sk->sk_wq = meta_sk->sk_wq;
-+
-+	return 0;
-+
-+error:
-+	/* May happen if mptcp_add_sock fails first */
-+	if (!mptcp(tp)) {
-+		tcp_close(sk, 0);
-+	} else {
-+		local_bh_disable();
-+		mptcp_sub_force_close(sk);
-+		local_bh_enable();
-+	}
-+	return ret;
-+}
-+EXPORT_SYMBOL(mptcp_init4_subsockets);
-+
-+const struct inet_connection_sock_af_ops mptcp_v4_specific = {
-+	.queue_xmit	   = ip_queue_xmit,
-+	.send_check	   = tcp_v4_send_check,
-+	.rebuild_header	   = inet_sk_rebuild_header,
-+	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
-+	.conn_request	   = mptcp_conn_request,
-+	.syn_recv_sock	   = tcp_v4_syn_recv_sock,
-+	.net_header_len	   = sizeof(struct iphdr),
-+	.setsockopt	   = ip_setsockopt,
-+	.getsockopt	   = ip_getsockopt,
-+	.addr2sockaddr	   = inet_csk_addr2sockaddr,
-+	.sockaddr_len	   = sizeof(struct sockaddr_in),
-+	.bind_conflict	   = inet_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+	.compat_setsockopt = compat_ip_setsockopt,
-+	.compat_getsockopt = compat_ip_getsockopt,
-+#endif
-+};
-+
-+struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
-+struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
-+
-+/* General initialization of IPv4 for MPTCP */
-+int mptcp_pm_v4_init(void)
-+{
-+	int ret = 0;
-+	struct request_sock_ops *ops = &mptcp_request_sock_ops;
-+
-+	mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-+	mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
-+
-+	mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-+	mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
-+	mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
-+
-+	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
-+	if (ops->slab_name == NULL) {
-+		ret = -ENOMEM;
-+		goto out;
-+	}
-+
-+	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
-+				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+				      NULL);
-+
-+	if (ops->slab == NULL) {
-+		ret =  -ENOMEM;
-+		goto err_reqsk_create;
-+	}
-+
-+out:
-+	return ret;
-+
-+err_reqsk_create:
-+	kfree(ops->slab_name);
-+	ops->slab_name = NULL;
-+	goto out;
-+}
-+
-+void mptcp_pm_v4_undo(void)
-+{
-+	kmem_cache_destroy(mptcp_request_sock_ops.slab);
-+	kfree(mptcp_request_sock_ops.slab_name);
-+}
-diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
-new file mode 100644
-index 000000000000..1036973aa855
---- /dev/null
-+++ b/net/mptcp/mptcp_ipv6.c
-@@ -0,0 +1,518 @@
-+/*
-+ *	MPTCP implementation - IPv6-specific functions
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/export.h>
-+#include <linux/in6.h>
-+#include <linux/kernel.h>
-+
-+#include <net/addrconf.h>
-+#include <net/flow.h>
-+#include <net/inet6_connection_sock.h>
-+#include <net/inet6_hashtables.h>
-+#include <net/inet_common.h>
-+#include <net/ipv6.h>
-+#include <net/ip6_checksum.h>
-+#include <net/ip6_route.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v6.h>
-+#include <net/tcp.h>
-+#include <net/transp_v6.h>
-+
-+__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
-+			 __be16 sport, __be16 dport)
-+{
-+	u32 secret[MD5_MESSAGE_BYTES / 4];
-+	u32 hash[MD5_DIGEST_WORDS];
-+	u32 i;
-+
-+	memcpy(hash, saddr, 16);
-+	for (i = 0; i < 4; i++)
-+		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
-+	secret[4] = mptcp_secret[4] +
-+		    (((__force u16)sport << 16) + (__force u16)dport);
-+	secret[5] = mptcp_seed++;
-+	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
-+		secret[i] = mptcp_secret[i];
-+
-+	md5_transform(hash, secret);
-+
-+	return hash[0];
-+}
-+
-+u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
-+		     __be16 sport, __be16 dport)
-+{
-+	u32 secret[MD5_MESSAGE_BYTES / 4];
-+	u32 hash[MD5_DIGEST_WORDS];
-+	u32 i;
-+
-+	memcpy(hash, saddr, 16);
-+	for (i = 0; i < 4; i++)
-+		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
-+	secret[4] = mptcp_secret[4] +
-+		    (((__force u16)sport << 16) + (__force u16)dport);
-+	secret[5] = mptcp_seed++;
-+	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
-+		secret[i] = mptcp_secret[i];
-+
-+	md5_transform(hash, secret);
-+
-+	return *((u64 *)hash);
-+}
-+
-+static void mptcp_v6_reqsk_destructor(struct request_sock *req)
-+{
-+	mptcp_reqsk_destructor(req);
-+
-+	tcp_v6_reqsk_destructor(req);
-+}
-+
-+static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
-+			     struct sk_buff *skb)
-+{
-+	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
-+	mptcp_reqsk_init(req, skb);
-+
-+	return 0;
-+}
-+
-+static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
-+				  struct sk_buff *skb)
-+{
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+	union inet_addr addr;
-+	int loc_id;
-+	bool low_prio = false;
-+
-+	/* We need to do this as early as possible. Because, if we fail later
-+	 * (e.g., get_local_id), then reqsk_free tries to remove the
-+	 * request-socket from the htb in mptcp_hash_request_remove as pprev
-+	 * may be different from NULL.
-+	 */
-+	mtreq->hash_entry.pprev = NULL;
-+
-+	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
-+
-+	mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
-+						    ipv6_hdr(skb)->daddr.s6_addr32,
-+						    tcp_hdr(skb)->source,
-+						    tcp_hdr(skb)->dest);
-+	addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
-+	loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
-+	if (loc_id == -1)
-+		return -1;
-+	mtreq->loc_id = loc_id;
-+	mtreq->low_prio = low_prio;
-+
-+	mptcp_join_reqsk_init(mpcb, req, skb);
-+
-+	return 0;
-+}
-+
-+/* Similar to tcp6_request_sock_ops */
-+struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
-+	.family		=	AF_INET6,
-+	.obj_size	=	sizeof(struct mptcp_request_sock),
-+	.rtx_syn_ack	=	tcp_v6_rtx_synack,
-+	.send_ack	=	tcp_v6_reqsk_send_ack,
-+	.destructor	=	mptcp_v6_reqsk_destructor,
-+	.send_reset	=	tcp_v6_send_reset,
-+	.syn_ack_timeout =	tcp_syn_ack_timeout,
-+};
-+
-+static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
-+					  struct request_sock *req,
-+					  const unsigned long timeout)
-+{
-+	const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
-+				      inet_rsk(req)->ir_rmt_port,
-+				      0, MPTCP_HASH_SIZE);
-+	/* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
-+	 * want to reset the keepalive-timer (responsible for retransmitting
-+	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
-+	 * overload the keepalive timer. Also, it's not a big deal, because the
-+	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
-+	 * if the third ACK gets lost, the client will handle the retransmission
-+	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
-+	 * SYN.
-+	 */
-+	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
-+	const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
-+				      inet_rsk(req)->ir_rmt_port,
-+				      lopt->hash_rnd, lopt->nr_table_entries);
-+
-+	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
-+	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
-+		mptcp_reset_synack_timer(meta_sk, timeout);
-+
-+	rcu_read_lock();
-+	spin_lock(&mptcp_reqsk_hlock);
-+	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
-+	spin_unlock(&mptcp_reqsk_hlock);
-+	rcu_read_unlock();
-+}
-+
-+static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	return tcp_conn_request(&mptcp6_request_sock_ops,
-+				&mptcp_join_request_sock_ipv6_ops,
-+				meta_sk, skb);
-+}
-+
-+int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *child, *rsk = NULL;
-+	int ret;
-+
-+	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
-+		struct tcphdr *th = tcp_hdr(skb);
-+		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
-+		struct sock *sk;
-+
-+		sk = __inet6_lookup_established(sock_net(meta_sk),
-+						&tcp_hashinfo,
-+						&ip6h->saddr, th->source,
-+						&ip6h->daddr, ntohs(th->dest),
-+						inet6_iif(skb));
-+
-+		if (!sk) {
-+			kfree_skb(skb);
-+			return 0;
-+		}
-+		if (is_meta_sk(sk)) {
-+			WARN("%s Did not find a sub-sk!\n", __func__);
-+			kfree_skb(skb);
-+			sock_put(sk);
-+			return 0;
-+		}
-+
-+		if (sk->sk_state == TCP_TIME_WAIT) {
-+			inet_twsk_put(inet_twsk(sk));
-+			kfree_skb(skb);
-+			return 0;
-+		}
-+
-+		ret = tcp_v6_do_rcv(sk, skb);
-+		sock_put(sk);
-+
-+		return ret;
-+	}
-+	TCP_SKB_CB(skb)->mptcp_flags = 0;
-+
-+	/* Has been removed from the tk-table. Thus, no new subflows.
-+	 *
-+	 * Check for close-state is necessary, because we may have been closed
-+	 * without passing by mptcp_close().
-+	 *
-+	 * When falling back, no new subflows are allowed either.
-+	 */
-+	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
-+	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
-+		goto reset_and_discard;
-+
-+	child = tcp_v6_hnd_req(meta_sk, skb);
-+
-+	if (!child)
-+		goto discard;
-+
-+	if (child != meta_sk) {
-+		sock_rps_save_rxhash(child, skb);
-+		/* We don't call tcp_child_process here, because we hold
-+		 * already the meta-sk-lock and are sure that it is not owned
-+		 * by the user.
-+		 */
-+		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
-+		bh_unlock_sock(child);
-+		sock_put(child);
-+		if (ret) {
-+			rsk = child;
-+			goto reset_and_discard;
-+		}
-+	} else {
-+		if (tcp_hdr(skb)->syn) {
-+			mptcp_v6_join_request(meta_sk, skb);
-+			goto discard;
-+		}
-+		goto reset_and_discard;
-+	}
-+	return 0;
-+
-+reset_and_discard:
-+	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
-+		const struct tcphdr *th = tcp_hdr(skb);
-+		struct request_sock **prev, *req;
-+		/* If we end up here, it means we should not have matched on the
-+		 * request-socket. But, because the request-sock queue is only
-+		 * destroyed in mptcp_close, the socket may actually already be
-+		 * in close-state (e.g., through shutdown()) while still having
-+		 * pending request sockets.
-+		 */
-+		req = inet6_csk_search_req(meta_sk, &prev, th->source,
-+					   &ipv6_hdr(skb)->saddr,
-+					   &ipv6_hdr(skb)->daddr, inet6_iif(skb));
-+		if (req) {
-+			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
-+					    req);
-+			reqsk_free(req);
-+		}
-+	}
-+
-+	tcp_v6_send_reset(rsk, skb);
-+discard:
-+	kfree_skb(skb);
-+	return 0;
-+}
-+
-+/* After this, the ref count of the meta_sk associated with the request_sock
-+ * is incremented. Thus it is the responsibility of the caller
-+ * to call sock_put() when the reference is not needed anymore.
-+ */
-+struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
-+				 const struct in6_addr *laddr, const struct net *net)
-+{
-+	const struct mptcp_request_sock *mtreq;
-+	struct sock *meta_sk = NULL;
-+	const struct hlist_nulls_node *node;
-+	const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
-+
-+	rcu_read_lock();
-+begin:
-+	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
-+				       hash_entry) {
-+		struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
-+		meta_sk = mtreq->mptcp_mpcb->meta_sk;
-+
-+		if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
-+		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
-+		    ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
-+		    ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
-+		    net_eq(net, sock_net(meta_sk)))
-+			goto found;
-+		meta_sk = NULL;
-+	}
-+	/* A request-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
-+		goto begin;
-+
-+found:
-+	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+		meta_sk = NULL;
-+	rcu_read_unlock();
-+
-+	return meta_sk;
-+}
-+
-+/* Create a new IPv6 subflow.
-+ *
-+ * We are in user-context and meta-sock-lock is hold.
-+ */
-+int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
-+			   struct mptcp_rem6 *rem)
-+{
-+	struct tcp_sock *tp;
-+	struct sock *sk;
-+	struct sockaddr_in6 loc_in, rem_in;
-+	struct socket sock;
-+	int ret;
-+
-+	/** First, create and prepare the new socket */
-+
-+	sock.type = meta_sk->sk_socket->type;
-+	sock.state = SS_UNCONNECTED;
-+	sock.wq = meta_sk->sk_socket->wq;
-+	sock.file = meta_sk->sk_socket->file;
-+	sock.ops = NULL;
-+
-+	ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
-+	if (unlikely(ret < 0)) {
-+		mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
-+		return ret;
-+	}
-+
-+	sk = sock.sk;
-+	tp = tcp_sk(sk);
-+
-+	/* All subsockets need the MPTCP-lock-class */
-+	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
-+	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
-+
-+	if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
-+		goto error;
-+
-+	tp->mptcp->slave_sk = 1;
-+	tp->mptcp->low_prio = loc->low_prio;
-+
-+	/* Initializing the timer for an MPTCP subflow */
-+	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
-+
-+	/** Then, connect the socket to the peer */
-+	loc_in.sin6_family = AF_INET6;
-+	rem_in.sin6_family = AF_INET6;
-+	loc_in.sin6_port = 0;
-+	if (rem->port)
-+		rem_in.sin6_port = rem->port;
-+	else
-+		rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
-+	loc_in.sin6_addr = loc->addr;
-+	rem_in.sin6_addr = rem->addr;
-+
-+	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
-+	if (ret < 0) {
-+		mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
-+			    __func__, ret);
-+		goto error;
-+	}
-+
-+	mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
-+		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
-+		    tp->mptcp->path_index, &loc_in.sin6_addr,
-+		    ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
-+		    ntohs(rem_in.sin6_port));
-+
-+	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
-+		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
-+
-+	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
-+				sizeof(struct sockaddr_in6), O_NONBLOCK);
-+	if (ret < 0 && ret != -EINPROGRESS) {
-+		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
-+			    __func__, ret);
-+		goto error;
-+	}
-+
-+	sk_set_socket(sk, meta_sk->sk_socket);
-+	sk->sk_wq = meta_sk->sk_wq;
-+
-+	return 0;
-+
-+error:
-+	/* May happen if mptcp_add_sock fails first */
-+	if (!mptcp(tp)) {
-+		tcp_close(sk, 0);
-+	} else {
-+		local_bh_disable();
-+		mptcp_sub_force_close(sk);
-+		local_bh_enable();
-+	}
-+	return ret;
-+}
-+EXPORT_SYMBOL(mptcp_init6_subsockets);
-+
-+const struct inet_connection_sock_af_ops mptcp_v6_specific = {
-+	.queue_xmit	   = inet6_csk_xmit,
-+	.send_check	   = tcp_v6_send_check,
-+	.rebuild_header	   = inet6_sk_rebuild_header,
-+	.sk_rx_dst_set	   = inet6_sk_rx_dst_set,
-+	.conn_request	   = mptcp_conn_request,
-+	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
-+	.net_header_len	   = sizeof(struct ipv6hdr),
-+	.net_frag_header_len = sizeof(struct frag_hdr),
-+	.setsockopt	   = ipv6_setsockopt,
-+	.getsockopt	   = ipv6_getsockopt,
-+	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
-+	.sockaddr_len	   = sizeof(struct sockaddr_in6),
-+	.bind_conflict	   = inet6_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+	.compat_setsockopt = compat_ipv6_setsockopt,
-+	.compat_getsockopt = compat_ipv6_getsockopt,
-+#endif
-+};
-+
-+const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
-+	.queue_xmit	   = ip_queue_xmit,
-+	.send_check	   = tcp_v4_send_check,
-+	.rebuild_header	   = inet_sk_rebuild_header,
-+	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
-+	.conn_request	   = mptcp_conn_request,
-+	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
-+	.net_header_len	   = sizeof(struct iphdr),
-+	.setsockopt	   = ipv6_setsockopt,
-+	.getsockopt	   = ipv6_getsockopt,
-+	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
-+	.sockaddr_len	   = sizeof(struct sockaddr_in6),
-+	.bind_conflict	   = inet6_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+	.compat_setsockopt = compat_ipv6_setsockopt,
-+	.compat_getsockopt = compat_ipv6_getsockopt,
-+#endif
-+};
-+
-+struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
-+struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
-+
-+int mptcp_pm_v6_init(void)
-+{
-+	int ret = 0;
-+	struct request_sock_ops *ops = &mptcp6_request_sock_ops;
-+
-+	mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
-+	mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
-+
-+	mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
-+	mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
-+	mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
-+
-+	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
-+	if (ops->slab_name == NULL) {
-+		ret = -ENOMEM;
-+		goto out;
-+	}
-+
-+	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
-+				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+				      NULL);
-+
-+	if (ops->slab == NULL) {
-+		ret =  -ENOMEM;
-+		goto err_reqsk_create;
-+	}
-+
-+out:
-+	return ret;
-+
-+err_reqsk_create:
-+	kfree(ops->slab_name);
-+	ops->slab_name = NULL;
-+	goto out;
-+}
-+
-+void mptcp_pm_v6_undo(void)
-+{
-+	kmem_cache_destroy(mptcp6_request_sock_ops.slab);
-+	kfree(mptcp6_request_sock_ops.slab_name);
-+}
-diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
-new file mode 100644
-index 000000000000..6f5087983175
---- /dev/null
-+++ b/net/mptcp/mptcp_ndiffports.c
-@@ -0,0 +1,161 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#endif
-+
-+struct ndiffports_priv {
-+	/* Worker struct for subflow establishment */
-+	struct work_struct subflow_work;
-+
-+	struct mptcp_cb *mpcb;
-+};
-+
-+static int num_subflows __read_mostly = 2;
-+module_param(num_subflows, int, 0644);
-+MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+	const struct ndiffports_priv *pm_priv = container_of(work,
-+						     struct ndiffports_priv,
-+						     subflow_work);
-+	struct mptcp_cb *mpcb = pm_priv->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	int iter = 0;
-+
-+next_subflow:
-+	if (iter) {
-+		release_sock(meta_sk);
-+		mutex_unlock(&mpcb->mpcb_mutex);
-+
-+		cond_resched();
-+	}
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	iter++;
-+
-+	if (sock_flag(meta_sk, SOCK_DEAD))
-+		goto exit;
-+
-+	if (mpcb->master_sk &&
-+	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+		goto exit;
-+
-+	if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
-+		if (meta_sk->sk_family == AF_INET ||
-+		    mptcp_v6_is_v4_mapped(meta_sk)) {
-+			struct mptcp_loc4 loc;
-+			struct mptcp_rem4 rem;
-+
-+			loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
-+			loc.loc4_id = 0;
-+			loc.low_prio = 0;
-+
-+			rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
-+			rem.port = inet_sk(meta_sk)->inet_dport;
-+			rem.rem4_id = 0; /* Default 0 */
-+
-+			mptcp_init4_subsockets(meta_sk, &loc, &rem);
-+		} else {
-+#if IS_ENABLED(CONFIG_IPV6)
-+			struct mptcp_loc6 loc;
-+			struct mptcp_rem6 rem;
-+
-+			loc.addr = inet6_sk(meta_sk)->saddr;
-+			loc.loc6_id = 0;
-+			loc.low_prio = 0;
-+
-+			rem.addr = meta_sk->sk_v6_daddr;
-+			rem.port = inet_sk(meta_sk)->inet_dport;
-+			rem.rem6_id = 0; /* Default 0 */
-+
-+			mptcp_init6_subsockets(meta_sk, &loc, &rem);
-+#endif
-+		}
-+		goto next_subflow;
-+	}
-+
-+exit:
-+	release_sock(meta_sk);
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk);
-+}
-+
-+static void ndiffports_new_session(const struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
-+
-+	/* Initialize workqueue-struct */
-+	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+	fmp->mpcb = mpcb;
-+}
-+
-+static void ndiffports_create_subflows(struct sock *meta_sk)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
-+
-+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+	    mpcb->send_infinite_mapping ||
-+	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+		return;
-+
-+	if (!work_pending(&pm_priv->subflow_work)) {
-+		sock_hold(meta_sk);
-+		queue_work(mptcp_wq, &pm_priv->subflow_work);
-+	}
-+}
-+
-+static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
-+				   struct net *net, bool *low_prio)
-+{
-+	return 0;
-+}
-+
-+static struct mptcp_pm_ops ndiffports __read_mostly = {
-+	.new_session = ndiffports_new_session,
-+	.fully_established = ndiffports_create_subflows,
-+	.get_local_id = ndiffports_get_local_id,
-+	.name = "ndiffports",
-+	.owner = THIS_MODULE,
-+};
-+
-+/* General initialization of MPTCP_PM */
-+static int __init ndiffports_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
-+
-+	if (mptcp_register_path_manager(&ndiffports))
-+		goto exit;
-+
-+	return 0;
-+
-+exit:
-+	return -1;
-+}
-+
-+static void ndiffports_unregister(void)
-+{
-+	mptcp_unregister_path_manager(&ndiffports);
-+}
-+
-+module_init(ndiffports_register);
-+module_exit(ndiffports_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
-+MODULE_VERSION("0.88");
-diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
-new file mode 100644
-index 000000000000..ec4e98622637
---- /dev/null
-+++ b/net/mptcp/mptcp_ofo_queue.c
-@@ -0,0 +1,295 @@
-+/*
-+ *	MPTCP implementation - Fast algorithm for MPTCP meta-reordering
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/skbuff.h>
-+#include <linux/slab.h>
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+			    const struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp;
-+
-+	mptcp_for_each_tp(mpcb, tp) {
-+		if (tp->mptcp->shortcut_ofoqueue == skb) {
-+			tp->mptcp->shortcut_ofoqueue = NULL;
-+			return;
-+		}
-+	}
-+}
-+
-+/* Does 'skb' fits after 'here' in the queue 'head' ?
-+ * If yes, we queue it and return 1
-+ */
-+static int mptcp_ofo_queue_after(struct sk_buff_head *head,
-+				 struct sk_buff *skb, struct sk_buff *here,
-+				 const struct tcp_sock *tp)
-+{
-+	struct sock *meta_sk = tp->meta_sk;
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	u32 seq = TCP_SKB_CB(skb)->seq;
-+	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+	/* We want to queue skb after here, thus seq >= end_seq */
-+	if (before(seq, TCP_SKB_CB(here)->end_seq))
-+		return 0;
-+
-+	if (seq == TCP_SKB_CB(here)->end_seq) {
-+		bool fragstolen = false;
-+
-+		if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
-+			__skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
-+			return 1;
-+		} else {
-+			kfree_skb_partial(skb, fragstolen);
-+			return -1;
-+		}
-+	}
-+
-+	/* If here is the last one, we can always queue it */
-+	if (skb_queue_is_last(head, here)) {
-+		__skb_queue_after(head, here, skb);
-+		return 1;
-+	} else {
-+		struct sk_buff *skb1 = skb_queue_next(head, here);
-+		/* It's not the last one, but does it fits between 'here' and
-+		 * the one after 'here' ? Thus, does end_seq <= after_here->seq
-+		 */
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
-+			__skb_queue_after(head, here, skb);
-+			return 1;
-+		}
-+	}
-+
-+	return 0;
-+}
-+
-+static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
-+			 struct sk_buff_head *head, struct tcp_sock *tp)
-+{
-+	struct sock *meta_sk = tp->meta_sk;
-+	struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sk_buff *skb1, *best_shortcut = NULL;
-+	u32 seq = TCP_SKB_CB(skb)->seq;
-+	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+	u32 distance = 0xffffffff;
-+
-+	/* First, check the tp's shortcut */
-+	if (!shortcut) {
-+		if (skb_queue_empty(head)) {
-+			__skb_queue_head(head, skb);
-+			goto end;
-+		}
-+	} else {
-+		int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
-+		/* Does the tp's shortcut is a hit? If yes, we insert. */
-+
-+		if (ret) {
-+			skb = (ret > 0) ? skb : NULL;
-+			goto end;
-+		}
-+	}
-+
-+	/* Check the shortcuts of the other subsockets. */
-+	mptcp_for_each_tp(mpcb, tp_it) {
-+		shortcut = tp_it->mptcp->shortcut_ofoqueue;
-+		/* Can we queue it here? If yes, do so! */
-+		if (shortcut) {
-+			int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
-+
-+			if (ret) {
-+				skb = (ret > 0) ? skb : NULL;
-+				goto end;
-+			}
-+		}
-+
-+		/* Could not queue it, check if we are close.
-+		 * We are looking for a shortcut, close enough to seq to
-+		 * set skb1 prematurely and thus improve the subsequent lookup,
-+		 * which tries to find a skb1 so that skb1->seq <= seq.
-+		 *
-+		 * So, here we only take shortcuts, whose shortcut->seq > seq,
-+		 * and minimize the distance between shortcut->seq and seq and
-+		 * set best_shortcut to this one with the minimal distance.
-+		 *
-+		 * That way, the subsequent while-loop is shortest.
-+		 */
-+		if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
-+			/* Are we closer than the current best shortcut? */
-+			if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
-+				distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
-+				best_shortcut = shortcut;
-+			}
-+		}
-+	}
-+
-+	if (best_shortcut)
-+		skb1 = best_shortcut;
-+	else
-+		skb1 = skb_peek_tail(head);
-+
-+	if (seq == TCP_SKB_CB(skb1)->end_seq) {
-+		bool fragstolen = false;
-+
-+		if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
-+			__skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
-+		} else {
-+			kfree_skb_partial(skb, fragstolen);
-+			skb = NULL;
-+		}
-+
-+		goto end;
-+	}
-+
-+	/* Find the insertion point, starting from best_shortcut if available.
-+	 *
-+	 * Inspired from tcp_data_queue_ofo.
-+	 */
-+	while (1) {
-+		/* skb1->seq <= seq */
-+		if (!after(TCP_SKB_CB(skb1)->seq, seq))
-+			break;
-+		if (skb_queue_is_first(head, skb1)) {
-+			skb1 = NULL;
-+			break;
-+		}
-+		skb1 = skb_queue_prev(head, skb1);
-+	}
-+
-+	/* Do skb overlap to previous one? */
-+	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+			/* All the bits are present. */
-+			__kfree_skb(skb);
-+			skb = NULL;
-+			goto end;
-+		}
-+		if (seq == TCP_SKB_CB(skb1)->seq) {
-+			if (skb_queue_is_first(head, skb1))
-+				skb1 = NULL;
-+			else
-+				skb1 = skb_queue_prev(head, skb1);
-+		}
-+	}
-+	if (!skb1)
-+		__skb_queue_head(head, skb);
-+	else
-+		__skb_queue_after(head, skb1, skb);
-+
-+	/* And clean segments covered by new one as whole. */
-+	while (!skb_queue_is_last(head, skb)) {
-+		skb1 = skb_queue_next(head, skb);
-+
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-+			break;
-+
-+		__skb_unlink(skb1, head);
-+		mptcp_remove_shortcuts(mpcb, skb1);
-+		__kfree_skb(skb1);
-+	}
-+
-+end:
-+	if (skb) {
-+		skb_set_owner_r(skb, meta_sk);
-+		tp->mptcp->shortcut_ofoqueue = skb;
-+	}
-+
-+	return;
-+}
-+
-+/**
-+ * @sk: the subflow that received this skb.
-+ */
-+void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
-+			      struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
-+		     &tcp_sk(meta_sk)->out_of_order_queue, tp);
-+}
-+
-+bool mptcp_prune_ofo_queue(struct sock *sk)
-+{
-+	struct tcp_sock *tp	= tcp_sk(sk);
-+	bool res		= false;
-+
-+	if (!skb_queue_empty(&tp->out_of_order_queue)) {
-+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-+		mptcp_purge_ofo_queue(tp);
-+
-+		/* No sack at the mptcp-level */
-+		sk_mem_reclaim(sk);
-+		res = true;
-+	}
-+
-+	return res;
-+}
-+
-+void mptcp_ofo_queue(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sk_buff *skb;
-+
-+	while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
-+		u32 old_rcv_nxt = meta_tp->rcv_nxt;
-+		if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
-+			break;
-+
-+		if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
-+			__skb_unlink(skb, &meta_tp->out_of_order_queue);
-+			mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+			__kfree_skb(skb);
-+			continue;
-+		}
-+
-+		__skb_unlink(skb, &meta_tp->out_of_order_queue);
-+		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+
-+		__skb_queue_tail(&meta_sk->sk_receive_queue, skb);
-+		meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-+		mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
-+
-+		if (tcp_hdr(skb)->fin)
-+			mptcp_fin(meta_sk);
-+	}
-+}
-+
-+void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
-+{
-+	struct sk_buff_head *head = &meta_tp->out_of_order_queue;
-+	struct sk_buff *skb, *tmp;
-+
-+	skb_queue_walk_safe(head, skb, tmp) {
-+		__skb_unlink(skb, head);
-+		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+		kfree_skb(skb);
-+	}
-+}
-diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
-new file mode 100644
-index 000000000000..53f5c43bb488
---- /dev/null
-+++ b/net/mptcp/mptcp_olia.c
-@@ -0,0 +1,311 @@
-+/*
-+ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
-+ *
-+ * Algorithm design:
-+ * Ramin Khalili <ramin.khalili@epfl.ch>
-+ * Nicolas Gast <nicolas.gast@epfl.ch>
-+ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
-+ *
-+ * Implementation:
-+ * Ramin Khalili <ramin.khalili@epfl.ch>
-+ *
-+ * Ported to the official MPTCP-kernel:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+#include <linux/module.h>
-+
-+static int scale = 10;
-+
-+struct mptcp_olia {
-+	u32	mptcp_loss1;
-+	u32	mptcp_loss2;
-+	u32	mptcp_loss3;
-+	int	epsilon_num;
-+	u32	epsilon_den;
-+	int	mptcp_snd_cwnd_cnt;
-+};
-+
-+static inline int mptcp_olia_sk_can_send(const struct sock *sk)
-+{
-+	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
-+}
-+
-+static inline u64 mptcp_olia_scale(u64 val, int scale)
-+{
-+	return (u64) val << scale;
-+}
-+
-+/* take care of artificially inflate (see RFC5681)
-+ * of cwnd during fast-retransmit phase
-+ */
-+static u32 mptcp_get_crt_cwnd(struct sock *sk)
-+{
-+	const struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+	if (icsk->icsk_ca_state == TCP_CA_Recovery)
-+		return tcp_sk(sk)->snd_ssthresh;
-+	else
-+		return tcp_sk(sk)->snd_cwnd;
-+}
-+
-+/* return the dominator of the first term of  the increasing term */
-+static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
-+{
-+	struct sock *sk;
-+	u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
-+
-+	mptcp_for_each_sk(mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+		u64 scaled_num;
-+		u32 tmp_cwnd;
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+		scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
-+		rate += div_u64(scaled_num , tp->srtt_us);
-+	}
-+	rate *= rate;
-+	return rate;
-+}
-+
-+/* find the maximum cwnd, used to find set M */
-+static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
-+{
-+	struct sock *sk;
-+	u32 best_cwnd = 0;
-+
-+	mptcp_for_each_sk(mpcb, sk) {
-+		u32 tmp_cwnd;
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+		if (tmp_cwnd > best_cwnd)
-+			best_cwnd = tmp_cwnd;
-+	}
-+	return best_cwnd;
-+}
-+
-+static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
-+{
-+	struct mptcp_olia *ca;
-+	struct tcp_sock *tp;
-+	struct sock *sk;
-+	u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
-+	u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
-+	u8 M = 0, B_not_M = 0;
-+
-+	/* TODO - integrate this in the following loop - we just want to iterate once */
-+
-+	max_cwnd = mptcp_get_max_cwnd(mpcb);
-+
-+	/* find the best path */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		tp = tcp_sk(sk);
-+		ca = inet_csk_ca(sk);
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+		/* TODO - check here and rename variables */
-+		tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+			      ca->mptcp_loss2 - ca->mptcp_loss1);
-+
-+		tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+		if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
-+			best_rtt = tmp_rtt;
-+			best_int = tmp_int;
-+			best_cwnd = tmp_cwnd;
-+		}
-+	}
-+
-+	/* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
-+	/* find the size of M and B_not_M */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		tp = tcp_sk(sk);
-+		ca = inet_csk_ca(sk);
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+		if (tmp_cwnd == max_cwnd) {
-+			M++;
-+		} else {
-+			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+				      ca->mptcp_loss2 - ca->mptcp_loss1);
-+
-+			if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
-+				B_not_M++;
-+		}
-+	}
-+
-+	/* check if the path is in M or B_not_M and set the value of epsilon accordingly */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		tp = tcp_sk(sk);
-+		ca = inet_csk_ca(sk);
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		if (B_not_M == 0) {
-+			ca->epsilon_num = 0;
-+			ca->epsilon_den = 1;
-+		} else {
-+			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+				      ca->mptcp_loss2 - ca->mptcp_loss1);
-+			tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+
-+			if (tmp_cwnd < max_cwnd &&
-+			    (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
-+				ca->epsilon_num = 1;
-+				ca->epsilon_den = mpcb->cnt_established * B_not_M;
-+			} else if (tmp_cwnd == max_cwnd) {
-+				ca->epsilon_num = -1;
-+				ca->epsilon_den = mpcb->cnt_established  * M;
-+			} else {
-+				ca->epsilon_num = 0;
-+				ca->epsilon_den = 1;
-+			}
-+		}
-+	}
-+}
-+
-+/* setting the initial values */
-+static void mptcp_olia_init(struct sock *sk)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_olia *ca = inet_csk_ca(sk);
-+
-+	if (mptcp(tp)) {
-+		ca->mptcp_loss1 = tp->snd_una;
-+		ca->mptcp_loss2 = tp->snd_una;
-+		ca->mptcp_loss3 = tp->snd_una;
-+		ca->mptcp_snd_cwnd_cnt = 0;
-+		ca->epsilon_num = 0;
-+		ca->epsilon_den = 1;
-+	}
-+}
-+
-+/* updating inter-loss distance and ssthresh */
-+static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
-+{
-+	if (!mptcp(tcp_sk(sk)))
-+		return;
-+
-+	if (new_state == TCP_CA_Loss ||
-+	    new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
-+		struct mptcp_olia *ca = inet_csk_ca(sk);
-+
-+		if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
-+		    !inet_csk(sk)->icsk_retransmits) {
-+			ca->mptcp_loss1 = ca->mptcp_loss2;
-+			ca->mptcp_loss2 = ca->mptcp_loss3;
-+		}
-+	}
-+}
-+
-+/* main algorithm */
-+static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_olia *ca = inet_csk_ca(sk);
-+	const struct mptcp_cb *mpcb = tp->mpcb;
-+
-+	u64 inc_num, inc_den, rate, cwnd_scaled;
-+
-+	if (!mptcp(tp)) {
-+		tcp_reno_cong_avoid(sk, ack, acked);
-+		return;
-+	}
-+
-+	ca->mptcp_loss3 = tp->snd_una;
-+
-+	if (!tcp_is_cwnd_limited(sk))
-+		return;
-+
-+	/* slow start if it is in the safe area */
-+	if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+		tcp_slow_start(tp, acked);
-+		return;
-+	}
-+
-+	mptcp_get_epsilon(mpcb);
-+	rate = mptcp_get_rate(mpcb, tp->srtt_us);
-+	cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
-+	inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
-+
-+	/* calculate the increasing term, scaling is used to reduce the rounding effect */
-+	if (ca->epsilon_num == -1) {
-+		if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
-+			inc_num = rate - ca->epsilon_den *
-+				cwnd_scaled * cwnd_scaled;
-+			ca->mptcp_snd_cwnd_cnt -= div64_u64(
-+			    mptcp_olia_scale(inc_num , scale) , inc_den);
-+		} else {
-+			inc_num = ca->epsilon_den *
-+			    cwnd_scaled * cwnd_scaled - rate;
-+			ca->mptcp_snd_cwnd_cnt += div64_u64(
-+			    mptcp_olia_scale(inc_num , scale) , inc_den);
-+		}
-+	} else {
-+		inc_num = ca->epsilon_num * rate +
-+		    ca->epsilon_den * cwnd_scaled * cwnd_scaled;
-+		ca->mptcp_snd_cwnd_cnt += div64_u64(
-+		    mptcp_olia_scale(inc_num , scale) , inc_den);
-+	}
-+
-+
-+	if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
-+		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
-+			tp->snd_cwnd++;
-+		ca->mptcp_snd_cwnd_cnt = 0;
-+	} else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
-+		tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
-+		ca->mptcp_snd_cwnd_cnt = 0;
-+	}
-+}
-+
-+static struct tcp_congestion_ops mptcp_olia = {
-+	.init		= mptcp_olia_init,
-+	.ssthresh	= tcp_reno_ssthresh,
-+	.cong_avoid	= mptcp_olia_cong_avoid,
-+	.set_state	= mptcp_olia_set_state,
-+	.owner		= THIS_MODULE,
-+	.name		= "olia",
-+};
-+
-+static int __init mptcp_olia_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
-+	return tcp_register_congestion_control(&mptcp_olia);
-+}
-+
-+static void __exit mptcp_olia_unregister(void)
-+{
-+	tcp_unregister_congestion_control(&mptcp_olia);
-+}
-+
-+module_init(mptcp_olia_register);
-+module_exit(mptcp_olia_unregister);
-+
-+MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
-new file mode 100644
-index 000000000000..400ea254c078
---- /dev/null
-+++ b/net/mptcp/mptcp_output.c
-@@ -0,0 +1,1743 @@
-+/*
-+ *	MPTCP implementation - Sending side
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/kconfig.h>
-+#include <linux/skbuff.h>
-+#include <linux/tcp.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-+#include <net/sock.h>
-+
-+static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
-+				 MPTCP_SUB_LEN_ACK_ALIGN +
-+				 MPTCP_SUB_LEN_SEQ_ALIGN;
-+
-+static inline int mptcp_sub_len_remove_addr(u16 bitfield)
-+{
-+	unsigned int c;
-+	for (c = 0; bitfield; c++)
-+		bitfield &= bitfield - 1;
-+	return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
-+}
-+
-+int mptcp_sub_len_remove_addr_align(u16 bitfield)
-+{
-+	return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
-+}
-+EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
-+
-+/* get the data-seq and end-data-seq and store them again in the
-+ * tcp_skb_cb
-+ */
-+static int mptcp_reconstruct_mapping(struct sk_buff *skb)
-+{
-+	const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
-+	u32 *p32;
-+	u16 *p16;
-+
-+	if (!mpdss->M)
-+		return 1;
-+
-+	/* Move the pointer to the data-seq */
-+	p32 = (u32 *)mpdss;
-+	p32++;
-+	if (mpdss->A) {
-+		p32++;
-+		if (mpdss->a)
-+			p32++;
-+	}
-+
-+	TCP_SKB_CB(skb)->seq = ntohl(*p32);
-+
-+	/* Get the data_len to calculate the end_data_seq */
-+	p32++;
-+	p32++;
-+	p16 = (u16 *)p32;
-+	TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
-+
-+	return 0;
-+}
-+
-+static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	struct sk_buff *skb_it;
-+
-+	skb_it = tcp_write_queue_head(meta_sk);
-+
-+	tcp_for_write_queue_from(skb_it, meta_sk) {
-+		if (skb_it == tcp_send_head(meta_sk))
-+			break;
-+
-+		if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
-+			TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
-+			break;
-+		}
-+	}
-+}
-+
-+/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
-+ * coming from the meta-retransmit-timer
-+ */
-+static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
-+				  struct sock *sk, int clone_it)
-+{
-+	struct sk_buff *skb, *skb1;
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	u32 seq, end_seq;
-+
-+	if (clone_it) {
-+		/* pskb_copy is necessary here, because the TCP/IP-headers
-+		 * will be changed when it's going to be reinjected on another
-+		 * subflow.
-+		 */
-+		skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
-+	} else {
-+		__skb_unlink(orig_skb, &sk->sk_write_queue);
-+		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
-+		sk->sk_wmem_queued -= orig_skb->truesize;
-+		sk_mem_uncharge(sk, orig_skb->truesize);
-+		skb = orig_skb;
-+	}
-+	if (unlikely(!skb))
-+		return;
-+
-+	if (sk && mptcp_reconstruct_mapping(skb)) {
-+		__kfree_skb(skb);
-+		return;
-+	}
-+
-+	skb->sk = meta_sk;
-+
-+	/* If it reached already the destination, we don't have to reinject it */
-+	if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
-+		__kfree_skb(skb);
-+		return;
-+	}
-+
-+	/* Only reinject segments that are fully covered by the mapping */
-+	if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
-+	    TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
-+		u32 seq = TCP_SKB_CB(skb)->seq;
-+		u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+		__kfree_skb(skb);
-+
-+		/* Ok, now we have to look for the full mapping in the meta
-+		 * send-queue :S
-+		 */
-+		tcp_for_write_queue(skb, meta_sk) {
-+			/* Not yet at the mapping? */
-+			if (before(TCP_SKB_CB(skb)->seq, seq))
-+				continue;
-+			/* We have passed by the mapping */
-+			if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
-+				return;
-+
-+			__mptcp_reinject_data(skb, meta_sk, NULL, 1);
-+		}
-+		return;
-+	}
-+
-+	/* Segment goes back to the MPTCP-layer. So, we need to zero the
-+	 * path_mask/dss.
-+	 */
-+	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
-+
-+	/* We need to find out the path-mask from the meta-write-queue
-+	 * to properly select a subflow.
-+	 */
-+	mptcp_find_and_set_pathmask(meta_sk, skb);
-+
-+	/* If it's empty, just add */
-+	if (skb_queue_empty(&mpcb->reinject_queue)) {
-+		skb_queue_head(&mpcb->reinject_queue, skb);
-+		return;
-+	}
-+
-+	/* Find place to insert skb - or even we can 'drop' it, as the
-+	 * data is already covered by other skb's in the reinject-queue.
-+	 *
-+	 * This is inspired by code from tcp_data_queue.
-+	 */
-+
-+	skb1 = skb_peek_tail(&mpcb->reinject_queue);
-+	seq = TCP_SKB_CB(skb)->seq;
-+	while (1) {
-+		if (!after(TCP_SKB_CB(skb1)->seq, seq))
-+			break;
-+		if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
-+			skb1 = NULL;
-+			break;
-+		}
-+		skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
-+	}
-+
-+	/* Do skb overlap to previous one? */
-+	end_seq = TCP_SKB_CB(skb)->end_seq;
-+	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+			/* All the bits are present. Don't reinject */
-+			__kfree_skb(skb);
-+			return;
-+		}
-+		if (seq == TCP_SKB_CB(skb1)->seq) {
-+			if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
-+				skb1 = NULL;
-+			else
-+				skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
-+		}
-+	}
-+	if (!skb1)
-+		__skb_queue_head(&mpcb->reinject_queue, skb);
-+	else
-+		__skb_queue_after(&mpcb->reinject_queue, skb1, skb);
-+
-+	/* And clean segments covered by new one as whole. */
-+	while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
-+		skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
-+
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-+			break;
-+
-+		__skb_unlink(skb1, &mpcb->reinject_queue);
-+		__kfree_skb(skb1);
-+	}
-+	return;
-+}
-+
-+/* Inserts data into the reinject queue */
-+void mptcp_reinject_data(struct sock *sk, int clone_it)
-+{
-+	struct sk_buff *skb_it, *tmp;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = tp->meta_sk;
-+
-+	/* It has already been closed - there is really no point in reinjecting */
-+	if (meta_sk->sk_state == TCP_CLOSE)
-+		return;
-+
-+	skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
-+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
-+		/* Subflow syn's and fin's are not reinjected.
-+		 *
-+		 * As well as empty subflow-fins with a data-fin.
-+		 * They are reinjected below (without the subflow-fin-flag)
-+		 */
-+		if (tcb->tcp_flags & TCPHDR_SYN ||
-+		    (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
-+		    (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
-+			continue;
-+
-+		__mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
-+	}
-+
-+	skb_it = tcp_write_queue_tail(meta_sk);
-+	/* If sk has sent the empty data-fin, we have to reinject it too. */
-+	if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
-+	    TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
-+		__mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
-+	}
-+
-+	mptcp_push_pending_frames(meta_sk);
-+
-+	tp->pf = 1;
-+}
-+EXPORT_SYMBOL(mptcp_reinject_data);
-+
-+static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
-+			       struct sock *subsk)
-+{
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sock *sk_it;
-+	int all_empty = 1, all_acked;
-+
-+	/* In infinite mapping we always try to combine */
-+	if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
-+		subsk->sk_shutdown |= SEND_SHUTDOWN;
-+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+		return;
-+	}
-+
-+	/* Don't combine, if they didn't combine - otherwise we end up in
-+	 * TIME_WAIT, even if our app is smart enough to avoid it
-+	 */
-+	if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
-+		if (!mpcb->dfin_combined)
-+			return;
-+	}
-+
-+	/* If no other subflow has data to send, we can combine */
-+	mptcp_for_each_sk(mpcb, sk_it) {
-+		if (!mptcp_sk_can_send(sk_it))
-+			continue;
-+
-+		if (!tcp_write_queue_empty(sk_it))
-+			all_empty = 0;
-+	}
-+
-+	/* If all data has been DATA_ACKed, we can combine.
-+	 * -1, because the data_fin consumed one byte
-+	 */
-+	all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
-+
-+	if ((all_empty || all_acked) && tcp_close_state(subsk)) {
-+		subsk->sk_shutdown |= SEND_SHUTDOWN;
-+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+	}
-+}
-+
-+static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
-+				   __be32 *ptr)
-+{
-+	const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	__be32 *start = ptr;
-+	__u16 data_len;
-+
-+	*ptr++ = htonl(tcb->seq); /* data_seq */
-+
-+	/* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
-+	if (mptcp_is_data_fin(skb) && skb->len == 0)
-+		*ptr++ = 0; /* subseq */
-+	else
-+		*ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
-+
-+	if (tcb->mptcp_flags & MPTCPHDR_INF)
-+		data_len = 0;
-+	else
-+		data_len = tcb->end_seq - tcb->seq;
-+
-+	if (tp->mpcb->dss_csum && data_len) {
-+		__be16 *p16 = (__be16 *)ptr;
-+		__be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
-+		__wsum csum;
-+
-+		*ptr = htonl(((data_len) << 16) |
-+			     (TCPOPT_EOL << 8) |
-+			     (TCPOPT_EOL));
-+		csum = csum_partial(ptr - 2, 12, skb->csum);
-+		p16++;
-+		*p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
-+	} else {
-+		*ptr++ = htonl(((data_len) << 16) |
-+			       (TCPOPT_NOP << 8) |
-+			       (TCPOPT_NOP));
-+	}
-+
-+	return ptr - start;
-+}
-+
-+static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
-+				    __be32 *ptr)
-+{
-+	struct mp_dss *mdss = (struct mp_dss *)ptr;
-+	__be32 *start = ptr;
-+
-+	mdss->kind = TCPOPT_MPTCP;
-+	mdss->sub = MPTCP_SUB_DSS;
-+	mdss->rsv1 = 0;
-+	mdss->rsv2 = 0;
-+	mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
-+	mdss->m = 0;
-+	mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
-+	mdss->a = 0;
-+	mdss->A = 1;
-+	mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
-+	ptr++;
-+
-+	*ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
-+
-+	return ptr - start;
-+}
-+
-+/* RFC6824 states that once a particular subflow mapping has been sent
-+ * out it must never be changed. However, packets may be split while
-+ * they are in the retransmission queue (due to SACK or ACKs) and that
-+ * arguably means that we would change the mapping (e.g. it splits it,
-+ * our sends out a subset of the initial mapping).
-+ *
-+ * Furthermore, the skb checksum is not always preserved across splits
-+ * (e.g. mptcp_fragment) which would mean that we need to recompute
-+ * the DSS checksum in this case.
-+ *
-+ * To avoid this we save the initial DSS mapping which allows us to
-+ * send the same DSS mapping even for fragmented retransmits.
-+ */
-+static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
-+{
-+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	__be32 *ptr = (__be32 *)tcb->dss;
-+
-+	tcb->mptcp_flags |= MPTCPHDR_SEQ;
-+
-+	ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-+	ptr += mptcp_write_dss_mapping(tp, skb, ptr);
-+}
-+
-+/* Write the saved DSS mapping to the header */
-+static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
-+				    __be32 *ptr)
-+{
-+	__be32 *start = ptr;
-+
-+	memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
-+
-+	/* update the data_ack */
-+	start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
-+
-+	/* dss is in a union with inet_skb_parm and
-+	 * the IP layer expects zeroed IPCB fields.
-+	 */
-+	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
-+
-+	return mptcp_dss_len/sizeof(*ptr);
-+}
-+
-+static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	const struct sock *meta_sk = mptcp_meta_sk(sk);
-+	const struct mptcp_cb *mpcb = tp->mpcb;
-+	struct tcp_skb_cb *tcb;
-+	struct sk_buff *subskb = NULL;
-+
-+	if (!reinject)
-+		TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
-+						  MPTCPHDR_SEQ64_INDEX : 0);
-+
-+	subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
-+	if (!subskb)
-+		return false;
-+
-+	/* At the subflow-level we need to call again tcp_init_tso_segs. We
-+	 * force this, by setting gso_segs to 0. It has been set to 1 prior to
-+	 * the call to mptcp_skb_entail.
-+	 */
-+	skb_shinfo(subskb)->gso_segs = 0;
-+
-+	TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
-+
-+	if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
-+	    skb->ip_summed == CHECKSUM_PARTIAL) {
-+		subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
-+		subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
-+	}
-+
-+	tcb = TCP_SKB_CB(subskb);
-+
-+	if (tp->mpcb->send_infinite_mapping &&
-+	    !tp->mpcb->infinite_mapping_snd &&
-+	    !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
-+		tp->mptcp->fully_established = 1;
-+		tp->mpcb->infinite_mapping_snd = 1;
-+		tp->mptcp->infinite_cutoff_seq = tp->write_seq;
-+		tcb->mptcp_flags |= MPTCPHDR_INF;
-+	}
-+
-+	if (mptcp_is_data_fin(subskb))
-+		mptcp_combine_dfin(subskb, meta_sk, sk);
-+
-+	mptcp_save_dss_data_seq(tp, subskb);
-+
-+	tcb->seq = tp->write_seq;
-+	tcb->sacked = 0; /* reset the sacked field: from the point of view
-+			  * of this subflow, we are sending a brand new
-+			  * segment
-+			  */
-+	/* Take into account seg len */
-+	tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
-+	tcb->end_seq = tp->write_seq;
-+
-+	/* If it's a non-payload DATA_FIN (also no subflow-fin), the
-+	 * segment is not part of the subflow but on a meta-only-level.
-+	 */
-+	if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
-+		tcp_add_write_queue_tail(sk, subskb);
-+		sk->sk_wmem_queued += subskb->truesize;
-+		sk_mem_charge(sk, subskb->truesize);
-+	} else {
-+		int err;
-+
-+		/* Necessary to initialize for tcp_transmit_skb. mss of 1, as
-+		 * skb->len = 0 will force tso_segs to 1.
-+		 */
-+		tcp_init_tso_segs(sk, subskb, 1);
-+		/* Empty data-fins are sent immediatly on the subflow */
-+		TCP_SKB_CB(subskb)->when = tcp_time_stamp;
-+		err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
-+
-+		/* It has not been queued, we can free it now. */
-+		kfree_skb(subskb);
-+
-+		if (err)
-+			return false;
-+	}
-+
-+	if (!tp->mptcp->fully_established) {
-+		tp->mptcp->second_packet = 1;
-+		tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
-+	}
-+
-+	return true;
-+}
-+
-+/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
-+ * might need to undo some operations done by tcp_fragment.
-+ */
-+static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
-+			  gfp_t gfp, int reinject)
-+{
-+	int ret, diff, old_factor;
-+	struct sk_buff *buff;
-+	u8 flags;
-+
-+	if (skb_headlen(skb) < len)
-+		diff = skb->len - len;
-+	else
-+		diff = skb->data_len;
-+	old_factor = tcp_skb_pcount(skb);
-+
-+	/* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
-+	 * At the MPTCP-level we do not care about the absolute value. All we
-+	 * care about is that it is set to 1 for accurate packets_out
-+	 * accounting.
-+	 */
-+	ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
-+	if (ret)
-+		return ret;
-+
-+	buff = skb->next;
-+
-+	flags = TCP_SKB_CB(skb)->mptcp_flags;
-+	TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
-+	TCP_SKB_CB(buff)->mptcp_flags = flags;
-+	TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
-+
-+	/* If reinject == 1, the buff will be added to the reinject
-+	 * queue, which is currently not part of memory accounting. So
-+	 * undo the changes done by tcp_fragment and update the
-+	 * reinject queue. Also, undo changes to the packet counters.
-+	 */
-+	if (reinject == 1) {
-+		int undo = buff->truesize - diff;
-+		meta_sk->sk_wmem_queued -= undo;
-+		sk_mem_uncharge(meta_sk, undo);
-+
-+		tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
-+		meta_sk->sk_write_queue.qlen--;
-+
-+		if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
-+			undo = old_factor - tcp_skb_pcount(skb) -
-+				tcp_skb_pcount(buff);
-+			if (undo)
-+				tcp_adjust_pcount(meta_sk, skb, -undo);
-+		}
-+	}
-+
-+	return 0;
-+}
-+
-+/* Inspired by tcp_write_wakeup */
-+int mptcp_write_wakeup(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sk_buff *skb;
-+	struct sock *sk_it;
-+	int ans = 0;
-+
-+	if (meta_sk->sk_state == TCP_CLOSE)
-+		return -1;
-+
-+	skb = tcp_send_head(meta_sk);
-+	if (skb &&
-+	    before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
-+		unsigned int mss;
-+		unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
-+		struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
-+		struct tcp_sock *subtp;
-+		if (!subsk)
-+			goto window_probe;
-+		subtp = tcp_sk(subsk);
-+		mss = tcp_current_mss(subsk);
-+
-+		seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
-+			       tcp_wnd_end(subtp) - subtp->write_seq);
-+
-+		if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
-+			meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+		/* We are probing the opening of a window
-+		 * but the window size is != 0
-+		 * must have been a result SWS avoidance ( sender )
-+		 */
-+		if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
-+		    skb->len > mss) {
-+			seg_size = min(seg_size, mss);
-+			TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
-+			if (mptcp_fragment(meta_sk, skb, seg_size,
-+					   GFP_ATOMIC, 0))
-+				return -1;
-+		} else if (!tcp_skb_pcount(skb)) {
-+			/* see mptcp_write_xmit on why we use UINT_MAX */
-+			tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
-+		}
-+
-+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
-+		if (!mptcp_skb_entail(subsk, skb, 0))
-+			return -1;
-+		TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+		mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
-+						 TCP_SKB_CB(skb)->seq);
-+		tcp_event_new_data_sent(meta_sk, skb);
-+
-+		__tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
-+
-+		return 0;
-+	} else {
-+window_probe:
-+		if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
-+			    meta_tp->snd_una + 0xFFFF)) {
-+			mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+				if (mptcp_sk_can_send_ack(sk_it))
-+					tcp_xmit_probe_skb(sk_it, 1);
-+			}
-+		}
-+
-+		/* At least one of the tcp_xmit_probe_skb's has to succeed */
-+		mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+			int ret;
-+
-+			if (!mptcp_sk_can_send_ack(sk_it))
-+				continue;
-+
-+			ret = tcp_xmit_probe_skb(sk_it, 0);
-+			if (unlikely(ret > 0))
-+				ans = ret;
-+		}
-+		return ans;
-+	}
-+}
-+
-+bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
-+		     int push_one, gfp_t gfp)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
-+	struct sock *subsk = NULL;
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sk_buff *skb;
-+	unsigned int sent_pkts;
-+	int reinject = 0;
-+	unsigned int sublimit;
-+
-+	sent_pkts = 0;
-+
-+	while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
-+						    &sublimit))) {
-+		unsigned int limit;
-+
-+		subtp = tcp_sk(subsk);
-+		mss_now = tcp_current_mss(subsk);
-+
-+		if (reinject == 1) {
-+			if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
-+				/* Segment already reached the peer, take the next one */
-+				__skb_unlink(skb, &mpcb->reinject_queue);
-+				__kfree_skb(skb);
-+				continue;
-+			}
-+		}
-+
-+		/* If the segment was cloned (e.g. a meta retransmission),
-+		 * the header must be expanded/copied so that there is no
-+		 * corruption of TSO information.
-+		 */
-+		if (skb_unclone(skb, GFP_ATOMIC))
-+			break;
-+
-+		if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
-+			break;
-+
-+		/* Force tso_segs to 1 by using UINT_MAX.
-+		 * We actually don't care about the exact number of segments
-+		 * emitted on the subflow. We need just to set tso_segs, because
-+		 * we still need an accurate packets_out count in
-+		 * tcp_event_new_data_sent.
-+		 */
-+		tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
-+
-+		/* Check for nagle, irregardless of tso_segs. If the segment is
-+		 * actually larger than mss_now (TSO segment), then
-+		 * tcp_nagle_check will have partial == false and always trigger
-+		 * the transmission.
-+		 * tcp_write_xmit has a TSO-level nagle check which is not
-+		 * subject to the MPTCP-level. It is based on the properties of
-+		 * the subflow, not the MPTCP-level.
-+		 */
-+		if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
-+					     (tcp_skb_is_last(meta_sk, skb) ?
-+					      nonagle : TCP_NAGLE_PUSH))))
-+			break;
-+
-+		limit = mss_now;
-+		/* skb->len > mss_now is the equivalent of tso_segs > 1 in
-+		 * tcp_write_xmit. Otherwise split-point would return 0.
-+		 */
-+		if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
-+			/* We limit the size of the skb so that it fits into the
-+			 * window. Call tcp_mss_split_point to avoid duplicating
-+			 * code.
-+			 * We really only care about fitting the skb into the
-+			 * window. That's why we use UINT_MAX. If the skb does
-+			 * not fit into the cwnd_quota or the NIC's max-segs
-+			 * limitation, it will be split by the subflow's
-+			 * tcp_write_xmit which does the appropriate call to
-+			 * tcp_mss_split_point.
-+			 */
-+			limit = tcp_mss_split_point(meta_sk, skb, mss_now,
-+						    UINT_MAX / mss_now,
-+						    nonagle);
-+
-+		if (sublimit)
-+			limit = min(limit, sublimit);
-+
-+		if (skb->len > limit &&
-+		    unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
-+			break;
-+
-+		if (!mptcp_skb_entail(subsk, skb, reinject))
-+			break;
-+		/* Nagle is handled at the MPTCP-layer, so
-+		 * always push on the subflow
-+		 */
-+		__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
-+		TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+		if (!reinject) {
-+			mptcp_check_sndseq_wrap(meta_tp,
-+						TCP_SKB_CB(skb)->end_seq -
-+						TCP_SKB_CB(skb)->seq);
-+			tcp_event_new_data_sent(meta_sk, skb);
-+		}
-+
-+		tcp_minshall_update(meta_tp, mss_now, skb);
-+		sent_pkts += tcp_skb_pcount(skb);
-+
-+		if (reinject > 0) {
-+			__skb_unlink(skb, &mpcb->reinject_queue);
-+			kfree_skb(skb);
-+		}
-+
-+		if (push_one)
-+			break;
-+	}
-+
-+	return !meta_tp->packets_out && tcp_send_head(meta_sk);
-+}
-+
-+void mptcp_write_space(struct sock *sk)
-+{
-+	mptcp_push_pending_frames(mptcp_meta_sk(sk));
-+}
-+
-+u32 __mptcp_select_window(struct sock *sk)
-+{
-+	struct inet_connection_sock *icsk = inet_csk(sk);
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+	int mss, free_space, full_space, window;
-+
-+	/* MSS for the peer's data.  Previous versions used mss_clamp
-+	 * here.  I don't know if the value based on our guesses
-+	 * of peer's MSS is better for the performance.  It's more correct
-+	 * but may be worse for the performance because of rcv_mss
-+	 * fluctuations.  --SAW  1998/11/1
-+	 */
-+	mss = icsk->icsk_ack.rcv_mss;
-+	free_space = tcp_space(sk);
-+	full_space = min_t(int, meta_tp->window_clamp,
-+			tcp_full_space(sk));
-+
-+	if (mss > full_space)
-+		mss = full_space;
-+
-+	if (free_space < (full_space >> 1)) {
-+		icsk->icsk_ack.quick = 0;
-+
-+		if (tcp_memory_pressure)
-+			/* TODO this has to be adapted when we support different
-+			 * MSS's among the subflows.
-+			 */
-+			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
-+						    4U * meta_tp->advmss);
-+
-+		if (free_space < mss)
-+			return 0;
-+	}
-+
-+	if (free_space > meta_tp->rcv_ssthresh)
-+		free_space = meta_tp->rcv_ssthresh;
-+
-+	/* Don't do rounding if we are using window scaling, since the
-+	 * scaled window will not line up with the MSS boundary anyway.
-+	 */
-+	window = meta_tp->rcv_wnd;
-+	if (tp->rx_opt.rcv_wscale) {
-+		window = free_space;
-+
-+		/* Advertise enough space so that it won't get scaled away.
-+		 * Import case: prevent zero window announcement if
-+		 * 1<<rcv_wscale > mss.
-+		 */
-+		if (((window >> tp->rx_opt.rcv_wscale) << tp->
-+		     rx_opt.rcv_wscale) != window)
-+			window = (((window >> tp->rx_opt.rcv_wscale) + 1)
-+				  << tp->rx_opt.rcv_wscale);
-+	} else {
-+		/* Get the largest window that is a nice multiple of mss.
-+		 * Window clamp already applied above.
-+		 * If our current window offering is within 1 mss of the
-+		 * free space we just keep it. This prevents the divide
-+		 * and multiply from happening most of the time.
-+		 * We also don't do any window rounding when the free space
-+		 * is too small.
-+		 */
-+		if (window <= free_space - mss || window > free_space)
-+			window = (free_space / mss) * mss;
-+		else if (mss == full_space &&
-+			 free_space > window + (full_space >> 1))
-+			window = free_space;
-+	}
-+
-+	return window;
-+}
-+
-+void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
-+		       unsigned *remaining)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+
-+	opts->options |= OPTION_MPTCP;
-+	if (is_master_tp(tp)) {
-+		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
-+		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
-+		opts->mp_capable.sender_key = tp->mptcp_loc_key;
-+		opts->dss_csum = !!sysctl_mptcp_checksum;
-+	} else {
-+		const struct mptcp_cb *mpcb = tp->mpcb;
-+
-+		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
-+		*remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
-+		opts->mp_join_syns.token = mpcb->mptcp_rem_token;
-+		opts->mp_join_syns.low_prio  = tp->mptcp->low_prio;
-+		opts->addr_id = tp->mptcp->loc_id;
-+		opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
-+	}
-+}
-+
-+void mptcp_synack_options(struct request_sock *req,
-+			  struct tcp_out_options *opts, unsigned *remaining)
-+{
-+	struct mptcp_request_sock *mtreq;
-+	mtreq = mptcp_rsk(req);
-+
-+	opts->options |= OPTION_MPTCP;
-+	/* MPCB not yet set - thus it's a new MPTCP-session */
-+	if (!mtreq->is_sub) {
-+		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
-+		opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
-+		opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
-+		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
-+	} else {
-+		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
-+		opts->mp_join_syns.sender_truncated_mac =
-+				mtreq->mptcp_hash_tmac;
-+		opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
-+		opts->mp_join_syns.low_prio = mtreq->low_prio;
-+		opts->addr_id = mtreq->loc_id;
-+		*remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
-+	}
-+}
-+
-+void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
-+			       struct tcp_out_options *opts, unsigned *size)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
-+
-+	/* We are coming from tcp_current_mss with the meta_sk as an argument.
-+	 * It does not make sense to check for the options, because when the
-+	 * segment gets sent, another subflow will be chosen.
-+	 */
-+	if (!skb && is_meta_sk(sk))
-+		return;
-+
-+	/* In fallback mp_fail-mode, we have to repeat it until the fallback
-+	 * has been done by the sender
-+	 */
-+	if (unlikely(tp->mptcp->send_mp_fail)) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_FAIL;
-+		*size += MPTCP_SUB_LEN_FAIL;
-+		return;
-+	}
-+
-+	if (unlikely(tp->send_mp_fclose)) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_FCLOSE;
-+		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
-+		*size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
-+		return;
-+	}
-+
-+	/* 1. If we are the sender of the infinite-mapping, we need the
-+	 *    MPTCPHDR_INF-flag, because a retransmission of the
-+	 *    infinite-announcment still needs the mptcp-option.
-+	 *
-+	 *    We need infinite_cutoff_seq, because retransmissions from before
-+	 *    the infinite-cutoff-moment still need the MPTCP-signalling to stay
-+	 *    consistent.
-+	 *
-+	 * 2. If we are the receiver of the infinite-mapping, we always skip
-+	 *    mptcp-options, because acknowledgments from before the
-+	 *    infinite-mapping point have already been sent out.
-+	 *
-+	 * I know, the whole infinite-mapping stuff is ugly...
-+	 *
-+	 * TODO: Handle wrapped data-sequence numbers
-+	 *       (even if it's very unlikely)
-+	 */
-+	if (unlikely(mpcb->infinite_mapping_snd) &&
-+	    ((mpcb->send_infinite_mapping && tcb &&
-+	      mptcp_is_data_seq(skb) &&
-+	      !(tcb->mptcp_flags & MPTCPHDR_INF) &&
-+	      !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
-+	     !mpcb->send_infinite_mapping))
-+		return;
-+
-+	if (unlikely(tp->mptcp->include_mpc)) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_CAPABLE |
-+				       OPTION_TYPE_ACK;
-+		*size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
-+		opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
-+		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
-+		opts->dss_csum = mpcb->dss_csum;
-+
-+		if (skb)
-+			tp->mptcp->include_mpc = 0;
-+	}
-+	if (unlikely(tp->mptcp->pre_established)) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
-+		*size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
-+	}
-+
-+	if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_DATA_ACK;
-+		/* If !skb, we come from tcp_current_mss and thus we always
-+		 * assume that the DSS-option will be set for the data-packet.
-+		 */
-+		if (skb && !mptcp_is_data_seq(skb)) {
-+			*size += MPTCP_SUB_LEN_ACK_ALIGN;
-+		} else {
-+			/* Doesn't matter, if csum included or not. It will be
-+			 * either 10 or 12, and thus aligned = 12
-+			 */
-+			*size += MPTCP_SUB_LEN_ACK_ALIGN +
-+				 MPTCP_SUB_LEN_SEQ_ALIGN;
-+		}
-+
-+		*size += MPTCP_SUB_LEN_DSS_ALIGN;
-+	}
-+
-+	if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
-+		mpcb->pm_ops->addr_signal(sk, size, opts, skb);
-+
-+	if (unlikely(tp->mptcp->send_mp_prio) &&
-+	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_PRIO;
-+		if (skb)
-+			tp->mptcp->send_mp_prio = 0;
-+		*size += MPTCP_SUB_LEN_PRIO_ALIGN;
-+	}
-+
-+	return;
-+}
-+
-+u16 mptcp_select_window(struct sock *sk)
-+{
-+	u16 new_win		= tcp_select_window(sk);
-+	struct tcp_sock *tp	= tcp_sk(sk);
-+	struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
-+
-+	meta_tp->rcv_wnd	= tp->rcv_wnd;
-+	meta_tp->rcv_wup	= meta_tp->rcv_nxt;
-+
-+	return new_win;
-+}
-+
-+void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+			 const struct tcp_out_options *opts,
-+			 struct sk_buff *skb)
-+{
-+	if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
-+		struct mp_capable *mpc = (struct mp_capable *)ptr;
-+
-+		mpc->kind = TCPOPT_MPTCP;
-+
-+		if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
-+		    (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
-+			mpc->sender_key = opts->mp_capable.sender_key;
-+			mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
-+			ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
-+		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
-+			mpc->sender_key = opts->mp_capable.sender_key;
-+			mpc->receiver_key = opts->mp_capable.receiver_key;
-+			mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
-+			ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
-+		}
-+
-+		mpc->sub = MPTCP_SUB_CAPABLE;
-+		mpc->ver = 0;
-+		mpc->a = opts->dss_csum;
-+		mpc->b = 0;
-+		mpc->rsv = 0;
-+		mpc->h = 1;
-+	}
-+
-+	if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
-+		struct mp_join *mpj = (struct mp_join *)ptr;
-+
-+		mpj->kind = TCPOPT_MPTCP;
-+		mpj->sub = MPTCP_SUB_JOIN;
-+		mpj->rsv = 0;
-+
-+		if (OPTION_TYPE_SYN & opts->mptcp_options) {
-+			mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
-+			mpj->u.syn.token = opts->mp_join_syns.token;
-+			mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
-+			mpj->b = opts->mp_join_syns.low_prio;
-+			mpj->addr_id = opts->addr_id;
-+			ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
-+		} else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
-+			mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
-+			mpj->u.synack.mac =
-+				opts->mp_join_syns.sender_truncated_mac;
-+			mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
-+			mpj->b = opts->mp_join_syns.low_prio;
-+			mpj->addr_id = opts->addr_id;
-+			ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
-+		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
-+			mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
-+			mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
-+			memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
-+			ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
-+		}
-+	}
-+	if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
-+		struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+
-+		mpadd->kind = TCPOPT_MPTCP;
-+		if (opts->add_addr_v4) {
-+			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
-+			mpadd->sub = MPTCP_SUB_ADD_ADDR;
-+			mpadd->ipver = 4;
-+			mpadd->addr_id = opts->add_addr4.addr_id;
-+			mpadd->u.v4.addr = opts->add_addr4.addr;
-+			ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
-+		} else if (opts->add_addr_v6) {
-+			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
-+			mpadd->sub = MPTCP_SUB_ADD_ADDR;
-+			mpadd->ipver = 6;
-+			mpadd->addr_id = opts->add_addr6.addr_id;
-+			memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
-+			       sizeof(mpadd->u.v6.addr));
-+			ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
-+		}
-+	}
-+	if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
-+		struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
-+		u8 *addrs_id;
-+		int id, len, len_align;
-+
-+		len = mptcp_sub_len_remove_addr(opts->remove_addrs);
-+		len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
-+
-+		mprem->kind = TCPOPT_MPTCP;
-+		mprem->len = len;
-+		mprem->sub = MPTCP_SUB_REMOVE_ADDR;
-+		mprem->rsv = 0;
-+		addrs_id = &mprem->addrs_id;
-+
-+		mptcp_for_each_bit_set(opts->remove_addrs, id)
-+			*(addrs_id++) = id;
-+
-+		/* Fill the rest with NOP's */
-+		if (len_align > len) {
-+			int i;
-+			for (i = 0; i < len_align - len; i++)
-+				*(addrs_id++) = TCPOPT_NOP;
-+		}
-+
-+		ptr += len_align >> 2;
-+	}
-+	if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
-+		struct mp_fail *mpfail = (struct mp_fail *)ptr;
-+
-+		mpfail->kind = TCPOPT_MPTCP;
-+		mpfail->len = MPTCP_SUB_LEN_FAIL;
-+		mpfail->sub = MPTCP_SUB_FAIL;
-+		mpfail->rsv1 = 0;
-+		mpfail->rsv2 = 0;
-+		mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
-+
-+		ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
-+	}
-+	if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
-+		struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
-+
-+		mpfclose->kind = TCPOPT_MPTCP;
-+		mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
-+		mpfclose->sub = MPTCP_SUB_FCLOSE;
-+		mpfclose->rsv1 = 0;
-+		mpfclose->rsv2 = 0;
-+		mpfclose->key = opts->mp_capable.receiver_key;
-+
-+		ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
-+	}
-+
-+	if (OPTION_DATA_ACK & opts->mptcp_options) {
-+		if (!mptcp_is_data_seq(skb))
-+			ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-+		else
-+			ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
-+	}
-+	if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
-+		struct mp_prio *mpprio = (struct mp_prio *)ptr;
-+
-+		mpprio->kind = TCPOPT_MPTCP;
-+		mpprio->len = MPTCP_SUB_LEN_PRIO;
-+		mpprio->sub = MPTCP_SUB_PRIO;
-+		mpprio->rsv = 0;
-+		mpprio->b = tp->mptcp->low_prio;
-+		mpprio->addr_id = TCPOPT_NOP;
-+
-+		ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
-+	}
-+}
-+
-+/* Sends the datafin */
-+void mptcp_send_fin(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
-+	int mss_now;
-+
-+	if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
-+		meta_tp->mpcb->passive_close = 1;
-+
-+	/* Optimization, tack on the FIN if we have a queue of
-+	 * unsent frames.  But be careful about outgoing SACKS
-+	 * and IP options.
-+	 */
-+	mss_now = mptcp_current_mss(meta_sk);
-+
-+	if (tcp_send_head(meta_sk) != NULL) {
-+		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
-+		TCP_SKB_CB(skb)->end_seq++;
-+		meta_tp->write_seq++;
-+	} else {
-+		/* Socket is locked, keep trying until memory is available. */
-+		for (;;) {
-+			skb = alloc_skb_fclone(MAX_TCP_HEADER,
-+					       meta_sk->sk_allocation);
-+			if (skb)
-+				break;
-+			yield();
-+		}
-+		/* Reserve space for headers and prepare control bits. */
-+		skb_reserve(skb, MAX_TCP_HEADER);
-+
-+		tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
-+		TCP_SKB_CB(skb)->end_seq++;
-+		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
-+		tcp_queue_skb(meta_sk, skb);
-+	}
-+	__tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
-+}
-+
-+void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
-+
-+	if (!mpcb->cnt_subflows)
-+		return;
-+
-+	WARN_ON(meta_tp->send_mp_fclose);
-+
-+	/* First - select a socket */
-+	sk = mptcp_select_ack_sock(meta_sk);
-+
-+	/* May happen if no subflow is in an appropriate state */
-+	if (!sk)
-+		return;
-+
-+	/* We are in infinite mode - just send a reset */
-+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
-+		sk->sk_err = ECONNRESET;
-+		if (tcp_need_reset(sk->sk_state))
-+			tcp_send_active_reset(sk, priority);
-+		mptcp_sub_force_close(sk);
-+		return;
-+	}
-+
-+
-+	tcp_sk(sk)->send_mp_fclose = 1;
-+	/** Reset all other subflows */
-+
-+	/* tcp_done must be handled with bh disabled */
-+	if (!in_serving_softirq())
-+		local_bh_disable();
-+
-+	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+		if (tcp_sk(sk_it)->send_mp_fclose)
-+			continue;
-+
-+		sk_it->sk_err = ECONNRESET;
-+		if (tcp_need_reset(sk_it->sk_state))
-+			tcp_send_active_reset(sk_it, GFP_ATOMIC);
-+		mptcp_sub_force_close(sk_it);
-+	}
-+
-+	if (!in_serving_softirq())
-+		local_bh_enable();
-+
-+	tcp_send_ack(sk);
-+	inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
-+
-+	meta_tp->send_mp_fclose = 1;
-+}
-+
-+static void mptcp_ack_retransmit_timer(struct sock *sk)
-+{
-+	struct sk_buff *skb;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
-+		goto out; /* Routing failure or similar */
-+
-+	if (!tp->retrans_stamp)
-+		tp->retrans_stamp = tcp_time_stamp ? : 1;
-+
-+	if (tcp_write_timeout(sk)) {
-+		tp->mptcp->pre_established = 0;
-+		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+		tp->ops->send_active_reset(sk, GFP_ATOMIC);
-+		goto out;
-+	}
-+
-+	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
-+	if (skb == NULL) {
-+		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+			       jiffies + icsk->icsk_rto);
-+		return;
-+	}
-+
-+	/* Reserve space for headers and prepare control bits */
-+	skb_reserve(skb, MAX_TCP_HEADER);
-+	tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
-+
-+	TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+	if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
-+		/* Retransmission failed because of local congestion,
-+		 * do not backoff.
-+		 */
-+		if (!icsk->icsk_retransmits)
-+			icsk->icsk_retransmits = 1;
-+		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+			       jiffies + icsk->icsk_rto);
-+		return;
-+	}
-+
-+
-+	icsk->icsk_retransmits++;
-+	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-+	sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+		       jiffies + icsk->icsk_rto);
-+	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
-+		__sk_dst_reset(sk);
-+
-+out:;
-+}
-+
-+void mptcp_ack_handler(unsigned long data)
-+{
-+	struct sock *sk = (struct sock *)data;
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+	bh_lock_sock(meta_sk);
-+	if (sock_owned_by_user(meta_sk)) {
-+		/* Try again later */
-+		sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
-+			       jiffies + (HZ / 20));
-+		goto out_unlock;
-+	}
-+
-+	if (sk->sk_state == TCP_CLOSE)
-+		goto out_unlock;
-+	if (!tcp_sk(sk)->mptcp->pre_established)
-+		goto out_unlock;
-+
-+	mptcp_ack_retransmit_timer(sk);
-+
-+	sk_mem_reclaim(sk);
-+
-+out_unlock:
-+	bh_unlock_sock(meta_sk);
-+	sock_put(sk);
-+}
-+
-+/* Similar to tcp_retransmit_skb
-+ *
-+ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
-+ * meta-level.
-+ */
-+int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sock *subsk;
-+	unsigned int limit, mss_now;
-+	int err = -1;
-+
-+	/* Do not sent more than we queued. 1/4 is reserved for possible
-+	 * copying overhead: fragmentation, tunneling, mangling etc.
-+	 *
-+	 * This is a meta-retransmission thus we check on the meta-socket.
-+	 */
-+	if (atomic_read(&meta_sk->sk_wmem_alloc) >
-+	    min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
-+		return -EAGAIN;
-+	}
-+
-+	/* We need to make sure that the retransmitted segment can be sent on a
-+	 * subflow right now. If it is too big, it needs to be fragmented.
-+	 */
-+	subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
-+	if (!subsk) {
-+		/* We want to increase icsk_retransmits, thus return 0, so that
-+		 * mptcp_retransmit_timer enters the desired branch.
-+		 */
-+		err = 0;
-+		goto failed;
-+	}
-+	mss_now = tcp_current_mss(subsk);
-+
-+	/* If the segment was cloned (e.g. a meta retransmission), the header
-+	 * must be expanded/copied so that there is no corruption of TSO
-+	 * information.
-+	 */
-+	if (skb_unclone(skb, GFP_ATOMIC)) {
-+		err = -ENOMEM;
-+		goto failed;
-+	}
-+
-+	/* Must have been set by mptcp_write_xmit before */
-+	BUG_ON(!tcp_skb_pcount(skb));
-+
-+	limit = mss_now;
-+	/* skb->len > mss_now is the equivalent of tso_segs > 1 in
-+	 * tcp_write_xmit. Otherwise split-point would return 0.
-+	 */
-+	if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
-+		limit = tcp_mss_split_point(meta_sk, skb, mss_now,
-+					    UINT_MAX / mss_now,
-+					    TCP_NAGLE_OFF);
-+
-+	if (skb->len > limit &&
-+	    unlikely(mptcp_fragment(meta_sk, skb, limit,
-+				    GFP_ATOMIC, 0)))
-+		goto failed;
-+
-+	if (!mptcp_skb_entail(subsk, skb, -1))
-+		goto failed;
-+	TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+	/* Update global TCP statistics. */
-+	TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
-+
-+	/* Diff to tcp_retransmit_skb */
-+
-+	/* Save stamp of the first retransmit. */
-+	if (!meta_tp->retrans_stamp)
-+		meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
-+
-+	__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
-+
-+	return 0;
-+
-+failed:
-+	NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
-+	return err;
-+}
-+
-+/* Similar to tcp_retransmit_timer
-+ *
-+ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
-+ * and that we don't have an srtt estimation at the meta-level.
-+ */
-+void mptcp_retransmit_timer(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+	int err;
-+
-+	/* In fallback, retransmission is handled at the subflow-level */
-+	if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
-+	    mpcb->send_infinite_mapping)
-+		return;
-+
-+	WARN_ON(tcp_write_queue_empty(meta_sk));
-+
-+	if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
-+	    !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
-+		/* Receiver dastardly shrinks window. Our retransmits
-+		 * become zero probes, but we should not timeout this
-+		 * connection. If the socket is an orphan, time it out,
-+		 * we cannot allow such beasts to hang infinitely.
-+		 */
-+		struct inet_sock *meta_inet = inet_sk(meta_sk);
-+		if (meta_sk->sk_family == AF_INET) {
-+			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-+				       &meta_inet->inet_daddr,
-+				       ntohs(meta_inet->inet_dport),
-+				       meta_inet->inet_num, meta_tp->snd_una,
-+				       meta_tp->snd_nxt);
-+		}
-+#if IS_ENABLED(CONFIG_IPV6)
-+		else if (meta_sk->sk_family == AF_INET6) {
-+			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-+				       &meta_sk->sk_v6_daddr,
-+				       ntohs(meta_inet->inet_dport),
-+				       meta_inet->inet_num, meta_tp->snd_una,
-+				       meta_tp->snd_nxt);
-+		}
-+#endif
-+		if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
-+			tcp_write_err(meta_sk);
-+			return;
-+		}
-+
-+		mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
-+		goto out_reset_timer;
-+	}
-+
-+	if (tcp_write_timeout(meta_sk))
-+		return;
-+
-+	if (meta_icsk->icsk_retransmits == 0)
-+		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
-+
-+	meta_icsk->icsk_ca_state = TCP_CA_Loss;
-+
-+	err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
-+	if (err > 0) {
-+		/* Retransmission failed because of local congestion,
-+		 * do not backoff.
-+		 */
-+		if (!meta_icsk->icsk_retransmits)
-+			meta_icsk->icsk_retransmits = 1;
-+		inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
-+					  min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
-+					  TCP_RTO_MAX);
-+		return;
-+	}
-+
-+	/* Increase the timeout each time we retransmit.  Note that
-+	 * we do not increase the rtt estimate.  rto is initialized
-+	 * from rtt, but increases here.  Jacobson (SIGCOMM 88) suggests
-+	 * that doubling rto each time is the least we can get away with.
-+	 * In KA9Q, Karn uses this for the first few times, and then
-+	 * goes to quadratic.  netBSD doubles, but only goes up to *64,
-+	 * and clamps at 1 to 64 sec afterwards.  Note that 120 sec is
-+	 * defined in the protocol as the maximum possible RTT.  I guess
-+	 * we'll have to use something other than TCP to talk to the
-+	 * University of Mars.
-+	 *
-+	 * PAWS allows us longer timeouts and large windows, so once
-+	 * implemented ftp to mars will work nicely. We will have to fix
-+	 * the 120 second clamps though!
-+	 */
-+	meta_icsk->icsk_backoff++;
-+	meta_icsk->icsk_retransmits++;
-+
-+out_reset_timer:
-+	/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
-+	 * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
-+	 * might be increased if the stream oscillates between thin and thick,
-+	 * thus the old value might already be too high compared to the value
-+	 * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
-+	 * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
-+	 * exponential backoff behaviour to avoid continue hammering
-+	 * linear-timeout retransmissions into a black hole
-+	 */
-+	if (meta_sk->sk_state == TCP_ESTABLISHED &&
-+	    (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
-+	    tcp_stream_is_thin(meta_tp) &&
-+	    meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
-+		meta_icsk->icsk_backoff = 0;
-+		/* We cannot do the same as in tcp_write_timer because the
-+		 * srtt is not set here.
-+		 */
-+		mptcp_set_rto(meta_sk);
-+	} else {
-+		/* Use normal (exponential) backoff */
-+		meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
-+	}
-+	inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
-+
-+	return;
-+}
-+
-+/* Modify values to an mptcp-level for the initial window of new subflows */
-+void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
-+				__u32 *window_clamp, int wscale_ok,
-+				__u8 *rcv_wscale, __u32 init_rcv_wnd,
-+				 const struct sock *sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+	*window_clamp = mpcb->orig_window_clamp;
-+	__space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
-+
-+	tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
-+				  wscale_ok, rcv_wscale, init_rcv_wnd, sk);
-+}
-+
-+static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
-+				  unsigned int (*mss_cb)(struct sock *sk))
-+{
-+	struct sock *sk;
-+	u64 rate = 0;
-+
-+	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+		int this_mss;
-+		u64 this_rate;
-+
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+
-+		/* Do not consider subflows without a RTT estimation yet
-+		 * otherwise this_rate >>> rate.
-+		 */
-+		if (unlikely(!tp->srtt_us))
-+			continue;
-+
-+		this_mss = mss_cb(sk);
-+
-+		/* If this_mss is smaller than mss, it means that a segment will
-+		 * be splitted in two (or more) when pushed on this subflow. If
-+		 * you consider that mss = 1428 and this_mss = 1420 then two
-+		 * segments will be generated: a 1420-byte and 8-byte segment.
-+		 * The latter will introduce a large overhead as for a single
-+		 * data segment 2 slots will be used in the congestion window.
-+		 * Therefore reducing by ~2 the potential throughput of this
-+		 * subflow. Indeed, 1428 will be send while 2840 could have been
-+		 * sent if mss == 1420 reducing the throughput by 2840 / 1428.
-+		 *
-+		 * The following algorithm take into account this overhead
-+		 * when computing the potential throughput that MPTCP can
-+		 * achieve when generating mss-byte segments.
-+		 *
-+		 * The formulae is the following:
-+		 *  \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
-+		 * Where ratio is computed as follows:
-+		 *  \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
-+		 *
-+		 * ratio gives the reduction factor of the theoretical
-+		 * throughput a subflow can achieve if MPTCP uses a specific
-+		 * MSS value.
-+		 */
-+		this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
-+				      max(tp->snd_cwnd, tp->packets_out),
-+				      (u64)tp->srtt_us *
-+				      DIV_ROUND_UP(mss, this_mss) * this_mss);
-+		rate += this_rate;
-+	}
-+
-+	return rate;
-+}
-+
-+static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
-+					unsigned int (*mss_cb)(struct sock *sk))
-+{
-+	unsigned int mss = 0;
-+	u64 rate = 0;
-+	struct sock *sk;
-+
-+	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+		int this_mss;
-+		u64 this_rate;
-+
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+
-+		this_mss = mss_cb(sk);
-+
-+		/* Same mss values will produce the same throughput. */
-+		if (this_mss == mss)
-+			continue;
-+
-+		/* See whether using this mss value can theoretically improve
-+		 * the performances.
-+		 */
-+		this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
-+		if (this_rate >= rate) {
-+			mss = this_mss;
-+			rate = this_rate;
-+		}
-+	}
-+
-+	return mss;
-+}
-+
-+unsigned int mptcp_current_mss(struct sock *meta_sk)
-+{
-+	unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
-+
-+	/* If no subflow is available, we take a default-mss from the
-+	 * meta-socket.
-+	 */
-+	return !mss ? tcp_current_mss(meta_sk) : mss;
-+}
-+
-+static unsigned int mptcp_select_size_mss(struct sock *sk)
-+{
-+	return tcp_sk(sk)->mss_cache;
-+}
-+
-+int mptcp_select_size(const struct sock *meta_sk, bool sg)
-+{
-+	unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
-+
-+	if (sg) {
-+		if (mptcp_sk_can_gso(meta_sk)) {
-+			mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
-+		} else {
-+			int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
-+
-+			if (mss >= pgbreak &&
-+			    mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
-+				mss = pgbreak;
-+		}
-+	}
-+
-+	return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
-+}
-+
-+int mptcp_check_snd_buf(const struct tcp_sock *tp)
-+{
-+	const struct sock *sk;
-+	u32 rtt_max = tp->srtt_us;
-+	u64 bw_est;
-+
-+	if (!tp->srtt_us)
-+		return tp->reordering + 1;
-+
-+	mptcp_for_each_sk(tp->mpcb, sk) {
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+
-+		if (rtt_max < tcp_sk(sk)->srtt_us)
-+			rtt_max = tcp_sk(sk)->srtt_us;
-+	}
-+
-+	bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
-+				(u64)tp->srtt_us);
-+
-+	return max_t(unsigned int, (u32)(bw_est >> 16),
-+			tp->reordering + 1);
-+}
-+
-+unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
-+				  int large_allowed)
-+{
-+	struct sock *sk;
-+	u32 xmit_size_goal = 0;
-+
-+	if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
-+		mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+			int this_size_goal;
-+
-+			if (!mptcp_sk_can_send(sk))
-+				continue;
-+
-+			this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
-+			if (this_size_goal > xmit_size_goal)
-+				xmit_size_goal = this_size_goal;
-+		}
-+	}
-+
-+	return max(xmit_size_goal, mss_now);
-+}
-+
-+/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
-+int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
-+{
-+	if (skb_cloned(skb)) {
-+		if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
-+			return -ENOMEM;
-+	}
-+
-+	__pskb_trim_head(skb, len);
-+
-+	TCP_SKB_CB(skb)->seq += len;
-+	skb->ip_summed = CHECKSUM_PARTIAL;
-+
-+	skb->truesize	     -= len;
-+	sk->sk_wmem_queued   -= len;
-+	sk_mem_uncharge(sk, len);
-+	sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
-+
-+	/* Any change of skb->len requires recalculation of tso factor. */
-+	if (tcp_skb_pcount(skb) > 1)
-+		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
-+
-+	return 0;
-+}
-diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
-new file mode 100644
-index 000000000000..9542f950729f
---- /dev/null
-+++ b/net/mptcp/mptcp_pm.c
-@@ -0,0 +1,169 @@
-+/*
-+ *     MPTCP implementation - MPTCP-subflow-management
-+ *
-+ *     Initial Design & Implementation:
-+ *     Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *     Current Maintainer & Author:
-+ *     Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *     Additional authors:
-+ *     Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *     Gregory Detal <gregory.detal@uclouvain.be>
-+ *     Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *     Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *     Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *     Andreas Ripke <ripke@neclab.eu>
-+ *     Vlad Dogaru <vlad.dogaru@intel.com>
-+ *     Octavian Purdila <octavian.purdila@intel.com>
-+ *     John Ronan <jronan@tssg.org>
-+ *     Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *     Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *     This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static DEFINE_SPINLOCK(mptcp_pm_list_lock);
-+static LIST_HEAD(mptcp_pm_list);
-+
-+static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
-+			    struct net *net, bool *low_prio)
-+{
-+	return 0;
-+}
-+
-+struct mptcp_pm_ops mptcp_pm_default = {
-+	.get_local_id = mptcp_default_id, /* We do not care */
-+	.name = "default",
-+	.owner = THIS_MODULE,
-+};
-+
-+static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
-+{
-+	struct mptcp_pm_ops *e;
-+
-+	list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
-+		if (strcmp(e->name, name) == 0)
-+			return e;
-+	}
-+
-+	return NULL;
-+}
-+
-+int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
-+{
-+	int ret = 0;
-+
-+	if (!pm->get_local_id)
-+		return -EINVAL;
-+
-+	spin_lock(&mptcp_pm_list_lock);
-+	if (mptcp_pm_find(pm->name)) {
-+		pr_notice("%s already registered\n", pm->name);
-+		ret = -EEXIST;
-+	} else {
-+		list_add_tail_rcu(&pm->list, &mptcp_pm_list);
-+		pr_info("%s registered\n", pm->name);
-+	}
-+	spin_unlock(&mptcp_pm_list_lock);
-+
-+	return ret;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
-+
-+void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
-+{
-+	spin_lock(&mptcp_pm_list_lock);
-+	list_del_rcu(&pm->list);
-+	spin_unlock(&mptcp_pm_list_lock);
-+}
-+EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
-+
-+void mptcp_get_default_path_manager(char *name)
-+{
-+	struct mptcp_pm_ops *pm;
-+
-+	BUG_ON(list_empty(&mptcp_pm_list));
-+
-+	rcu_read_lock();
-+	pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
-+	strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
-+	rcu_read_unlock();
-+}
-+
-+int mptcp_set_default_path_manager(const char *name)
-+{
-+	struct mptcp_pm_ops *pm;
-+	int ret = -ENOENT;
-+
-+	spin_lock(&mptcp_pm_list_lock);
-+	pm = mptcp_pm_find(name);
-+#ifdef CONFIG_MODULES
-+	if (!pm && capable(CAP_NET_ADMIN)) {
-+		spin_unlock(&mptcp_pm_list_lock);
-+
-+		request_module("mptcp_%s", name);
-+		spin_lock(&mptcp_pm_list_lock);
-+		pm = mptcp_pm_find(name);
-+	}
-+#endif
-+
-+	if (pm) {
-+		list_move(&pm->list, &mptcp_pm_list);
-+		ret = 0;
-+	} else {
-+		pr_info("%s is not available\n", name);
-+	}
-+	spin_unlock(&mptcp_pm_list_lock);
-+
-+	return ret;
-+}
-+
-+void mptcp_init_path_manager(struct mptcp_cb *mpcb)
-+{
-+	struct mptcp_pm_ops *pm;
-+
-+	rcu_read_lock();
-+	list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
-+		if (try_module_get(pm->owner)) {
-+			mpcb->pm_ops = pm;
-+			break;
-+		}
-+	}
-+	rcu_read_unlock();
-+}
-+
-+/* Manage refcounts on socket close. */
-+void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
-+{
-+	module_put(mpcb->pm_ops->owner);
-+}
-+
-+/* Fallback to the default path-manager. */
-+void mptcp_fallback_default(struct mptcp_cb *mpcb)
-+{
-+	struct mptcp_pm_ops *pm;
-+
-+	mptcp_cleanup_path_manager(mpcb);
-+	pm = mptcp_pm_find("default");
-+
-+	/* Cannot fail - it's the default module */
-+	try_module_get(pm->owner);
-+	mpcb->pm_ops = pm;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_fallback_default);
-+
-+/* Set default value from kernel configuration at bootup */
-+static int __init mptcp_path_manager_default(void)
-+{
-+	return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
-+}
-+late_initcall(mptcp_path_manager_default);
-diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
-new file mode 100644
-index 000000000000..93278f684069
---- /dev/null
-+++ b/net/mptcp/mptcp_rr.c
-@@ -0,0 +1,301 @@
-+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static unsigned char num_segments __read_mostly = 1;
-+module_param(num_segments, byte, 0644);
-+MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
-+
-+static bool cwnd_limited __read_mostly = 1;
-+module_param(cwnd_limited, bool, 0644);
-+MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
-+
-+struct rrsched_priv {
-+	unsigned char quota;
-+};
-+
-+static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
-+{
-+	return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
-+}
-+
-+/* If the sub-socket sk available to send the skb? */
-+static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
-+				  bool zero_wnd_test, bool cwnd_test)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	unsigned int space, in_flight;
-+
-+	/* Set of states for which we are allowed to send data */
-+	if (!mptcp_sk_can_send(sk))
-+		return false;
-+
-+	/* We do not send data on this subflow unless it is
-+	 * fully established, i.e. the 4th ack has been received.
-+	 */
-+	if (tp->mptcp->pre_established)
-+		return false;
-+
-+	if (tp->pf)
-+		return false;
-+
-+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
-+		/* If SACK is disabled, and we got a loss, TCP does not exit
-+		 * the loss-state until something above high_seq has been acked.
-+		 * (see tcp_try_undo_recovery)
-+		 *
-+		 * high_seq is the snd_nxt at the moment of the RTO. As soon
-+		 * as we have an RTO, we won't push data on the subflow.
-+		 * Thus, snd_una can never go beyond high_seq.
-+		 */
-+		if (!tcp_is_reno(tp))
-+			return false;
-+		else if (tp->snd_una != tp->high_seq)
-+			return false;
-+	}
-+
-+	if (!tp->mptcp->fully_established) {
-+		/* Make sure that we send in-order data */
-+		if (skb && tp->mptcp->second_packet &&
-+		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
-+			return false;
-+	}
-+
-+	if (!cwnd_test)
-+		goto zero_wnd_test;
-+
-+	in_flight = tcp_packets_in_flight(tp);
-+	/* Not even a single spot in the cwnd */
-+	if (in_flight >= tp->snd_cwnd)
-+		return false;
-+
-+	/* Now, check if what is queued in the subflow's send-queue
-+	 * already fills the cwnd.
-+	 */
-+	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
-+
-+	if (tp->write_seq - tp->snd_nxt > space)
-+		return false;
-+
-+zero_wnd_test:
-+	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
-+		return false;
-+
-+	return true;
-+}
-+
-+/* Are we not allowed to reinject this skb on tp? */
-+static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
-+{
-+	/* If the skb has already been enqueued in this sk, try to find
-+	 * another one.
-+	 */
-+	return skb &&
-+		/* Has the skb already been enqueued into this subsocket? */
-+		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
-+}
-+
-+/* We just look for any subflow that is available */
-+static struct sock *rr_get_available_subflow(struct sock *meta_sk,
-+					     struct sk_buff *skb,
-+					     bool zero_wnd_test)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *sk, *bestsk = NULL, *backupsk = NULL;
-+
-+	/* Answer data_fin on same subflow!!! */
-+	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
-+	    skb && mptcp_is_data_fin(skb)) {
-+		mptcp_for_each_sk(mpcb, sk) {
-+			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
-+			    mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
-+				return sk;
-+		}
-+	}
-+
-+	/* First, find the best subflow */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+
-+		if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
-+			continue;
-+
-+		if (mptcp_rr_dont_reinject_skb(tp, skb)) {
-+			backupsk = sk;
-+			continue;
-+		}
-+
-+		bestsk = sk;
-+	}
-+
-+	if (bestsk) {
-+		sk = bestsk;
-+	} else if (backupsk) {
-+		/* It has been sent on all subflows once - let's give it a
-+		 * chance again by restarting its pathmask.
-+		 */
-+		if (skb)
-+			TCP_SKB_CB(skb)->path_mask = 0;
-+		sk = backupsk;
-+	}
-+
-+	return sk;
-+}
-+
-+/* Returns the next segment to be sent from the mptcp meta-queue.
-+ * (chooses the reinject queue if any segment is waiting in it, otherwise,
-+ * chooses the normal write queue).
-+ * Sets *@reinject to 1 if the returned segment comes from the
-+ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
-+ * and sets it to -1 if it is a meta-level retransmission to optimize the
-+ * receive-buffer.
-+ */
-+static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sk_buff *skb = NULL;
-+
-+	*reinject = 0;
-+
-+	/* If we are in fallback-mode, just take from the meta-send-queue */
-+	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
-+		return tcp_send_head(meta_sk);
-+
-+	skb = skb_peek(&mpcb->reinject_queue);
-+
-+	if (skb)
-+		*reinject = 1;
-+	else
-+		skb = tcp_send_head(meta_sk);
-+	return skb;
-+}
-+
-+static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
-+					     int *reinject,
-+					     struct sock **subsk,
-+					     unsigned int *limit)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *sk_it, *choose_sk = NULL;
-+	struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
-+	unsigned char split = num_segments;
-+	unsigned char iter = 0, full_subs = 0;
-+
-+	/* As we set it, we have to reset it as well. */
-+	*limit = 0;
-+
-+	if (!skb)
-+		return NULL;
-+
-+	if (*reinject) {
-+		*subsk = rr_get_available_subflow(meta_sk, skb, false);
-+		if (!*subsk)
-+			return NULL;
-+
-+		return skb;
-+	}
-+
-+retry:
-+
-+	/* First, we look for a subflow who is currently being used */
-+	mptcp_for_each_sk(mpcb, sk_it) {
-+		struct tcp_sock *tp_it = tcp_sk(sk_it);
-+		struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
-+
-+		if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
-+			continue;
-+
-+		iter++;
-+
-+		/* Is this subflow currently being used? */
-+		if (rsp->quota > 0 && rsp->quota < num_segments) {
-+			split = num_segments - rsp->quota;
-+			choose_sk = sk_it;
-+			goto found;
-+		}
-+
-+		/* Or, it's totally unused */
-+		if (!rsp->quota) {
-+			split = num_segments;
-+			choose_sk = sk_it;
-+		}
-+
-+		/* Or, it must then be fully used  */
-+		if (rsp->quota == num_segments)
-+			full_subs++;
-+	}
-+
-+	/* All considered subflows have a full quota, and we considered at
-+	 * least one.
-+	 */
-+	if (iter && iter == full_subs) {
-+		/* So, we restart this round by setting quota to 0 and retry
-+		 * to find a subflow.
-+		 */
-+		mptcp_for_each_sk(mpcb, sk_it) {
-+			struct tcp_sock *tp_it = tcp_sk(sk_it);
-+			struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
-+
-+			if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
-+				continue;
-+
-+			rsp->quota = 0;
-+		}
-+
-+		goto retry;
-+	}
-+
-+found:
-+	if (choose_sk) {
-+		unsigned int mss_now;
-+		struct tcp_sock *choose_tp = tcp_sk(choose_sk);
-+		struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
-+
-+		if (!mptcp_rr_is_available(choose_sk, skb, false, true))
-+			return NULL;
-+
-+		*subsk = choose_sk;
-+		mss_now = tcp_current_mss(*subsk);
-+		*limit = split * mss_now;
-+
-+		if (skb->len > mss_now)
-+			rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
-+		else
-+			rsp->quota++;
-+
-+		return skb;
-+	}
-+
-+	return NULL;
-+}
-+
-+static struct mptcp_sched_ops mptcp_sched_rr = {
-+	.get_subflow = rr_get_available_subflow,
-+	.next_segment = mptcp_rr_next_segment,
-+	.name = "roundrobin",
-+	.owner = THIS_MODULE,
-+};
-+
-+static int __init rr_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
-+
-+	if (mptcp_register_scheduler(&mptcp_sched_rr))
-+		return -1;
-+
-+	return 0;
-+}
-+
-+static void rr_unregister(void)
-+{
-+	mptcp_unregister_scheduler(&mptcp_sched_rr);
-+}
-+
-+module_init(rr_register);
-+module_exit(rr_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
-+MODULE_VERSION("0.89");
-diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
-new file mode 100644
-index 000000000000..6c7ff4eceac1
---- /dev/null
-+++ b/net/mptcp/mptcp_sched.c
-@@ -0,0 +1,493 @@
-+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static DEFINE_SPINLOCK(mptcp_sched_list_lock);
-+static LIST_HEAD(mptcp_sched_list);
-+
-+struct defsched_priv {
-+	u32	last_rbuf_opti;
-+};
-+
-+static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
-+{
-+	return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
-+}
-+
-+/* If the sub-socket sk available to send the skb? */
-+static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
-+			       bool zero_wnd_test)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	unsigned int mss_now, space, in_flight;
-+
-+	/* Set of states for which we are allowed to send data */
-+	if (!mptcp_sk_can_send(sk))
-+		return false;
-+
-+	/* We do not send data on this subflow unless it is
-+	 * fully established, i.e. the 4th ack has been received.
-+	 */
-+	if (tp->mptcp->pre_established)
-+		return false;
-+
-+	if (tp->pf)
-+		return false;
-+
-+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
-+		/* If SACK is disabled, and we got a loss, TCP does not exit
-+		 * the loss-state until something above high_seq has been acked.
-+		 * (see tcp_try_undo_recovery)
-+		 *
-+		 * high_seq is the snd_nxt at the moment of the RTO. As soon
-+		 * as we have an RTO, we won't push data on the subflow.
-+		 * Thus, snd_una can never go beyond high_seq.
-+		 */
-+		if (!tcp_is_reno(tp))
-+			return false;
-+		else if (tp->snd_una != tp->high_seq)
-+			return false;
-+	}
-+
-+	if (!tp->mptcp->fully_established) {
-+		/* Make sure that we send in-order data */
-+		if (skb && tp->mptcp->second_packet &&
-+		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
-+			return false;
-+	}
-+
-+	/* If TSQ is already throttling us, do not send on this subflow. When
-+	 * TSQ gets cleared the subflow becomes eligible again.
-+	 */
-+	if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
-+		return false;
-+
-+	in_flight = tcp_packets_in_flight(tp);
-+	/* Not even a single spot in the cwnd */
-+	if (in_flight >= tp->snd_cwnd)
-+		return false;
-+
-+	/* Now, check if what is queued in the subflow's send-queue
-+	 * already fills the cwnd.
-+	 */
-+	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
-+
-+	if (tp->write_seq - tp->snd_nxt > space)
-+		return false;
-+
-+	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
-+		return false;
-+
-+	mss_now = tcp_current_mss(sk);
-+
-+	/* Don't send on this subflow if we bypass the allowed send-window at
-+	 * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
-+	 * calculated end_seq (because here at this point end_seq is still at
-+	 * the meta-level).
-+	 */
-+	if (skb && !zero_wnd_test &&
-+	    after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
-+		return false;
-+
-+	return true;
-+}
-+
-+/* Are we not allowed to reinject this skb on tp? */
-+static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
-+{
-+	/* If the skb has already been enqueued in this sk, try to find
-+	 * another one.
-+	 */
-+	return skb &&
-+		/* Has the skb already been enqueued into this subsocket? */
-+		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
-+}
-+
-+/* This is the scheduler. This function decides on which flow to send
-+ * a given MSS. If all subflows are found to be busy, NULL is returned
-+ * The flow is selected based on the shortest RTT.
-+ * If all paths have full cong windows, we simply return NULL.
-+ *
-+ * Additionally, this function is aware of the backup-subflows.
-+ */
-+static struct sock *get_available_subflow(struct sock *meta_sk,
-+					  struct sk_buff *skb,
-+					  bool zero_wnd_test)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
-+	u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
-+	int cnt_backups = 0;
-+
-+	/* if there is only one subflow, bypass the scheduling function */
-+	if (mpcb->cnt_subflows == 1) {
-+		bestsk = (struct sock *)mpcb->connection_list;
-+		if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
-+			bestsk = NULL;
-+		return bestsk;
-+	}
-+
-+	/* Answer data_fin on same subflow!!! */
-+	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
-+	    skb && mptcp_is_data_fin(skb)) {
-+		mptcp_for_each_sk(mpcb, sk) {
-+			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
-+			    mptcp_is_available(sk, skb, zero_wnd_test))
-+				return sk;
-+		}
-+	}
-+
-+	/* First, find the best subflow */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+
-+		if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
-+			cnt_backups++;
-+
-+		if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
-+		    tp->srtt_us < lowprio_min_time_to_peer) {
-+			if (!mptcp_is_available(sk, skb, zero_wnd_test))
-+				continue;
-+
-+			if (mptcp_dont_reinject_skb(tp, skb)) {
-+				backupsk = sk;
-+				continue;
-+			}
-+
-+			lowprio_min_time_to_peer = tp->srtt_us;
-+			lowpriosk = sk;
-+		} else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
-+			   tp->srtt_us < min_time_to_peer) {
-+			if (!mptcp_is_available(sk, skb, zero_wnd_test))
-+				continue;
-+
-+			if (mptcp_dont_reinject_skb(tp, skb)) {
-+				backupsk = sk;
-+				continue;
-+			}
-+
-+			min_time_to_peer = tp->srtt_us;
-+			bestsk = sk;
-+		}
-+	}
-+
-+	if (mpcb->cnt_established == cnt_backups && lowpriosk) {
-+		sk = lowpriosk;
-+	} else if (bestsk) {
-+		sk = bestsk;
-+	} else if (backupsk) {
-+		/* It has been sent on all subflows once - let's give it a
-+		 * chance again by restarting its pathmask.
-+		 */
-+		if (skb)
-+			TCP_SKB_CB(skb)->path_mask = 0;
-+		sk = backupsk;
-+	}
-+
-+	return sk;
-+}
-+
-+static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
-+{
-+	struct sock *meta_sk;
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct tcp_sock *tp_it;
-+	struct sk_buff *skb_head;
-+	struct defsched_priv *dsp = defsched_get_priv(tp);
-+
-+	if (tp->mpcb->cnt_subflows == 1)
-+		return NULL;
-+
-+	meta_sk = mptcp_meta_sk(sk);
-+	skb_head = tcp_write_queue_head(meta_sk);
-+
-+	if (!skb_head || skb_head == tcp_send_head(meta_sk))
-+		return NULL;
-+
-+	/* If penalization is optional (coming from mptcp_next_segment() and
-+	 * We are not send-buffer-limited we do not penalize. The retransmission
-+	 * is just an optimization to fix the idle-time due to the delay before
-+	 * we wake up the application.
-+	 */
-+	if (!penal && sk_stream_memory_free(meta_sk))
-+		goto retrans;
-+
-+	/* Only penalize again after an RTT has elapsed */
-+	if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
-+		goto retrans;
-+
-+	/* Half the cwnd of the slow flow */
-+	mptcp_for_each_tp(tp->mpcb, tp_it) {
-+		if (tp_it != tp &&
-+		    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
-+			if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
-+				tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
-+				if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
-+					tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
-+
-+				dsp->last_rbuf_opti = tcp_time_stamp;
-+			}
-+			break;
-+		}
-+	}
-+
-+retrans:
-+
-+	/* Segment not yet injected into this path? Take it!!! */
-+	if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
-+		bool do_retrans = false;
-+		mptcp_for_each_tp(tp->mpcb, tp_it) {
-+			if (tp_it != tp &&
-+			    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
-+				if (tp_it->snd_cwnd <= 4) {
-+					do_retrans = true;
-+					break;
-+				}
-+
-+				if (4 * tp->srtt_us >= tp_it->srtt_us) {
-+					do_retrans = false;
-+					break;
-+				} else {
-+					do_retrans = true;
-+				}
-+			}
-+		}
-+
-+		if (do_retrans && mptcp_is_available(sk, skb_head, false))
-+			return skb_head;
-+	}
-+	return NULL;
-+}
-+
-+/* Returns the next segment to be sent from the mptcp meta-queue.
-+ * (chooses the reinject queue if any segment is waiting in it, otherwise,
-+ * chooses the normal write queue).
-+ * Sets *@reinject to 1 if the returned segment comes from the
-+ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
-+ * and sets it to -1 if it is a meta-level retransmission to optimize the
-+ * receive-buffer.
-+ */
-+static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sk_buff *skb = NULL;
-+
-+	*reinject = 0;
-+
-+	/* If we are in fallback-mode, just take from the meta-send-queue */
-+	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
-+		return tcp_send_head(meta_sk);
-+
-+	skb = skb_peek(&mpcb->reinject_queue);
-+
-+	if (skb) {
-+		*reinject = 1;
-+	} else {
-+		skb = tcp_send_head(meta_sk);
-+
-+		if (!skb && meta_sk->sk_socket &&
-+		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
-+		    sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
-+			struct sock *subsk = get_available_subflow(meta_sk, NULL,
-+								   false);
-+			if (!subsk)
-+				return NULL;
-+
-+			skb = mptcp_rcv_buf_optimization(subsk, 0);
-+			if (skb)
-+				*reinject = -1;
-+		}
-+	}
-+	return skb;
-+}
-+
-+static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
-+					  int *reinject,
-+					  struct sock **subsk,
-+					  unsigned int *limit)
-+{
-+	struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
-+	unsigned int mss_now;
-+	struct tcp_sock *subtp;
-+	u16 gso_max_segs;
-+	u32 max_len, max_segs, window, needed;
-+
-+	/* As we set it, we have to reset it as well. */
-+	*limit = 0;
-+
-+	if (!skb)
-+		return NULL;
-+
-+	*subsk = get_available_subflow(meta_sk, skb, false);
-+	if (!*subsk)
-+		return NULL;
-+
-+	subtp = tcp_sk(*subsk);
-+	mss_now = tcp_current_mss(*subsk);
-+
-+	if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
-+		skb = mptcp_rcv_buf_optimization(*subsk, 1);
-+		if (skb)
-+			*reinject = -1;
-+		else
-+			return NULL;
-+	}
-+
-+	/* No splitting required, as we will only send one single segment */
-+	if (skb->len <= mss_now)
-+		return skb;
-+
-+	/* The following is similar to tcp_mss_split_point, but
-+	 * we do not care about nagle, because we will anyways
-+	 * use TCP_NAGLE_PUSH, which overrides this.
-+	 *
-+	 * So, we first limit according to the cwnd/gso-size and then according
-+	 * to the subflow's window.
-+	 */
-+
-+	gso_max_segs = (*subsk)->sk_gso_max_segs;
-+	if (!gso_max_segs) /* No gso supported on the subflow's NIC */
-+		gso_max_segs = 1;
-+	max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
-+	if (!max_segs)
-+		return NULL;
-+
-+	max_len = mss_now * max_segs;
-+	window = tcp_wnd_end(subtp) - subtp->write_seq;
-+
-+	needed = min(skb->len, window);
-+	if (max_len <= skb->len)
-+		/* Take max_win, which is actually the cwnd/gso-size */
-+		*limit = max_len;
-+	else
-+		/* Or, take the window */
-+		*limit = needed;
-+
-+	return skb;
-+}
-+
-+static void defsched_init(struct sock *sk)
-+{
-+	struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
-+
-+	dsp->last_rbuf_opti = tcp_time_stamp;
-+}
-+
-+struct mptcp_sched_ops mptcp_sched_default = {
-+	.get_subflow = get_available_subflow,
-+	.next_segment = mptcp_next_segment,
-+	.init = defsched_init,
-+	.name = "default",
-+	.owner = THIS_MODULE,
-+};
-+
-+static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
-+{
-+	struct mptcp_sched_ops *e;
-+
-+	list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
-+		if (strcmp(e->name, name) == 0)
-+			return e;
-+	}
-+
-+	return NULL;
-+}
-+
-+int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
-+{
-+	int ret = 0;
-+
-+	if (!sched->get_subflow || !sched->next_segment)
-+		return -EINVAL;
-+
-+	spin_lock(&mptcp_sched_list_lock);
-+	if (mptcp_sched_find(sched->name)) {
-+		pr_notice("%s already registered\n", sched->name);
-+		ret = -EEXIST;
-+	} else {
-+		list_add_tail_rcu(&sched->list, &mptcp_sched_list);
-+		pr_info("%s registered\n", sched->name);
-+	}
-+	spin_unlock(&mptcp_sched_list_lock);
-+
-+	return ret;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
-+
-+void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
-+{
-+	spin_lock(&mptcp_sched_list_lock);
-+	list_del_rcu(&sched->list);
-+	spin_unlock(&mptcp_sched_list_lock);
-+}
-+EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
-+
-+void mptcp_get_default_scheduler(char *name)
-+{
-+	struct mptcp_sched_ops *sched;
-+
-+	BUG_ON(list_empty(&mptcp_sched_list));
-+
-+	rcu_read_lock();
-+	sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
-+	strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
-+	rcu_read_unlock();
-+}
-+
-+int mptcp_set_default_scheduler(const char *name)
-+{
-+	struct mptcp_sched_ops *sched;
-+	int ret = -ENOENT;
-+
-+	spin_lock(&mptcp_sched_list_lock);
-+	sched = mptcp_sched_find(name);
-+#ifdef CONFIG_MODULES
-+	if (!sched && capable(CAP_NET_ADMIN)) {
-+		spin_unlock(&mptcp_sched_list_lock);
-+
-+		request_module("mptcp_%s", name);
-+		spin_lock(&mptcp_sched_list_lock);
-+		sched = mptcp_sched_find(name);
-+	}
-+#endif
-+
-+	if (sched) {
-+		list_move(&sched->list, &mptcp_sched_list);
-+		ret = 0;
-+	} else {
-+		pr_info("%s is not available\n", name);
-+	}
-+	spin_unlock(&mptcp_sched_list_lock);
-+
-+	return ret;
-+}
-+
-+void mptcp_init_scheduler(struct mptcp_cb *mpcb)
-+{
-+	struct mptcp_sched_ops *sched;
-+
-+	rcu_read_lock();
-+	list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
-+		if (try_module_get(sched->owner)) {
-+			mpcb->sched_ops = sched;
-+			break;
-+		}
-+	}
-+	rcu_read_unlock();
-+}
-+
-+/* Manage refcounts on socket close. */
-+void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
-+{
-+	module_put(mpcb->sched_ops->owner);
-+}
-+
-+/* Set default value from kernel configuration at bootup */
-+static int __init mptcp_scheduler_default(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
-+
-+	return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
-+}
-+late_initcall(mptcp_scheduler_default);
-diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
-new file mode 100644
-index 000000000000..29ca1d868d17
---- /dev/null
-+++ b/net/mptcp/mptcp_wvegas.c
-@@ -0,0 +1,268 @@
-+/*
-+ *	MPTCP implementation - WEIGHTED VEGAS
-+ *
-+ *	Algorithm design:
-+ *	Yu Cao <cyAnalyst@126.com>
-+ *	Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
-+ *	Xiaoming Fu <fu@cs.uni-goettinggen.de>
-+ *
-+ *	Implementation:
-+ *	Yu Cao <cyAnalyst@126.com>
-+ *	Enhuan Dong <deh13@mails.tsinghua.edu.cn>
-+ *
-+ *	Ported to the official MPTCP-kernel:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *	modify it under the terms of the GNU General Public License
-+ *	as published by the Free Software Foundation; either version
-+ *	2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/skbuff.h>
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <linux/module.h>
-+#include <linux/tcp.h>
-+
-+static int initial_alpha = 2;
-+static int total_alpha = 10;
-+static int gamma = 1;
-+
-+module_param(initial_alpha, int, 0644);
-+MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
-+module_param(total_alpha, int, 0644);
-+MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
-+module_param(gamma, int, 0644);
-+MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
-+
-+#define MPTCP_WVEGAS_SCALE 16
-+
-+/* wVegas variables */
-+struct wvegas {
-+	u32	beg_snd_nxt;	/* right edge during last RTT */
-+	u8	doing_wvegas_now;/* if true, do wvegas for this RTT */
-+
-+	u16	cnt_rtt;		/* # of RTTs measured within last RTT */
-+	u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
-+	u32	base_rtt;	/* the min of all wVegas RTT measurements seen (in usec) */
-+
-+	u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
-+	u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
-+	int alpha; /* alpha for each subflows */
-+
-+	u32 queue_delay; /* queue delay*/
-+};
-+
-+
-+static inline u64 mptcp_wvegas_scale(u32 val, int scale)
-+{
-+	return (u64) val << scale;
-+}
-+
-+static void wvegas_enable(const struct sock *sk)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	wvegas->doing_wvegas_now = 1;
-+
-+	wvegas->beg_snd_nxt = tp->snd_nxt;
-+
-+	wvegas->cnt_rtt = 0;
-+	wvegas->sampled_rtt = 0;
-+
-+	wvegas->instant_rate = 0;
-+	wvegas->alpha = initial_alpha;
-+	wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
-+
-+	wvegas->queue_delay = 0;
-+}
-+
-+static inline void wvegas_disable(const struct sock *sk)
-+{
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	wvegas->doing_wvegas_now = 0;
-+}
-+
-+static void mptcp_wvegas_init(struct sock *sk)
-+{
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	wvegas->base_rtt = 0x7fffffff;
-+	wvegas_enable(sk);
-+}
-+
-+static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
-+{
-+	return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
-+}
-+
-+static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
-+{
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+	u32 vrtt;
-+
-+	if (rtt_us < 0)
-+		return;
-+
-+	vrtt = rtt_us + 1;
-+
-+	if (vrtt < wvegas->base_rtt)
-+		wvegas->base_rtt = vrtt;
-+
-+	wvegas->sampled_rtt += vrtt;
-+	wvegas->cnt_rtt++;
-+}
-+
-+static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
-+{
-+	if (ca_state == TCP_CA_Open)
-+		wvegas_enable(sk);
-+	else
-+		wvegas_disable(sk);
-+}
-+
-+static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
-+{
-+	if (event == CA_EVENT_CWND_RESTART) {
-+		mptcp_wvegas_init(sk);
-+	} else if (event == CA_EVENT_LOSS) {
-+		struct wvegas *wvegas = inet_csk_ca(sk);
-+		wvegas->instant_rate = 0;
-+	}
-+}
-+
-+static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
-+{
-+	return  min(tp->snd_ssthresh, tp->snd_cwnd - 1);
-+}
-+
-+static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
-+{
-+	u64 total_rate = 0;
-+	struct sock *sub_sk;
-+	const struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	if (!mpcb)
-+		return wvegas->weight;
-+
-+
-+	mptcp_for_each_sk(mpcb, sub_sk) {
-+		struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
-+
-+		/* sampled_rtt is initialized by 0 */
-+		if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
-+			total_rate += sub_wvegas->instant_rate;
-+	}
-+
-+	if (total_rate && wvegas->instant_rate)
-+		return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
-+	else
-+		return wvegas->weight;
-+}
-+
-+static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	if (!wvegas->doing_wvegas_now) {
-+		tcp_reno_cong_avoid(sk, ack, acked);
-+		return;
-+	}
-+
-+	if (after(ack, wvegas->beg_snd_nxt)) {
-+		wvegas->beg_snd_nxt  = tp->snd_nxt;
-+
-+		if (wvegas->cnt_rtt <= 2) {
-+			tcp_reno_cong_avoid(sk, ack, acked);
-+		} else {
-+			u32 rtt, diff, q_delay;
-+			u64 target_cwnd;
-+
-+			rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
-+			target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
-+
-+			diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
-+
-+			if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
-+				tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
-+				tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
-+
-+			} else if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+				tcp_slow_start(tp, acked);
-+			} else {
-+				if (diff >= wvegas->alpha) {
-+					wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
-+					wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
-+					wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
-+				}
-+				if (diff > wvegas->alpha) {
-+					tp->snd_cwnd--;
-+					tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
-+				} else if (diff < wvegas->alpha) {
-+					tp->snd_cwnd++;
-+				}
-+
-+				/* Try to drain link queue if needed*/
-+				q_delay = rtt - wvegas->base_rtt;
-+				if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
-+					wvegas->queue_delay = q_delay;
-+
-+				if (q_delay >= 2 * wvegas->queue_delay) {
-+					u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
-+					tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
-+					wvegas->queue_delay = 0;
-+				}
-+			}
-+
-+			if (tp->snd_cwnd < 2)
-+				tp->snd_cwnd = 2;
-+			else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
-+				tp->snd_cwnd = tp->snd_cwnd_clamp;
-+
-+			tp->snd_ssthresh = tcp_current_ssthresh(sk);
-+		}
-+
-+		wvegas->cnt_rtt = 0;
-+		wvegas->sampled_rtt = 0;
-+	}
-+	/* Use normal slow start */
-+	else if (tp->snd_cwnd <= tp->snd_ssthresh)
-+		tcp_slow_start(tp, acked);
-+}
-+
-+
-+static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
-+	.init		= mptcp_wvegas_init,
-+	.ssthresh	= tcp_reno_ssthresh,
-+	.cong_avoid	= mptcp_wvegas_cong_avoid,
-+	.pkts_acked	= mptcp_wvegas_pkts_acked,
-+	.set_state	= mptcp_wvegas_state,
-+	.cwnd_event	= mptcp_wvegas_cwnd_event,
-+
-+	.owner		= THIS_MODULE,
-+	.name		= "wvegas",
-+};
-+
-+static int __init mptcp_wvegas_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
-+	tcp_register_congestion_control(&mptcp_wvegas);
-+	return 0;
-+}
-+
-+static void __exit mptcp_wvegas_unregister(void)
-+{
-+	tcp_unregister_congestion_control(&mptcp_wvegas);
-+}
-+
-+module_init(mptcp_wvegas_register);
-+module_exit(mptcp_wvegas_unregister);
-+
-+MODULE_AUTHOR("Yu Cao, Enhuan Dong");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP wVegas");
-+MODULE_VERSION("0.1");


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-06 11:38 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-06 11:38 UTC (permalink / raw
  To: gentoo-commits

commit:     f2ea3e49d07e5b148c974633ec003ba2382f1189
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Mon Oct  6 11:38:42 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Mon Oct  6 11:38:42 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=f2ea3e49

Move multipath to experimental.

---
 5010_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 ++++++++++++++++++++++++++
 1 file changed, 19230 insertions(+)

diff --git a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
new file mode 100644
index 0000000..3000da3
--- /dev/null
+++ b/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
@@ -0,0 +1,19230 @@
+diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
+index 768a0fb67dd6..5a46d91a8df9 100644
+--- a/drivers/infiniband/hw/cxgb4/cm.c
++++ b/drivers/infiniband/hw/cxgb4/cm.c
+@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
+ 	 */
+ 	memset(&tmp_opt, 0, sizeof(tmp_opt));
+ 	tcp_clear_options(&tmp_opt);
+-	tcp_parse_options(skb, &tmp_opt, 0, NULL);
++	tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
+ 
+ 	req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
+ 	memset(req, 0, sizeof(*req));
+diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
+index 2faef339d8f2..d86c853ffaad 100644
+--- a/include/linux/ipv6.h
++++ b/include/linux/ipv6.h
+@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ 	return inet_sk(__sk)->pinet6;
+ }
+ 
+-static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
+-{
+-	struct request_sock *req = reqsk_alloc(ops);
+-
+-	if (req)
+-		inet_rsk(req)->pktopts = NULL;
+-
+-	return req;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ 	return (struct raw6_sock *)sk;
+@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
+ 	return NULL;
+ }
+ 
+-static inline struct inet6_request_sock *
+-			inet6_rsk(const struct request_sock *rsk)
+-{
+-	return NULL;
+-}
+-
+ static inline struct raw6_sock *raw6_sk(const struct sock *sk)
+ {
+ 	return NULL;
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index ec89301ada41..99ea4b0e3693 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
+ 						  bool zero_okay,
+ 						  __sum16 check)
+ {
+-	if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
+-		skb->csum_valid = 1;
++	if (skb_csum_unnecessary(skb)) {
++		return false;
++	} else if (zero_okay && !check) {
++		skb->ip_summed = CHECKSUM_UNNECESSARY;
+ 		return false;
+ 	}
+ 
+diff --git a/include/linux/tcp.h b/include/linux/tcp.h
+index a0513210798f..7bc2e078d6ca 100644
+--- a/include/linux/tcp.h
++++ b/include/linux/tcp.h
+@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
+ /* TCP Fast Open */
+ #define TCP_FASTOPEN_COOKIE_MIN	4	/* Min Fast Open Cookie size in bytes */
+ #define TCP_FASTOPEN_COOKIE_MAX	16	/* Max Fast Open Cookie size in bytes */
+-#define TCP_FASTOPEN_COOKIE_SIZE 8	/* the size employed by this impl. */
++#define TCP_FASTOPEN_COOKIE_SIZE 4	/* the size employed by this impl. */
+ 
+ /* TCP Fast Open Cookie as stored in memory */
+ struct tcp_fastopen_cookie {
+@@ -72,6 +72,51 @@ struct tcp_sack_block {
+ 	u32	end_seq;
+ };
+ 
++struct tcp_out_options {
++	u16	options;	/* bit field of OPTION_* */
++	u8	ws;		/* window scale, 0 to disable */
++	u8	num_sack_blocks;/* number of SACK blocks to include */
++	u8	hash_size;	/* bytes in hash_location */
++	u16	mss;		/* 0 to disable */
++	__u8	*hash_location;	/* temporary pointer, overloaded */
++	__u32	tsval, tsecr;	/* need to include OPTION_TS */
++	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
++#ifdef CONFIG_MPTCP
++	u16	mptcp_options;	/* bit field of MPTCP related OPTION_* */
++	u8	dss_csum:1,
++		add_addr_v4:1,
++		add_addr_v6:1;	/* dss-checksum required? */
++
++	union {
++		struct {
++			__u64	sender_key;	/* sender's key for mptcp */
++			__u64	receiver_key;	/* receiver's key for mptcp */
++		} mp_capable;
++
++		struct {
++			__u64	sender_truncated_mac;
++			__u32	sender_nonce;
++					/* random number of the sender */
++			__u32	token;	/* token for mptcp */
++			u8	low_prio:1;
++		} mp_join_syns;
++	};
++
++	struct {
++		struct in_addr addr;
++		u8 addr_id;
++	} add_addr4;
++
++	struct {
++		struct in6_addr addr;
++		u8 addr_id;
++	} add_addr6;
++
++	u16	remove_addrs;	/* list of address id */
++	u8	addr_id;	/* address id (mp_join or add_address) */
++#endif /* CONFIG_MPTCP */
++};
++
+ /*These are used to set the sack_ok field in struct tcp_options_received */
+ #define TCP_SACK_SEEN     (1 << 0)   /*1 = peer is SACK capable, */
+ #define TCP_FACK_ENABLED  (1 << 1)   /*1 = FACK is enabled locally*/
+@@ -95,6 +140,9 @@ struct tcp_options_received {
+ 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+ };
+ 
++struct mptcp_cb;
++struct mptcp_tcp_sock;
++
+ static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
+ {
+ 	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
+@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
+ 
+ struct tcp_request_sock {
+ 	struct inet_request_sock 	req;
+-#ifdef CONFIG_TCP_MD5SIG
+-	/* Only used by TCP MD5 Signature so far. */
+ 	const struct tcp_request_sock_ops *af_specific;
+-#endif
+ 	struct sock			*listener; /* needed for TFO */
+ 	u32				rcv_isn;
+ 	u32				snt_isn;
+@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
+ 	return (struct tcp_request_sock *)req;
+ }
+ 
++struct tcp_md5sig_key;
++
+ struct tcp_sock {
+ 	/* inet_connection_sock has to be the first member of tcp_sock */
+ 	struct inet_connection_sock	inet_conn;
+@@ -326,6 +373,37 @@ struct tcp_sock {
+ 	 * socket. Used to retransmit SYNACKs etc.
+ 	 */
+ 	struct request_sock *fastopen_rsk;
++
++	/* MPTCP/TCP-specific callbacks */
++	const struct tcp_sock_ops	*ops;
++
++	struct mptcp_cb		*mpcb;
++	struct sock		*meta_sk;
++	/* We keep these flags even if CONFIG_MPTCP is not checked, because
++	 * it allows checking MPTCP capability just by checking the mpc flag,
++	 * rather than adding ifdefs everywhere.
++	 */
++	u16     mpc:1,          /* Other end is multipath capable */
++		inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
++		send_mp_fclose:1,
++		request_mptcp:1, /* Did we send out an MP_CAPABLE?
++				  * (this speeds up mptcp_doit() in tcp_recvmsg)
++				  */
++		mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
++		pf:1, /* Potentially Failed state: when this flag is set, we
++		       * stop using the subflow
++		       */
++		mp_killed:1, /* Killed with a tcp_done in mptcp? */
++		was_meta_sk:1,	/* This was a meta sk (in case of reuse) */
++		is_master_sk,
++		close_it:1,	/* Must close socket in mptcp_data_ready? */
++		closing:1;
++	struct mptcp_tcp_sock *mptcp;
++#ifdef CONFIG_MPTCP
++	struct hlist_nulls_node tk_table;
++	u32		mptcp_loc_token;
++	u64		mptcp_loc_key;
++#endif /* CONFIG_MPTCP */
+ };
+ 
+ enum tsq_flags {
+@@ -337,6 +415,8 @@ enum tsq_flags {
+ 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
+ 				    * tcp_v{4|6}_mtu_reduced()
+ 				    */
++	MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
++	MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
+ };
+ 
+ static inline struct tcp_sock *tcp_sk(const struct sock *sk)
+@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
+ #ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_md5sig_key	  *tw_md5_key;
+ #endif
++	struct mptcp_tw		  *mptcp_tw;
+ };
+ 
+ static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
+diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
+index 74af137304be..83f63033897a 100644
+--- a/include/net/inet6_connection_sock.h
++++ b/include/net/inet6_connection_sock.h
+@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
+ 
+ struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
+ 				      const struct request_sock *req);
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++		    const u32 rnd, const u32 synq_hsize);
+ 
+ struct request_sock *inet6_csk_search_req(const struct sock *sk,
+ 					  struct request_sock ***prevp,
+diff --git a/include/net/inet_common.h b/include/net/inet_common.h
+index fe7994c48b75..780f229f46a8 100644
+--- a/include/net/inet_common.h
++++ b/include/net/inet_common.h
+@@ -1,6 +1,8 @@
+ #ifndef _INET_COMMON_H
+ #define _INET_COMMON_H
+ 
++#include <net/sock.h>
++
+ extern const struct proto_ops inet_stream_ops;
+ extern const struct proto_ops inet_dgram_ops;
+ 
+@@ -13,6 +15,8 @@ struct sock;
+ struct sockaddr;
+ struct socket;
+ 
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
+ int inet_release(struct socket *sock);
+ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
+ 			int addr_len, int flags);
+diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
+index 7a4313887568..f62159e39839 100644
+--- a/include/net/inet_connection_sock.h
++++ b/include/net/inet_connection_sock.h
+@@ -30,6 +30,7 @@
+ 
+ struct inet_bind_bucket;
+ struct tcp_congestion_ops;
++struct tcp_options_received;
+ 
+ /*
+  * Pointers to address related TCP functions
+@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
+ 
+ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
+ 
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++		   const u32 synq_hsize);
++
+ struct request_sock *inet_csk_search_req(const struct sock *sk,
+ 					 struct request_sock ***prevp,
+ 					 const __be16 rport,
+diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
+index b1edf17bec01..6a32d8d6b85e 100644
+--- a/include/net/inet_sock.h
++++ b/include/net/inet_sock.h
+@@ -86,10 +86,14 @@ struct inet_request_sock {
+ 				wscale_ok  : 1,
+ 				ecn_ok	   : 1,
+ 				acked	   : 1,
+-				no_srccheck: 1;
++				no_srccheck: 1,
++				mptcp_rqsk : 1,
++				saw_mpc    : 1;
+ 	kmemcheck_bitfield_end(flags);
+-	struct ip_options_rcu	*opt;
+-	struct sk_buff		*pktopts;
++	union {
++		struct ip_options_rcu	*opt;
++		struct sk_buff		*pktopts;
++	};
+ 	u32                     ir_mark;
+ };
+ 
+diff --git a/include/net/mptcp.h b/include/net/mptcp.h
+new file mode 100644
+index 000000000000..712780fc39e4
+--- /dev/null
++++ b/include/net/mptcp.h
+@@ -0,0 +1,1439 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_H
++#define _MPTCP_H
++
++#include <linux/inetdevice.h>
++#include <linux/ipv6.h>
++#include <linux/list.h>
++#include <linux/net.h>
++#include <linux/netpoll.h>
++#include <linux/skbuff.h>
++#include <linux/socket.h>
++#include <linux/tcp.h>
++#include <linux/kernel.h>
++
++#include <asm/byteorder.h>
++#include <asm/unaligned.h>
++#include <crypto/hash.h>
++#include <net/tcp.h>
++
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	#define ntohll(x)  be64_to_cpu(x)
++	#define htonll(x)  cpu_to_be64(x)
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	#define ntohll(x) (x)
++	#define htonll(x) (x)
++#endif
++
++struct mptcp_loc4 {
++	u8		loc4_id;
++	u8		low_prio:1;
++	struct in_addr	addr;
++};
++
++struct mptcp_rem4 {
++	u8		rem4_id;
++	__be16		port;
++	struct in_addr	addr;
++};
++
++struct mptcp_loc6 {
++	u8		loc6_id;
++	u8		low_prio:1;
++	struct in6_addr	addr;
++};
++
++struct mptcp_rem6 {
++	u8		rem6_id;
++	__be16		port;
++	struct in6_addr	addr;
++};
++
++struct mptcp_request_sock {
++	struct tcp_request_sock		req;
++	/* hlist-nulls entry to the hash-table. Depending on whether this is a
++	 * a new MPTCP connection or an additional subflow, the request-socket
++	 * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
++	 */
++	struct hlist_nulls_node		hash_entry;
++
++	union {
++		struct {
++			/* Only on initial subflows */
++			u64		mptcp_loc_key;
++			u64		mptcp_rem_key;
++			u32		mptcp_loc_token;
++		};
++
++		struct {
++			/* Only on additional subflows */
++			struct mptcp_cb	*mptcp_mpcb;
++			u32		mptcp_rem_nonce;
++			u32		mptcp_loc_nonce;
++			u64		mptcp_hash_tmac;
++		};
++	};
++
++	u8				loc_id;
++	u8				rem_id; /* Address-id in the MP_JOIN */
++	u8				dss_csum:1,
++					is_sub:1, /* Is this a new subflow? */
++					low_prio:1, /* Interface set to low-prio? */
++					rcv_low_prio:1;
++};
++
++struct mptcp_options_received {
++	u16	saw_mpc:1,
++		dss_csum:1,
++		drop_me:1,
++
++		is_mp_join:1,
++		join_ack:1,
++
++		saw_low_prio:2, /* 0x1 - low-prio set for this subflow
++				 * 0x2 - low-prio set for another subflow
++				 */
++		low_prio:1,
++
++		saw_add_addr:2, /* Saw at least one add_addr option:
++				 * 0x1: IPv4 - 0x2: IPv6
++				 */
++		more_add_addr:1, /* Saw one more add-addr. */
++
++		saw_rem_addr:1, /* Saw at least one rem_addr option */
++		more_rem_addr:1, /* Saw one more rem-addr. */
++
++		mp_fail:1,
++		mp_fclose:1;
++	u8	rem_id;		/* Address-id in the MP_JOIN */
++	u8	prio_addr_id;	/* Address-id in the MP_PRIO */
++
++	const unsigned char *add_addr_ptr; /* Pointer to add-address option */
++	const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
++
++	u32	data_ack;
++	u32	data_seq;
++	u16	data_len;
++
++	u32	mptcp_rem_token;/* Remote token */
++
++	/* Key inside the option (from mp_capable or fast_close) */
++	u64	mptcp_key;
++
++	u32	mptcp_recv_nonce;
++	u64	mptcp_recv_tmac;
++	u8	mptcp_recv_mac[20];
++};
++
++struct mptcp_tcp_sock {
++	struct tcp_sock	*next;		/* Next subflow socket */
++	struct hlist_node cb_list;
++	struct mptcp_options_received rx_opt;
++
++	 /* Those three fields record the current mapping */
++	u64	map_data_seq;
++	u32	map_subseq;
++	u16	map_data_len;
++	u16	slave_sk:1,
++		fully_established:1,
++		establish_increased:1,
++		second_packet:1,
++		attached:1,
++		send_mp_fail:1,
++		include_mpc:1,
++		mapping_present:1,
++		map_data_fin:1,
++		low_prio:1, /* use this socket as backup */
++		rcv_low_prio:1, /* Peer sent low-prio option to us */
++		send_mp_prio:1, /* Trigger to send mp_prio on this socket */
++		pre_established:1; /* State between sending 3rd ACK and
++				    * receiving the fourth ack of new subflows.
++				    */
++
++	/* isn: needed to translate abs to relative subflow seqnums */
++	u32	snt_isn;
++	u32	rcv_isn;
++	u8	path_index;
++	u8	loc_id;
++	u8	rem_id;
++
++#define MPTCP_SCHED_SIZE 4
++	u8	mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
++
++	struct sk_buff  *shortcut_ofoqueue; /* Shortcut to the current modified
++					     * skb in the ofo-queue.
++					     */
++
++	int	init_rcv_wnd;
++	u32	infinite_cutoff_seq;
++	struct delayed_work work;
++	u32	mptcp_loc_nonce;
++	struct tcp_sock *tp; /* Where is my daddy? */
++	u32	last_end_data_seq;
++
++	/* MP_JOIN subflow: timer for retransmitting the 3rd ack */
++	struct timer_list mptcp_ack_timer;
++
++	/* HMAC of the third ack */
++	char sender_mac[20];
++};
++
++struct mptcp_tw {
++	struct list_head list;
++	u64 loc_key;
++	u64 rcv_nxt;
++	struct mptcp_cb __rcu *mpcb;
++	u8 meta_tw:1,
++	   in_list:1;
++};
++
++#define MPTCP_PM_NAME_MAX 16
++struct mptcp_pm_ops {
++	struct list_head list;
++
++	/* Signal the creation of a new MPTCP-session. */
++	void (*new_session)(const struct sock *meta_sk);
++	void (*release_sock)(struct sock *meta_sk);
++	void (*fully_established)(struct sock *meta_sk);
++	void (*new_remote_address)(struct sock *meta_sk);
++	int  (*get_local_id)(sa_family_t family, union inet_addr *addr,
++			     struct net *net, bool *low_prio);
++	void (*addr_signal)(struct sock *sk, unsigned *size,
++			    struct tcp_out_options *opts, struct sk_buff *skb);
++	void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
++			  sa_family_t family, __be16 port, u8 id);
++	void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
++	void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
++	void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
++
++	char		name[MPTCP_PM_NAME_MAX];
++	struct module	*owner;
++};
++
++#define MPTCP_SCHED_NAME_MAX 16
++struct mptcp_sched_ops {
++	struct list_head list;
++
++	struct sock *		(*get_subflow)(struct sock *meta_sk,
++					       struct sk_buff *skb,
++					       bool zero_wnd_test);
++	struct sk_buff *	(*next_segment)(struct sock *meta_sk,
++						int *reinject,
++						struct sock **subsk,
++						unsigned int *limit);
++	void			(*init)(struct sock *sk);
++
++	char			name[MPTCP_SCHED_NAME_MAX];
++	struct module		*owner;
++};
++
++struct mptcp_cb {
++	/* list of sockets in this multipath connection */
++	struct tcp_sock *connection_list;
++	/* list of sockets that need a call to release_cb */
++	struct hlist_head callback_list;
++
++	/* High-order bits of 64-bit sequence numbers */
++	u32 snd_high_order[2];
++	u32 rcv_high_order[2];
++
++	u16	send_infinite_mapping:1,
++		in_time_wait:1,
++		list_rcvd:1, /* XXX TO REMOVE */
++		addr_signal:1, /* Path-manager wants us to call addr_signal */
++		dss_csum:1,
++		server_side:1,
++		infinite_mapping_rcv:1,
++		infinite_mapping_snd:1,
++		dfin_combined:1,   /* Was the DFIN combined with subflow-fin? */
++		passive_close:1,
++		snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
++		rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
++
++	/* socket count in this connection */
++	u8 cnt_subflows;
++	u8 cnt_established;
++
++	struct mptcp_sched_ops *sched_ops;
++
++	struct sk_buff_head reinject_queue;
++	/* First cache-line boundary is here minus 8 bytes. But from the
++	 * reinject-queue only the next and prev pointers are regularly
++	 * accessed. Thus, the whole data-path is on a single cache-line.
++	 */
++
++	u64	csum_cutoff_seq;
++
++	/***** Start of fields, used for connection closure */
++	spinlock_t	 tw_lock;
++	unsigned char	 mptw_state;
++	u8		 dfin_path_index;
++
++	struct list_head tw_list;
++
++	/***** Start of fields, used for subflow establishment and closure */
++	atomic_t	mpcb_refcnt;
++
++	/* Mutex needed, because otherwise mptcp_close will complain that the
++	 * socket is owned by the user.
++	 * E.g., mptcp_sub_close_wq is taking the meta-lock.
++	 */
++	struct mutex	mpcb_mutex;
++
++	/***** Start of fields, used for subflow establishment */
++	struct sock *meta_sk;
++
++	/* Master socket, also part of the connection_list, this
++	 * socket is the one that the application sees.
++	 */
++	struct sock *master_sk;
++
++	__u64	mptcp_loc_key;
++	__u64	mptcp_rem_key;
++	__u32	mptcp_loc_token;
++	__u32	mptcp_rem_token;
++
++#define MPTCP_PM_SIZE 608
++	u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
++	struct mptcp_pm_ops *pm_ops;
++
++	u32 path_index_bits;
++	/* Next pi to pick up in case a new path becomes available */
++	u8 next_path_index;
++
++	/* Original snd/rcvbuf of the initial subflow.
++	 * Used for the new subflows on the server-side to allow correct
++	 * autotuning
++	 */
++	int orig_sk_rcvbuf;
++	int orig_sk_sndbuf;
++	u32 orig_window_clamp;
++
++	/* Timer for retransmitting SYN/ACK+MP_JOIN */
++	struct timer_list synack_timer;
++};
++
++#define MPTCP_SUB_CAPABLE			0
++#define MPTCP_SUB_LEN_CAPABLE_SYN		12
++#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN		12
++#define MPTCP_SUB_LEN_CAPABLE_ACK		20
++#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN		20
++
++#define MPTCP_SUB_JOIN			1
++#define MPTCP_SUB_LEN_JOIN_SYN		12
++#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN	12
++#define MPTCP_SUB_LEN_JOIN_SYNACK	16
++#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN	16
++#define MPTCP_SUB_LEN_JOIN_ACK		24
++#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN	24
++
++#define MPTCP_SUB_DSS		2
++#define MPTCP_SUB_LEN_DSS	4
++#define MPTCP_SUB_LEN_DSS_ALIGN	4
++
++/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
++ * as they are part of the DSS-option.
++ * To get the total length, just add the different options together.
++ */
++#define MPTCP_SUB_LEN_SEQ	10
++#define MPTCP_SUB_LEN_SEQ_CSUM	12
++#define MPTCP_SUB_LEN_SEQ_ALIGN	12
++
++#define MPTCP_SUB_LEN_SEQ_64		14
++#define MPTCP_SUB_LEN_SEQ_CSUM_64	16
++#define MPTCP_SUB_LEN_SEQ_64_ALIGN	16
++
++#define MPTCP_SUB_LEN_ACK	4
++#define MPTCP_SUB_LEN_ACK_ALIGN	4
++
++#define MPTCP_SUB_LEN_ACK_64		8
++#define MPTCP_SUB_LEN_ACK_64_ALIGN	8
++
++/* This is the "default" option-length we will send out most often.
++ * MPTCP DSS-header
++ * 32-bit data sequence number
++ * 32-bit data ack
++ *
++ * It is necessary to calculate the effective MSS we will be using when
++ * sending data.
++ */
++#define MPTCP_SUB_LEN_DSM_ALIGN  (MPTCP_SUB_LEN_DSS_ALIGN +		\
++				  MPTCP_SUB_LEN_SEQ_ALIGN +		\
++				  MPTCP_SUB_LEN_ACK_ALIGN)
++
++#define MPTCP_SUB_ADD_ADDR		3
++#define MPTCP_SUB_LEN_ADD_ADDR4		8
++#define MPTCP_SUB_LEN_ADD_ADDR6		20
++#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN	8
++#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN	20
++
++#define MPTCP_SUB_REMOVE_ADDR	4
++#define MPTCP_SUB_LEN_REMOVE_ADDR	4
++
++#define MPTCP_SUB_PRIO		5
++#define MPTCP_SUB_LEN_PRIO	3
++#define MPTCP_SUB_LEN_PRIO_ADDR	4
++#define MPTCP_SUB_LEN_PRIO_ALIGN	4
++
++#define MPTCP_SUB_FAIL		6
++#define MPTCP_SUB_LEN_FAIL	12
++#define MPTCP_SUB_LEN_FAIL_ALIGN	12
++
++#define MPTCP_SUB_FCLOSE	7
++#define MPTCP_SUB_LEN_FCLOSE	12
++#define MPTCP_SUB_LEN_FCLOSE_ALIGN	12
++
++
++#define OPTION_MPTCP		(1 << 5)
++
++#ifdef CONFIG_MPTCP
++
++/* Used for checking if the mptcp initialization has been successful */
++extern bool mptcp_init_failed;
++
++/* MPTCP options */
++#define OPTION_TYPE_SYN		(1 << 0)
++#define OPTION_TYPE_SYNACK	(1 << 1)
++#define OPTION_TYPE_ACK		(1 << 2)
++#define OPTION_MP_CAPABLE	(1 << 3)
++#define OPTION_DATA_ACK		(1 << 4)
++#define OPTION_ADD_ADDR		(1 << 5)
++#define OPTION_MP_JOIN		(1 << 6)
++#define OPTION_MP_FAIL		(1 << 7)
++#define OPTION_MP_FCLOSE	(1 << 8)
++#define OPTION_REMOVE_ADDR	(1 << 9)
++#define OPTION_MP_PRIO		(1 << 10)
++
++/* MPTCP flags: both TX and RX */
++#define MPTCPHDR_SEQ		0x01 /* DSS.M option is present */
++#define MPTCPHDR_FIN		0x02 /* DSS.F option is present */
++#define MPTCPHDR_SEQ64_INDEX	0x04 /* index of seq in mpcb->snd_high_order */
++/* MPTCP flags: RX only */
++#define MPTCPHDR_ACK		0x08
++#define MPTCPHDR_SEQ64_SET	0x10 /* Did we received a 64-bit seq number?  */
++#define MPTCPHDR_SEQ64_OFO	0x20 /* Is it not in our circular array? */
++#define MPTCPHDR_DSS_CSUM	0x40
++#define MPTCPHDR_JOIN		0x80
++/* MPTCP flags: TX only */
++#define MPTCPHDR_INF		0x08
++
++struct mptcp_option {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ver:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ver:4;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_capable {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ver:4,
++		sub:4;
++	__u8	h:1,
++		rsv:5,
++		b:1,
++		a:1;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ver:4;
++	__u8	a:1,
++		b:1,
++		rsv:5,
++		h:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u64	sender_key;
++	__u64	receiver_key;
++} __attribute__((__packed__));
++
++struct mp_join {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	b:1,
++		rsv:3,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:3,
++		b:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++	union {
++		struct {
++			u32	token;
++			u32	nonce;
++		} syn;
++		struct {
++			__u64	mac;
++			u32	nonce;
++		} synack;
++		struct {
++			__u8	mac[20];
++		} ack;
++	} u;
++} __attribute__((__packed__));
++
++struct mp_dss {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		A:1,
++		a:1,
++		M:1,
++		m:1,
++		F:1,
++		rsv2:3;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:3,
++		F:1,
++		m:1,
++		M:1,
++		a:1,
++		A:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++};
++
++struct mp_add_addr {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	ipver:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		ipver:4;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++	union {
++		struct {
++			struct in_addr	addr;
++			__be16		port;
++		} v4;
++		struct {
++			struct in6_addr	addr;
++			__be16		port;
++		} v6;
++	} u;
++} __attribute__((__packed__));
++
++struct mp_remove_addr {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	rsv:4,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:4;
++#else
++#error "Adjust your <asm/byteorder.h> defines"
++#endif
++	/* list of addr_id */
++	__u8	addrs_id;
++};
++
++struct mp_fail {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:8;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__be64	data_seq;
++} __attribute__((__packed__));
++
++struct mp_fclose {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u16	rsv1:4,
++		sub:4,
++		rsv2:8;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u16	sub:4,
++		rsv1:4,
++		rsv2:8;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u64	key;
++} __attribute__((__packed__));
++
++struct mp_prio {
++	__u8	kind;
++	__u8	len;
++#if defined(__LITTLE_ENDIAN_BITFIELD)
++	__u8	b:1,
++		rsv:3,
++		sub:4;
++#elif defined(__BIG_ENDIAN_BITFIELD)
++	__u8	sub:4,
++		rsv:3,
++		b:1;
++#else
++#error	"Adjust your <asm/byteorder.h> defines"
++#endif
++	__u8	addr_id;
++} __attribute__((__packed__));
++
++static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
++{
++	return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
++}
++
++#define MPTCP_APP	2
++
++extern int sysctl_mptcp_enabled;
++extern int sysctl_mptcp_checksum;
++extern int sysctl_mptcp_debug;
++extern int sysctl_mptcp_syn_retries;
++
++extern struct workqueue_struct *mptcp_wq;
++
++#define mptcp_debug(fmt, args...)					\
++	do {								\
++		if (unlikely(sysctl_mptcp_debug))			\
++			pr_err(__FILE__ ": " fmt, ##args);	\
++	} while (0)
++
++/* Iterates over all subflows */
++#define mptcp_for_each_tp(mpcb, tp)					\
++	for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
++
++#define mptcp_for_each_sk(mpcb, sk)					\
++	for ((sk) = (struct sock *)(mpcb)->connection_list;		\
++	     sk;							\
++	     sk = (struct sock *)tcp_sk(sk)->mptcp->next)
++
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)			\
++	for (__sk = (struct sock *)(__mpcb)->connection_list,		\
++	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
++	     __sk;							\
++	     __sk = __temp,						\
++	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
++
++/* Iterates over all bit set to 1 in a bitset */
++#define mptcp_for_each_bit_set(b, i)					\
++	for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
++
++#define mptcp_for_each_bit_unset(b, i)					\
++	mptcp_for_each_bit_set(~b, i)
++
++extern struct lock_class_key meta_key;
++extern struct lock_class_key meta_slock_key;
++extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
++
++/* This is needed to ensure that two subsequent key/nonce-generation result in
++ * different keys/nonces if the IPs and ports are the same.
++ */
++extern u32 mptcp_seed;
++
++#define MPTCP_HASH_SIZE                1024
++
++extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++extern spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
++
++/* Lock, protecting the two hash-tables that hold the token. Namely,
++ * mptcp_reqsk_tk_htb and tk_hashtable
++ */
++extern spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
++
++/* Request-sockets can be hashed in the tk_htb for collision-detection or in
++ * the regular htb for join-connections. We need to define different NULLS
++ * values so that we can correctly detect a request-socket that has been
++ * recycled. See also c25eb3bfb9729.
++ */
++#define MPTCP_REQSK_NULLS_BASE (1U << 29)
++
++
++void mptcp_data_ready(struct sock *sk);
++void mptcp_write_space(struct sock *sk);
++
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++			      struct sock *sk);
++void mptcp_ofo_queue(struct sock *meta_sk);
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++		   gfp_t flags);
++void mptcp_del_sock(struct sock *sk);
++void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
++void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
++void mptcp_update_sndbuf(const struct tcp_sock *tp);
++void mptcp_send_fin(struct sock *meta_sk);
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
++bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++		      int push_one, gfp_t gfp);
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++			     struct mptcp_options_received *mopt);
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++			 struct mptcp_options_received *mopt,
++			 const struct sk_buff *skb);
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++		       unsigned *remaining);
++void mptcp_synack_options(struct request_sock *req,
++			  struct tcp_out_options *opts,
++			  unsigned *remaining);
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++			       struct tcp_out_options *opts, unsigned *size);
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++			 const struct tcp_out_options *opts,
++			 struct sk_buff *skb);
++void mptcp_close(struct sock *meta_sk, long timeout);
++int mptcp_doit(struct sock *sk);
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++			   struct request_sock *req,
++			   struct request_sock **prev);
++struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
++				   struct request_sock *req,
++				   struct request_sock **prev,
++				   const struct mptcp_options_received *mopt);
++u32 __mptcp_select_window(struct sock *sk);
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++					__u32 *window_clamp, int wscale_ok,
++					__u8 *rcv_wscale, __u32 init_rcv_wnd,
++					const struct sock *sk);
++unsigned int mptcp_current_mss(struct sock *meta_sk);
++int mptcp_select_size(const struct sock *meta_sk, bool sg);
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++		     u32 *hash_out);
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
++void mptcp_fin(struct sock *meta_sk);
++void mptcp_retransmit_timer(struct sock *meta_sk);
++int mptcp_write_wakeup(struct sock *meta_sk);
++void mptcp_sub_close_wq(struct work_struct *work);
++void mptcp_sub_close(struct sock *sk, unsigned long delay);
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
++void mptcp_fallback_meta_sk(struct sock *meta_sk);
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_ack_handler(unsigned long);
++int mptcp_check_rtt(const struct tcp_sock *tp, int time);
++int mptcp_check_snd_buf(const struct tcp_sock *tp);
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++			 const struct sk_buff *skb);
++void __init mptcp_init(void);
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
++void mptcp_destroy_sock(struct sock *sk);
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++				    const struct sk_buff *skb,
++				    const struct mptcp_options_received *mopt);
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++				  int large_allowed);
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
++void mptcp_time_wait(struct sock *sk, int state, int timeo);
++void mptcp_disconnect(struct sock *sk);
++bool mptcp_should_expand_sndbuf(const struct sock *sk);
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
++void mptcp_tsq_flags(struct sock *sk);
++void mptcp_tsq_sub_deferred(struct sock *meta_sk);
++struct mp_join *mptcp_find_join(const struct sk_buff *skb);
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
++void mptcp_hash_remove(struct tcp_sock *meta_tp);
++struct sock *mptcp_hash_find(const struct net *net, const u32 token);
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
++int mptcp_do_join_short(struct sk_buff *skb,
++			const struct mptcp_options_received *mopt,
++			struct net *net);
++void mptcp_reqsk_destructor(struct request_sock *req);
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++			   const struct mptcp_options_received *mopt,
++			   const struct sk_buff *skb);
++int mptcp_check_req(struct sk_buff *skb, struct net *net);
++void mptcp_connect_init(struct sock *sk);
++void mptcp_sub_force_close(struct sock *sk);
++int mptcp_sub_len_remove_addr_align(u16 bitfield);
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++			    const struct sk_buff *skb);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++			   struct sk_buff *skb);
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
++void mptcp_init_congestion_control(struct sock *sk);
++
++/* MPTCP-path-manager registration/initialization functions */
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
++void mptcp_init_path_manager(struct mptcp_cb *mpcb);
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
++void mptcp_fallback_default(struct mptcp_cb *mpcb);
++void mptcp_get_default_path_manager(char *name);
++int mptcp_set_default_path_manager(const char *name);
++extern struct mptcp_pm_ops mptcp_pm_default;
++
++/* MPTCP-scheduler registration/initialization functions */
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
++void mptcp_init_scheduler(struct mptcp_cb *mpcb);
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
++void mptcp_get_default_scheduler(char *name);
++int mptcp_set_default_scheduler(const char *name);
++extern struct mptcp_sched_ops mptcp_sched_default;
++
++static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
++					    unsigned long len)
++{
++	sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
++		       jiffies + len);
++}
++
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
++{
++	sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
++}
++
++static inline bool is_mptcp_enabled(const struct sock *sk)
++{
++	if (!sysctl_mptcp_enabled || mptcp_init_failed)
++		return false;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++		return false;
++
++	return true;
++}
++
++static inline int mptcp_pi_to_flag(int pi)
++{
++	return 1 << (pi - 1);
++}
++
++static inline
++struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
++{
++	return (struct mptcp_request_sock *)req;
++}
++
++static inline
++struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
++{
++	return (struct request_sock *)req;
++}
++
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++	struct sock *sk_it;
++
++	if (tcp_sk(sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
++		if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
++		    !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
++			return false;
++	}
++
++	return true;
++}
++
++static inline void mptcp_push_pending_frames(struct sock *meta_sk)
++{
++	/* We check packets out and send-head here. TCP only checks the
++	 * send-head. But, MPTCP also checks packets_out, as this is an
++	 * indication that we might want to do opportunistic reinjection.
++	 */
++	if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
++		struct tcp_sock *tp = tcp_sk(meta_sk);
++
++		/* We don't care about the MSS, because it will be set in
++		 * mptcp_write_xmit.
++		 */
++		__tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
++	}
++}
++
++static inline void mptcp_send_reset(struct sock *sk)
++{
++	tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++	mptcp_sub_force_close(sk);
++}
++
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
++}
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
++}
++
++/* Is it a data-fin while in infinite mapping mode?
++ * In infinite mode, a subflow-fin is in fact a data-fin.
++ */
++static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
++				     const struct tcp_sock *tp)
++{
++	return mptcp_is_data_fin(skb) ||
++	       (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
++}
++
++static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
++{
++	u64 data_seq_high = (u32)(data_seq >> 32);
++
++	if (mpcb->rcv_high_order[0] == data_seq_high)
++		return 0;
++	else if (mpcb->rcv_high_order[1] == data_seq_high)
++		return MPTCPHDR_SEQ64_INDEX;
++	else
++		return MPTCPHDR_SEQ64_OFO;
++}
++
++/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
++ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
++ */
++static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
++					    u32 *data_seq,
++					    struct mptcp_cb *mpcb)
++{
++	__u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
++
++	if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
++		u64 data_seq64 = get_unaligned_be64(ptr);
++
++		if (mpcb)
++			TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
++
++		*data_seq = (u32)data_seq64;
++		ptr++;
++	} else {
++		*data_seq = get_unaligned_be32(ptr);
++	}
++
++	return ptr;
++}
++
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++	return tcp_sk(sk)->meta_sk;
++}
++
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++	return tcp_sk(tp->meta_sk);
++}
++
++static inline int is_meta_tp(const struct tcp_sock *tp)
++{
++	return tp->mpcb && mptcp_meta_tp(tp) == tp;
++}
++
++static inline int is_meta_sk(const struct sock *sk)
++{
++	return sk->sk_type == SOCK_STREAM  && sk->sk_protocol == IPPROTO_TCP &&
++	       mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
++}
++
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++	return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
++}
++
++static inline void mptcp_hash_request_remove(struct request_sock *req)
++{
++	int in_softirq = 0;
++
++	if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
++		return;
++
++	if (in_softirq()) {
++		spin_lock(&mptcp_reqsk_hlock);
++		in_softirq = 1;
++	} else {
++		spin_lock_bh(&mptcp_reqsk_hlock);
++	}
++
++	hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++
++	if (in_softirq)
++		spin_unlock(&mptcp_reqsk_hlock);
++	else
++		spin_unlock_bh(&mptcp_reqsk_hlock);
++}
++
++static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
++{
++	mopt->saw_mpc = 0;
++	mopt->dss_csum = 0;
++	mopt->drop_me = 0;
++
++	mopt->is_mp_join = 0;
++	mopt->join_ack = 0;
++
++	mopt->saw_low_prio = 0;
++	mopt->low_prio = 0;
++
++	mopt->saw_add_addr = 0;
++	mopt->more_add_addr = 0;
++
++	mopt->saw_rem_addr = 0;
++	mopt->more_rem_addr = 0;
++
++	mopt->mp_fail = 0;
++	mopt->mp_fclose = 0;
++}
++
++static inline void mptcp_reset_mopt(struct tcp_sock *tp)
++{
++	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++	mopt->saw_low_prio = 0;
++	mopt->saw_add_addr = 0;
++	mopt->more_add_addr = 0;
++	mopt->saw_rem_addr = 0;
++	mopt->more_rem_addr = 0;
++	mopt->join_ack = 0;
++	mopt->mp_fail = 0;
++	mopt->mp_fclose = 0;
++}
++
++static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
++						 const struct mptcp_cb *mpcb)
++{
++	return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
++			MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
++}
++
++static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
++					u32 data_seq_32)
++{
++	return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
++}
++
++static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
++{
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++				     meta_tp->rcv_nxt);
++}
++
++static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
++{
++	if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
++		struct mptcp_cb *mpcb = meta_tp->mpcb;
++		mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++		mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
++	}
++}
++
++static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
++					   u32 old_rcv_nxt)
++{
++	if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
++		struct mptcp_cb *mpcb = meta_tp->mpcb;
++		mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
++		mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
++	}
++}
++
++static inline int mptcp_sk_can_send(const struct sock *sk)
++{
++	return tcp_passive_fastopen(sk) ||
++	       ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
++		!tcp_sk(sk)->mptcp->pre_established);
++}
++
++static inline int mptcp_sk_can_recv(const struct sock *sk)
++{
++	return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
++}
++
++static inline int mptcp_sk_can_send_ack(const struct sock *sk)
++{
++	return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
++					TCPF_CLOSE | TCPF_LISTEN)) &&
++	       !tcp_sk(sk)->mptcp->pre_established;
++}
++
++/* Only support GSO if all subflows supports it */
++static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
++{
++	struct sock *sk;
++
++	if (tcp_sk(meta_sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++		if (!sk_can_gso(sk))
++			return false;
++	}
++	return true;
++}
++
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++	struct sock *sk;
++
++	if (tcp_sk(meta_sk)->mpcb->dss_csum)
++		return false;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++		if (!(sk->sk_route_caps & NETIF_F_SG))
++			return false;
++	}
++	return true;
++}
++
++static inline void mptcp_set_rto(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *sk_it;
++	struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
++	__u32 max_rto = 0;
++
++	/* We are in recovery-phase on the MPTCP-level. Do not update the
++	 * RTO, because this would kill exponential backoff.
++	 */
++	if (micsk->icsk_retransmits)
++		return;
++
++	mptcp_for_each_sk(tp->mpcb, sk_it) {
++		if (mptcp_sk_can_send(sk_it) &&
++		    inet_csk(sk_it)->icsk_rto > max_rto)
++			max_rto = inet_csk(sk_it)->icsk_rto;
++	}
++	if (max_rto) {
++		micsk->icsk_rto = max_rto << 1;
++
++		/* A successfull rto-measurement - reset backoff counter */
++		micsk->icsk_backoff = 0;
++	}
++}
++
++static inline int mptcp_sysctl_syn_retries(void)
++{
++	return sysctl_mptcp_syn_retries;
++}
++
++static inline void mptcp_sub_close_passive(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
++
++	/* Only close, if the app did a send-shutdown (passive close), and we
++	 * received the data-ack of the data-fin.
++	 */
++	if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
++		mptcp_sub_close(sk, 0);
++}
++
++static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* If data has been acknowleged on the meta-level, fully_established
++	 * will have been set before and thus we will not fall back to infinite
++	 * mapping.
++	 */
++	if (likely(tp->mptcp->fully_established))
++		return false;
++
++	if (!(flag & MPTCP_FLAG_DATA_ACKED))
++		return false;
++
++	/* Don't fallback twice ;) */
++	if (tp->mpcb->infinite_mapping_snd)
++		return false;
++
++	pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
++	       __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
++	       &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
++	       __builtin_return_address(0));
++	if (!is_master_tp(tp))
++		return true;
++
++	tp->mpcb->infinite_mapping_snd = 1;
++	tp->mpcb->infinite_mapping_rcv = 1;
++	tp->mptcp->fully_established = 1;
++
++	return false;
++}
++
++/* Find the first index whose bit in the bit-field == 0 */
++static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
++{
++	u8 base = mpcb->next_path_index;
++	int i;
++
++	/* Start at 1, because 0 is reserved for the meta-sk */
++	mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
++		if (i + base < 1)
++			continue;
++		if (i + base >= sizeof(mpcb->path_index_bits) * 8)
++			break;
++		i += base;
++		mpcb->path_index_bits |= (1 << i);
++		mpcb->next_path_index = i + 1;
++		return i;
++	}
++	mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
++		if (i >= sizeof(mpcb->path_index_bits) * 8)
++			break;
++		if (i < 1)
++			continue;
++		mpcb->path_index_bits |= (1 << i);
++		mpcb->next_path_index = i + 1;
++		return i;
++	}
++
++	return 0;
++}
++
++static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
++{
++	return sk->sk_family == AF_INET6 &&
++	       ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
++}
++
++/* TCP and MPTCP mpc flag-depending functions */
++u16 mptcp_select_window(struct sock *sk);
++void mptcp_init_buffer_space(struct sock *sk);
++void mptcp_tcp_set_rto(struct sock *sk);
++
++/* TCP and MPTCP flag-depending functions */
++bool mptcp_prune_ofo_queue(struct sock *sk);
++
++#else /* CONFIG_MPTCP */
++#define mptcp_debug(fmt, args...)	\
++	do {				\
++	} while (0)
++
++/* Without MPTCP, we just do one iteration
++ * over the only socket available. This assumes that
++ * the sk/tp arg is the socket in that case.
++ */
++#define mptcp_for_each_sk(mpcb, sk)
++#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
++
++static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
++{
++	return false;
++}
++static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
++{
++	return false;
++}
++static inline struct sock *mptcp_meta_sk(const struct sock *sk)
++{
++	return NULL;
++}
++static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
++{
++	return NULL;
++}
++static inline int is_meta_sk(const struct sock *sk)
++{
++	return 0;
++}
++static inline int is_master_tp(const struct tcp_sock *tp)
++{
++	return 0;
++}
++static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
++static inline void mptcp_del_sock(const struct sock *sk) {}
++static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
++static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
++static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
++static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
++					    const struct sock *sk) {}
++static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
++static inline void mptcp_set_rto(const struct sock *sk) {}
++static inline void mptcp_send_fin(const struct sock *meta_sk) {}
++static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
++				       const struct mptcp_options_received *mopt,
++				       const struct sk_buff *skb) {}
++static inline void mptcp_syn_options(const struct sock *sk,
++				     struct tcp_out_options *opts,
++				     unsigned *remaining) {}
++static inline void mptcp_synack_options(struct request_sock *req,
++					struct tcp_out_options *opts,
++					unsigned *remaining) {}
++
++static inline void mptcp_established_options(struct sock *sk,
++					     struct sk_buff *skb,
++					     struct tcp_out_options *opts,
++					     unsigned *size) {}
++static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++				       const struct tcp_out_options *opts,
++				       struct sk_buff *skb) {}
++static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
++static inline int mptcp_doit(struct sock *sk)
++{
++	return 0;
++}
++static inline int mptcp_check_req_fastopen(struct sock *child,
++					   struct request_sock *req)
++{
++	return 1;
++}
++static inline int mptcp_check_req_master(const struct sock *sk,
++					 const struct sock *child,
++					 struct request_sock *req,
++					 struct request_sock **prev)
++{
++	return 1;
++}
++static inline struct sock *mptcp_check_req_child(struct sock *sk,
++						 struct sock *child,
++						 struct request_sock *req,
++						 struct request_sock **prev,
++						 const struct mptcp_options_received *mopt)
++{
++	return NULL;
++}
++static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++	return 0;
++}
++static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++	return 0;
++}
++static inline void mptcp_sub_close_passive(struct sock *sk) {}
++static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
++{
++	return false;
++}
++static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
++static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++	return 0;
++}
++static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++	return 0;
++}
++static inline int mptcp_sysctl_syn_retries(void)
++{
++	return 0;
++}
++static inline void mptcp_send_reset(const struct sock *sk) {}
++static inline int mptcp_handle_options(struct sock *sk,
++				       const struct tcphdr *th,
++				       struct sk_buff *skb)
++{
++	return 0;
++}
++static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
++static inline void  __init mptcp_init(void) {}
++static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++	return 0;
++}
++static inline bool mptcp_sk_can_gso(const struct sock *sk)
++{
++	return false;
++}
++static inline bool mptcp_can_sg(const struct sock *meta_sk)
++{
++	return false;
++}
++static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
++						u32 mss_now, int large_allowed)
++{
++	return 0;
++}
++static inline void mptcp_destroy_sock(struct sock *sk) {}
++static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
++						  struct sock **skptr,
++						  struct sk_buff *skb,
++						  const struct mptcp_options_received *mopt)
++{
++	return 0;
++}
++static inline bool mptcp_can_sendpage(struct sock *sk)
++{
++	return false;
++}
++static inline int mptcp_init_tw_sock(struct sock *sk,
++				     struct tcp_timewait_sock *tw)
++{
++	return 0;
++}
++static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
++static inline void mptcp_disconnect(struct sock *sk) {}
++static inline void mptcp_tsq_flags(struct sock *sk) {}
++static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
++static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
++static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
++static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
++					 const struct tcp_options_received *rx_opt,
++					 const struct mptcp_options_received *mopt,
++					 const struct sk_buff *skb) {}
++static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++					  const struct sk_buff *skb) {}
++static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_H */
+diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
+new file mode 100644
+index 000000000000..93ad97c77c5a
+--- /dev/null
++++ b/include/net/mptcp_v4.h
+@@ -0,0 +1,67 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef MPTCP_V4_H_
++#define MPTCP_V4_H_
++
++
++#include <linux/in.h>
++#include <linux/skbuff.h>
++#include <net/mptcp.h>
++#include <net/request_sock.h>
++#include <net/sock.h>
++
++extern struct request_sock_ops mptcp_request_sock_ops;
++extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++#ifdef CONFIG_MPTCP
++
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++				 const __be32 laddr, const struct net *net);
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++			   struct mptcp_rem4 *rem);
++int mptcp_pm_v4_init(void);
++void mptcp_pm_v4_undo(void);
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
++
++#else
++
++static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
++				  const struct sk_buff *skb)
++{
++	return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* MPTCP_V4_H_ */
+diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
+new file mode 100644
+index 000000000000..49a4f30ccd4d
+--- /dev/null
++++ b/include/net/mptcp_v6.h
+@@ -0,0 +1,69 @@
++/*
++ *	MPTCP implementation
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef _MPTCP_V6_H
++#define _MPTCP_V6_H
++
++#include <linux/in6.h>
++#include <net/if_inet6.h>
++
++#include <net/mptcp.h>
++
++
++#ifdef CONFIG_MPTCP
++extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
++extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
++extern struct request_sock_ops mptcp6_request_sock_ops;
++extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++				 const struct in6_addr *laddr, const struct net *net);
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++			   struct mptcp_rem6 *rem);
++int mptcp_pm_v6_init(void);
++void mptcp_pm_v6_undo(void);
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++			 __be16 sport, __be16 dport);
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++		     __be16 sport, __be16 dport);
++
++#else /* CONFIG_MPTCP */
++
++#define mptcp_v6_mapped ipv6_mapped
++
++static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return 0;
++}
++
++#endif /* CONFIG_MPTCP */
++
++#endif /* _MPTCP_V6_H */
+diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
+index 361d26077196..bae95a11c531 100644
+--- a/include/net/net_namespace.h
++++ b/include/net/net_namespace.h
+@@ -16,6 +16,7 @@
+ #include <net/netns/packet.h>
+ #include <net/netns/ipv4.h>
+ #include <net/netns/ipv6.h>
++#include <net/netns/mptcp.h>
+ #include <net/netns/ieee802154_6lowpan.h>
+ #include <net/netns/sctp.h>
+ #include <net/netns/dccp.h>
+@@ -92,6 +93,9 @@ struct net {
+ #if IS_ENABLED(CONFIG_IPV6)
+ 	struct netns_ipv6	ipv6;
+ #endif
++#if IS_ENABLED(CONFIG_MPTCP)
++	struct netns_mptcp	mptcp;
++#endif
+ #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
+ 	struct netns_ieee802154_lowpan	ieee802154_lowpan;
+ #endif
+diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
+new file mode 100644
+index 000000000000..bad418b04cc8
+--- /dev/null
++++ b/include/net/netns/mptcp.h
+@@ -0,0 +1,44 @@
++/*
++ *	MPTCP implementation - MPTCP namespace
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#ifndef __NETNS_MPTCP_H__
++#define __NETNS_MPTCP_H__
++
++#include <linux/compiler.h>
++
++enum {
++	MPTCP_PM_FULLMESH = 0,
++	MPTCP_PM_MAX
++};
++
++struct netns_mptcp {
++	void *path_managers[MPTCP_PM_MAX];
++};
++
++#endif /* __NETNS_MPTCP_H__ */
+diff --git a/include/net/request_sock.h b/include/net/request_sock.h
+index 7f830ff67f08..e79e87a8e1a6 100644
+--- a/include/net/request_sock.h
++++ b/include/net/request_sock.h
+@@ -164,7 +164,7 @@ struct request_sock_queue {
+ };
+ 
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+-		      unsigned int nr_table_entries);
++		      unsigned int nr_table_entries, gfp_t flags);
+ 
+ void __reqsk_queue_destroy(struct request_sock_queue *queue);
+ void reqsk_queue_destroy(struct request_sock_queue *queue);
+diff --git a/include/net/sock.h b/include/net/sock.h
+index 156350745700..0e23cae8861f 100644
+--- a/include/net/sock.h
++++ b/include/net/sock.h
+@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
+ 
+ int sk_wait_data(struct sock *sk, long *timeo);
+ 
++/* START - needed for MPTCP */
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
++void sock_lock_init(struct sock *sk);
++
++extern struct lock_class_key af_callback_keys[AF_MAX];
++extern char *const af_family_clock_key_strings[AF_MAX+1];
++
++#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
++/* END - needed for MPTCP */
++
+ struct request_sock_ops;
+ struct timewait_sock_ops;
+ struct inet_hashinfo;
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 7286db80e8b8..ff92e74cd684 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define TCPOPT_SACK             5       /* SACK Block */
+ #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
+ #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
++#define TCPOPT_MPTCP		30
+ #define TCPOPT_EXP		254	/* Experimental */
+ /* Magic number to be after the option value for sharing TCP
+  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
+@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
+ #define	TFO_SERVER_WO_SOCKOPT1	0x400
+ #define	TFO_SERVER_WO_SOCKOPT2	0x800
+ 
++/* Flags from tcp_input.c for tcp_ack */
++#define FLAG_DATA               0x01 /* Incoming frame contained data.          */
++#define FLAG_WIN_UPDATE         0x02 /* Incoming ACK was a window update.       */
++#define FLAG_DATA_ACKED         0x04 /* This ACK acknowledged new data.         */
++#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted.  */
++#define FLAG_SYN_ACKED          0x10 /* This ACK acknowledged SYN.              */
++#define FLAG_DATA_SACKED        0x20 /* New SACK.                               */
++#define FLAG_ECE                0x40 /* ECE in this ACK                         */
++#define FLAG_SLOWPATH           0x100 /* Do not skip RFC checks for window update.*/
++#define FLAG_ORIG_SACK_ACKED    0x200 /* Never retransmitted data are (s)acked  */
++#define FLAG_SND_UNA_ADVANCED   0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
++#define FLAG_DSACKING_ACK       0x800 /* SACK blocks contained D-SACK info */
++#define FLAG_SACK_RENEGING      0x2000 /* snd_una advanced to a sacked seq */
++#define FLAG_UPDATE_TS_RECENT   0x4000 /* tcp_replace_ts_recent() */
++#define MPTCP_FLAG_DATA_ACKED	0x8000
++
++#define FLAG_ACKED              (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
++#define FLAG_NOT_DUP            (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
++#define FLAG_CA_ALERT           (FLAG_DATA_SACKED|FLAG_ECE)
++#define FLAG_FORWARD_PROGRESS   (FLAG_ACKED|FLAG_DATA_SACKED)
++
+ extern struct inet_timewait_death_row tcp_death_row;
+ 
+ /* sysctl variables for tcp */
+@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
+ #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
+ #define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
+ 
++/**** START - Exports needed for MPTCP ****/
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
++extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
++
++struct mptcp_options_received;
++
++void tcp_enter_quickack_mode(struct sock *sk);
++int tcp_close_state(struct sock *sk);
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++			 const struct sk_buff *skb);
++int tcp_xmit_probe_skb(struct sock *sk, int urgent);
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++		     gfp_t gfp_mask);
++unsigned int tcp_mss_split_point(const struct sock *sk,
++				 const struct sk_buff *skb,
++				 unsigned int mss_now,
++				 unsigned int max_segs,
++				 int nonagle);
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		    unsigned int cur_mss, int nonagle);
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		      unsigned int cur_mss);
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++		      unsigned int mss_now);
++void __pskb_trim_head(struct sk_buff *skb, int len);
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
++void tcp_reset(struct sock *sk);
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++			   const u32 ack_seq, const u32 nwin);
++bool tcp_urg_mode(const struct tcp_sock *tp);
++void tcp_ack_probe(struct sock *sk);
++void tcp_rearm_rto(struct sock *sk);
++int tcp_write_timeout(struct sock *sk);
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++			   unsigned int timeout, bool syn_set);
++void tcp_write_err(struct sock *sk);
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++			  unsigned int mss_now);
++
++int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req);
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc);
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
++void tcp_v4_reqsk_destructor(struct request_sock *req);
++
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req);
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl, struct request_sock *req,
++		       u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
++void tcp_v6_destroy_sock(struct sock *sk);
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
++void tcp_v6_hash(struct sock *sk);
++struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++			          struct request_sock *req,
++				  struct dst_entry *dst);
++void tcp_v6_reqsk_destructor(struct request_sock *req);
++
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
++				       int large_allowed);
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
++
++void skb_clone_fraglist(struct sk_buff *skb);
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
++
++void inet_twsk_free(struct inet_timewait_sock *tw);
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
++/* These states need RST on ABORT according to RFC793 */
++static inline bool tcp_need_reset(int state)
++{
++	return (1 << state) &
++	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
++		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
++}
++
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
++			    int hlen);
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++			       bool *fragstolen);
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
++		      struct sk_buff *from, bool *fragstolen);
++/**** END - Exports needed for MPTCP ****/
++
+ void tcp_tasklet_init(void);
+ 
+ void tcp_v4_err(struct sk_buff *skb, u32);
+@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 		size_t len, int nonblock, int flags, int *addr_len);
+ void tcp_parse_options(const struct sk_buff *skb,
+ 		       struct tcp_options_received *opt_rx,
++		       struct mptcp_options_received *mopt_rx,
+ 		       int estab, struct tcp_fastopen_cookie *foc);
+ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
+ 
+@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
+ 
+ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ 			      u16 *mssp);
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
+-#else
+-static inline __u32 cookie_v4_init_sequence(struct sock *sk,
+-					    struct sk_buff *skb,
+-					    __u16 *mss)
+-{
+-	return 0;
+-}
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++			      __u16 *mss);
+ #endif
+ 
+ __u32 cookie_init_timestamp(struct request_sock *req);
+@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
+ 			      const struct tcphdr *th, u16 *mssp);
+ __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
+ 			      __u16 *mss);
+-#else
+-static inline __u32 cookie_v6_init_sequence(struct sock *sk,
+-					    struct sk_buff *skb,
+-					    __u16 *mss)
+-{
+-	return 0;
+-}
+ #endif
+ /* tcp_output.c */
+ 
+@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
+ void tcp_send_loss_probe(struct sock *sk);
+ bool tcp_schedule_loss_probe(struct sock *sk);
+ 
++u16 tcp_select_window(struct sock *sk);
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++		int push_one, gfp_t gfp);
++
+ /* tcp_input.c */
+ void tcp_resume_early_retransmit(struct sock *sk);
+ void tcp_rearm_rto(struct sock *sk);
+ void tcp_reset(struct sock *sk);
++void tcp_set_rto(struct sock *sk);
++bool tcp_should_expand_sndbuf(const struct sock *sk);
++bool tcp_prune_ofo_queue(struct sock *sk);
+ 
+ /* tcp_timer.c */
+ void tcp_init_xmit_timers(struct sock *);
+@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
+  */
+ struct tcp_skb_cb {
+ 	union {
+-		struct inet_skb_parm	h4;
++		union {
++			struct inet_skb_parm	h4;
+ #if IS_ENABLED(CONFIG_IPV6)
+-		struct inet6_skb_parm	h6;
++			struct inet6_skb_parm	h6;
+ #endif
+-	} header;	/* For incoming frames		*/
++		} header;	/* For incoming frames		*/
++#ifdef CONFIG_MPTCP
++		union {			/* For MPTCP outgoing frames */
++			__u32 path_mask; /* paths that tried to send this skb */
++			__u32 dss[6];	/* DSS options */
++		};
++#endif
++	};
+ 	__u32		seq;		/* Starting sequence number	*/
+ 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
+ 	__u32		when;		/* used to compute rtt's	*/
++#ifdef CONFIG_MPTCP
++	__u8		mptcp_flags;	/* flags for the MPTCP layer    */
++	__u8		dss_off;	/* Number of 4-byte words until
++					 * seq-number */
++#endif
+ 	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
+ 
+ 	__u8		sacked;		/* State flags for SACK/FACK.	*/
+@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
+ /* Determine a window scaling and initial window to offer. */
+ void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
+ 			       __u32 *window_clamp, int wscale_ok,
+-			       __u8 *rcv_wscale, __u32 init_rcv_wnd);
++			       __u8 *rcv_wscale, __u32 init_rcv_wnd,
++			       const struct sock *sk);
+ 
+ static inline int tcp_win_from_space(int space)
+ {
+@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
+ 		space - (space>>sysctl_tcp_adv_win_scale);
+ }
+ 
++#ifdef CONFIG_MPTCP
++extern struct static_key mptcp_static_key;
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++	return static_key_false(&mptcp_static_key) && tp->mpc;
++}
++#else
++static inline bool mptcp(const struct tcp_sock *tp)
++{
++	return 0;
++}
++#endif
++
+ /* Note: caller must be prepared to deal with negative returns */ 
+ static inline int tcp_space(const struct sock *sk)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = tcp_sk(sk)->meta_sk;
++
+ 	return tcp_win_from_space(sk->sk_rcvbuf -
+ 				  atomic_read(&sk->sk_rmem_alloc));
+ } 
+ 
+ static inline int tcp_full_space(const struct sock *sk)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = tcp_sk(sk)->meta_sk;
++
+ 	return tcp_win_from_space(sk->sk_rcvbuf); 
+ }
+ 
+@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
+ 	ireq->wscale_ok = rx_opt->wscale_ok;
+ 	ireq->acked = 0;
+ 	ireq->ecn_ok = 0;
++	ireq->mptcp_rqsk = 0;
++	ireq->saw_mpc = 0;
+ 	ireq->ir_rmt_port = tcp_hdr(skb)->source;
+ 	ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
+ }
+@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
+ void tcp4_proc_exit(void);
+ #endif
+ 
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++		     const struct tcp_request_sock_ops *af_ops,
++		     struct sock *sk, struct sk_buff *skb);
++
+ /* TCP af-specific functions */
+ struct tcp_sock_af_ops {
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
+ #endif
+ };
+ 
++/* TCP/MPTCP-specific functions */
++struct tcp_sock_ops {
++	u32 (*__select_window)(struct sock *sk);
++	u16 (*select_window)(struct sock *sk);
++	void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
++				      __u32 *window_clamp, int wscale_ok,
++				      __u8 *rcv_wscale, __u32 init_rcv_wnd,
++				      const struct sock *sk);
++	void (*init_buffer_space)(struct sock *sk);
++	void (*set_rto)(struct sock *sk);
++	bool (*should_expand_sndbuf)(const struct sock *sk);
++	void (*send_fin)(struct sock *sk);
++	bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
++			   int push_one, gfp_t gfp);
++	void (*send_active_reset)(struct sock *sk, gfp_t priority);
++	int (*write_wakeup)(struct sock *sk);
++	bool (*prune_ofo_queue)(struct sock *sk);
++	void (*retransmit_timer)(struct sock *sk);
++	void (*time_wait)(struct sock *sk, int state, int timeo);
++	void (*cleanup_rbuf)(struct sock *sk, int copied);
++	void (*init_congestion_control)(struct sock *sk);
++};
++extern const struct tcp_sock_ops tcp_specific;
++
+ struct tcp_request_sock_ops {
++	u16 mss_clamp;
+ #ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_md5sig_key	*(*md5_lookup) (struct sock *sk,
+ 						struct request_sock *req);
+@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
+ 						  const struct request_sock *req,
+ 						  const struct sk_buff *skb);
+ #endif
++	int (*init_req)(struct request_sock *req, struct sock *sk,
++			 struct sk_buff *skb);
++#ifdef CONFIG_SYN_COOKIES
++	__u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
++				 __u16 *mss);
++#endif
++	struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
++				       const struct request_sock *req,
++				       bool *strict);
++	__u32 (*init_seq)(const struct sk_buff *skb);
++	int (*send_synack)(struct sock *sk, struct dst_entry *dst,
++			   struct flowi *fl, struct request_sock *req,
++			   u16 queue_mapping, struct tcp_fastopen_cookie *foc);
++	void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
++			       const unsigned long timeout);
+ };
+ 
++#ifdef CONFIG_SYN_COOKIES
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++					 struct sock *sk, struct sk_buff *skb,
++					 __u16 *mss)
++{
++	return ops->cookie_init_seq(sk, skb, mss);
++}
++#else
++static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
++					 struct sock *sk, struct sk_buff *skb,
++					 __u16 *mss)
++{
++	return 0;
++}
++#endif
++
+ int tcpv4_offload_init(void);
+ 
+ void tcp_v4_init(void);
+diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
+index 9cf2394f0bcf..c2634b6ed854 100644
+--- a/include/uapi/linux/if.h
++++ b/include/uapi/linux/if.h
+@@ -109,6 +109,9 @@ enum net_device_flags {
+ #define IFF_DORMANT			IFF_DORMANT
+ #define IFF_ECHO			IFF_ECHO
+ 
++#define IFF_NOMULTIPATH	0x80000		/* Disable for MPTCP 		*/
++#define IFF_MPBACKUP	0x100000	/* Use as backup path for MPTCP */
++
+ #define IFF_VOLATILE	(IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
+ 		IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
+ 
+diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
+index 3b9718328d8b..487475681d84 100644
+--- a/include/uapi/linux/tcp.h
++++ b/include/uapi/linux/tcp.h
+@@ -112,6 +112,7 @@ enum {
+ #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
+ #define TCP_TIMESTAMP		24
+ #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
++#define MPTCP_ENABLED		26
+ 
+ struct tcp_repair_opt {
+ 	__u32	opt_code;
+diff --git a/net/Kconfig b/net/Kconfig
+index d92afe4204d9..96b58593ad5e 100644
+--- a/net/Kconfig
++++ b/net/Kconfig
+@@ -79,6 +79,7 @@ if INET
+ source "net/ipv4/Kconfig"
+ source "net/ipv6/Kconfig"
+ source "net/netlabel/Kconfig"
++source "net/mptcp/Kconfig"
+ 
+ endif # if INET
+ 
+diff --git a/net/Makefile b/net/Makefile
+index cbbbe6d657ca..244bac1435b1 100644
+--- a/net/Makefile
++++ b/net/Makefile
+@@ -20,6 +20,7 @@ obj-$(CONFIG_INET)		+= ipv4/
+ obj-$(CONFIG_XFRM)		+= xfrm/
+ obj-$(CONFIG_UNIX)		+= unix/
+ obj-$(CONFIG_NET)		+= ipv6/
++obj-$(CONFIG_MPTCP)		+= mptcp/
+ obj-$(CONFIG_PACKET)		+= packet/
+ obj-$(CONFIG_NET_KEY)		+= key/
+ obj-$(CONFIG_BRIDGE)		+= bridge/
+diff --git a/net/core/dev.c b/net/core/dev.c
+index 367a586d0c8a..215d2757fbf6 100644
+--- a/net/core/dev.c
++++ b/net/core/dev.c
+@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
+ 
+ 	dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
+ 			       IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
+-			       IFF_AUTOMEDIA)) |
++			       IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
+ 		     (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
+ 				    IFF_ALLMULTI));
+ 
+diff --git a/net/core/request_sock.c b/net/core/request_sock.c
+index 467f326126e0..909dfa13f499 100644
+--- a/net/core/request_sock.c
++++ b/net/core/request_sock.c
+@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
+ EXPORT_SYMBOL(sysctl_max_syn_backlog);
+ 
+ int reqsk_queue_alloc(struct request_sock_queue *queue,
+-		      unsigned int nr_table_entries)
++		      unsigned int nr_table_entries,
++		      gfp_t flags)
+ {
+ 	size_t lopt_size = sizeof(struct listen_sock);
+ 	struct listen_sock *lopt;
+@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
+ 	nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
+ 	lopt_size += nr_table_entries * sizeof(struct request_sock *);
+ 	if (lopt_size > PAGE_SIZE)
+-		lopt = vzalloc(lopt_size);
++		lopt = __vmalloc(lopt_size,
++			flags | __GFP_HIGHMEM | __GFP_ZERO,
++			PAGE_KERNEL);
+ 	else
+-		lopt = kzalloc(lopt_size, GFP_KERNEL);
++		lopt = kzalloc(lopt_size, flags);
+ 	if (lopt == NULL)
+ 		return -ENOMEM;
+ 
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index c1a33033cbe2..8abc5d60fbe3 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
+ 	skb_drop_list(&skb_shinfo(skb)->frag_list);
+ }
+ 
+-static void skb_clone_fraglist(struct sk_buff *skb)
++void skb_clone_fraglist(struct sk_buff *skb)
+ {
+ 	struct sk_buff *list;
+ 
+@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
+ 	skb->inner_mac_header += off;
+ }
+ 
+-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
++void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+ {
+ 	__copy_skb_header(new, old);
+ 
+diff --git a/net/core/sock.c b/net/core/sock.c
+index 026e01f70274..359295523177 100644
+--- a/net/core/sock.c
++++ b/net/core/sock.c
+@@ -136,6 +136,11 @@
+ 
+ #include <trace/events/sock.h>
+ 
++#ifdef CONFIG_MPTCP
++#include <net/mptcp.h>
++#include <net/inet_common.h>
++#endif
++
+ #ifdef CONFIG_INET
+ #include <net/tcp.h>
+ #endif
+@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
+   "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG"      ,
+   "slock-AF_NFC"   , "slock-AF_VSOCK"    ,"slock-AF_MAX"
+ };
+-static const char *const af_family_clock_key_strings[AF_MAX+1] = {
++char *const af_family_clock_key_strings[AF_MAX+1] = {
+   "clock-AF_UNSPEC", "clock-AF_UNIX"     , "clock-AF_INET"     ,
+   "clock-AF_AX25"  , "clock-AF_IPX"      , "clock-AF_APPLETALK",
+   "clock-AF_NETROM", "clock-AF_BRIDGE"   , "clock-AF_ATMPVC"   ,
+@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
+  * sk_callback_lock locking rules are per-address-family,
+  * so split the lock classes by using a per-AF key:
+  */
+-static struct lock_class_key af_callback_keys[AF_MAX];
++struct lock_class_key af_callback_keys[AF_MAX];
+ 
+ /* Take into consideration the size of the struct sk_buff overhead in the
+  * determination of these values, since that is non-constant across
+@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
+ 	}
+ }
+ 
+-#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
+-
+ static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
+ {
+ 	if (sk->sk_flags & flags) {
+@@ -1253,8 +1256,25 @@ lenout:
+  *
+  * (We also register the sk_lock with the lock validator.)
+  */
+-static inline void sock_lock_init(struct sock *sk)
+-{
++void sock_lock_init(struct sock *sk)
++{
++#ifdef CONFIG_MPTCP
++	/* Reclassify the lock-class for subflows */
++	if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
++		if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
++			sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
++						      &meta_slock_key,
++						      "sk_lock-AF_INET-MPTCP",
++						      &meta_key);
++
++			/* We don't yet have the mptcp-point.
++			 * Thus we still need inet_sock_destruct
++			 */
++			sk->sk_destruct = inet_sock_destruct;
++			return;
++		}
++#endif
++
+ 	sock_lock_init_class_and_name(sk,
+ 			af_family_slock_key_strings[sk->sk_family],
+ 			af_family_slock_keys + sk->sk_family,
+@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
+ }
+ EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
+ 
+-static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
++struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
+ 		int family)
+ {
+ 	struct sock *sk;
+diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
+index 4db3c2a1679c..04cb17d4b0ce 100644
+--- a/net/dccp/ipv6.c
++++ b/net/dccp/ipv6.c
+@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ 	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
+ 		goto drop;
+ 
+-	req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
++	req = inet_reqsk_alloc(&dccp6_request_sock_ops);
+ 	if (req == NULL)
+ 		goto drop;
+ 
+diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
+index 05c57f0fcabe..630434db0085 100644
+--- a/net/ipv4/Kconfig
++++ b/net/ipv4/Kconfig
+@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
+ 	For further details see:
+ 	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
+ 
++config TCP_CONG_COUPLED
++	tristate "MPTCP COUPLED CONGESTION CONTROL"
++	depends on MPTCP
++	default n
++	---help---
++	MultiPath TCP Coupled Congestion Control
++	To enable it, just put 'coupled' in tcp_congestion_control
++
++config TCP_CONG_OLIA
++	tristate "MPTCP Opportunistic Linked Increase"
++	depends on MPTCP
++	default n
++	---help---
++	MultiPath TCP Opportunistic Linked Increase Congestion Control
++	To enable it, just put 'olia' in tcp_congestion_control
++
++config TCP_CONG_WVEGAS
++	tristate "MPTCP WVEGAS CONGESTION CONTROL"
++	depends on MPTCP
++	default n
++	---help---
++	wVegas congestion control for MPTCP
++	To enable it, just put 'wvegas' in tcp_congestion_control
++
+ choice
+ 	prompt "Default TCP congestion control"
+ 	default DEFAULT_CUBIC
+@@ -584,6 +608,15 @@ choice
+ 	config DEFAULT_WESTWOOD
+ 		bool "Westwood" if TCP_CONG_WESTWOOD=y
+ 
++	config DEFAULT_COUPLED
++		bool "Coupled" if TCP_CONG_COUPLED=y
++
++	config DEFAULT_OLIA
++		bool "Olia" if TCP_CONG_OLIA=y
++
++	config DEFAULT_WVEGAS
++		bool "Wvegas" if TCP_CONG_WVEGAS=y
++
+ 	config DEFAULT_RENO
+ 		bool "Reno"
+ 
+@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
+ 	default "vegas" if DEFAULT_VEGAS
+ 	default "westwood" if DEFAULT_WESTWOOD
+ 	default "veno" if DEFAULT_VENO
++	default "coupled" if DEFAULT_COUPLED
++	default "wvegas" if DEFAULT_WVEGAS
+ 	default "reno" if DEFAULT_RENO
+ 	default "cubic"
+ 
+diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
+index d156b3c5f363..4afd6d8d9028 100644
+--- a/net/ipv4/af_inet.c
++++ b/net/ipv4/af_inet.c
+@@ -104,6 +104,7 @@
+ #include <net/ip_fib.h>
+ #include <net/inet_connection_sock.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/ping.h>
+@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
+  *	Create an inet socket.
+  */
+ 
+-static int inet_create(struct net *net, struct socket *sock, int protocol,
+-		       int kern)
++int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ 	struct sock *sk;
+ 	struct inet_protosw *answer;
+@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
+ 	lock_sock(sk2);
+ 
+ 	sock_rps_record_flow(sk2);
++
++	if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
++		struct sock *sk_it = sk2;
++
++		mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++
++		if (tcp_sk(sk2)->mpcb->master_sk) {
++			sk_it = tcp_sk(sk2)->mpcb->master_sk;
++
++			write_lock_bh(&sk_it->sk_callback_lock);
++			sk_it->sk_wq = newsock->wq;
++			sk_it->sk_socket = newsock;
++			write_unlock_bh(&sk_it->sk_callback_lock);
++		}
++	}
++
+ 	WARN_ON(!((1 << sk2->sk_state) &
+ 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
+ 		  TCPF_CLOSE_WAIT | TCPF_CLOSE)));
+@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
+ 
+ 	ip_init();
+ 
++	/* We must initialize MPTCP before TCP. */
++	mptcp_init();
++
+ 	tcp_v4_init();
+ 
+ 	/* Setup TCP slab cache for open requests. */
+diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
+index 14d02ea905b6..7d734d8af19b 100644
+--- a/net/ipv4/inet_connection_sock.c
++++ b/net/ipv4/inet_connection_sock.c
+@@ -23,6 +23,7 @@
+ #include <net/route.h>
+ #include <net/tcp_states.h>
+ #include <net/xfrm.h>
++#include <net/mptcp.h>
+ 
+ #ifdef INET_CSK_DEBUG
+ const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
+@@ -465,8 +466,8 @@ no_route:
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
+ 
+-static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
+-				 const u32 rnd, const u32 synq_hsize)
++u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
++		   const u32 synq_hsize)
+ {
+ 	return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
+ }
+@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
+ 
+ 	lopt->clock_hand = i;
+ 
+-	if (lopt->qlen)
++	if (lopt->qlen && !is_meta_sk(parent))
+ 		inet_csk_reset_keepalive_timer(parent, interval);
+ }
+ EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
+@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
+ 				 const struct request_sock *req,
+ 				 const gfp_t priority)
+ {
+-	struct sock *newsk = sk_clone_lock(sk, priority);
++	struct sock *newsk;
++
++	newsk = sk_clone_lock(sk, priority);
+ 
+ 	if (newsk != NULL) {
+ 		struct inet_connection_sock *newicsk = inet_csk(newsk);
+@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+-	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
++	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
++				   GFP_KERNEL);
+ 
+ 	if (rc != 0)
+ 		return rc;
+@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
+ 
+ 	while ((req = acc_req) != NULL) {
+ 		struct sock *child = req->sk;
++		bool mutex_taken = false;
+ 
+ 		acc_req = req->dl_next;
+ 
++		if (is_meta_sk(child)) {
++			mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
++			mutex_taken = true;
++		}
+ 		local_bh_disable();
+ 		bh_lock_sock(child);
+ 		WARN_ON(sock_owned_by_user(child));
+@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
+ 
+ 		bh_unlock_sock(child);
+ 		local_bh_enable();
++		if (mutex_taken)
++			mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
+ 		sock_put(child);
+ 
+ 		sk_acceptq_removed(sk);
+diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
+index c86624b36a62..0ff3fe004d62 100644
+--- a/net/ipv4/syncookies.c
++++ b/net/ipv4/syncookies.c
+@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
+ }
+ EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
+ 
+-__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
++__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
++			      __u16 *mssp)
+ {
+ 	const struct iphdr *iph = ip_hdr(skb);
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ 
+ 	/* check for timestamp cookie support */
+ 	memset(&tcp_opt, 0, sizeof(tcp_opt));
+-	tcp_parse_options(skb, &tcp_opt, 0, NULL);
++	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+ 
+ 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ 		goto out;
+@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
+ 	/* Try to redo what tcp_v4_send_synack did. */
+ 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
+ 
+-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+-				  &req->rcv_wnd, &req->window_clamp,
+-				  ireq->wscale_ok, &rcv_wscale,
+-				  dst_metric(&rt->dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++				       &req->rcv_wnd, &req->window_clamp,
++				       ireq->wscale_ok, &rcv_wscale,
++				       dst_metric(&rt->dst, RTAX_INITRWND), sk);
+ 
+ 	ireq->rcv_wscale  = rcv_wscale;
+ 
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 9d2118e5fbc7..2cb89f886d45 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -271,6 +271,7 @@
+ 
+ #include <net/icmp.h>
+ #include <net/inet_common.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/xfrm.h>
+ #include <net/ip.h>
+@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+ 	return period;
+ }
+ 
++const struct tcp_sock_ops tcp_specific = {
++	.__select_window		= __tcp_select_window,
++	.select_window			= tcp_select_window,
++	.select_initial_window		= tcp_select_initial_window,
++	.init_buffer_space		= tcp_init_buffer_space,
++	.set_rto			= tcp_set_rto,
++	.should_expand_sndbuf		= tcp_should_expand_sndbuf,
++	.init_congestion_control	= tcp_init_congestion_control,
++	.send_fin			= tcp_send_fin,
++	.write_xmit			= tcp_write_xmit,
++	.send_active_reset		= tcp_send_active_reset,
++	.write_wakeup			= tcp_write_wakeup,
++	.prune_ofo_queue		= tcp_prune_ofo_queue,
++	.retransmit_timer		= tcp_retransmit_timer,
++	.time_wait			= tcp_time_wait,
++	.cleanup_rbuf			= tcp_cleanup_rbuf,
++};
++
+ /* Address-family independent initialization for a tcp_sock.
+  *
+  * NOTE: A lot of things set to zero explicitly by call to
+@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
+ 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
+ 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
+ 
++	tp->ops = &tcp_specific;
++
+ 	local_bh_disable();
+ 	sock_update_memcg(sk);
+ 	sk_sockets_allocated_inc(sk);
+@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+ 	int ret;
+ 
+ 	sock_rps_record_flow(sk);
++
++#ifdef CONFIG_MPTCP
++	if (mptcp(tcp_sk(sk))) {
++		struct sock *sk_it;
++		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++#endif
+ 	/*
+ 	 * We can't seek on a socket input
+ 	 */
+@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
+ 	return NULL;
+ }
+ 
+-static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
+-				       int large_allowed)
++unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 xmit_size_goal, old_size_goal;
+@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+ {
+ 	int mss_now;
+ 
+-	mss_now = tcp_current_mss(sk);
+-	*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	if (mptcp(tcp_sk(sk))) {
++		mss_now = mptcp_current_mss(sk);
++		*size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	} else {
++		mss_now = tcp_current_mss(sk);
++		*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
++	}
+ 
+ 	return mss_now;
+ }
+@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+ 	 * is fully established.
+ 	 */
+ 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+-	    !tcp_passive_fastopen(sk)) {
++	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++				  tp->mpcb->master_sk : sk)) {
+ 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ 			goto out_err;
+ 	}
+ 
++	if (mptcp(tp)) {
++		struct sock *sk_it = sk;
++
++		/* We must check this with socket-lock hold because we iterate
++		 * over the subflows.
++		 */
++		if (!mptcp_can_sendpage(sk)) {
++			ssize_t ret;
++
++			release_sock(sk);
++			ret = sock_no_sendpage(sk->sk_socket, page, offset,
++					       size, flags);
++			lock_sock(sk);
++			return ret;
++		}
++
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++
+ 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+ 
+ 	mss_now = tcp_send_mss(sk, &size_goal, flags);
+@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+ {
+ 	ssize_t res;
+ 
+-	if (!(sk->sk_route_caps & NETIF_F_SG) ||
+-	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
++	/* If MPTCP is enabled, we check it later after establishment */
++	if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
++	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
+ 		return sock_no_sendpage(sk->sk_socket, page, offset, size,
+ 					flags);
+ 
+@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	int tmp = tp->mss_cache;
+ 
++	if (mptcp(tp))
++		return mptcp_select_size(sk, sg);
++
+ 	if (sg) {
+ 		if (sk_can_gso(sk)) {
+ 			/* Small frames wont use a full page:
+@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 	 * is fully established.
+ 	 */
+ 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
+-	    !tcp_passive_fastopen(sk)) {
++	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
++				  tp->mpcb->master_sk : sk)) {
+ 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
+ 			goto do_error;
+ 	}
+ 
++	if (mptcp(tp)) {
++		struct sock *sk_it = sk;
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++
+ 	if (unlikely(tp->repair)) {
+ 		if (tp->repair_queue == TCP_RECV_QUEUE) {
+ 			copied = tcp_send_rcvq(sk, msg, size);
+@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
+ 		goto out_err;
+ 
+-	sg = !!(sk->sk_route_caps & NETIF_F_SG);
++	if (mptcp(tp))
++		sg = mptcp_can_sg(sk);
++	else
++		sg = !!(sk->sk_route_caps & NETIF_F_SG);
+ 
+ 	while (--iovlen >= 0) {
+ 		size_t seglen = iov->iov_len;
+@@ -1183,8 +1251,15 @@ new_segment:
+ 
+ 				/*
+ 				 * Check whether we can use HW checksum.
++				 *
++				 * If dss-csum is enabled, we do not do hw-csum.
++				 * In case of non-mptcp we check the
++				 * device-capabilities.
++				 * In case of mptcp, hw-csum's will be handled
++				 * later in mptcp_write_xmit.
+ 				 */
+-				if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
++				if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
++				    (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
+ 					skb->ip_summed = CHECKSUM_PARTIAL;
+ 
+ 				skb_entail(sk, skb);
+@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
+ 
+ 		/* Optimize, __tcp_select_window() is not cheap. */
+ 		if (2*rcv_window_now <= tp->window_clamp) {
+-			__u32 new_window = __tcp_select_window(sk);
++			__u32 new_window = tp->ops->__select_window(sk);
+ 
+ 			/* Send ACK now, if this read freed lots of space
+ 			 * in our buffer. Certainly, new_window is new window.
+@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ 	/* Clean up data we have read: This will do ACK frames. */
+ 	if (copied > 0) {
+ 		tcp_recv_skb(sk, seq, &offset);
+-		tcp_cleanup_rbuf(sk, copied);
++		tp->ops->cleanup_rbuf(sk, copied);
+ 	}
+ 	return copied;
+ }
+@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 
+ 	lock_sock(sk);
+ 
++#ifdef CONFIG_MPTCP
++	if (mptcp(tp)) {
++		struct sock *sk_it;
++		mptcp_for_each_sk(tp->mpcb, sk_it)
++			sock_rps_record_flow(sk_it);
++	}
++#endif
++
+ 	err = -ENOTCONN;
+ 	if (sk->sk_state == TCP_LISTEN)
+ 		goto out;
+@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 			}
+ 		}
+ 
+-		tcp_cleanup_rbuf(sk, copied);
++		tp->ops->cleanup_rbuf(sk, copied);
+ 
+ 		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
+ 			/* Install new reader */
+@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+ 			if (tp->rcv_wnd == 0 &&
+ 			    !skb_queue_empty(&sk->sk_async_wait_queue)) {
+ 				tcp_service_net_dma(sk, true);
+-				tcp_cleanup_rbuf(sk, copied);
++				tp->ops->cleanup_rbuf(sk, copied);
+ 			} else
+ 				dma_async_issue_pending(tp->ucopy.dma_chan);
+ 		}
+@@ -1993,7 +2076,7 @@ skip_copy:
+ 	 */
+ 
+ 	/* Clean up data we have read: This will do ACK frames. */
+-	tcp_cleanup_rbuf(sk, copied);
++	tp->ops->cleanup_rbuf(sk, copied);
+ 
+ 	release_sock(sk);
+ 	return copied;
+@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
+   /* TCP_CLOSING	*/ TCP_CLOSING,
+ };
+ 
+-static int tcp_close_state(struct sock *sk)
++int tcp_close_state(struct sock *sk)
+ {
+ 	int next = (int)new_state[sk->sk_state];
+ 	int ns = next & TCP_STATE_MASK;
+@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
+ 	     TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
+ 		/* Clear out any half completed packets.  FIN if needed. */
+ 		if (tcp_close_state(sk))
+-			tcp_send_fin(sk);
++			tcp_sk(sk)->ops->send_fin(sk);
+ 	}
+ }
+ EXPORT_SYMBOL(tcp_shutdown);
+@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
+ 	int data_was_unread = 0;
+ 	int state;
+ 
++	if (is_meta_sk(sk)) {
++		mptcp_close(sk, timeout);
++		return;
++	}
++
+ 	lock_sock(sk);
+ 	sk->sk_shutdown = SHUTDOWN_MASK;
+ 
+@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
+ 		/* Unread data was tossed, zap the connection. */
+ 		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
+ 		tcp_set_state(sk, TCP_CLOSE);
+-		tcp_send_active_reset(sk, sk->sk_allocation);
++		tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
+ 	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
+ 		/* Check zero linger _after_ checking for unread data. */
+ 		sk->sk_prot->disconnect(sk, 0);
+@@ -2247,7 +2335,7 @@ adjudge_to_death:
+ 		struct tcp_sock *tp = tcp_sk(sk);
+ 		if (tp->linger2 < 0) {
+ 			tcp_set_state(sk, TCP_CLOSE);
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			NET_INC_STATS_BH(sock_net(sk),
+ 					LINUX_MIB_TCPABORTONLINGER);
+ 		} else {
+@@ -2257,7 +2345,8 @@ adjudge_to_death:
+ 				inet_csk_reset_keepalive_timer(sk,
+ 						tmo - TCP_TIMEWAIT_LEN);
+ 			} else {
+-				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++				tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
++							   tmo);
+ 				goto out;
+ 			}
+ 		}
+@@ -2266,7 +2355,7 @@ adjudge_to_death:
+ 		sk_mem_reclaim(sk);
+ 		if (tcp_check_oom(sk, 0)) {
+ 			tcp_set_state(sk, TCP_CLOSE);
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			NET_INC_STATS_BH(sock_net(sk),
+ 					LINUX_MIB_TCPABORTONMEMORY);
+ 		}
+@@ -2291,15 +2380,6 @@ out:
+ }
+ EXPORT_SYMBOL(tcp_close);
+ 
+-/* These states need RST on ABORT according to RFC793 */
+-
+-static inline bool tcp_need_reset(int state)
+-{
+-	return (1 << state) &
+-	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
+-		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
+-}
+-
+ int tcp_disconnect(struct sock *sk, int flags)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
+@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
+ 		/* The last check adjusts for discrepancy of Linux wrt. RFC
+ 		 * states
+ 		 */
+-		tcp_send_active_reset(sk, gfp_any());
++		tp->ops->send_active_reset(sk, gfp_any());
+ 		sk->sk_err = ECONNRESET;
+ 	} else if (old_state == TCP_SYN_SENT)
+ 		sk->sk_err = ECONNRESET;
+@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
+ 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
+ 		inet_reset_saddr(sk);
+ 
++	if (is_meta_sk(sk)) {
++		mptcp_disconnect(sk);
++	} else {
++		if (tp->inside_tk_table)
++			mptcp_hash_remove_bh(tp);
++	}
++
+ 	sk->sk_shutdown = 0;
+ 	sock_reset_flag(sk, SOCK_DONE);
+ 	tp->srtt_us = 0;
+@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 		break;
+ 
+ 	case TCP_DEFER_ACCEPT:
++		/* An established MPTCP-connection (mptcp(tp) only returns true
++		 * if the socket is established) should not use DEFER on new
++		 * subflows.
++		 */
++		if (mptcp(tp))
++			break;
+ 		/* Translate value in seconds to number of retransmits */
+ 		icsk->icsk_accept_queue.rskq_defer_accept =
+ 			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 			    (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
+ 			    inet_csk_ack_scheduled(sk)) {
+ 				icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
+-				tcp_cleanup_rbuf(sk, 1);
++				tp->ops->cleanup_rbuf(sk, 1);
+ 				if (!(val & 1))
+ 					icsk->icsk_ack.pingpong = 1;
+ 			}
+@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+ 		tp->notsent_lowat = val;
+ 		sk->sk_write_space(sk);
+ 		break;
++#ifdef CONFIG_MPTCP
++	case MPTCP_ENABLED:
++		if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
++			if (val)
++				tp->mptcp_enabled = 1;
++			else
++				tp->mptcp_enabled = 0;
++		} else {
++			err = -EPERM;
++		}
++		break;
++#endif
+ 	default:
+ 		err = -ENOPROTOOPT;
+ 		break;
+@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
+ 	case TCP_NOTSENT_LOWAT:
+ 		val = tp->notsent_lowat;
+ 		break;
++#ifdef CONFIG_MPTCP
++	case MPTCP_ENABLED:
++		val = tp->mptcp_enabled;
++		break;
++#endif
+ 	default:
+ 		return -ENOPROTOOPT;
+ 	}
+@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
+ 	if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
+ 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
+ 
++	WARN_ON(sk->sk_state == TCP_CLOSE);
+ 	tcp_set_state(sk, TCP_CLOSE);
++
+ 	tcp_clear_xmit_timers(sk);
++
+ 	if (req != NULL)
+ 		reqsk_fastopen_remove(sk, req, false);
+ 
+diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
+index 9771563ab564..5c230d96c4c1 100644
+--- a/net/ipv4/tcp_fastopen.c
++++ b/net/ipv4/tcp_fastopen.c
+@@ -7,6 +7,7 @@
+ #include <linux/rculist.h>
+ #include <net/inetpeer.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
+ 
+ int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
+ 
+@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ {
+ 	struct tcp_sock *tp;
+ 	struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
+-	struct sock *child;
++	struct sock *child, *meta_sk;
+ 
+ 	req->num_retrans = 0;
+ 	req->num_timeout = 0;
+@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ 	/* Add the child socket directly into the accept queue */
+ 	inet_csk_reqsk_queue_add(sk, req, child);
+ 
+-	/* Now finish processing the fastopen child socket. */
+-	inet_csk(child)->icsk_af_ops->rebuild_header(child);
+-	tcp_init_congestion_control(child);
+-	tcp_mtup_init(child);
+-	tcp_init_metrics(child);
+-	tcp_init_buffer_space(child);
+-
+ 	/* Queue the data carried in the SYN packet. We need to first
+ 	 * bump skb's refcnt because the caller will attempt to free it.
+ 	 *
+@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
+ 		tp->syn_data_acked = 1;
+ 	}
+ 	tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++
++	meta_sk = child;
++	if (!mptcp_check_req_fastopen(meta_sk, req)) {
++		child = tcp_sk(meta_sk)->mpcb->master_sk;
++		tp = tcp_sk(child);
++	}
++
++	/* Now finish processing the fastopen child socket. */
++	inet_csk(child)->icsk_af_ops->rebuild_header(child);
++	tp->ops->init_congestion_control(child);
++	tcp_mtup_init(child);
++	tcp_init_metrics(child);
++	tp->ops->init_buffer_space(child);
++
+ 	sk->sk_data_ready(sk);
+-	bh_unlock_sock(child);
++	if (mptcp(tcp_sk(child)))
++		bh_unlock_sock(child);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(child);
+ 	WARN_ON(req->sk == NULL);
+ 	return true;
+diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
+index 40639c288dc2..3273bb69f387 100644
+--- a/net/ipv4/tcp_input.c
++++ b/net/ipv4/tcp_input.c
+@@ -74,6 +74,9 @@
+ #include <linux/ipsec.h>
+ #include <asm/unaligned.h>
+ #include <net/netdma.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
+ 
+ int sysctl_tcp_timestamps __read_mostly = 1;
+ int sysctl_tcp_window_scaling __read_mostly = 1;
+@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
+ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
+ int sysctl_tcp_early_retrans __read_mostly = 3;
+ 
+-#define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
+-#define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
+-#define FLAG_DATA_ACKED		0x04 /* This ACK acknowledged new data.		*/
+-#define FLAG_RETRANS_DATA_ACKED	0x08 /* "" "" some of which was retransmitted.	*/
+-#define FLAG_SYN_ACKED		0x10 /* This ACK acknowledged SYN.		*/
+-#define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
+-#define FLAG_ECE		0x40 /* ECE in this ACK				*/
+-#define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/
+-#define FLAG_ORIG_SACK_ACKED	0x200 /* Never retransmitted data are (s)acked	*/
+-#define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
+-#define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info */
+-#define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a sacked seq */
+-#define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
+-
+-#define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
+-#define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
+-#define FLAG_CA_ALERT		(FLAG_DATA_SACKED|FLAG_ECE)
+-#define FLAG_FORWARD_PROGRESS	(FLAG_ACKED|FLAG_DATA_SACKED)
+-
+ #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
+ #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
+ 
+@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
+ 		icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
+ }
+ 
+-static void tcp_enter_quickack_mode(struct sock *sk)
++void tcp_enter_quickack_mode(struct sock *sk)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	tcp_incr_quickack(sk);
+@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ 	per_mss = roundup_pow_of_two(per_mss) +
+ 		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
+ 
+-	nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
+-	nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++	if (mptcp(tp)) {
++		nr_segs = mptcp_check_snd_buf(tp);
++	} else {
++		nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
++		nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
++	}
+ 
+ 	/* Fast Recovery (RFC 5681 3.2) :
+ 	 * Cubic needs 1.7 factor, rounded to 2 to include
+@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
+ 	 */
+ 	sndmem = 2 * nr_segs * per_mss;
+ 
+-	if (sk->sk_sndbuf < sndmem)
++	/* MPTCP: after this sndmem is the new contribution of the
++	 * current subflow to the aggregated sndbuf */
++	if (sk->sk_sndbuf < sndmem) {
++		int old_sndbuf = sk->sk_sndbuf;
+ 		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
++		/* MPTCP: ok, the subflow sndbuf has grown, reflect
++		 * this in the aggregate buffer.*/
++		if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
++			mptcp_update_sndbuf(tp);
++	}
+ }
+ 
+ /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
+@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
+ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
+ 
+ 	/* Check #1 */
+-	if (tp->rcv_ssthresh < tp->window_clamp &&
+-	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
++	if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
++	    (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
+ 	    !sk_under_memory_pressure(sk)) {
+ 		int incr;
+ 
+@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
+ 		 * will fit to rcvbuf in future.
+ 		 */
+ 		if (tcp_win_from_space(skb->truesize) <= skb->len)
+-			incr = 2 * tp->advmss;
++			incr = 2 * meta_tp->advmss;
+ 		else
+-			incr = __tcp_grow_window(sk, skb);
++			incr = __tcp_grow_window(meta_sk, skb);
+ 
+ 		if (incr) {
+ 			incr = max_t(int, incr, 2 * skb->len);
+-			tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
+-					       tp->window_clamp);
++			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
++					            meta_tp->window_clamp);
+ 			inet_csk(sk)->icsk_ack.quick |= 1;
+ 		}
+ 	}
+@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
+ 	int copied;
+ 
+ 	time = tcp_time_stamp - tp->rcvq_space.time;
+-	if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
++	if (mptcp(tp)) {
++		if (mptcp_check_rtt(tp, time))
++			return;
++	} else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+ 		return;
+ 
+ 	/* Number of bytes copied to user in last RTT */
+@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
+ /* Calculate rto without backoff.  This is the second half of Van Jacobson's
+  * routine referred to above.
+  */
+-static void tcp_set_rto(struct sock *sk)
++void tcp_set_rto(struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	/* Old crap is replaced with new one. 8)
+@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
+ 	int len;
+ 	int in_sack;
+ 
+-	if (!sk_can_gso(sk))
++	/* For MPTCP we cannot shift skb-data and remove one skb from the
++	 * send-queue, because this will make us loose the DSS-option (which
++	 * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
++	 */
++	if (!sk_can_gso(sk) || mptcp(tp))
+ 		goto fallback;
+ 
+ 	/* Normally R but no L won't result in plain S */
+@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
+ 		return false;
+ 
+ 	tcp_rtt_estimator(sk, seq_rtt_us);
+-	tcp_set_rto(sk);
++	tp->ops->set_rto(sk);
+ 
+ 	/* RFC6298: only reset backoff on valid RTT measurement. */
+ 	inet_csk(sk)->icsk_backoff = 0;
+@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
+ }
+ 
+ /* If we get here, the whole TSO packet has not been acked. */
+-static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
++u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 packets_acked;
+@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ 		 */
+ 		if (!(scb->tcp_flags & TCPHDR_SYN)) {
+ 			flag |= FLAG_DATA_ACKED;
++			if (mptcp(tp) && mptcp_is_data_seq(skb))
++				flag |= MPTCP_FLAG_DATA_ACKED;
+ 		} else {
+ 			flag |= FLAG_SYN_ACKED;
+ 			tp->retrans_stamp = 0;
+@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
+ 	return flag;
+ }
+ 
+-static void tcp_ack_probe(struct sock *sk)
++void tcp_ack_probe(struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
+ /* Check that window update is acceptable.
+  * The function assumes that snd_una<=ack<=snd_next.
+  */
+-static inline bool tcp_may_update_window(const struct tcp_sock *tp,
+-					const u32 ack, const u32 ack_seq,
+-					const u32 nwin)
++bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
++			   const u32 ack_seq, const u32 nwin)
+ {
+ 	return	after(ack, tp->snd_una) ||
+ 		after(ack_seq, tp->snd_wl1) ||
+@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
+ }
+ 
+ /* This routine deals with incoming acks, but not outgoing ones. */
+-static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
++static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+ 				    sack_rtt_us);
+ 	acked -= tp->packets_out;
+ 
++	if (mptcp(tp)) {
++		if (mptcp_fallback_infinite(sk, flag)) {
++			pr_err("%s resetting flow\n", __func__);
++			mptcp_send_reset(sk);
++			goto invalid_ack;
++		}
++
++		mptcp_clean_rtx_infinite(skb, sk);
++	}
++
+ 	/* Advance cwnd if state allows */
+ 	if (tcp_may_raise_cwnd(sk, flag))
+ 		tcp_cong_avoid(sk, ack, acked);
+@@ -3512,8 +3528,9 @@ old_ack:
+  * the fast version below fails.
+  */
+ void tcp_parse_options(const struct sk_buff *skb,
+-		       struct tcp_options_received *opt_rx, int estab,
+-		       struct tcp_fastopen_cookie *foc)
++		       struct tcp_options_received *opt_rx,
++		       struct mptcp_options_received *mopt,
++		       int estab, struct tcp_fastopen_cookie *foc)
+ {
+ 	const unsigned char *ptr;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
+ 				 */
+ 				break;
+ #endif
++			case TCPOPT_MPTCP:
++				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++				break;
+ 			case TCPOPT_EXP:
+ 				/* Fast Open option shares code 254 using a
+ 				 * 16 bits magic number. It's valid only in
+@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
+ 		if (tcp_parse_aligned_timestamp(tp, th))
+ 			return true;
+ 	}
+-
+-	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
++	tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
++			  1, NULL);
+ 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+ 
+@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
+ 		dst = __sk_dst_get(sk);
+ 		if (!dst || !dst_metric(dst, RTAX_QUICKACK))
+ 			inet_csk(sk)->icsk_ack.pingpong = 1;
++		if (mptcp(tp))
++			mptcp_sub_close_passive(sk);
+ 		break;
+ 
+ 	case TCP_CLOSE_WAIT:
+@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
+ 		tcp_set_state(sk, TCP_CLOSING);
+ 		break;
+ 	case TCP_FIN_WAIT2:
++		if (mptcp(tp)) {
++			/* The socket will get closed by mptcp_data_ready.
++			 * We first have to process all data-sequences.
++			 */
++			tp->close_it = 1;
++			break;
++		}
+ 		/* Received a FIN -- send ACK and enter TIME_WAIT. */
+ 		tcp_send_ack(sk);
+-		tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++		tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ 		break;
+ 	default:
+ 		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
+@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
+ 	if (!sock_flag(sk, SOCK_DEAD)) {
+ 		sk->sk_state_change(sk);
+ 
++		/* Don't wake up MPTCP-subflows */
++		if (mptcp(tp))
++			return;
++
+ 		/* Do not send POLL_HUP for half duplex close. */
+ 		if (sk->sk_shutdown == SHUTDOWN_MASK ||
+ 		    sk->sk_state == TCP_CLOSE)
+@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
+ 			tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
+ 		}
+ 
+-		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
++		/* In case of MPTCP, the segment may be empty if it's a
++		 * non-data DATA_FIN. (see beginning of tcp_data_queue)
++		 */
++		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
++		    !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
+ 			SOCK_DEBUG(sk, "ofo packet was already received\n");
+ 			__skb_unlink(skb, &tp->out_of_order_queue);
+ 			__kfree_skb(skb);
+@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
+ 	}
+ }
+ 
+-static bool tcp_prune_ofo_queue(struct sock *sk);
+ static int tcp_prune_queue(struct sock *sk);
+ 
+ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ 				 unsigned int size)
+ {
++	if (mptcp(tcp_sk(sk)))
++		sk = mptcp_meta_sk(sk);
++
+ 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+ 	    !sk_rmem_schedule(sk, skb, size)) {
+ 
+@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+ 			return -1;
+ 
+ 		if (!sk_rmem_schedule(sk, skb, size)) {
+-			if (!tcp_prune_ofo_queue(sk))
++			if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
+ 				return -1;
+ 
+ 			if (!sk_rmem_schedule(sk, skb, size))
+@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+  * Better try to coalesce them right now to avoid future collapses.
+  * Returns true if caller should free @from instead of queueing it
+  */
+-static bool tcp_try_coalesce(struct sock *sk,
+-			     struct sk_buff *to,
+-			     struct sk_buff *from,
+-			     bool *fragstolen)
++bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
++		      bool *fragstolen)
+ {
+ 	int delta;
+ 
+ 	*fragstolen = false;
+ 
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++		return false;
++
+ 	if (tcp_hdr(from)->fin)
+ 		return false;
+ 
+@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ 
+ 	/* Do skb overlap to previous one? */
+ 	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+-		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++		/* MPTCP allows non-data data-fin to be in the ofo-queue */
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
++		    !(mptcp(tp) && end_seq == seq)) {
+ 			/* All the bits are present. Drop. */
+ 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
+ 			__kfree_skb(skb);
+@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+ 					 end_seq);
+ 			break;
+ 		}
++		/* MPTCP allows non-data data-fin to be in the ofo-queue */
++		if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
++			continue;
+ 		__skb_unlink(skb1, &tp->out_of_order_queue);
+ 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
+ 				 TCP_SKB_CB(skb1)->end_seq);
+@@ -4280,8 +4325,8 @@ end:
+ 	}
+ }
+ 
+-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
+-		  bool *fragstolen)
++int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
++			       bool *fragstolen)
+ {
+ 	int eaten;
+ 	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
+ 	int eaten = -1;
+ 	bool fragstolen = false;
+ 
+-	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
++	/* If no data is present, but a data_fin is in the options, we still
++	 * have to call mptcp_queue_skb later on. */
++	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
++	    !(mptcp(tp) && mptcp_is_data_fin(skb)))
+ 		goto drop;
+ 
+ 	skb_dst_drop(skb);
+@@ -4389,7 +4437,7 @@ queue_and_out:
+ 			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
+ 		}
+ 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+-		if (skb->len)
++		if (skb->len || mptcp_is_data_fin(skb))
+ 			tcp_event_data_recv(sk, skb);
+ 		if (th->fin)
+ 			tcp_fin(sk);
+@@ -4411,7 +4459,11 @@ queue_and_out:
+ 
+ 		if (eaten > 0)
+ 			kfree_skb_partial(skb, fragstolen);
+-		if (!sock_flag(sk, SOCK_DEAD))
++		if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
++			/* MPTCP: we always have to call data_ready, because
++			 * we may be about to receive a data-fin, which still
++			 * must get queued.
++			 */
+ 			sk->sk_data_ready(sk);
+ 		return;
+ 	}
+@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
+ 		next = skb_queue_next(list, skb);
+ 
+ 	__skb_unlink(skb, list);
++	if (mptcp(tcp_sk(sk)))
++		mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
+ 	__kfree_skb(skb);
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
+ 
+@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
+  * Purge the out-of-order queue.
+  * Return true if queue was pruned.
+  */
+-static bool tcp_prune_ofo_queue(struct sock *sk)
++bool tcp_prune_ofo_queue(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	bool res = false;
+@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
+ 	/* Collapsing did not help, destructive actions follow.
+ 	 * This must not ever occur. */
+ 
+-	tcp_prune_ofo_queue(sk);
++	tp->ops->prune_ofo_queue(sk);
+ 
+ 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+ 		return 0;
+@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
+ 	return -1;
+ }
+ 
+-static bool tcp_should_expand_sndbuf(const struct sock *sk)
++/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
++ * As additional protections, we do not touch cwnd in retransmission phases,
++ * and if application hit its sndbuf limit recently.
++ */
++void tcp_cwnd_application_limited(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
++	    sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
++		/* Limited by application or receiver window. */
++		u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
++		u32 win_used = max(tp->snd_cwnd_used, init_win);
++		if (win_used < tp->snd_cwnd) {
++			tp->snd_ssthresh = tcp_current_ssthresh(sk);
++			tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
++		}
++		tp->snd_cwnd_used = 0;
++	}
++	tp->snd_cwnd_stamp = tcp_time_stamp;
++}
++
++bool tcp_should_expand_sndbuf(const struct sock *sk)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+-	if (tcp_should_expand_sndbuf(sk)) {
++	if (tp->ops->should_expand_sndbuf(sk)) {
+ 		tcp_sndbuf_expand(sk);
+ 		tp->snd_cwnd_stamp = tcp_time_stamp;
+ 	}
+@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
+ {
+ 	if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
+ 		sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
+-		if (sk->sk_socket &&
+-		    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
++		if (mptcp(tcp_sk(sk)) ||
++		    (sk->sk_socket &&
++			test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
+ 			tcp_new_space(sk);
+ 	}
+ }
+@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+ 	     /* ... and right edge of window advances far enough.
+ 	      * (tcp_recvmsg() will send ACK otherwise). Or...
+ 	      */
+-	     __tcp_select_window(sk) >= tp->rcv_wnd) ||
++	     tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
+ 	    /* We ACK each frame or... */
+ 	    tcp_in_quickack_mode(sk) ||
+ 	    /* We have out of order data. */
+@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
++	/* MPTCP urgent data is not yet supported */
++	if (mptcp(tp))
++		return;
++
+ 	/* Check if we get a new urgent pointer - normally not. */
+ 	if (th->urg)
+ 		tcp_check_urg(sk, th);
+@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
+ }
+ 
+ #ifdef CONFIG_NET_DMA
+-static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
+-				  int hlen)
++bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	int chunk = skb->len - hlen;
+@@ -5052,9 +5132,15 @@ syn_challenge:
+ 		goto discard;
+ 	}
+ 
++	/* If valid: post process the received MPTCP options. */
++	if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
++		goto discard;
++
+ 	return true;
+ 
+ discard:
++	if (mptcp(tp))
++		mptcp_reset_mopt(tp);
+ 	__kfree_skb(skb);
+ 	return false;
+ }
+@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ 
+ 	tp->rx_opt.saw_tstamp = 0;
+ 
++	/* MPTCP: force slowpath. */
++	if (mptcp(tp))
++		goto slow_path;
++
+ 	/*	pred_flags is 0xS?10 << 16 + snd_wnd
+ 	 *	if header_prediction is to be made
+ 	 *	'S' will always be tp->tcp_header_len >> 2
+@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
+ 					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
+ 				}
+ 				if (copied_early)
+-					tcp_cleanup_rbuf(sk, skb->len);
++					tp->ops->cleanup_rbuf(sk, skb->len);
+ 			}
+ 			if (!eaten) {
+ 				if (tcp_checksum_complete_user(sk, skb))
+@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
+ 
+ 	tcp_init_metrics(sk);
+ 
+-	tcp_init_congestion_control(sk);
++	tp->ops->init_congestion_control(sk);
+ 
+ 	/* Prevent spurious tcp_cwnd_restart() on first data
+ 	 * packet.
+ 	 */
+ 	tp->lsndtime = tcp_time_stamp;
+ 
+-	tcp_init_buffer_space(sk);
++	tp->ops->init_buffer_space(sk);
+ 
+ 	if (sock_flag(sk, SOCK_KEEPOPEN))
+ 		inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
+@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ 		/* Get original SYNACK MSS value if user MSS sets mss_clamp */
+ 		tcp_clear_options(&opt);
+ 		opt.user_mss = opt.mss_clamp = 0;
+-		tcp_parse_options(synack, &opt, 0, NULL);
++		tcp_parse_options(synack, &opt, NULL, 0, NULL);
+ 		mss = opt.mss_clamp;
+ 	}
+ 
+@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
+ 
+ 	tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
+ 
+-	if (data) { /* Retransmit unacked data in SYN */
++	/* In mptcp case, we do not rely on "retransmit", but instead on
++	 * "transmit", because if fastopen data is not acked, the retransmission
++	 * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
++	 */
++	if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
+ 		tcp_for_write_queue_from(data, sk) {
+ 			if (data == tcp_send_head(sk) ||
+ 			    __tcp_retransmit_skb(sk, data))
+@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct tcp_fastopen_cookie foc = { .len = -1 };
+ 	int saved_clamp = tp->rx_opt.mss_clamp;
++	struct mptcp_options_received mopt;
++	mptcp_init_mp_opt(&mopt);
+ 
+-	tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
++	tcp_parse_options(skb, &tp->rx_opt,
++			  mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
+ 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
+ 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
+ 
+@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
+ 		tcp_ack(sk, skb, FLAG_SLOWPATH);
+ 
++		if (tp->request_mptcp || mptcp(tp)) {
++			int ret;
++			ret = mptcp_rcv_synsent_state_process(sk, &sk,
++							      skb, &mopt);
++
++			/* May have changed if we support MPTCP */
++			tp = tcp_sk(sk);
++			icsk = inet_csk(sk);
++
++			if (ret == 1)
++				goto reset_and_undo;
++			if (ret == 2)
++				goto discard;
++		}
++
++		if (mptcp(tp) && !is_master_tp(tp)) {
++			/* Timer for repeating the ACK until an answer
++			 * arrives. Used only when establishing an additional
++			 * subflow inside of an MPTCP connection.
++			 */
++			sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++				       jiffies + icsk->icsk_rto);
++		}
++
+ 		/* Ok.. it's good. Set up sequence numbers and
+ 		 * move to established.
+ 		 */
+@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 			tp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
+ 
++		if (mptcp(tp)) {
++			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++		}
++
+ 		if (tcp_is_sack(tp) && sysctl_tcp_fack)
+ 			tcp_enable_fack(tp);
+ 
+@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
+ 		    tcp_rcv_fastopen_synack(sk, skb, &foc))
+ 			return -1;
+ 
+-		if (sk->sk_write_pending ||
++		/* With MPTCP we cannot send data on the third ack due to the
++		 * lack of option-space to combine with an MP_CAPABLE.
++		 */
++		if (!mptcp(tp) && (sk->sk_write_pending ||
+ 		    icsk->icsk_accept_queue.rskq_defer_accept ||
+-		    icsk->icsk_ack.pingpong) {
++		    icsk->icsk_ack.pingpong)) {
+ 			/* Save one ACK. Data will be ready after
+ 			 * several ticks, if write_pending is set.
+ 			 *
+@@ -5536,6 +5665,7 @@ discard:
+ 	    tcp_paws_reject(&tp->rx_opt, 0))
+ 		goto discard_and_undo;
+ 
++	/* TODO - check this here for MPTCP */
+ 	if (th->syn) {
+ 		/* We see SYN without ACK. It is attempt of
+ 		 * simultaneous connect with crossed SYNs.
+@@ -5552,6 +5682,11 @@ discard:
+ 			tp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
+ 
++		if (mptcp(tp)) {
++			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
++		}
++
+ 		tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
+ 		tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
+ 
+@@ -5610,6 +5745,7 @@ reset_and_undo:
+ 
+ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			  const struct tcphdr *th, unsigned int len)
++	__releases(&sk->sk_lock.slock)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 	case TCP_SYN_SENT:
+ 		queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
++		if (is_meta_sk(sk)) {
++			sk = tcp_sk(sk)->mpcb->master_sk;
++			tp = tcp_sk(sk);
++
++			/* Need to call it here, because it will announce new
++			 * addresses, which can only be done after the third ack
++			 * of the 3-way handshake.
++			 */
++			mptcp_update_metasocket(sk, tp->meta_sk);
++		}
+ 		if (queued >= 0)
+ 			return queued;
+ 
+@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tcp_urg(sk, skb, th);
+ 		__kfree_skb(skb);
+ 		tcp_data_snd_check(sk);
++		if (mptcp(tp) && is_master_tp(tp))
++			bh_unlock_sock(sk);
+ 		return 0;
+ 	}
+ 
+@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			synack_stamp = tp->lsndtime;
+ 			/* Make sure socket is routed, for correct metrics. */
+ 			icsk->icsk_af_ops->rebuild_header(sk);
+-			tcp_init_congestion_control(sk);
++			tp->ops->init_congestion_control(sk);
+ 
+ 			tcp_mtup_init(sk);
+ 			tp->copied_seq = tp->rcv_nxt;
+-			tcp_init_buffer_space(sk);
++			tp->ops->init_buffer_space(sk);
+ 		}
+ 		smp_mb();
+ 		tcp_set_state(sk, TCP_ESTABLISHED);
+@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 		if (tp->rx_opt.tstamp_ok)
+ 			tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
++		if (mptcp(tp))
++			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
+ 
+ 		if (req) {
+ 			/* Re-arm the timer because data may have been sent out.
+@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 		tcp_initialize_rcv_mss(sk);
+ 		tcp_fast_path_on(tp);
++		/* Send an ACK when establishing a new
++		 * MPTCP subflow, i.e. using an MP_JOIN
++		 * subtype.
++		 */
++		if (mptcp(tp) && !is_master_tp(tp))
++			tcp_send_ack(sk);
+ 		break;
+ 
+ 	case TCP_FIN_WAIT1: {
+@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		tmo = tcp_fin_time(sk);
+ 		if (tmo > TCP_TIMEWAIT_LEN) {
+ 			inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
+-		} else if (th->fin || sock_owned_by_user(sk)) {
++		} else if (th->fin || mptcp_is_data_fin(skb) ||
++			   sock_owned_by_user(sk)) {
+ 			/* Bad case. We could lose such FIN otherwise.
+ 			 * It is not a big problem, but it looks confusing
+ 			 * and not so rare event. We still can lose it now,
+@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			 */
+ 			inet_csk_reset_keepalive_timer(sk, tmo);
+ 		} else {
+-			tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++			tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ 			goto discard;
+ 		}
+ 		break;
+@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 
+ 	case TCP_CLOSING:
+ 		if (tp->snd_una == tp->write_seq) {
+-			tcp_time_wait(sk, TCP_TIME_WAIT, 0);
++			tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
+ 			goto discard;
+ 		}
+ 		break;
+@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 			goto discard;
+ 		}
+ 		break;
++	case TCP_CLOSE:
++		if (tp->mp_killed)
++			goto discard;
+ 	}
+ 
+ 	/* step 6: check the URG bit */
+@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
+ 		 */
+ 		if (sk->sk_shutdown & RCV_SHUTDOWN) {
+ 			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
+-			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
++			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++			    !mptcp(tp)) {
++				/* In case of mptcp, the reset is handled by
++				 * mptcp_rcv_state_process
++				 */
+ 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
+ 				tcp_reset(sk);
+ 				return 1;
+@@ -5877,3 +6041,154 @@ discard:
+ 	return 0;
+ }
+ EXPORT_SYMBOL(tcp_rcv_state_process);
++
++static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++
++	if (family == AF_INET)
++		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
++			       &ireq->ir_rmt_addr, port);
++#if IS_ENABLED(CONFIG_IPV6)
++	else if (family == AF_INET6)
++		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
++			       &ireq->ir_v6_rmt_addr, port);
++#endif
++}
++
++int tcp_conn_request(struct request_sock_ops *rsk_ops,
++		     const struct tcp_request_sock_ops *af_ops,
++		     struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_options_received tmp_opt;
++	struct request_sock *req;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct dst_entry *dst = NULL;
++	__u32 isn = TCP_SKB_CB(skb)->when;
++	bool want_cookie = false, fastopen;
++	struct flowi fl;
++	struct tcp_fastopen_cookie foc = { .len = -1 };
++	int err;
++
++
++	/* TW buckets are converted to open requests without
++	 * limitations, they conserve resources and peer is
++	 * evidently real one.
++	 */
++	if ((sysctl_tcp_syncookies == 2 ||
++	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++		want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
++		if (!want_cookie)
++			goto drop;
++	}
++
++
++	/* Accept backlog is full. If we have already queued enough
++	 * of warm entries in syn queue, drop request. It is better than
++	 * clogging syn queue with openreqs with exponentially increasing
++	 * timeout.
++	 */
++	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
++		goto drop;
++	}
++
++	req = inet_reqsk_alloc(rsk_ops);
++	if (!req)
++		goto drop;
++
++	tcp_rsk(req)->af_specific = af_ops;
++
++	tcp_clear_options(&tmp_opt);
++	tmp_opt.mss_clamp = af_ops->mss_clamp;
++	tmp_opt.user_mss  = tp->rx_opt.user_mss;
++	tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
++
++	if (want_cookie && !tmp_opt.saw_tstamp)
++		tcp_clear_options(&tmp_opt);
++
++	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
++	tcp_openreq_init(req, &tmp_opt, skb);
++
++	if (af_ops->init_req(req, sk, skb))
++		goto drop_and_free;
++
++	if (security_inet_conn_request(sk, skb, req))
++		goto drop_and_free;
++
++	if (!want_cookie || tmp_opt.tstamp_ok)
++		TCP_ECN_create_request(req, skb, sock_net(sk));
++
++	if (want_cookie) {
++		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
++		req->cookie_ts = tmp_opt.tstamp_ok;
++	} else if (!isn) {
++		/* VJ's idea. We save last timestamp seen
++		 * from the destination in peer table, when entering
++		 * state TIME-WAIT, and check against it before
++		 * accepting new connection request.
++		 *
++		 * If "isn" is not zero, this request hit alive
++		 * timewait bucket, so that all the necessary checks
++		 * are made in the function processing timewait state.
++		 */
++		if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
++			bool strict;
++
++			dst = af_ops->route_req(sk, &fl, req, &strict);
++			if (dst && strict &&
++			    !tcp_peer_is_proven(req, dst, true)) {
++				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
++				goto drop_and_release;
++			}
++		}
++		/* Kill the following clause, if you dislike this way. */
++		else if (!sysctl_tcp_syncookies &&
++			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
++			  (sysctl_max_syn_backlog >> 2)) &&
++			 !tcp_peer_is_proven(req, dst, false)) {
++			/* Without syncookies last quarter of
++			 * backlog is filled with destinations,
++			 * proven to be alive.
++			 * It means that we continue to communicate
++			 * to destinations, already remembered
++			 * to the moment of synflood.
++			 */
++			pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
++				    rsk_ops->family);
++			goto drop_and_release;
++		}
++
++		isn = af_ops->init_seq(skb);
++	}
++	if (!dst) {
++		dst = af_ops->route_req(sk, &fl, req, NULL);
++		if (!dst)
++			goto drop_and_free;
++	}
++
++	tcp_rsk(req)->snt_isn = isn;
++	tcp_openreq_init_rwin(req, sk, dst);
++	fastopen = !want_cookie &&
++		   tcp_try_fastopen(sk, skb, req, &foc, dst);
++	err = af_ops->send_synack(sk, dst, &fl, req,
++				  skb_get_queue_mapping(skb), &foc);
++	if (!fastopen) {
++		if (err || want_cookie)
++			goto drop_and_free;
++
++		tcp_rsk(req)->listener = NULL;
++		af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
++	}
++
++	return 0;
++
++drop_and_release:
++	dst_release(dst);
++drop_and_free:
++	reqsk_free(req);
++drop:
++	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++	return 0;
++}
++EXPORT_SYMBOL(tcp_conn_request);
+diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
+index 77cccda1ad0c..c77017f600f1 100644
+--- a/net/ipv4/tcp_ipv4.c
++++ b/net/ipv4/tcp_ipv4.c
+@@ -67,6 +67,8 @@
+ #include <net/icmp.h>
+ #include <net/inet_hashtables.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/transp_v6.h>
+ #include <net/ipv6.h>
+ #include <net/inet_common.h>
+@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
+ struct inet_hashinfo tcp_hashinfo;
+ EXPORT_SYMBOL(tcp_hashinfo);
+ 
+-static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
+ {
+ 	return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
+ 					  ip_hdr(skb)->saddr,
+@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	struct inet_sock *inet;
+ 	const int type = icmp_hdr(icmp_skb)->type;
+ 	const int code = icmp_hdr(icmp_skb)->code;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 	struct sk_buff *skb;
+ 	struct request_sock *fastopen;
+ 	__u32 seq, snd_una;
+@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		return;
+ 	}
+ 
+-	bh_lock_sock(sk);
++	tp = tcp_sk(sk);
++	if (mptcp(tp))
++		meta_sk = mptcp_meta_sk(sk);
++	else
++		meta_sk = sk;
++
++	bh_lock_sock(meta_sk);
+ 	/* If too many ICMPs get dropped on busy
+ 	 * servers this needs to be solved differently.
+ 	 * We do take care of PMTU discovery (RFC1191) special case :
+ 	 * we can receive locally generated ICMP messages while socket is held.
+ 	 */
+-	if (sock_owned_by_user(sk)) {
++	if (sock_owned_by_user(meta_sk)) {
+ 		if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
+ 			NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ 	}
+@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	}
+ 
+ 	icsk = inet_csk(sk);
+-	tp = tcp_sk(sk);
+ 	seq = ntohl(th->seq);
+ 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ 	fastopen = tp->fastopen_rsk;
+@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 				goto out;
+ 
+ 			tp->mtu_info = info;
+-			if (!sock_owned_by_user(sk)) {
++			if (!sock_owned_by_user(meta_sk)) {
+ 				tcp_v4_mtu_reduced(sk);
+ 			} else {
+ 				if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
+ 					sock_hold(sk);
++				if (mptcp(tp))
++					mptcp_tsq_flags(sk);
+ 			}
+ 			goto out;
+ 		}
+@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		    !icsk->icsk_backoff || fastopen)
+ 			break;
+ 
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			break;
+ 
+ 		icsk->icsk_backoff--;
+@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	switch (sk->sk_state) {
+ 		struct request_sock *req, **prev;
+ 	case TCP_LISTEN:
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			goto out;
+ 
+ 		req = inet_csk_search_req(sk, &prev, th->dest,
+@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 		if (fastopen && fastopen->sk == NULL)
+ 			break;
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			sk->sk_err = err;
+ 
+ 			sk->sk_error_report(sk);
+@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	 */
+ 
+ 	inet = inet_sk(sk);
+-	if (!sock_owned_by_user(sk) && inet->recverr) {
++	if (!sock_owned_by_user(meta_sk) && inet->recverr) {
+ 		sk->sk_err = err;
+ 		sk->sk_error_report(sk);
+ 	} else	{ /* Only an error on timeout */
+@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
+ 	}
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
+  *	Exception: precedence violation. We do not implement it in any case.
+  */
+ 
+-static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct {
+@@ -702,10 +711,10 @@ release_sk1:
+    outside socket context is ugly, certainly. What can I do?
+  */
+ 
+-static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ 			    u32 win, u32 tsval, u32 tsecr, int oif,
+ 			    struct tcp_md5sig_key *key,
+-			    int reply_flags, u8 tos)
++			    int reply_flags, u8 tos, int mptcp)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct {
+@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ #ifdef CONFIG_TCP_MD5SIG
+ 			   + (TCPOLEN_MD5SIG_ALIGNED >> 2)
+ #endif
++#ifdef CONFIG_MPTCP
++			   + ((MPTCP_SUB_LEN_DSS >> 2) +
++			      (MPTCP_SUB_LEN_ACK >> 2))
++#endif
+ 			];
+ 	} rep;
+ 	struct ip_reply_arg arg;
+@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
+ 				    ip_hdr(skb)->daddr, &rep.th);
+ 	}
+ #endif
++#ifdef CONFIG_MPTCP
++	if (mptcp) {
++		int offset = (tsecr) ? 3 : 0;
++		/* Construction of 32-bit data_ack */
++		rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
++					  ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++					  (0x20 << 8) |
++					  (0x01));
++		rep.opt[offset] = htonl(data_ack);
++
++		arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++		rep.th.doff = arg.iov[0].iov_len / 4;
++	}
++#endif /* CONFIG_MPTCP */
++
+ 	arg.flags = reply_flags;
+ 	arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
+ 				      ip_hdr(skb)->saddr, /* XXX */
+@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct inet_timewait_sock *tw = inet_twsk(sk);
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++	u32 data_ack = 0;
++	int mptcp = 0;
++
++	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++		mptcp = 1;
++	}
+ 
+ 	tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++			data_ack,
+ 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ 			tcp_time_stamp + tcptw->tw_ts_offset,
+ 			tcptw->tw_ts_recent,
+ 			tw->tw_bound_dev_if,
+ 			tcp_twsk_md5_key(tcptw),
+ 			tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
+-			tw->tw_tos
++			tw->tw_tos, mptcp
+ 			);
+ 
+ 	inet_twsk_put(tw);
+ }
+ 
+-static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				  struct request_sock *req)
++void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req)
+ {
+ 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ 	 */
+ 	tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+-			tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
++			tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
+ 			tcp_time_stamp,
+ 			req->ts_recent,
+ 			0,
+ 			tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
+ 					  AF_INET),
+ 			inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
+-			ip_hdr(skb)->tos);
++			ip_hdr(skb)->tos, 0);
+ }
+ 
+ /*
+@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+  *	This still operates on a request_sock only, not on a big
+  *	socket.
+  */
+-static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+-			      struct request_sock *req,
+-			      u16 queue_mapping,
+-			      struct tcp_fastopen_cookie *foc)
++int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc)
+ {
+ 	const struct inet_request_sock *ireq = inet_rsk(req);
+ 	struct flowi4 fl4;
+@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
+ 	return err;
+ }
+ 
+-static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
+-{
+-	int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
+-
+-	if (!res) {
+-		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+-	}
+-	return res;
+-}
+-
+ /*
+  *	IPv4 request_sock destructor.
+  */
+-static void tcp_v4_reqsk_destructor(struct request_sock *req)
++void tcp_v4_reqsk_destructor(struct request_sock *req)
+ {
+ 	kfree(inet_rsk(req)->opt);
+ }
+@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
+ /*
+  * Save and compile IPv4 options into the request_sock if needed.
+  */
+-static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
++struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
+ {
+ 	const struct ip_options *opt = &(IPCB(skb)->opt);
+ 	struct ip_options_rcu *dopt = NULL;
+@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ 
+ #endif
+ 
++static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
++			   struct sk_buff *skb)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++
++	ireq->ir_loc_addr = ip_hdr(skb)->daddr;
++	ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
++	ireq->no_srccheck = inet_sk(sk)->transparent;
++	ireq->opt = tcp_v4_save_options(skb);
++	ireq->ir_mark = inet_request_mark(sk, skb);
++
++	return 0;
++}
++
++static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
++					  const struct request_sock *req,
++					  bool *strict)
++{
++	struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
++
++	if (strict) {
++		if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
++			*strict = true;
++		else
++			*strict = false;
++	}
++
++	return dst;
++}
++
+ struct request_sock_ops tcp_request_sock_ops __read_mostly = {
+ 	.family		=	PF_INET,
+ 	.obj_size	=	sizeof(struct tcp_request_sock),
+-	.rtx_syn_ack	=	tcp_v4_rtx_synack,
++	.rtx_syn_ack	=	tcp_rtx_synack,
+ 	.send_ack	=	tcp_v4_reqsk_send_ack,
+ 	.destructor	=	tcp_v4_reqsk_destructor,
+ 	.send_reset	=	tcp_v4_send_reset,
+ 	.syn_ack_timeout = 	tcp_syn_ack_timeout,
+ };
+ 
++const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
++	.mss_clamp	=	TCP_MSS_DEFAULT,
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
+ 	.md5_lookup	=	tcp_v4_reqsk_md5_lookup,
+ 	.calc_md5_hash	=	tcp_v4_md5_hash_skb,
+-};
+ #endif
++	.init_req	=	tcp_v4_init_req,
++#ifdef CONFIG_SYN_COOKIES
++	.cookie_init_seq =	cookie_v4_init_sequence,
++#endif
++	.route_req	=	tcp_v4_route_req,
++	.init_seq	=	tcp_v4_init_sequence,
++	.send_synack	=	tcp_v4_send_synack,
++	.queue_hash_add =	inet_csk_reqsk_queue_hash_add,
++};
+ 
+ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+-	struct tcp_options_received tmp_opt;
+-	struct request_sock *req;
+-	struct inet_request_sock *ireq;
+-	struct tcp_sock *tp = tcp_sk(sk);
+-	struct dst_entry *dst = NULL;
+-	__be32 saddr = ip_hdr(skb)->saddr;
+-	__be32 daddr = ip_hdr(skb)->daddr;
+-	__u32 isn = TCP_SKB_CB(skb)->when;
+-	bool want_cookie = false, fastopen;
+-	struct flowi4 fl4;
+-	struct tcp_fastopen_cookie foc = { .len = -1 };
+-	int err;
+-
+ 	/* Never answer to SYNs send to broadcast or multicast */
+ 	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
+ 		goto drop;
+ 
+-	/* TW buckets are converted to open requests without
+-	 * limitations, they conserve resources and peer is
+-	 * evidently real one.
+-	 */
+-	if ((sysctl_tcp_syncookies == 2 ||
+-	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+-		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
+-		if (!want_cookie)
+-			goto drop;
+-	}
+-
+-	/* Accept backlog is full. If we have already queued enough
+-	 * of warm entries in syn queue, drop request. It is better than
+-	 * clogging syn queue with openreqs with exponentially increasing
+-	 * timeout.
+-	 */
+-	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+-		goto drop;
+-	}
+-
+-	req = inet_reqsk_alloc(&tcp_request_sock_ops);
+-	if (!req)
+-		goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+-	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
+-#endif
+-
+-	tcp_clear_options(&tmp_opt);
+-	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
+-	tmp_opt.user_mss  = tp->rx_opt.user_mss;
+-	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+-	if (want_cookie && !tmp_opt.saw_tstamp)
+-		tcp_clear_options(&tmp_opt);
+-
+-	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+-	tcp_openreq_init(req, &tmp_opt, skb);
++	return tcp_conn_request(&tcp_request_sock_ops,
++				&tcp_request_sock_ipv4_ops, sk, skb);
+ 
+-	ireq = inet_rsk(req);
+-	ireq->ir_loc_addr = daddr;
+-	ireq->ir_rmt_addr = saddr;
+-	ireq->no_srccheck = inet_sk(sk)->transparent;
+-	ireq->opt = tcp_v4_save_options(skb);
+-	ireq->ir_mark = inet_request_mark(sk, skb);
+-
+-	if (security_inet_conn_request(sk, skb, req))
+-		goto drop_and_free;
+-
+-	if (!want_cookie || tmp_opt.tstamp_ok)
+-		TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+-	if (want_cookie) {
+-		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
+-		req->cookie_ts = tmp_opt.tstamp_ok;
+-	} else if (!isn) {
+-		/* VJ's idea. We save last timestamp seen
+-		 * from the destination in peer table, when entering
+-		 * state TIME-WAIT, and check against it before
+-		 * accepting new connection request.
+-		 *
+-		 * If "isn" is not zero, this request hit alive
+-		 * timewait bucket, so that all the necessary checks
+-		 * are made in the function processing timewait state.
+-		 */
+-		if (tmp_opt.saw_tstamp &&
+-		    tcp_death_row.sysctl_tw_recycle &&
+-		    (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
+-		    fl4.daddr == saddr) {
+-			if (!tcp_peer_is_proven(req, dst, true)) {
+-				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+-				goto drop_and_release;
+-			}
+-		}
+-		/* Kill the following clause, if you dislike this way. */
+-		else if (!sysctl_tcp_syncookies &&
+-			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+-			  (sysctl_max_syn_backlog >> 2)) &&
+-			 !tcp_peer_is_proven(req, dst, false)) {
+-			/* Without syncookies last quarter of
+-			 * backlog is filled with destinations,
+-			 * proven to be alive.
+-			 * It means that we continue to communicate
+-			 * to destinations, already remembered
+-			 * to the moment of synflood.
+-			 */
+-			LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
+-				       &saddr, ntohs(tcp_hdr(skb)->source));
+-			goto drop_and_release;
+-		}
+-
+-		isn = tcp_v4_init_sequence(skb);
+-	}
+-	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
+-		goto drop_and_free;
+-
+-	tcp_rsk(req)->snt_isn = isn;
+-	tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-	tcp_openreq_init_rwin(req, sk, dst);
+-	fastopen = !want_cookie &&
+-		   tcp_try_fastopen(sk, skb, req, &foc, dst);
+-	err = tcp_v4_send_synack(sk, dst, req,
+-				 skb_get_queue_mapping(skb), &foc);
+-	if (!fastopen) {
+-		if (err || want_cookie)
+-			goto drop_and_free;
+-
+-		tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-		tcp_rsk(req)->listener = NULL;
+-		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+-	}
+-
+-	return 0;
+-
+-drop_and_release:
+-	dst_release(dst);
+-drop_and_free:
+-	reqsk_free(req);
+ drop:
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ 	return 0;
+@@ -1497,7 +1433,7 @@ put_and_exit:
+ }
+ EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
+ 
+-static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcphdr *th = tcp_hdr(skb);
+ 	const struct iphdr *iph = ip_hdr(skb);
+@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 
+ 	if (nsk) {
+ 		if (nsk->sk_state != TCP_TIME_WAIT) {
++			/* Don't lock again the meta-sk. It has been locked
++			 * before mptcp_v4_do_rcv.
++			 */
++			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++				bh_lock_sock(mptcp_meta_sk(nsk));
+ 			bh_lock_sock(nsk);
++
+ 			return nsk;
++
+ 		}
+ 		inet_twsk_put(inet_twsk(nsk));
+ 		return NULL;
+@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
+ 		goto discard;
+ #endif
+ 
++	if (is_meta_sk(sk))
++		return mptcp_v4_do_rcv(sk, skb);
++
+ 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
+ 		struct dst_entry *dst = sk->sk_rx_dst;
+ 
+@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
+ 	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
+ 		wake_up_interruptible_sync_poll(sk_sleep(sk),
+ 					   POLLIN | POLLRDNORM | POLLRDBAND);
+-		if (!inet_csk_ack_scheduled(sk))
++		if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
+ 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
+ 						  (3 * tcp_rto_min(sk)) / 4,
+ 						  TCP_RTO_MAX);
+@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ {
+ 	const struct iphdr *iph;
+ 	const struct tcphdr *th;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk = NULL;
+ 	int ret;
+ 	struct net *net = dev_net(skb->dev);
+ 
+@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
+ 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ 				    skb->len - th->doff * 4);
+ 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++	TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ 	TCP_SKB_CB(skb)->when	 = 0;
+ 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
+ 	TCP_SKB_CB(skb)->sacked	 = 0;
+ 
+ 	sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+-	if (!sk)
+-		goto no_tcp_socket;
+ 
+ process:
+-	if (sk->sk_state == TCP_TIME_WAIT)
++	if (sk && sk->sk_state == TCP_TIME_WAIT)
+ 		goto do_time_wait;
+ 
++#ifdef CONFIG_MPTCP
++	if (!sk && th->syn && !th->ack) {
++		int ret = mptcp_lookup_join(skb, NULL);
++
++		if (ret < 0) {
++			tcp_v4_send_reset(NULL, skb);
++			goto discard_it;
++		} else if (ret > 0) {
++			return 0;
++		}
++	}
++
++	/* Is there a pending request sock for this segment ? */
++	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++		if (sk)
++			sock_put(sk);
++		return 0;
++	}
++#endif
++	if (!sk)
++		goto no_tcp_socket;
++
+ 	if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ 		goto discard_and_relse;
+@@ -1759,11 +1729,21 @@ process:
+ 	sk_mark_napi_id(sk, skb);
+ 	skb->dev = NULL;
+ 
+-	bh_lock_sock_nested(sk);
++	if (mptcp(tcp_sk(sk))) {
++		meta_sk = mptcp_meta_sk(sk);
++
++		bh_lock_sock_nested(meta_sk);
++		if (sock_owned_by_user(meta_sk))
++			skb->sk = sk;
++	} else {
++		meta_sk = sk;
++		bh_lock_sock_nested(sk);
++	}
++
+ 	ret = 0;
+-	if (!sock_owned_by_user(sk)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+-		struct tcp_sock *tp = tcp_sk(sk);
++		struct tcp_sock *tp = tcp_sk(meta_sk);
+ 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ 			tp->ucopy.dma_chan = net_dma_find_channel();
+ 		if (tp->ucopy.dma_chan)
+@@ -1771,16 +1751,16 @@ process:
+ 		else
+ #endif
+ 		{
+-			if (!tcp_prequeue(sk, skb))
++			if (!tcp_prequeue(meta_sk, skb))
+ 				ret = tcp_v4_do_rcv(sk, skb);
+ 		}
+-	} else if (unlikely(sk_add_backlog(sk, skb,
+-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
+-		bh_unlock_sock(sk);
++	} else if (unlikely(sk_add_backlog(meta_sk, skb,
++					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++		bh_unlock_sock(meta_sk);
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ 		goto discard_and_relse;
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 
+ 	sock_put(sk);
+ 
+@@ -1835,6 +1815,18 @@ do_time_wait:
+ 			sk = sk2;
+ 			goto process;
+ 		}
++#ifdef CONFIG_MPTCP
++		if (th->syn && !th->ack) {
++			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++			if (ret < 0) {
++				tcp_v4_send_reset(NULL, skb);
++				goto discard_it;
++			} else if (ret > 0) {
++				return 0;
++			}
++		}
++#endif
+ 		/* Fall through to ACK */
+ 	}
+ 	case TCP_TW_ACK:
+@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
+ 
+ 	tcp_init_sock(sk);
+ 
+-	icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++	if (is_mptcp_enabled(sk))
++		icsk->icsk_af_ops = &mptcp_v4_specific;
++	else
++#endif
++		icsk->icsk_af_ops = &ipv4_specific;
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ 	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
+@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
+ 
+ 	tcp_cleanup_congestion_control(sk);
+ 
++	if (mptcp(tp))
++		mptcp_destroy_sock(sk);
++	if (tp->inside_tk_table)
++		mptcp_hash_remove(tp);
++
+ 	/* Cleanup up the write buffer. */
+ 	tcp_write_queue_purge(sk);
+ 
+@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
+ }
+ #endif /* CONFIG_PROC_FS */
+ 
++#ifdef CONFIG_MPTCP
++static void tcp_v4_clear_sk(struct sock *sk, int size)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* we do not want to clear tk_table field, because of RCU lookups */
++	sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
++
++	size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
++	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
++}
++#endif
++
+ struct proto tcp_prot = {
+ 	.name			= "TCP",
+ 	.owner			= THIS_MODULE,
+@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
+ 	.destroy_cgroup		= tcp_destroy_cgroup,
+ 	.proto_cgroup		= tcp_proto_cgroup,
+ #endif
++#ifdef CONFIG_MPTCP
++	.clear_sk		= tcp_v4_clear_sk,
++#endif
+ };
+ EXPORT_SYMBOL(tcp_prot);
+ 
+diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
+index e68e0d4af6c9..ae6946857dff 100644
+--- a/net/ipv4/tcp_minisocks.c
++++ b/net/ipv4/tcp_minisocks.c
+@@ -18,11 +18,13 @@
+  *		Jorge Cwik, <jorge@laser.satlink.net>
+  */
+ 
++#include <linux/kconfig.h>
+ #include <linux/mm.h>
+ #include <linux/module.h>
+ #include <linux/slab.h>
+ #include <linux/sysctl.h>
+ #include <linux/workqueue.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ #include <net/inet_common.h>
+ #include <net/xfrm.h>
+@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 	struct tcp_options_received tmp_opt;
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
+ 	bool paws_reject = false;
++	struct mptcp_options_received mopt;
+ 
+ 	tmp_opt.saw_tstamp = 0;
+ 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
+-		tcp_parse_options(skb, &tmp_opt, 0, NULL);
++		mptcp_init_mp_opt(&mopt);
++
++		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+ 
+ 		if (tmp_opt.saw_tstamp) {
+ 			tmp_opt.rcv_tsecr	-= tcptw->tw_ts_offset;
+@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 			tmp_opt.ts_recent_stamp	= tcptw->tw_ts_recent_stamp;
+ 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
+ 		}
++
++		if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
++			if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
++				goto kill_with_rst;
++		}
+ 	}
+ 
+ 	if (tw->tw_substate == TCP_FIN_WAIT2) {
+@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
+ 		if (!th->ack ||
+ 		    !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
+ 		    TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
++			/* If mptcp_is_data_fin() returns true, we are sure that
++			 * mopt has been initialized - otherwise it would not
++			 * be a DATA_FIN.
++			 */
++			if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
++			    mptcp_is_data_fin(skb) &&
++			    TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
++			    mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
++				return TCP_TW_ACK;
++
+ 			inet_twsk_put(tw);
+ 			return TCP_TW_SUCCESS;
+ 		}
+@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ 		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
+ 		tcptw->tw_ts_offset	= tp->tsoffset;
+ 
++		if (mptcp(tp)) {
++			if (mptcp_init_tw_sock(sk, tcptw)) {
++				inet_twsk_free(tw);
++				goto exit;
++			}
++		} else {
++			tcptw->mptcp_tw = NULL;
++		}
++
+ #if IS_ENABLED(CONFIG_IPV6)
+ 		if (tw->tw_family == PF_INET6) {
+ 			struct ipv6_pinfo *np = inet6_sk(sk);
+@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
+ 	}
+ 
++exit:
+ 	tcp_update_metrics(sk);
+ 	tcp_done(sk);
+ }
+ 
+ void tcp_twsk_destructor(struct sock *sk)
+ {
+-#ifdef CONFIG_TCP_MD5SIG
+ 	struct tcp_timewait_sock *twsk = tcp_twsk(sk);
+ 
++	if (twsk->mptcp_tw)
++		mptcp_twsk_destructor(twsk);
++#ifdef CONFIG_TCP_MD5SIG
+ 	if (twsk->tw_md5_key)
+ 		kfree_rcu(twsk->tw_md5_key, rcu);
+ #endif
+@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
+ 		req->window_clamp = tcp_full_space(sk);
+ 
+ 	/* tcp_full_space because it is guaranteed to be the first packet */
+-	tcp_select_initial_window(tcp_full_space(sk),
+-		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
++	tp->ops->select_initial_window(tcp_full_space(sk),
++		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
++		(ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
+ 		&req->rcv_wnd,
+ 		&req->window_clamp,
+ 		ireq->wscale_ok,
+ 		&rcv_wscale,
+-		dst_metric(dst, RTAX_INITRWND));
++		dst_metric(dst, RTAX_INITRWND), sk);
+ 	ireq->rcv_wscale = rcv_wscale;
+ }
+ EXPORT_SYMBOL(tcp_openreq_init_rwin);
+@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
+ 			newtp->rx_opt.ts_recent_stamp = 0;
+ 			newtp->tcp_header_len = sizeof(struct tcphdr);
+ 		}
++		if (ireq->saw_mpc)
++			newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
+ 		newtp->tsoffset = 0;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		newtp->md5sig_info = NULL;	/*XXX*/
+@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 			   bool fastopen)
+ {
+ 	struct tcp_options_received tmp_opt;
++	struct mptcp_options_received mopt;
+ 	struct sock *child;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
+ 	bool paws_reject = false;
+ 
+-	BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
++	BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
+ 
+ 	tmp_opt.saw_tstamp = 0;
++
++	mptcp_init_mp_opt(&mopt);
++
+ 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
+-		tcp_parse_options(skb, &tmp_opt, 0, NULL);
++		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
+ 
+ 		if (tmp_opt.saw_tstamp) {
+ 			tmp_opt.ts_recent = req->ts_recent;
+@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 		 *
+ 		 * Reset timer after retransmitting SYNACK, similar to
+ 		 * the idea of fast retransmit in recovery.
++		 *
++		 * Fall back to TCP if MP_CAPABLE is not set.
+ 		 */
++
++		if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
++			inet_rsk(req)->saw_mpc = false;
++
++
+ 		if (!inet_rtx_syn_ack(sk, req))
+ 			req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
+ 					   TCP_RTO_MAX) + jiffies;
+@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
+ 	 * socket is created, wait for troubles.
+ 	 */
+ 	child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
++
+ 	if (child == NULL)
+ 		goto listen_overflow;
+ 
++	if (!is_meta_sk(sk)) {
++		int ret = mptcp_check_req_master(sk, child, req, prev);
++		if (ret < 0)
++			goto listen_overflow;
++
++		/* MPTCP-supported */
++		if (!ret)
++			return tcp_sk(child)->mpcb->master_sk;
++	} else {
++		return mptcp_check_req_child(sk, child, req, prev, &mopt);
++	}
+ 	inet_csk_reqsk_queue_unlink(sk, req, prev);
+ 	inet_csk_reqsk_queue_removed(sk, req);
+ 
+@@ -746,7 +804,17 @@ embryonic_reset:
+ 		tcp_reset(sk);
+ 	}
+ 	if (!fastopen) {
+-		inet_csk_reqsk_queue_drop(sk, req, prev);
++		if (is_meta_sk(sk)) {
++			/* We want to avoid stoping the keepalive-timer and so
++			 * avoid ending up in inet_csk_reqsk_queue_removed ...
++			 */
++			inet_csk_reqsk_queue_unlink(sk, req, prev);
++			if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
++				mptcp_delete_synack_timer(sk);
++			reqsk_free(req);
++		} else {
++			inet_csk_reqsk_queue_drop(sk, req, prev);
++		}
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
+ 	}
+ 	return NULL;
+@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ {
+ 	int ret = 0;
+ 	int state = child->sk_state;
++	struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
+ 
+-	if (!sock_owned_by_user(child)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ 		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
+ 					    skb->len);
+ 		/* Wakeup parent, send SIGIO */
+@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
+ 		 * in main socket hash table and lock on listening
+ 		 * socket does not protect us more.
+ 		 */
+-		__sk_add_backlog(child, skb);
++		if (mptcp(tcp_sk(child)))
++			skb->sk = child;
++		__sk_add_backlog(meta_sk, skb);
+ 	}
+ 
+-	bh_unlock_sock(child);
++	if (mptcp(tcp_sk(child)))
++		bh_unlock_sock(child);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(child);
+ 	return ret;
+ }
+diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
+index 179b51e6bda3..efd31b6c5784 100644
+--- a/net/ipv4/tcp_output.c
++++ b/net/ipv4/tcp_output.c
+@@ -36,6 +36,12 @@
+ 
+ #define pr_fmt(fmt) "TCP: " fmt
+ 
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++#include <net/ipv6.h>
+ #include <net/tcp.h>
+ 
+ #include <linux/compiler.h>
+@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
+ unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
+ EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
+ 
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+-			   int push_one, gfp_t gfp);
+-
+ /* Account for new data that has been sent to the network. */
+-static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
++void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
+ void tcp_select_initial_window(int __space, __u32 mss,
+ 			       __u32 *rcv_wnd, __u32 *window_clamp,
+ 			       int wscale_ok, __u8 *rcv_wscale,
+-			       __u32 init_rcv_wnd)
++			       __u32 init_rcv_wnd, const struct sock *sk)
+ {
+ 	unsigned int space = (__space < 0 ? 0 : __space);
+ 
+@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
+  * value can be stuffed directly into th->window for an outgoing
+  * frame.
+  */
+-static u16 tcp_select_window(struct sock *sk)
++u16 tcp_select_window(struct sock *sk)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 old_win = tp->rcv_wnd;
+-	u32 cur_win = tcp_receive_window(tp);
+-	u32 new_win = __tcp_select_window(sk);
++	/* The window must never shrink at the meta-level. At the subflow we
++	 * have to allow this. Otherwise we may announce a window too large
++	 * for the current meta-level sk_rcvbuf.
++	 */
++	u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
++	u32 new_win = tp->ops->__select_window(sk);
+ 
+ 	/* Never shrink the offered window */
+ 	if (new_win < cur_win) {
+@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
+ 				      LINUX_MIB_TCPWANTZEROWINDOWADV);
+ 		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+ 	}
++
+ 	tp->rcv_wnd = new_win;
+ 	tp->rcv_wup = tp->rcv_nxt;
+ 
+@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
+ /* Constructs common control bits of non-data skb. If SYN/FIN is present,
+  * auto increment end seqno.
+  */
+-static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
++void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ {
+ 	struct skb_shared_info *shinfo = skb_shinfo(skb);
+ 
+@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+ 	TCP_SKB_CB(skb)->end_seq = seq;
+ }
+ 
+-static inline bool tcp_urg_mode(const struct tcp_sock *tp)
++bool tcp_urg_mode(const struct tcp_sock *tp)
+ {
+ 	return tp->snd_una != tp->snd_up;
+ }
+@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
+ #define OPTION_MD5		(1 << 2)
+ #define OPTION_WSCALE		(1 << 3)
+ #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
+-
+-struct tcp_out_options {
+-	u16 options;		/* bit field of OPTION_* */
+-	u16 mss;		/* 0 to disable */
+-	u8 ws;			/* window scale, 0 to disable */
+-	u8 num_sack_blocks;	/* number of SACK blocks to include */
+-	u8 hash_size;		/* bytes in hash_location */
+-	__u8 *hash_location;	/* temporary pointer, overloaded */
+-	__u32 tsval, tsecr;	/* need to include OPTION_TS */
+-	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
+-};
++/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
+ 
+ /* Write previously computed TCP options to the packet.
+  *
+@@ -430,7 +428,7 @@ struct tcp_out_options {
+  * (but it may well be that other scenarios fail similarly).
+  */
+ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+-			      struct tcp_out_options *opts)
++			      struct tcp_out_options *opts, struct sk_buff *skb)
+ {
+ 	u16 options = opts->options;	/* mungable copy */
+ 
+@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
+ 		}
+ 		ptr += (foc->len + 3) >> 2;
+ 	}
++
++	if (unlikely(OPTION_MPTCP & opts->options))
++		mptcp_options_write(ptr, tp, opts, skb);
+ }
+ 
+ /* Compute TCP options for SYN packets. This is not the final
+@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
+ 		if (unlikely(!(OPTION_TS & opts->options)))
+ 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
+ 	}
++	if (tp->request_mptcp || mptcp(tp))
++		mptcp_syn_options(sk, opts, &remaining);
+ 
+ 	if (fastopen && fastopen->cookie.len >= 0) {
+ 		u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
+@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
+ 		}
+ 	}
+ 
++	if (ireq->saw_mpc)
++		mptcp_synack_options(req, opts, &remaining);
++
+ 	return MAX_TCP_OPTION_SPACE - remaining;
+ }
+ 
+@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
+ 		opts->tsecr = tp->rx_opt.ts_recent;
+ 		size += TCPOLEN_TSTAMP_ALIGNED;
+ 	}
++	if (mptcp(tp))
++		mptcp_established_options(sk, skb, opts, &size);
+ 
+ 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
+ 	if (unlikely(eff_sacks)) {
+-		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
+-		opts->num_sack_blocks =
+-			min_t(unsigned int, eff_sacks,
+-			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
+-			      TCPOLEN_SACK_PERBLOCK);
+-		size += TCPOLEN_SACK_BASE_ALIGNED +
+-			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
++		const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
++		if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
++			opts->num_sack_blocks = 0;
++		else
++			opts->num_sack_blocks =
++			    min_t(unsigned int, eff_sacks,
++				  (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
++				  TCPOLEN_SACK_PERBLOCK);
++		if (opts->num_sack_blocks)
++			size += TCPOLEN_SACK_BASE_ALIGNED +
++			    opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+ 	}
+ 
+ 	return size;
+@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
+ 	if ((1 << sk->sk_state) &
+ 	    (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
+ 	     TCPF_CLOSE_WAIT  | TCPF_LAST_ACK))
+-		tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
+-			       0, GFP_ATOMIC);
++		tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
++					    tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
+ }
+ /*
+  * One tasklet per cpu tries to send more skbs.
+@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
+ 	unsigned long flags;
+ 	struct list_head *q, *n;
+ 	struct tcp_sock *tp;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 
+ 	local_irq_save(flags);
+ 	list_splice_init(&tsq->head, &list);
+@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
+ 		list_del(&tp->tsq_node);
+ 
+ 		sk = (struct sock *)tp;
+-		bh_lock_sock(sk);
++		meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
++		bh_lock_sock(meta_sk);
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			tcp_tsq_handler(sk);
++			if (mptcp(tp))
++				tcp_tsq_handler(meta_sk);
+ 		} else {
++			if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
++				goto exit;
++
+ 			/* defer the work to tcp_release_cb() */
+ 			set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
++
++			if (mptcp(tp))
++				mptcp_tsq_flags(sk);
+ 		}
+-		bh_unlock_sock(sk);
++exit:
++		bh_unlock_sock(meta_sk);
+ 
+ 		clear_bit(TSQ_QUEUED, &tp->tsq_flags);
+ 		sk_free(sk);
+@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
+ #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) |		\
+ 			  (1UL << TCP_WRITE_TIMER_DEFERRED) |	\
+ 			  (1UL << TCP_DELACK_TIMER_DEFERRED) |	\
+-			  (1UL << TCP_MTU_REDUCED_DEFERRED))
++			  (1UL << TCP_MTU_REDUCED_DEFERRED) |   \
++			  (1UL << MPTCP_PATH_MANAGER) |		\
++			  (1UL << MPTCP_SUB_DEFERRED))
++
+ /**
+  * tcp_release_cb - tcp release_sock() callback
+  * @sk: socket
+@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
+ 		sk->sk_prot->mtu_reduced(sk);
+ 		__sock_put(sk);
+ 	}
++	if (flags & (1UL << MPTCP_PATH_MANAGER)) {
++		if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
++			tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
++		__sock_put(sk);
++	}
++	if (flags & (1UL << MPTCP_SUB_DEFERRED))
++		mptcp_tsq_sub_deferred(sk);
+ }
+ EXPORT_SYMBOL(tcp_release_cb);
+ 
+@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
+  * We are working here with either a clone of the original
+  * SKB, or a fresh unique copy made by the retransmit engine.
+  */
+-static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+-			    gfp_t gfp_mask)
++int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
++		        gfp_t gfp_mask)
+ {
+ 	const struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct inet_sock *inet;
+@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ 		 */
+ 		th->window	= htons(min(tp->rcv_wnd, 65535U));
+ 	} else {
+-		th->window	= htons(tcp_select_window(sk));
++		th->window	= htons(tp->ops->select_window(sk));
+ 	}
+ 	th->check		= 0;
+ 	th->urg_ptr		= 0;
+@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+ 		}
+ 	}
+ 
+-	tcp_options_write((__be32 *)(th + 1), tp, &opts);
++	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ 	if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
+ 		TCP_ECN_send(sk, skb, tcp_header_size);
+ 
+@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
+  * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
+  * otherwise socket can stall.
+  */
+-static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
++void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
+ }
+ 
+ /* Initialize TSO segments for a packet. */
+-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+-				 unsigned int mss_now)
++void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
++			  unsigned int mss_now)
+ {
+ 	struct skb_shared_info *shinfo = skb_shinfo(skb);
+ 
+ 	/* Make sure we own this skb before messing gso_size/gso_segs */
+ 	WARN_ON_ONCE(skb_cloned(skb));
+ 
+-	if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
++	if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
++	    (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
+ 		/* Avoid the costly divide in the normal
+ 		 * non-TSO case.
+ 		 */
+@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
+ /* Pcount in the middle of the write queue got changed, we need to do various
+  * tweaks to fix counters
+  */
+-static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
++void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
+@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
+  * eventually). The difference is that pulled data not copied, but
+  * immediately discarded.
+  */
+-static void __pskb_trim_head(struct sk_buff *skb, int len)
++void __pskb_trim_head(struct sk_buff *skb, int len)
+ {
+ 	struct skb_shared_info *shinfo;
+ 	int i, k, eat;
+@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
+ /* Remove acked data from a packet in the transmit queue. */
+ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ {
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
++		return mptcp_trim_head(sk, skb, len);
++
+ 	if (skb_unclone(skb, GFP_ATOMIC))
+ 		return -ENOMEM;
+ 
+@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+ 	if (tcp_skb_pcount(skb) > 1)
+ 		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
+ 
++#ifdef CONFIG_MPTCP
++	/* Some data got acked - we assume that the seq-number reached the dest.
++	 * Anyway, our MPTCP-option has been trimmed above - we lost it here.
++	 * Only remove the SEQ if the call does not come from a meta retransmit.
++	 */
++	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
++		TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
++#endif
++
+ 	return 0;
+ }
+ 
+@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
+ 
+ 	return mss_now;
+ }
++EXPORT_SYMBOL(tcp_current_mss);
+ 
+ /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
+  * As additional protections, we do not touch cwnd in retransmission phases,
+@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
+  * But we can avoid doing the divide again given we already have
+  *  skb_pcount = skb->len / mss_now
+  */
+-static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
+-				const struct sk_buff *skb)
++void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
++			 const struct sk_buff *skb)
+ {
+ 	if (skb->len < tcp_skb_pcount(skb) * mss_now)
+ 		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
+ 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
+ }
+ /* Returns the portion of skb which can be sent right away */
+-static unsigned int tcp_mss_split_point(const struct sock *sk,
+-					const struct sk_buff *skb,
+-					unsigned int mss_now,
+-					unsigned int max_segs,
+-					int nonagle)
++unsigned int tcp_mss_split_point(const struct sock *sk,
++				 const struct sk_buff *skb,
++				 unsigned int mss_now,
++				 unsigned int max_segs,
++				 int nonagle)
+ {
+ 	const struct tcp_sock *tp = tcp_sk(sk);
+ 	u32 partial, needed, window, max_len;
+@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
+ /* Can at least one segment of SKB be sent right now, according to the
+  * congestion window rules?  If so, return how many segments are allowed.
+  */
+-static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+-					 const struct sk_buff *skb)
++unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
++			   const struct sk_buff *skb)
+ {
+ 	u32 in_flight, cwnd;
+ 
+ 	/* Don't be strict about the congestion window for the final FIN.  */
+-	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
++	if (skb &&
++	    (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
+ 	    tcp_skb_pcount(skb) == 1)
+ 		return 1;
+ 
+@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
+  * This must be invoked the first time we consider transmitting
+  * SKB onto the wire.
+  */
+-static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+-			     unsigned int mss_now)
++int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
++		      unsigned int mss_now)
+ {
+ 	int tso_segs = tcp_skb_pcount(skb);
+ 
+@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
+ /* Return true if the Nagle test allows this packet to be
+  * sent now.
+  */
+-static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
+-				  unsigned int cur_mss, int nonagle)
++bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		    unsigned int cur_mss, int nonagle)
+ {
+ 	/* Nagle rule does not apply to frames, which sit in the middle of the
+ 	 * write_queue (they have no chances to get new data).
+@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ 		return true;
+ 
+ 	/* Don't use the nagle rule for urgent data (or for the final FIN). */
+-	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
++	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
++	    mptcp_is_data_fin(skb))
+ 		return true;
+ 
+ 	if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
+@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
+ }
+ 
+ /* Does at least the first segment of SKB fit into the send window? */
+-static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
+-			     const struct sk_buff *skb,
+-			     unsigned int cur_mss)
++bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
++		      unsigned int cur_mss)
+ {
+ 	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
+ 
+@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
+ 	u32 send_win, cong_win, limit, in_flight;
+ 	int win_divisor;
+ 
+-	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
++	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
+ 		goto send_now;
+ 
+ 	if (icsk->icsk_ca_state != TCP_CA_Open)
+@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
+  * Returns true, if no segments are in flight and we have queued segments,
+  * but cannot send anything now because of SWS or another problem.
+  */
+-static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
++bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ 			   int push_one, gfp_t gfp)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ 
+ 	sent_pkts = 0;
+ 
+-	if (!push_one) {
++	/* pmtu not yet supported with MPTCP. Should be possible, by early
++	 * exiting the loop inside tcp_mtu_probe, making sure that only one
++	 * single DSS-mapping gets probed.
++	 */
++	if (!push_one && !mptcp(tp)) {
+ 		/* Do MTU probing. */
+ 		result = tcp_mtu_probe(sk);
+ 		if (!result) {
+@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
+ 	int err = -1;
+ 
+ 	if (tcp_send_head(sk) != NULL) {
+-		err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
++		err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
++					  GFP_ATOMIC);
+ 		goto rearm_timer;
+ 	}
+ 
+@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
+ 	if (unlikely(sk->sk_state == TCP_CLOSE))
+ 		return;
+ 
+-	if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
+-			   sk_gfp_atomic(sk, GFP_ATOMIC)))
++	if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
++					sk_gfp_atomic(sk, GFP_ATOMIC)))
+ 		tcp_check_probe_timer(sk);
+ }
+ 
+@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
+ 
+ 	BUG_ON(!skb || skb->len < mss_now);
+ 
+-	tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
++	tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
++				    sk->sk_allocation);
+ }
+ 
+ /* This function returns the amount that we can raise the
+@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+ 	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
+ 		return;
+ 
++	/* Currently not supported for MPTCP - but it should be possible */
++	if (mptcp(tp))
++		return;
++
+ 	tcp_for_write_queue_from_safe(skb, tmp, sk) {
+ 		if (!tcp_can_collapse(sk, skb))
+ 			break;
+@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+ 
+ 	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
+ 	th->window = htons(min(req->rcv_wnd, 65535U));
+-	tcp_options_write((__be32 *)(th + 1), tp, &opts);
++	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
+ 	th->doff = (tcp_header_size >> 2);
+ 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
+ 
+@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
+ 	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
+ 		tp->window_clamp = tcp_full_space(sk);
+ 
+-	tcp_select_initial_window(tcp_full_space(sk),
+-				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
+-				  &tp->rcv_wnd,
+-				  &tp->window_clamp,
+-				  sysctl_tcp_window_scaling,
+-				  &rcv_wscale,
+-				  dst_metric(dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk),
++				       tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
++				       &tp->rcv_wnd,
++				       &tp->window_clamp,
++				       sysctl_tcp_window_scaling,
++				       &rcv_wscale,
++				       dst_metric(dst, RTAX_INITRWND), sk);
+ 
+ 	tp->rx_opt.rcv_wscale = rcv_wscale;
+ 	tp->rcv_ssthresh = tp->rcv_wnd;
+@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
+ 	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ 	inet_csk(sk)->icsk_retransmits = 0;
+ 	tcp_clear_retrans(tp);
++
++#ifdef CONFIG_MPTCP
++	if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
++		if (is_master_tp(tp)) {
++			tp->request_mptcp = 1;
++			mptcp_connect_init(sk);
++		} else if (tp->mptcp) {
++			struct inet_sock *inet	= inet_sk(sk);
++
++			tp->mptcp->snt_isn	= tp->write_seq;
++			tp->mptcp->init_rcv_wnd	= tp->rcv_wnd;
++
++			/* Set nonce for new subflows */
++			if (sk->sk_family == AF_INET)
++				tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
++							inet->inet_saddr,
++							inet->inet_daddr,
++							inet->inet_sport,
++							inet->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++			else
++				tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
++						inet6_sk(sk)->saddr.s6_addr32,
++						sk->sk_v6_daddr.s6_addr32,
++						inet->inet_sport,
++						inet->inet_dport);
++#endif
++		}
++	}
++#endif
+ }
+ 
+ static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
+@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
+ 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
+ 	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
+ }
++EXPORT_SYMBOL(tcp_send_ack);
+ 
+ /* This routine sends a packet with an out of date sequence
+  * number. It assumes the other end will try to ack it.
+@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
+  * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
+  * out-of-date with SND.UNA-1 to probe window.
+  */
+-static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
++int tcp_xmit_probe_skb(struct sock *sk, int urgent)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	struct sk_buff *skb;
+@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	int err;
+ 
+-	err = tcp_write_wakeup(sk);
++	err = tp->ops->write_wakeup(sk);
+ 
+ 	if (tp->packets_out || !tcp_send_head(sk)) {
+ 		/* Cancel probe timer, if it is not required. */
+@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
+ 					  TCP_RTO_MAX);
+ 	}
+ }
++
++int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
++{
++	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++	struct flowi fl;
++	int res;
++
++	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
++	if (!res) {
++		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
++	}
++	return res;
++}
++EXPORT_SYMBOL(tcp_rtx_synack);
+diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
+index 286227abed10..966b873cbf3e 100644
+--- a/net/ipv4/tcp_timer.c
++++ b/net/ipv4/tcp_timer.c
+@@ -20,6 +20,7 @@
+ 
+ #include <linux/module.h>
+ #include <linux/gfp.h>
++#include <net/mptcp.h>
+ #include <net/tcp.h>
+ 
+ int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
+@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+ int sysctl_tcp_orphan_retries __read_mostly;
+ int sysctl_tcp_thin_linear_timeouts __read_mostly;
+ 
+-static void tcp_write_err(struct sock *sk)
++void tcp_write_err(struct sock *sk)
+ {
+ 	sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
+ 	sk->sk_error_report(sk);
+@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
+ 		    (!tp->snd_wnd && !tp->packets_out))
+ 			do_reset = 1;
+ 		if (do_reset)
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 		tcp_done(sk);
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
+ 		return 1;
+@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
+  * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
+  * syn_set flag is set.
+  */
+-static bool retransmits_timed_out(struct sock *sk,
+-				  unsigned int boundary,
+-				  unsigned int timeout,
+-				  bool syn_set)
++bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
++			   unsigned int timeout, bool syn_set)
+ {
+ 	unsigned int linear_backoff_thresh, start_ts;
+ 	unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
+ }
+ 
+ /* A write timeout has occurred. Process the after effects. */
+-static int tcp_write_timeout(struct sock *sk)
++int tcp_write_timeout(struct sock *sk)
+ {
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
+ 		}
+ 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
+ 		syn_set = true;
++		/* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
++		if (tcp_sk(sk)->request_mptcp &&
++		    icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
++			tcp_sk(sk)->request_mptcp = 0;
+ 	} else {
+ 		if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
+ 			/* Black hole detection */
+@@ -251,18 +254,22 @@ out:
+ static void tcp_delack_timer(unsigned long data)
+ {
+ 	struct sock *sk = (struct sock *)data;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ 
+-	bh_lock_sock(sk);
+-	if (!sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (!sock_owned_by_user(meta_sk)) {
+ 		tcp_delack_timer_handler(sk);
+ 	} else {
+ 		inet_csk(sk)->icsk_ack.blocked = 1;
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
++		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
+ 		/* deleguate our work to tcp_release_cb() */
+ 		if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ 			sock_hold(sk);
++		if (mptcp(tp))
++			mptcp_tsq_flags(sk);
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -479,6 +486,10 @@ out_reset_timer:
+ 		__sk_dst_reset(sk);
+ 
+ out:;
++	if (mptcp(tp)) {
++		mptcp_reinject_data(sk, 1);
++		mptcp_set_rto(sk);
++	}
+ }
+ 
+ void tcp_write_timer_handler(struct sock *sk)
+@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
+ 		break;
+ 	case ICSK_TIME_RETRANS:
+ 		icsk->icsk_pending = 0;
+-		tcp_retransmit_timer(sk);
++		tcp_sk(sk)->ops->retransmit_timer(sk);
+ 		break;
+ 	case ICSK_TIME_PROBE0:
+ 		icsk->icsk_pending = 0;
+@@ -520,16 +531,19 @@ out:
+ static void tcp_write_timer(unsigned long data)
+ {
+ 	struct sock *sk = (struct sock *)data;
++	struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
+ 
+-	bh_lock_sock(sk);
+-	if (!sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (!sock_owned_by_user(meta_sk)) {
+ 		tcp_write_timer_handler(sk);
+ 	} else {
+ 		/* deleguate our work to tcp_release_cb() */
+ 		if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
+ 			sock_hold(sk);
++		if (mptcp(tcp_sk(sk)))
++			mptcp_tsq_flags(sk);
+ 	}
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
+ 	struct sock *sk = (struct sock *) data;
+ 	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
+ 	u32 elapsed;
+ 
+ 	/* Only process if socket is not in use. */
+-	bh_lock_sock(sk);
+-	if (sock_owned_by_user(sk)) {
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
+ 		/* Try again later. */
+ 		inet_csk_reset_keepalive_timer (sk, HZ/20);
+ 		goto out;
+@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
+ 		goto out;
+ 	}
+ 
++	if (tp->send_mp_fclose) {
++		/* MUST do this before tcp_write_timeout, because retrans_stamp
++		 * may have been set to 0 in another part while we are
++		 * retransmitting MP_FASTCLOSE. Then, we would crash, because
++		 * retransmits_timed_out accesses the meta-write-queue.
++		 *
++		 * We make sure that the timestamp is != 0.
++		 */
++		if (!tp->retrans_stamp)
++			tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++		if (tcp_write_timeout(sk))
++			goto out;
++
++		tcp_send_ack(sk);
++		icsk->icsk_retransmits++;
++
++		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++		elapsed = icsk->icsk_rto;
++		goto resched;
++	}
++
+ 	if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
+ 		if (tp->linger2 >= 0) {
+ 			const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
+ 
+ 			if (tmo > 0) {
+-				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
++				tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
+ 				goto out;
+ 			}
+ 		}
+-		tcp_send_active_reset(sk, GFP_ATOMIC);
++		tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 		goto death;
+ 	}
+ 
+@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
+ 		    icsk->icsk_probes_out > 0) ||
+ 		    (icsk->icsk_user_timeout == 0 &&
+ 		    icsk->icsk_probes_out >= keepalive_probes(tp))) {
+-			tcp_send_active_reset(sk, GFP_ATOMIC);
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
+ 			tcp_write_err(sk);
+ 			goto out;
+ 		}
+-		if (tcp_write_wakeup(sk) <= 0) {
++		if (tp->ops->write_wakeup(sk) <= 0) {
+ 			icsk->icsk_probes_out++;
+ 			elapsed = keepalive_intvl_when(tp);
+ 		} else {
+@@ -642,7 +679,7 @@ death:
+ 	tcp_done(sk);
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
+index 5667b3003af9..7139c2973fd2 100644
+--- a/net/ipv6/addrconf.c
++++ b/net/ipv6/addrconf.c
+@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
+ 
+ 	kfree_rcu(ifp, rcu);
+ }
++EXPORT_SYMBOL(inet6_ifa_finish_destroy);
+ 
+ static void
+ ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
+diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
+index 7cb4392690dd..7057afbca4df 100644
+--- a/net/ipv6/af_inet6.c
++++ b/net/ipv6/af_inet6.c
+@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
+ 	return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
+ }
+ 
+-static int inet6_create(struct net *net, struct socket *sock, int protocol,
+-			int kern)
++int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
+ {
+ 	struct inet_sock *inet;
+ 	struct ipv6_pinfo *np;
+diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
+index a245e5ddffbd..99c892b8992d 100644
+--- a/net/ipv6/inet6_connection_sock.c
++++ b/net/ipv6/inet6_connection_sock.c
+@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
+ /*
+  * request_sock (formerly open request) hash tables.
+  */
+-static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
+-			   const u32 rnd, const u32 synq_hsize)
++u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
++		    const u32 rnd, const u32 synq_hsize)
+ {
+ 	u32 c;
+ 
+diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
+index edb58aff4ae7..ea4d9fda0927 100644
+--- a/net/ipv6/ipv6_sockglue.c
++++ b/net/ipv6/ipv6_sockglue.c
+@@ -48,6 +48,8 @@
+ #include <net/addrconf.h>
+ #include <net/inet_common.h>
+ #include <net/tcp.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
+ #include <net/udp.h>
+ #include <net/udplite.h>
+ #include <net/xfrm.h>
+@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
+ 				sock_prot_inuse_add(net, &tcp_prot, 1);
+ 				local_bh_enable();
+ 				sk->sk_prot = &tcp_prot;
+-				icsk->icsk_af_ops = &ipv4_specific;
++#ifdef CONFIG_MPTCP
++				if (is_mptcp_enabled(sk))
++					icsk->icsk_af_ops = &mptcp_v4_specific;
++				else
++#endif
++					icsk->icsk_af_ops = &ipv4_specific;
+ 				sk->sk_socket->ops = &inet_stream_ops;
+ 				sk->sk_family = PF_INET;
+ 				tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
+diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
+index a822b880689b..b2b38869d795 100644
+--- a/net/ipv6/syncookies.c
++++ b/net/ipv6/syncookies.c
+@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ 
+ 	/* check for timestamp cookie support */
+ 	memset(&tcp_opt, 0, sizeof(tcp_opt));
+-	tcp_parse_options(skb, &tcp_opt, 0, NULL);
++	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
+ 
+ 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
+ 		goto out;
+ 
+ 	ret = NULL;
+-	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
++	req = inet_reqsk_alloc(&tcp6_request_sock_ops);
+ 	if (!req)
+ 		goto out;
+ 
+@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
+ 	}
+ 
+ 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
+-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+-				  &req->rcv_wnd, &req->window_clamp,
+-				  ireq->wscale_ok, &rcv_wscale,
+-				  dst_metric(dst, RTAX_INITRWND));
++	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
++				       &req->rcv_wnd, &req->window_clamp,
++				       ireq->wscale_ok, &rcv_wscale,
++				       dst_metric(dst, RTAX_INITRWND), sk);
+ 
+ 	ireq->rcv_wscale = rcv_wscale;
+ 
+diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
+index 229239ad96b1..fda94d71666e 100644
+--- a/net/ipv6/tcp_ipv6.c
++++ b/net/ipv6/tcp_ipv6.c
+@@ -63,6 +63,8 @@
+ #include <net/inet_common.h>
+ #include <net/secure_seq.h>
+ #include <net/tcp_memcontrol.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
+ #include <net/busy_poll.h>
+ 
+ #include <linux/proc_fs.h>
+@@ -71,12 +73,6 @@
+ #include <linux/crypto.h>
+ #include <linux/scatterlist.h>
+ 
+-static void	tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
+-static void	tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				      struct request_sock *req);
+-
+-static int	tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
+-
+ static const struct inet_connection_sock_af_ops ipv6_mapped;
+ static const struct inet_connection_sock_af_ops ipv6_specific;
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
+ }
+ #endif
+ 
+-static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
++void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ {
+ 	struct dst_entry *dst = skb_dst(skb);
+ 	const struct rt6_info *rt = (const struct rt6_info *)dst;
+@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
+ 		inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+ }
+ 
+-static void tcp_v6_hash(struct sock *sk)
++void tcp_v6_hash(struct sock *sk)
+ {
+ 	if (sk->sk_state != TCP_CLOSE) {
+-		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
++		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
++		    inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
+ 			tcp_prot.hash(sk);
+ 			return;
+ 		}
+@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
+ 	}
+ }
+ 
+-static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
++__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ {
+ 	return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
+ 					    ipv6_hdr(skb)->saddr.s6_addr32,
+@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
+ 					    tcp_hdr(skb)->source);
+ }
+ 
+-static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
++int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 			  int addr_len)
+ {
+ 	struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
+@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 		sin.sin_port = usin->sin6_port;
+ 		sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
+ 
+-		icsk->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++		if (is_mptcp_enabled(sk))
++			icsk->icsk_af_ops = &mptcp_v6_mapped;
++		else
++#endif
++			icsk->icsk_af_ops = &ipv6_mapped;
+ 		sk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		tp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
+ 
+ 		if (err) {
+ 			icsk->icsk_ext_hdr_len = exthdrlen;
+-			icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++			if (is_mptcp_enabled(sk))
++				icsk->icsk_af_ops = &mptcp_v6_specific;
++			else
++#endif
++				icsk->icsk_af_ops = &ipv6_specific;
+ 			sk->sk_backlog_rcv = tcp_v6_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 			tp->af_specific = &tcp_sock_ipv6_specific;
+@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 	const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
+ 	const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
+ 	struct ipv6_pinfo *np;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk;
+ 	int err;
+ 	struct tcp_sock *tp;
+ 	struct request_sock *fastopen;
+@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		return;
+ 	}
+ 
+-	bh_lock_sock(sk);
+-	if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
++	tp = tcp_sk(sk);
++	if (mptcp(tp))
++		meta_sk = mptcp_meta_sk(sk);
++	else
++		meta_sk = sk;
++
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
+ 		NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
+ 
+ 	if (sk->sk_state == TCP_CLOSE)
+@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		goto out;
+ 	}
+ 
+-	tp = tcp_sk(sk);
+ 	seq = ntohl(th->seq);
+ 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
+ 	fastopen = tp->fastopen_rsk;
+@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 			goto out;
+ 
+ 		tp->mtu_info = ntohl(info);
+-		if (!sock_owned_by_user(sk))
++		if (!sock_owned_by_user(meta_sk))
+ 			tcp_v6_mtu_reduced(sk);
+-		else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
++		else {
++			if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
+ 					   &tp->tsq_flags))
+-			sock_hold(sk);
++				sock_hold(sk);
++			if (mptcp(tp))
++				mptcp_tsq_flags(sk);
++		}
+ 		goto out;
+ 	}
+ 
+@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 	switch (sk->sk_state) {
+ 		struct request_sock *req, **prev;
+ 	case TCP_LISTEN:
+-		if (sock_owned_by_user(sk))
++		if (sock_owned_by_user(meta_sk))
+ 			goto out;
+ 
+ 		req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
+@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		if (fastopen && fastopen->sk == NULL)
+ 			break;
+ 
+-		if (!sock_owned_by_user(sk)) {
++		if (!sock_owned_by_user(meta_sk)) {
+ 			sk->sk_err = err;
+ 			sk->sk_error_report(sk);		/* Wake people up to see the error (see connect in sock.c) */
+ 
+@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ 		goto out;
+ 	}
+ 
+-	if (!sock_owned_by_user(sk) && np->recverr) {
++	if (!sock_owned_by_user(meta_sk) && np->recverr) {
+ 		sk->sk_err = err;
+ 		sk->sk_error_report(sk);
+ 	} else
+ 		sk->sk_err_soft = err;
+ 
+ out:
+-	bh_unlock_sock(sk);
++	bh_unlock_sock(meta_sk);
+ 	sock_put(sk);
+ }
+ 
+ 
+-static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+-			      struct flowi6 *fl6,
+-			      struct request_sock *req,
+-			      u16 queue_mapping,
+-			      struct tcp_fastopen_cookie *foc)
++int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
++		       struct flowi *fl,
++		       struct request_sock *req,
++		       u16 queue_mapping,
++		       struct tcp_fastopen_cookie *foc)
+ {
+ 	struct inet_request_sock *ireq = inet_rsk(req);
+ 	struct ipv6_pinfo *np = inet6_sk(sk);
++	struct flowi6 *fl6 = &fl->u.ip6;
+ 	struct sk_buff *skb;
+ 	int err = -ENOMEM;
+ 
+@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
+ 		skb_set_queue_mapping(skb, queue_mapping);
+ 		err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
+ 		err = net_xmit_eval(err);
++		if (!tcp_rsk(req)->snt_synack && !err)
++			tcp_rsk(req)->snt_synack = tcp_time_stamp;
+ 	}
+ 
+ done:
+ 	return err;
+ }
+ 
+-static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
++int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ {
+-	struct flowi6 fl6;
++	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
++	struct flowi fl;
+ 	int res;
+ 
+-	res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
++	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
+ 	if (!res) {
+ 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
+ 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
+@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
+ 	return res;
+ }
+ 
+-static void tcp_v6_reqsk_destructor(struct request_sock *req)
++void tcp_v6_reqsk_destructor(struct request_sock *req)
+ {
+ 	kfree_skb(inet_rsk(req)->pktopts);
+ }
+@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
+ }
+ #endif
+ 
++static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
++			   struct sk_buff *skb)
++{
++	struct inet_request_sock *ireq = inet_rsk(req);
++	struct ipv6_pinfo *np = inet6_sk(sk);
++
++	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
++	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
++
++	ireq->ir_iif = sk->sk_bound_dev_if;
++	ireq->ir_mark = inet_request_mark(sk, skb);
++
++	/* So that link locals have meaning */
++	if (!sk->sk_bound_dev_if &&
++	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
++		ireq->ir_iif = inet6_iif(skb);
++
++	if (!TCP_SKB_CB(skb)->when &&
++	    (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
++	     np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
++	     np->rxopt.bits.rxohlim || np->repflow)) {
++		atomic_inc(&skb->users);
++		ireq->pktopts = skb;
++	}
++
++	return 0;
++}
++
++static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
++					  const struct request_sock *req,
++					  bool *strict)
++{
++	if (strict)
++		*strict = true;
++	return inet6_csk_route_req(sk, &fl->u.ip6, req);
++}
++
+ struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
+ 	.family		=	AF_INET6,
+ 	.obj_size	=	sizeof(struct tcp6_request_sock),
+-	.rtx_syn_ack	=	tcp_v6_rtx_synack,
++	.rtx_syn_ack	=	tcp_rtx_synack,
+ 	.send_ack	=	tcp_v6_reqsk_send_ack,
+ 	.destructor	=	tcp_v6_reqsk_destructor,
+ 	.send_reset	=	tcp_v6_send_reset,
+ 	.syn_ack_timeout =	tcp_syn_ack_timeout,
+ };
+ 
++const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
++	.mss_clamp	=	IPV6_MIN_MTU - sizeof(struct tcphdr) -
++				sizeof(struct ipv6hdr),
+ #ifdef CONFIG_TCP_MD5SIG
+-static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
+ 	.md5_lookup	=	tcp_v6_reqsk_md5_lookup,
+ 	.calc_md5_hash	=	tcp_v6_md5_hash_skb,
+-};
+ #endif
++	.init_req	=	tcp_v6_init_req,
++#ifdef CONFIG_SYN_COOKIES
++	.cookie_init_seq =	cookie_v6_init_sequence,
++#endif
++	.route_req	=	tcp_v6_route_req,
++	.init_seq	=	tcp_v6_init_sequence,
++	.send_synack	=	tcp_v6_send_synack,
++	.queue_hash_add =	inet6_csk_reqsk_queue_hash_add,
++};
+ 
+-static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+-				 u32 tsval, u32 tsecr, int oif,
+-				 struct tcp_md5sig_key *key, int rst, u8 tclass,
+-				 u32 label)
++static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
++				 u32 data_ack, u32 win, u32 tsval, u32 tsecr,
++				 int oif, struct tcp_md5sig_key *key, int rst,
++				 u8 tclass, u32 label, int mptcp)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	struct tcphdr *t1;
+@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 	if (key)
+ 		tot_len += TCPOLEN_MD5SIG_ALIGNED;
+ #endif
+-
++#ifdef CONFIG_MPTCP
++	if (mptcp)
++		tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
++#endif
+ 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
+ 			 GFP_ATOMIC);
+ 	if (buff == NULL)
+@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 		tcp_v6_md5_hash_hdr((__u8 *)topt, key,
+ 				    &ipv6_hdr(skb)->saddr,
+ 				    &ipv6_hdr(skb)->daddr, t1);
++		topt += 4;
++	}
++#endif
++#ifdef CONFIG_MPTCP
++	if (mptcp) {
++		/* Construction of 32-bit data_ack */
++		*topt++ = htonl((TCPOPT_MPTCP << 24) |
++				((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
++				(0x20 << 8) |
++				(0x01));
++		*topt++ = htonl(data_ack);
+ 	}
+ #endif
+ 
+@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
+ 	kfree_skb(buff);
+ }
+ 
+-static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
++void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th = tcp_hdr(skb);
+ 	u32 seq = 0, ack_seq = 0;
+@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
+ 			  (th->doff << 2);
+ 
+ 	oif = sk ? sk->sk_bound_dev_if : 0;
+-	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
++	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ release_sk1:
+@@ -902,45 +983,52 @@ release_sk1:
+ #endif
+ }
+ 
+-static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
++static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
+ 			    u32 win, u32 tsval, u32 tsecr, int oif,
+ 			    struct tcp_md5sig_key *key, u8 tclass,
+-			    u32 label)
++			    u32 label, int mptcp)
+ {
+-	tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
+-			     label);
++	tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
++			     key, 0, tclass, label, mptcp);
+ }
+ 
+ static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct inet_timewait_sock *tw = inet_twsk(sk);
+ 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
++	u32 data_ack = 0;
++	int mptcp = 0;
+ 
++	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
++		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
++		mptcp = 1;
++	}
+ 	tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
++			data_ack,
+ 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
+ 			tcp_time_stamp + tcptw->tw_ts_offset,
+ 			tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
+-			tw->tw_tclass, (tw->tw_flowlabel << 12));
++			tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
+ 
+ 	inet_twsk_put(tw);
+ }
+ 
+-static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
+-				  struct request_sock *req)
++void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
++			   struct request_sock *req)
+ {
+ 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
+ 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
+ 	 */
+ 	tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
+ 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
+-			tcp_rsk(req)->rcv_nxt,
++			tcp_rsk(req)->rcv_nxt, 0,
+ 			req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
+ 			tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
+-			0, 0);
++			0, 0, 0);
+ }
+ 
+ 
+-static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
++struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct request_sock *req, **prev;
+ 	const struct tcphdr *th = tcp_hdr(skb);
+@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 
+ 	if (nsk) {
+ 		if (nsk->sk_state != TCP_TIME_WAIT) {
++			/* Don't lock again the meta-sk. It has been locked
++			 * before mptcp_v6_do_rcv.
++			 */
++			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
++				bh_lock_sock(mptcp_meta_sk(nsk));
+ 			bh_lock_sock(nsk);
++
+ 			return nsk;
+ 		}
+ 		inet_twsk_put(inet_twsk(nsk));
+@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
+ 	return sk;
+ }
+ 
+-/* FIXME: this is substantially similar to the ipv4 code.
+- * Can some kind of merge be done? -- erics
+- */
+-static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
+ {
+-	struct tcp_options_received tmp_opt;
+-	struct request_sock *req;
+-	struct inet_request_sock *ireq;
+-	struct ipv6_pinfo *np = inet6_sk(sk);
+-	struct tcp_sock *tp = tcp_sk(sk);
+-	__u32 isn = TCP_SKB_CB(skb)->when;
+-	struct dst_entry *dst = NULL;
+-	struct tcp_fastopen_cookie foc = { .len = -1 };
+-	bool want_cookie = false, fastopen;
+-	struct flowi6 fl6;
+-	int err;
+-
+ 	if (skb->protocol == htons(ETH_P_IP))
+ 		return tcp_v4_conn_request(sk, skb);
+ 
+ 	if (!ipv6_unicast_destination(skb))
+ 		goto drop;
+ 
+-	if ((sysctl_tcp_syncookies == 2 ||
+-	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
+-		want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
+-		if (!want_cookie)
+-			goto drop;
+-	}
+-
+-	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
+-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+-		goto drop;
+-	}
+-
+-	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
+-	if (req == NULL)
+-		goto drop;
+-
+-#ifdef CONFIG_TCP_MD5SIG
+-	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
+-#endif
+-
+-	tcp_clear_options(&tmp_opt);
+-	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
+-	tmp_opt.user_mss = tp->rx_opt.user_mss;
+-	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
+-
+-	if (want_cookie && !tmp_opt.saw_tstamp)
+-		tcp_clear_options(&tmp_opt);
++	return tcp_conn_request(&tcp6_request_sock_ops,
++				&tcp_request_sock_ipv6_ops, sk, skb);
+ 
+-	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
+-	tcp_openreq_init(req, &tmp_opt, skb);
+-
+-	ireq = inet_rsk(req);
+-	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
+-	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
+-	if (!want_cookie || tmp_opt.tstamp_ok)
+-		TCP_ECN_create_request(req, skb, sock_net(sk));
+-
+-	ireq->ir_iif = sk->sk_bound_dev_if;
+-	ireq->ir_mark = inet_request_mark(sk, skb);
+-
+-	/* So that link locals have meaning */
+-	if (!sk->sk_bound_dev_if &&
+-	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
+-		ireq->ir_iif = inet6_iif(skb);
+-
+-	if (!isn) {
+-		if (ipv6_opt_accepted(sk, skb) ||
+-		    np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
+-		    np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
+-		    np->repflow) {
+-			atomic_inc(&skb->users);
+-			ireq->pktopts = skb;
+-		}
+-
+-		if (want_cookie) {
+-			isn = cookie_v6_init_sequence(sk, skb, &req->mss);
+-			req->cookie_ts = tmp_opt.tstamp_ok;
+-			goto have_isn;
+-		}
+-
+-		/* VJ's idea. We save last timestamp seen
+-		 * from the destination in peer table, when entering
+-		 * state TIME-WAIT, and check against it before
+-		 * accepting new connection request.
+-		 *
+-		 * If "isn" is not zero, this request hit alive
+-		 * timewait bucket, so that all the necessary checks
+-		 * are made in the function processing timewait state.
+-		 */
+-		if (tmp_opt.saw_tstamp &&
+-		    tcp_death_row.sysctl_tw_recycle &&
+-		    (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
+-			if (!tcp_peer_is_proven(req, dst, true)) {
+-				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
+-				goto drop_and_release;
+-			}
+-		}
+-		/* Kill the following clause, if you dislike this way. */
+-		else if (!sysctl_tcp_syncookies &&
+-			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
+-			  (sysctl_max_syn_backlog >> 2)) &&
+-			 !tcp_peer_is_proven(req, dst, false)) {
+-			/* Without syncookies last quarter of
+-			 * backlog is filled with destinations,
+-			 * proven to be alive.
+-			 * It means that we continue to communicate
+-			 * to destinations, already remembered
+-			 * to the moment of synflood.
+-			 */
+-			LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
+-				       &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
+-			goto drop_and_release;
+-		}
+-
+-		isn = tcp_v6_init_sequence(skb);
+-	}
+-have_isn:
+-
+-	if (security_inet_conn_request(sk, skb, req))
+-		goto drop_and_release;
+-
+-	if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
+-		goto drop_and_free;
+-
+-	tcp_rsk(req)->snt_isn = isn;
+-	tcp_rsk(req)->snt_synack = tcp_time_stamp;
+-	tcp_openreq_init_rwin(req, sk, dst);
+-	fastopen = !want_cookie &&
+-		   tcp_try_fastopen(sk, skb, req, &foc, dst);
+-	err = tcp_v6_send_synack(sk, dst, &fl6, req,
+-				 skb_get_queue_mapping(skb), &foc);
+-	if (!fastopen) {
+-		if (err || want_cookie)
+-			goto drop_and_free;
+-
+-		tcp_rsk(req)->listener = NULL;
+-		inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+-	}
+-	return 0;
+-
+-drop_and_release:
+-	dst_release(dst);
+-drop_and_free:
+-	reqsk_free(req);
+ drop:
+ 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
+ 	return 0; /* don't send reset */
+ }
+ 
+-static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+-					 struct request_sock *req,
+-					 struct dst_entry *dst)
++struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
++				  struct request_sock *req,
++				  struct dst_entry *dst)
+ {
+ 	struct inet_request_sock *ireq;
+ 	struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
+@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
+ 
+ 		newsk->sk_v6_rcv_saddr = newnp->saddr;
+ 
+-		inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
++#ifdef CONFIG_MPTCP
++		if (is_mptcp_enabled(newsk))
++			inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
++		else
++#endif
++			inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
+ 		newsk->sk_backlog_rcv = tcp_v4_do_rcv;
+ #ifdef CONFIG_TCP_MD5SIG
+ 		newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
+@@ -1329,7 +1292,7 @@ out:
+  * This is because we cannot sleep with the original spinlock
+  * held.
+  */
+-static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
++int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ {
+ 	struct ipv6_pinfo *np = inet6_sk(sk);
+ 	struct tcp_sock *tp;
+@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
+ 		goto discard;
+ #endif
+ 
++	if (is_meta_sk(sk))
++		return mptcp_v6_do_rcv(sk, skb);
++
+ 	if (sk_filter(sk, skb))
+ 		goto discard;
+ 
+@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ {
+ 	const struct tcphdr *th;
+ 	const struct ipv6hdr *hdr;
+-	struct sock *sk;
++	struct sock *sk, *meta_sk = NULL;
+ 	int ret;
+ 	struct net *net = dev_net(skb->dev);
+ 
+@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
+ 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ 				    skb->len - th->doff*4);
+ 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
++#ifdef CONFIG_MPTCP
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++	TCP_SKB_CB(skb)->dss_off = 0;
++#endif
+ 	TCP_SKB_CB(skb)->when = 0;
+ 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
+ 	TCP_SKB_CB(skb)->sacked = 0;
+ 
+ 	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
+-	if (!sk)
+-		goto no_tcp_socket;
+ 
+ process:
+-	if (sk->sk_state == TCP_TIME_WAIT)
++	if (sk && sk->sk_state == TCP_TIME_WAIT)
+ 		goto do_time_wait;
+ 
++#ifdef CONFIG_MPTCP
++	if (!sk && th->syn && !th->ack) {
++		int ret = mptcp_lookup_join(skb, NULL);
++
++		if (ret < 0) {
++			tcp_v6_send_reset(NULL, skb);
++			goto discard_it;
++		} else if (ret > 0) {
++			return 0;
++		}
++	}
++
++	/* Is there a pending request sock for this segment ? */
++	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
++		if (sk)
++			sock_put(sk);
++		return 0;
++	}
++#endif
++
++	if (!sk)
++		goto no_tcp_socket;
++
+ 	if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+ 		goto discard_and_relse;
+@@ -1529,11 +1520,21 @@ process:
+ 	sk_mark_napi_id(sk, skb);
+ 	skb->dev = NULL;
+ 
+-	bh_lock_sock_nested(sk);
++	if (mptcp(tcp_sk(sk))) {
++		meta_sk = mptcp_meta_sk(sk);
++
++		bh_lock_sock_nested(meta_sk);
++		if (sock_owned_by_user(meta_sk))
++			skb->sk = sk;
++	} else {
++		meta_sk = sk;
++		bh_lock_sock_nested(sk);
++	}
++
+ 	ret = 0;
+-	if (!sock_owned_by_user(sk)) {
++	if (!sock_owned_by_user(meta_sk)) {
+ #ifdef CONFIG_NET_DMA
+-		struct tcp_sock *tp = tcp_sk(sk);
++		struct tcp_sock *tp = tcp_sk(meta_sk);
+ 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ 			tp->ucopy.dma_chan = net_dma_find_channel();
+ 		if (tp->ucopy.dma_chan)
+@@ -1541,16 +1542,17 @@ process:
+ 		else
+ #endif
+ 		{
+-			if (!tcp_prequeue(sk, skb))
++			if (!tcp_prequeue(meta_sk, skb))
+ 				ret = tcp_v6_do_rcv(sk, skb);
+ 		}
+-	} else if (unlikely(sk_add_backlog(sk, skb,
+-					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
+-		bh_unlock_sock(sk);
++	} else if (unlikely(sk_add_backlog(meta_sk, skb,
++					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++		bh_unlock_sock(meta_sk);
+ 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
+ 		goto discard_and_relse;
+ 	}
+-	bh_unlock_sock(sk);
++
++	bh_unlock_sock(meta_sk);
+ 
+ 	sock_put(sk);
+ 	return ret ? -1 : 0;
+@@ -1607,6 +1609,18 @@ do_time_wait:
+ 			sk = sk2;
+ 			goto process;
+ 		}
++#ifdef CONFIG_MPTCP
++		if (th->syn && !th->ack) {
++			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
++
++			if (ret < 0) {
++				tcp_v6_send_reset(NULL, skb);
++				goto discard_it;
++			} else if (ret > 0) {
++				return 0;
++			}
++		}
++#endif
+ 		/* Fall through to ACK */
+ 	}
+ 	case TCP_TW_ACK:
+@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
+ 	}
+ }
+ 
+-static struct timewait_sock_ops tcp6_timewait_sock_ops = {
++struct timewait_sock_ops tcp6_timewait_sock_ops = {
+ 	.twsk_obj_size	= sizeof(struct tcp6_timewait_sock),
+ 	.twsk_unique	= tcp_twsk_unique,
+ 	.twsk_destructor = tcp_twsk_destructor,
+@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
+ 
+ 	tcp_init_sock(sk);
+ 
+-	icsk->icsk_af_ops = &ipv6_specific;
++#ifdef CONFIG_MPTCP
++	if (is_mptcp_enabled(sk))
++		icsk->icsk_af_ops = &mptcp_v6_specific;
++	else
++#endif
++		icsk->icsk_af_ops = &ipv6_specific;
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+ 	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
+@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
+ 	return 0;
+ }
+ 
+-static void tcp_v6_destroy_sock(struct sock *sk)
++void tcp_v6_destroy_sock(struct sock *sk)
+ {
+ 	tcp_v4_destroy_sock(sk);
+ 	inet6_destroy_sock(sk);
+@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
+ static void tcp_v6_clear_sk(struct sock *sk, int size)
+ {
+ 	struct inet_sock *inet = inet_sk(sk);
++#ifdef CONFIG_MPTCP
++	struct tcp_sock *tp = tcp_sk(sk);
++	/* size_tk_table goes from the end of tk_table to the end of sk */
++	int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
++			    sizeof(tp->tk_table);
++#endif
+ 
+ 	/* we do not want to clear pinet6 field, because of RCU lookups */
+ 	sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
+ 
+ 	size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
++
++#ifdef CONFIG_MPTCP
++	/* We zero out only from pinet6 to tk_table */
++	size -= size_tk_table + sizeof(tp->tk_table);
++#endif
+ 	memset(&inet->pinet6 + 1, 0, size);
++
++#ifdef CONFIG_MPTCP
++	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
++#endif
++
+ }
+ 
+ struct proto tcpv6_prot = {
+diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
+new file mode 100644
+index 000000000000..cdfc03adabf8
+--- /dev/null
++++ b/net/mptcp/Kconfig
+@@ -0,0 +1,115 @@
++#
++# MPTCP configuration
++#
++config MPTCP
++        bool "MPTCP protocol"
++        depends on (IPV6=y || IPV6=n)
++        ---help---
++          This replaces the normal TCP stack with a Multipath TCP stack,
++          able to use several paths at once.
++
++menuconfig MPTCP_PM_ADVANCED
++	bool "MPTCP: advanced path-manager control"
++	depends on MPTCP=y
++	---help---
++	  Support for selection of different path-managers. You should choose 'Y' here,
++	  because otherwise you will not actively create new MPTCP-subflows.
++
++if MPTCP_PM_ADVANCED
++
++config MPTCP_FULLMESH
++	tristate "MPTCP Full-Mesh Path-Manager"
++	depends on MPTCP=y
++	---help---
++	  This path-management module will create a full-mesh among all IP-addresses.
++
++config MPTCP_NDIFFPORTS
++	tristate "MPTCP ndiff-ports"
++	depends on MPTCP=y
++	---help---
++	  This path-management module will create multiple subflows between the same
++	  pair of IP-addresses, modifying the source-port. You can set the number
++	  of subflows via the mptcp_ndiffports-sysctl.
++
++config MPTCP_BINDER
++	tristate "MPTCP Binder"
++	depends on (MPTCP=y)
++	---help---
++	  This path-management module works like ndiffports, and adds the sysctl
++	  option to set the gateway (and/or path to) per each additional subflow
++	  via Loose Source Routing (IPv4 only).
++
++choice
++	prompt "Default MPTCP Path-Manager"
++	default DEFAULT
++	help
++	  Select the Path-Manager of your choice
++
++	config DEFAULT_FULLMESH
++		bool "Full mesh" if MPTCP_FULLMESH=y
++
++	config DEFAULT_NDIFFPORTS
++		bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
++
++	config DEFAULT_BINDER
++		bool "binder" if MPTCP_BINDER=y
++
++	config DEFAULT_DUMMY
++		bool "Default"
++
++endchoice
++
++endif
++
++config DEFAULT_MPTCP_PM
++	string
++	default "default" if DEFAULT_DUMMY
++	default "fullmesh" if DEFAULT_FULLMESH 
++	default "ndiffports" if DEFAULT_NDIFFPORTS
++	default "binder" if DEFAULT_BINDER
++	default "default"
++
++menuconfig MPTCP_SCHED_ADVANCED
++	bool "MPTCP: advanced scheduler control"
++	depends on MPTCP=y
++	---help---
++	  Support for selection of different schedulers. You should choose 'Y' here,
++	  if you want to choose a different scheduler than the default one.
++
++if MPTCP_SCHED_ADVANCED
++
++config MPTCP_ROUNDROBIN
++	tristate "MPTCP Round-Robin"
++	depends on (MPTCP=y)
++	---help---
++	  This is a very simple round-robin scheduler. Probably has bad performance
++	  but might be interesting for researchers.
++
++choice
++	prompt "Default MPTCP Scheduler"
++	default DEFAULT
++	help
++	  Select the Scheduler of your choice
++
++	config DEFAULT_SCHEDULER
++		bool "Default"
++		---help---
++		  This is the default scheduler, sending first on the subflow
++		  with the lowest RTT.
++
++	config DEFAULT_ROUNDROBIN
++		bool "Round-Robin" if MPTCP_ROUNDROBIN=y
++		---help---
++		  This is the round-rob scheduler, sending in a round-robin
++		  fashion..
++
++endchoice
++endif
++
++config DEFAULT_MPTCP_SCHED
++	string
++	depends on (MPTCP=y)
++	default "default" if DEFAULT_SCHEDULER
++	default "roundrobin" if DEFAULT_ROUNDROBIN
++	default "default"
++
+diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
+new file mode 100644
+index 000000000000..35561a7012e3
+--- /dev/null
++++ b/net/mptcp/Makefile
+@@ -0,0 +1,20 @@
++#
++## Makefile for MultiPath TCP support code.
++#
++#
++
++obj-$(CONFIG_MPTCP) += mptcp.o
++
++mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
++	   mptcp_output.o mptcp_input.o mptcp_sched.o
++
++obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
++obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
++obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
++obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
++obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
++obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
++obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
++
++mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
++
+diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
+new file mode 100644
+index 000000000000..95d8da560715
+--- /dev/null
++++ b/net/mptcp/mptcp_binder.c
+@@ -0,0 +1,487 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#include <linux/route.h>
++#include <linux/inet.h>
++#include <linux/mroute.h>
++#include <linux/spinlock_types.h>
++#include <net/inet_ecn.h>
++#include <net/route.h>
++#include <net/xfrm.h>
++#include <net/compat.h>
++#include <linux/slab.h>
++
++#define MPTCP_GW_MAX_LISTS	10
++#define MPTCP_GW_LIST_MAX_LEN	6
++#define MPTCP_GW_SYSCTL_MAX_LEN	(15 * MPTCP_GW_LIST_MAX_LEN *	\
++							MPTCP_GW_MAX_LISTS)
++
++struct mptcp_gw_list {
++	struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
++	u8 len[MPTCP_GW_MAX_LISTS];
++};
++
++struct binder_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++
++	struct mptcp_cb *mpcb;
++
++	/* Prevent multiple sub-sockets concurrently iterating over sockets */
++	spinlock_t *flow_lock;
++};
++
++static struct mptcp_gw_list *mptcp_gws;
++static rwlock_t mptcp_gws_lock;
++
++static int mptcp_binder_ndiffports __read_mostly = 1;
++
++static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
++
++static int mptcp_get_avail_list_ipv4(struct sock *sk)
++{
++	int i, j, list_taken, opt_ret, opt_len;
++	unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
++
++	for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
++		if (mptcp_gws->len[i] == 0)
++			goto error;
++
++		mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
++		list_taken = 0;
++
++		/* Loop through all sub-sockets in this connection */
++		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
++			mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
++
++			/* Reset length and options buffer, then retrieve
++			 * from socket
++			 */
++			opt_len = MAX_IPOPTLEN;
++			memset(opt, 0, MAX_IPOPTLEN);
++			opt_ret = ip_getsockopt(sk, IPPROTO_IP,
++				IP_OPTIONS, opt, &opt_len);
++			if (opt_ret < 0) {
++				mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
++					    __func__, opt_ret);
++				goto error;
++			}
++
++			/* If socket has no options, it has no stake in this list */
++			if (opt_len <= 0)
++				continue;
++
++			/* Iterate options buffer */
++			for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
++				if (*opt_ptr == IPOPT_LSRR) {
++					mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
++					goto sock_lsrr;
++				}
++			}
++			continue;
++
++sock_lsrr:
++			/* Pointer to the 2nd to last address */
++			opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
++
++			/* Addresses start 3 bytes after type offset */
++			opt_ptr += 3;
++			j = 0;
++
++			/* Different length lists cannot be the same */
++			if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
++				continue;
++
++			/* Iterate if we are still inside options list
++			 * and sysctl list
++			 */
++			while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
++				/* If there is a different address, this list must
++				 * not be set on this socket
++				 */
++				if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
++					break;
++
++				/* Jump 4 bytes to next address */
++				opt_ptr += 4;
++				j++;
++			}
++
++			/* Reached the end without a differing address, lists
++			 * are therefore identical.
++			 */
++			if (j == mptcp_gws->len[i]) {
++				mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
++				list_taken = 1;
++				break;
++			}
++		}
++
++		/* Free list found if not taken by a socket */
++		if (!list_taken) {
++			mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
++			break;
++		}
++	}
++
++	if (i >= MPTCP_GW_MAX_LISTS)
++		goto error;
++
++	return i;
++error:
++	return -1;
++}
++
++/* The list of addresses is parsed each time a new connection is opened,
++ *  to make sure it's up to date. In case of error, all the lists are
++ *  marked as unavailable and the subflow's fingerprint is set to 0.
++ */
++static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
++{
++	int i, j, ret;
++	unsigned char opt[MAX_IPOPTLEN] = {0};
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
++
++	/* Read lock: multiple sockets can read LSRR addresses at the same
++	 * time, but writes are done in mutual exclusion.
++	 * Spin lock: must search for free list for one socket at a time, or
++	 * multiple sockets could take the same list.
++	 */
++	read_lock(&mptcp_gws_lock);
++	spin_lock(fmp->flow_lock);
++
++	i = mptcp_get_avail_list_ipv4(sk);
++
++	/* Execution enters here only if a free path is found.
++	 */
++	if (i >= 0) {
++		opt[0] = IPOPT_NOP;
++		opt[1] = IPOPT_LSRR;
++		opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
++				(mptcp_gws->len[i] + 1) + 3;
++		opt[3] = IPOPT_MINOFF;
++		for (j = 0; j < mptcp_gws->len[i]; ++j)
++			memcpy(opt + 4 +
++				(j * sizeof(mptcp_gws->list[i][0].s_addr)),
++				&mptcp_gws->list[i][j].s_addr,
++				sizeof(mptcp_gws->list[i][0].s_addr));
++		/* Final destination must be part of IP_OPTIONS parameter. */
++		memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
++		       sizeof(addr.s_addr));
++
++		/* setsockopt must be inside the lock, otherwise another
++		 * subflow could fail to see that we have taken a list.
++		 */
++		ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
++				4 + sizeof(mptcp_gws->list[i][0].s_addr)
++				* (mptcp_gws->len[i] + 1));
++
++		if (ret < 0) {
++			mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
++				    __func__, ret);
++		}
++	}
++
++	spin_unlock(fmp->flow_lock);
++	read_unlock(&mptcp_gws_lock);
++
++	return;
++}
++
++/* Parses gateways string for a list of paths to different
++ * gateways, and stores them for use with the Loose Source Routing (LSRR)
++ * socket option. Each list must have "," separated addresses, and the lists
++ * themselves must be separated by "-". Returns -1 in case one or more of the
++ * addresses is not a valid ipv4/6 address.
++ */
++static int mptcp_parse_gateway_ipv4(char *gateways)
++{
++	int i, j, k, ret;
++	char *tmp_string = NULL;
++	struct in_addr tmp_addr;
++
++	tmp_string = kzalloc(16, GFP_KERNEL);
++	if (tmp_string == NULL)
++		return -ENOMEM;
++
++	write_lock(&mptcp_gws_lock);
++
++	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++
++	/* A TMP string is used since inet_pton needs a null terminated string
++	 * but we do not want to modify the sysctl for obvious reasons.
++	 * i will iterate over the SYSCTL string, j will iterate over the
++	 * temporary string where each IP is copied into, k will iterate over
++	 * the IPs in each list.
++	 */
++	for (i = j = k = 0;
++			i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
++			++i) {
++		if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
++			/* If the temp IP is empty and the current list is
++			 *  empty, we are done.
++			 */
++			if (j == 0 && mptcp_gws->len[k] == 0)
++				break;
++
++			/* Terminate the temp IP string, then if it is
++			 * non-empty parse the IP and copy it.
++			 */
++			tmp_string[j] = '\0';
++			if (j > 0) {
++				mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
++
++				ret = in4_pton(tmp_string, strlen(tmp_string),
++						(u8 *)&tmp_addr.s_addr, '\0',
++						NULL);
++
++				if (ret) {
++					mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
++						    ret,
++						    &tmp_addr.s_addr);
++					memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
++					       &tmp_addr.s_addr,
++					       sizeof(tmp_addr.s_addr));
++					mptcp_gws->len[k]++;
++					j = 0;
++					tmp_string[j] = '\0';
++					/* Since we can't impose a limit to
++					 * what the user can input, make sure
++					 * there are not too many IPs in the
++					 * SYSCTL string.
++					 */
++					if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
++						mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
++							    k,
++							    MPTCP_GW_LIST_MAX_LEN);
++						goto error;
++					}
++				} else {
++					goto error;
++				}
++			}
++
++			if (gateways[i] == '-' || gateways[i] == '\0')
++				++k;
++		} else {
++			tmp_string[j] = gateways[i];
++			++j;
++		}
++	}
++
++	/* Number of flows is number of gateway lists plus master flow */
++	mptcp_binder_ndiffports = k+1;
++
++	write_unlock(&mptcp_gws_lock);
++	kfree(tmp_string);
++
++	return 0;
++
++error:
++	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
++	memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
++	write_unlock(&mptcp_gws_lock);
++	kfree(tmp_string);
++	return -1;
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	const struct binder_priv *pm_priv = container_of(work,
++						     struct binder_priv,
++						     subflow_work);
++	struct mptcp_cb *mpcb = pm_priv->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	int iter = 0;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	if (mptcp_binder_ndiffports > iter &&
++	    mptcp_binder_ndiffports > mpcb->cnt_subflows) {
++		struct mptcp_loc4 loc;
++		struct mptcp_rem4 rem;
++
++		loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++		loc.loc4_id = 0;
++		loc.low_prio = 0;
++
++		rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++		rem.port = inet_sk(meta_sk)->inet_dport;
++		rem.rem4_id = 0; /* Default 0 */
++
++		mptcp_init4_subsockets(meta_sk, &loc, &rem);
++
++		goto next_subflow;
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void binder_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
++	static DEFINE_SPINLOCK(flow_lock);
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (meta_sk->sk_family == AF_INET6 &&
++	    !mptcp_v6_is_v4_mapped(meta_sk)) {
++			mptcp_fallback_default(mpcb);
++			return;
++	}
++#endif
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	fmp->mpcb = mpcb;
++
++	fmp->flow_lock = &flow_lock;
++}
++
++static void binder_create_subflows(struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (!work_pending(&pm_priv->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &pm_priv->subflow_work);
++	}
++}
++
++static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
++				  struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
++ * Inspired from proc_tcp_congestion_control().
++ */
++static int proc_mptcp_gateways(ctl_table *ctl, int write,
++				       void __user *buffer, size_t *lenp,
++				       loff_t *ppos)
++{
++	int ret;
++	ctl_table tbl = {
++		.maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
++	};
++
++	if (write) {
++		tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
++		if (tbl.data == NULL)
++			return -1;
++		ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++		if (ret == 0) {
++			ret = mptcp_parse_gateway_ipv4(tbl.data);
++			memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
++		}
++		kfree(tbl.data);
++	} else {
++		ret = proc_dostring(ctl, write, buffer, lenp, ppos);
++	}
++
++
++	return ret;
++}
++
++static struct mptcp_pm_ops binder __read_mostly = {
++	.new_session = binder_new_session,
++	.fully_established = binder_create_subflows,
++	.get_local_id = binder_get_local_id,
++	.init_subsocket_v4 = mptcp_v4_add_lsrr,
++	.name = "binder",
++	.owner = THIS_MODULE,
++};
++
++static struct ctl_table binder_table[] = {
++	{
++		.procname = "mptcp_binder_gateways",
++		.data = &sysctl_mptcp_binder_gateways,
++		.maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
++		.mode = 0644,
++		.proc_handler = &proc_mptcp_gateways
++	},
++	{ }
++};
++
++struct ctl_table_header *mptcp_sysctl_binder;
++
++/* General initialization of MPTCP_PM */
++static int __init binder_register(void)
++{
++	mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
++	if (!mptcp_gws)
++		return -ENOMEM;
++
++	rwlock_init(&mptcp_gws_lock);
++
++	BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
++
++	mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
++			binder_table);
++	if (!mptcp_sysctl_binder)
++		goto sysctl_fail;
++
++	if (mptcp_register_path_manager(&binder))
++		goto pm_failed;
++
++	return 0;
++
++pm_failed:
++	unregister_net_sysctl_table(mptcp_sysctl_binder);
++sysctl_fail:
++	kfree(mptcp_gws);
++
++	return -1;
++}
++
++static void binder_unregister(void)
++{
++	mptcp_unregister_path_manager(&binder);
++	unregister_net_sysctl_table(mptcp_sysctl_binder);
++	kfree(mptcp_gws);
++}
++
++module_init(binder_register);
++module_exit(binder_unregister);
++
++MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("BINDER MPTCP");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
+new file mode 100644
+index 000000000000..5d761164eb85
+--- /dev/null
++++ b/net/mptcp/mptcp_coupled.c
+@@ -0,0 +1,270 @@
++/*
++ *	MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++/* Scaling is done in the numerator with alpha_scale_num and in the denominator
++ * with alpha_scale_den.
++ *
++ * To downscale, we just need to use alpha_scale.
++ *
++ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
++ */
++static int alpha_scale_den = 10;
++static int alpha_scale_num = 32;
++static int alpha_scale = 12;
++
++struct mptcp_ccc {
++	u64	alpha;
++	bool	forced_update;
++};
++
++static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
++{
++	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
++{
++	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
++}
++
++static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
++{
++	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
++}
++
++static inline u64 mptcp_ccc_scale(u32 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++static inline bool mptcp_get_forced(const struct sock *meta_sk)
++{
++	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
++}
++
++static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
++{
++	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
++}
++
++static void mptcp_ccc_recalc_alpha(const struct sock *sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	const struct sock *sub_sk;
++	int best_cwnd = 0, best_rtt = 0, can_send = 0;
++	u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
++
++	if (!mpcb)
++		return;
++
++	/* Only one subflow left - fall back to normal reno-behavior
++	 * (set alpha to 1)
++	 */
++	if (mpcb->cnt_established <= 1)
++		goto exit;
++
++	/* Do regular alpha-calculation for multiple subflows */
++
++	/* Find the max numerator of the alpha-calculation */
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++		u64 tmp;
++
++		if (!mptcp_ccc_sk_can_send(sub_sk))
++			continue;
++
++		can_send++;
++
++		/* We need to look for the path, that provides the max-value.
++		 * Integer-overflow is not possible here, because
++		 * tmp will be in u64.
++		 */
++		tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
++				alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
++
++		if (tmp >= max_numerator) {
++			max_numerator = tmp;
++			best_cwnd = sub_tp->snd_cwnd;
++			best_rtt = sub_tp->srtt_us;
++		}
++	}
++
++	/* No subflow is able to send - we don't care anymore */
++	if (unlikely(!can_send))
++		goto exit;
++
++	/* Calculate the denominator */
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++
++		if (!mptcp_ccc_sk_can_send(sub_sk))
++			continue;
++
++		sum_denominator += div_u64(
++				mptcp_ccc_scale(sub_tp->snd_cwnd,
++						alpha_scale_den) * best_rtt,
++						sub_tp->srtt_us);
++	}
++	sum_denominator *= sum_denominator;
++	if (unlikely(!sum_denominator)) {
++		pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
++		       __func__, mpcb->cnt_established);
++		mptcp_for_each_sk(mpcb, sub_sk) {
++			struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++			pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
++			       __func__, sub_tp->mptcp->path_index,
++			       sub_sk->sk_state, sub_tp->srtt_us,
++			       sub_tp->snd_cwnd);
++		}
++	}
++
++	alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
++
++	if (unlikely(!alpha))
++		alpha = 1;
++
++exit:
++	mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
++}
++
++static void mptcp_ccc_init(struct sock *sk)
++{
++	if (mptcp(tcp_sk(sk))) {
++		mptcp_set_forced(mptcp_meta_sk(sk), 0);
++		mptcp_set_alpha(mptcp_meta_sk(sk), 1);
++	}
++	/* If we do not mptcp, behave like reno: return */
++}
++
++static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++	if (event == CA_EVENT_LOSS)
++		mptcp_ccc_recalc_alpha(sk);
++}
++
++static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
++{
++	if (!mptcp(tcp_sk(sk)))
++		return;
++
++	mptcp_set_forced(mptcp_meta_sk(sk), 1);
++}
++
++static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++	int snd_cwnd;
++
++	if (!mptcp(tp)) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	if (!tcp_is_cwnd_limited(sk))
++		return;
++
++	if (tp->snd_cwnd <= tp->snd_ssthresh) {
++		/* In "safe" area, increase. */
++		tcp_slow_start(tp, acked);
++		mptcp_ccc_recalc_alpha(sk);
++		return;
++	}
++
++	if (mptcp_get_forced(mptcp_meta_sk(sk))) {
++		mptcp_ccc_recalc_alpha(sk);
++		mptcp_set_forced(mptcp_meta_sk(sk), 0);
++	}
++
++	if (mpcb->cnt_established > 1) {
++		u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
++
++		/* This may happen, if at the initialization, the mpcb
++		 * was not yet attached to the sock, and thus
++		 * initializing alpha failed.
++		 */
++		if (unlikely(!alpha))
++			alpha = 1;
++
++		snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
++						alpha);
++
++		/* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
++		 * Thus, we select here the max value.
++		 */
++		if (snd_cwnd < tp->snd_cwnd)
++			snd_cwnd = tp->snd_cwnd;
++	} else {
++		snd_cwnd = tp->snd_cwnd;
++	}
++
++	if (tp->snd_cwnd_cnt >= snd_cwnd) {
++		if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
++			tp->snd_cwnd++;
++			mptcp_ccc_recalc_alpha(sk);
++		}
++
++		tp->snd_cwnd_cnt = 0;
++	} else {
++		tp->snd_cwnd_cnt++;
++	}
++}
++
++static struct tcp_congestion_ops mptcp_ccc = {
++	.init		= mptcp_ccc_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_ccc_cong_avoid,
++	.cwnd_event	= mptcp_ccc_cwnd_event,
++	.set_state	= mptcp_ccc_set_state,
++	.owner		= THIS_MODULE,
++	.name		= "lia",
++};
++
++static int __init mptcp_ccc_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
++	return tcp_register_congestion_control(&mptcp_ccc);
++}
++
++static void __exit mptcp_ccc_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_ccc);
++}
++
++module_init(mptcp_ccc_register);
++module_exit(mptcp_ccc_unregister);
++
++MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
+new file mode 100644
+index 000000000000..28dfa0479f5e
+--- /dev/null
++++ b/net/mptcp/mptcp_ctrl.c
+@@ -0,0 +1,2401 @@
++/*
++ *	MPTCP implementation - MPTCP-control
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <net/inet_common.h>
++#include <net/inet6_hashtables.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/ip6_route.h>
++#include <net/mptcp_v6.h>
++#endif
++#include <net/sock.h>
++#include <net/tcp.h>
++#include <net/tcp_states.h>
++#include <net/transp_v6.h>
++#include <net/xfrm.h>
++
++#include <linux/cryptohash.h>
++#include <linux/kconfig.h>
++#include <linux/module.h>
++#include <linux/netpoll.h>
++#include <linux/list.h>
++#include <linux/jhash.h>
++#include <linux/tcp.h>
++#include <linux/net.h>
++#include <linux/in.h>
++#include <linux/random.h>
++#include <linux/inetdevice.h>
++#include <linux/workqueue.h>
++#include <linux/atomic.h>
++#include <linux/sysctl.h>
++
++static struct kmem_cache *mptcp_sock_cache __read_mostly;
++static struct kmem_cache *mptcp_cb_cache __read_mostly;
++static struct kmem_cache *mptcp_tw_cache __read_mostly;
++
++int sysctl_mptcp_enabled __read_mostly = 1;
++int sysctl_mptcp_checksum __read_mostly = 1;
++int sysctl_mptcp_debug __read_mostly;
++EXPORT_SYMBOL(sysctl_mptcp_debug);
++int sysctl_mptcp_syn_retries __read_mostly = 3;
++
++bool mptcp_init_failed __read_mostly;
++
++struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
++EXPORT_SYMBOL(mptcp_static_key);
++
++static int proc_mptcp_path_manager(ctl_table *ctl, int write,
++				   void __user *buffer, size_t *lenp,
++				   loff_t *ppos)
++{
++	char val[MPTCP_PM_NAME_MAX];
++	ctl_table tbl = {
++		.data = val,
++		.maxlen = MPTCP_PM_NAME_MAX,
++	};
++	int ret;
++
++	mptcp_get_default_path_manager(val);
++
++	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++	if (write && ret == 0)
++		ret = mptcp_set_default_path_manager(val);
++	return ret;
++}
++
++static int proc_mptcp_scheduler(ctl_table *ctl, int write,
++				void __user *buffer, size_t *lenp,
++				loff_t *ppos)
++{
++	char val[MPTCP_SCHED_NAME_MAX];
++	ctl_table tbl = {
++		.data = val,
++		.maxlen = MPTCP_SCHED_NAME_MAX,
++	};
++	int ret;
++
++	mptcp_get_default_scheduler(val);
++
++	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
++	if (write && ret == 0)
++		ret = mptcp_set_default_scheduler(val);
++	return ret;
++}
++
++static struct ctl_table mptcp_table[] = {
++	{
++		.procname = "mptcp_enabled",
++		.data = &sysctl_mptcp_enabled,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_checksum",
++		.data = &sysctl_mptcp_checksum,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_debug",
++		.data = &sysctl_mptcp_debug,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname = "mptcp_syn_retries",
++		.data = &sysctl_mptcp_syn_retries,
++		.maxlen = sizeof(int),
++		.mode = 0644,
++		.proc_handler = &proc_dointvec
++	},
++	{
++		.procname	= "mptcp_path_manager",
++		.mode		= 0644,
++		.maxlen		= MPTCP_PM_NAME_MAX,
++		.proc_handler	= proc_mptcp_path_manager,
++	},
++	{
++		.procname	= "mptcp_scheduler",
++		.mode		= 0644,
++		.maxlen		= MPTCP_SCHED_NAME_MAX,
++		.proc_handler	= proc_mptcp_scheduler,
++	},
++	{ }
++};
++
++static inline u32 mptcp_hash_tk(u32 token)
++{
++	return token % MPTCP_HASH_SIZE;
++}
++
++struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
++EXPORT_SYMBOL(tk_hashtable);
++
++/* This second hashtable is needed to retrieve request socks
++ * created as a result of a join request. While the SYN contains
++ * the token, the final ack does not, so we need a separate hashtable
++ * to retrieve the mpcb.
++ */
++struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
++
++/* The following hash table is used to avoid collision of token */
++static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
++spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
++
++static bool mptcp_reqsk_find_tk(const u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct mptcp_request_sock *mtreqsk;
++	const struct hlist_nulls_node *node;
++
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreqsk, node,
++				       &mptcp_reqsk_tk_htb[hash], hash_entry) {
++		if (token == mtreqsk->mptcp_loc_token)
++			return true;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++	return false;
++}
++
++static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
++{
++	u32 hash = mptcp_hash_tk(token);
++
++	hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
++				 &mptcp_reqsk_tk_htb[hash]);
++}
++
++static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
++{
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++}
++
++void mptcp_reqsk_destructor(struct request_sock *req)
++{
++	if (!mptcp_rsk(req)->is_sub) {
++		if (in_softirq()) {
++			mptcp_reqsk_remove_tk(req);
++		} else {
++			rcu_read_lock_bh();
++			spin_lock(&mptcp_tk_hashlock);
++			hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
++			spin_unlock(&mptcp_tk_hashlock);
++			rcu_read_unlock_bh();
++		}
++	} else {
++		mptcp_hash_request_remove(req);
++	}
++}
++
++static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
++{
++	u32 hash = mptcp_hash_tk(token);
++	hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
++	meta_tp->inside_tk_table = 1;
++}
++
++static bool mptcp_find_token(u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct tcp_sock *meta_tp;
++	const struct hlist_nulls_node *node;
++
++begin:
++	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
++		if (token == meta_tp->mptcp_loc_token)
++			return true;
++	}
++	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++	return false;
++}
++
++static void mptcp_set_key_reqsk(struct request_sock *req,
++				const struct sk_buff *skb)
++{
++	const struct inet_request_sock *ireq = inet_rsk(req);
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++	if (skb->protocol == htons(ETH_P_IP)) {
++		mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
++							ip_hdr(skb)->daddr,
++							htons(ireq->ir_num),
++							ireq->ir_rmt_port);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
++							ipv6_hdr(skb)->daddr.s6_addr32,
++							htons(ireq->ir_num),
++							ireq->ir_rmt_port);
++#endif
++	}
++
++	mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
++}
++
++/* New MPTCP-connection request, prepare a new token for the meta-socket that
++ * will be created in mptcp_check_req_master(), and store the received token.
++ */
++void mptcp_reqsk_new_mptcp(struct request_sock *req,
++			   const struct mptcp_options_received *mopt,
++			   const struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++
++	inet_rsk(req)->saw_mpc = 1;
++
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	do {
++		mptcp_set_key_reqsk(req, skb);
++	} while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
++		 mptcp_find_token(mtreq->mptcp_loc_token));
++
++	mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++	mtreq->mptcp_rem_key = mopt->mptcp_key;
++}
++
++static void mptcp_set_key_sk(const struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct inet_sock *isk = inet_sk(sk);
++
++	if (sk->sk_family == AF_INET)
++		tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
++						     isk->inet_daddr,
++						     isk->inet_sport,
++						     isk->inet_dport);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
++						     sk->sk_v6_daddr.s6_addr32,
++						     isk->inet_sport,
++						     isk->inet_dport);
++#endif
++
++	mptcp_key_sha1(tp->mptcp_loc_key,
++		       &tp->mptcp_loc_token, NULL);
++}
++
++void mptcp_connect_init(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	rcu_read_lock_bh();
++	spin_lock(&mptcp_tk_hashlock);
++	do {
++		mptcp_set_key_sk(sk);
++	} while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
++		 mptcp_find_token(tp->mptcp_loc_token));
++
++	__mptcp_hash_insert(tp, tp->mptcp_loc_token);
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock_bh();
++}
++
++/**
++ * This function increments the refcount of the mpcb struct.
++ * It is the responsibility of the caller to decrement when releasing
++ * the structure.
++ */
++struct sock *mptcp_hash_find(const struct net *net, const u32 token)
++{
++	const u32 hash = mptcp_hash_tk(token);
++	const struct tcp_sock *meta_tp;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
++				       tk_table) {
++		meta_sk = (struct sock *)meta_tp;
++		if (token == meta_tp->mptcp_loc_token &&
++		    net_eq(net, sock_net(meta_sk))) {
++			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++				goto out;
++			if (unlikely(token != meta_tp->mptcp_loc_token ||
++				     !net_eq(net, sock_net(meta_sk)))) {
++				sock_gen_put(meta_sk);
++				goto begin;
++			}
++			goto found;
++		}
++	}
++	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash)
++		goto begin;
++out:
++	meta_sk = NULL;
++found:
++	rcu_read_unlock();
++	return meta_sk;
++}
++
++void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
++{
++	/* remove from the token hashtable */
++	rcu_read_lock_bh();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++	meta_tp->inside_tk_table = 0;
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock_bh();
++}
++
++void mptcp_hash_remove(struct tcp_sock *meta_tp)
++{
++	rcu_read_lock();
++	spin_lock(&mptcp_tk_hashlock);
++	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
++	meta_tp->inside_tk_table = 0;
++	spin_unlock(&mptcp_tk_hashlock);
++	rcu_read_unlock();
++}
++
++struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk, *rttsk = NULL, *lastsk = NULL;
++	u32 min_time = 0, last_active = 0;
++
++	mptcp_for_each_sk(meta_tp->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		u32 elapsed;
++
++		if (!mptcp_sk_can_send_ack(sk) || tp->pf)
++			continue;
++
++		elapsed = keepalive_time_elapsed(tp);
++
++		/* We take the one with the lowest RTT within a reasonable
++		 * (meta-RTO)-timeframe
++		 */
++		if (elapsed < inet_csk(meta_sk)->icsk_rto) {
++			if (!min_time || tp->srtt_us < min_time) {
++				min_time = tp->srtt_us;
++				rttsk = sk;
++			}
++			continue;
++		}
++
++		/* Otherwise, we just take the most recent active */
++		if (!rttsk && (!last_active || elapsed < last_active)) {
++			last_active = elapsed;
++			lastsk = sk;
++		}
++	}
++
++	if (rttsk)
++		return rttsk;
++
++	return lastsk;
++}
++EXPORT_SYMBOL(mptcp_select_ack_sock);
++
++static void mptcp_sock_def_error_report(struct sock *sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	if (!sock_flag(sk, SOCK_DEAD))
++		mptcp_sub_close(sk, 0);
++
++	if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
++	    mpcb->send_infinite_mapping) {
++		struct sock *meta_sk = mptcp_meta_sk(sk);
++
++		meta_sk->sk_err = sk->sk_err;
++		meta_sk->sk_err_soft = sk->sk_err_soft;
++
++		if (!sock_flag(meta_sk, SOCK_DEAD))
++			meta_sk->sk_error_report(meta_sk);
++
++		tcp_done(meta_sk);
++	}
++
++	sk->sk_err = 0;
++	return;
++}
++
++static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
++{
++	if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
++		mptcp_cleanup_path_manager(mpcb);
++		mptcp_cleanup_scheduler(mpcb);
++		kmem_cache_free(mptcp_cb_cache, mpcb);
++	}
++}
++
++static void mptcp_sock_destruct(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	inet_sock_destruct(sk);
++
++	if (!is_meta_sk(sk) && !tp->was_meta_sk) {
++		BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
++
++		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++		tp->mptcp = NULL;
++
++		/* Taken when mpcb pointer was set */
++		sock_put(mptcp_meta_sk(sk));
++		mptcp_mpcb_put(tp->mpcb);
++	} else {
++		struct mptcp_cb *mpcb = tp->mpcb;
++		struct mptcp_tw *mptw;
++
++		/* The mpcb is disappearing - we can make the final
++		 * update to the rcv_nxt of the time-wait-sock and remove
++		 * its reference to the mpcb.
++		 */
++		spin_lock_bh(&mpcb->tw_lock);
++		list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
++			list_del_rcu(&mptw->list);
++			mptw->in_list = 0;
++			mptcp_mpcb_put(mpcb);
++			rcu_assign_pointer(mptw->mpcb, NULL);
++		}
++		spin_unlock_bh(&mpcb->tw_lock);
++
++		mptcp_mpcb_put(mpcb);
++
++		mptcp_debug("%s destroying meta-sk\n", __func__);
++	}
++
++	WARN_ON(!static_key_false(&mptcp_static_key));
++	/* Must be the last call, because is_meta_sk() above still needs the
++	 * static key
++	 */
++	static_key_slow_dec(&mptcp_static_key);
++}
++
++void mptcp_destroy_sock(struct sock *sk)
++{
++	if (is_meta_sk(sk)) {
++		struct sock *sk_it, *tmpsk;
++
++		__skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
++		mptcp_purge_ofo_queue(tcp_sk(sk));
++
++		/* We have to close all remaining subflows. Normally, they
++		 * should all be about to get closed. But, if the kernel is
++		 * forcing a closure (e.g., tcp_write_err), the subflows might
++		 * not have been closed properly (as we are waiting for the
++		 * DATA_ACK of the DATA_FIN).
++		 */
++		mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
++			/* Already did call tcp_close - waiting for graceful
++			 * closure, or if we are retransmitting fast-close on
++			 * the subflow. The reset (or timeout) will kill the
++			 * subflow..
++			 */
++			if (tcp_sk(sk_it)->closing ||
++			    tcp_sk(sk_it)->send_mp_fclose)
++				continue;
++
++			/* Allow the delayed work first to prevent time-wait state */
++			if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
++				continue;
++
++			mptcp_sub_close(sk_it, 0);
++		}
++
++		mptcp_delete_synack_timer(sk);
++	} else {
++		mptcp_del_sock(sk);
++	}
++}
++
++static void mptcp_set_state(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	/* Meta is not yet established - wake up the application */
++	if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
++	    sk->sk_state == TCP_ESTABLISHED) {
++		tcp_set_state(meta_sk, TCP_ESTABLISHED);
++
++		if (!sock_flag(meta_sk, SOCK_DEAD)) {
++			meta_sk->sk_state_change(meta_sk);
++			sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
++		}
++	}
++
++	if (sk->sk_state == TCP_ESTABLISHED) {
++		tcp_sk(sk)->mptcp->establish_increased = 1;
++		tcp_sk(sk)->mpcb->cnt_established++;
++	}
++}
++
++void mptcp_init_congestion_control(struct sock *sk)
++{
++	struct inet_connection_sock *icsk = inet_csk(sk);
++	struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
++	const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
++
++	/* The application didn't set the congestion control to use
++	 * fallback to the default one.
++	 */
++	if (ca == &tcp_init_congestion_ops)
++		goto use_default;
++
++	/* Use the same congestion control as set by the user. If the
++	 * module is not available fallback to the default one.
++	 */
++	if (!try_module_get(ca->owner)) {
++		pr_warn("%s: fallback to the system default CC\n", __func__);
++		goto use_default;
++	}
++
++	icsk->icsk_ca_ops = ca;
++	if (icsk->icsk_ca_ops->init)
++		icsk->icsk_ca_ops->init(sk);
++
++	return;
++
++use_default:
++	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
++	tcp_init_congestion_control(sk);
++}
++
++u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
++u32 mptcp_seed = 0;
++
++void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
++{
++	u32 workspace[SHA_WORKSPACE_WORDS];
++	u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
++	u8 input[64];
++	int i;
++
++	memset(workspace, 0, sizeof(workspace));
++
++	/* Initialize input with appropriate padding */
++	memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
++						   * is explicitly set too
++						   */
++	memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
++	input[8] = 0x80; /* Padding: First bit after message = 1 */
++	input[63] = 0x40; /* Padding: Length of the message = 64 bits */
++
++	sha_init(mptcp_hashed_key);
++	sha_transform(mptcp_hashed_key, input, workspace);
++
++	for (i = 0; i < 5; i++)
++		mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
++
++	if (token)
++		*token = mptcp_hashed_key[0];
++	if (idsn)
++		*idsn = *((u64 *)&mptcp_hashed_key[3]);
++}
++
++void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
++		       u32 *hash_out)
++{
++	u32 workspace[SHA_WORKSPACE_WORDS];
++	u8 input[128]; /* 2 512-bit blocks */
++	int i;
++
++	memset(workspace, 0, sizeof(workspace));
++
++	/* Generate key xored with ipad */
++	memset(input, 0x36, 64);
++	for (i = 0; i < 8; i++)
++		input[i] ^= key_1[i];
++	for (i = 0; i < 8; i++)
++		input[i + 8] ^= key_2[i];
++
++	memcpy(&input[64], rand_1, 4);
++	memcpy(&input[68], rand_2, 4);
++	input[72] = 0x80; /* Padding: First bit after message = 1 */
++	memset(&input[73], 0, 53);
++
++	/* Padding: Length of the message = 512 + 64 bits */
++	input[126] = 0x02;
++	input[127] = 0x40;
++
++	sha_init(hash_out);
++	sha_transform(hash_out, input, workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	sha_transform(hash_out, &input[64], workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	for (i = 0; i < 5; i++)
++		hash_out[i] = cpu_to_be32(hash_out[i]);
++
++	/* Prepare second part of hmac */
++	memset(input, 0x5C, 64);
++	for (i = 0; i < 8; i++)
++		input[i] ^= key_1[i];
++	for (i = 0; i < 8; i++)
++		input[i + 8] ^= key_2[i];
++
++	memcpy(&input[64], hash_out, 20);
++	input[84] = 0x80;
++	memset(&input[85], 0, 41);
++
++	/* Padding: Length of the message = 512 + 160 bits */
++	input[126] = 0x02;
++	input[127] = 0xA0;
++
++	sha_init(hash_out);
++	sha_transform(hash_out, input, workspace);
++	memset(workspace, 0, sizeof(workspace));
++
++	sha_transform(hash_out, &input[64], workspace);
++
++	for (i = 0; i < 5; i++)
++		hash_out[i] = cpu_to_be32(hash_out[i]);
++}
++
++static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
++{
++	/* Socket-options handled by sk_clone_lock while creating the meta-sk.
++	 * ======
++	 * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
++	 * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
++	 * TCP_NODELAY, TCP_CORK
++	 *
++	 * Socket-options handled in this function here
++	 * ======
++	 * TCP_DEFER_ACCEPT
++	 * SO_KEEPALIVE
++	 *
++	 * Socket-options on the todo-list
++	 * ======
++	 * SO_BINDTODEVICE - should probably prevent creation of new subsocks
++	 *		     across other devices. - what about the api-draft?
++	 * SO_DEBUG
++	 * SO_REUSEADDR - probably we don't care about this
++	 * SO_DONTROUTE, SO_BROADCAST
++	 * SO_OOBINLINE
++	 * SO_LINGER
++	 * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
++	 * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
++	 * SO_RXQ_OVFL
++	 * TCP_COOKIE_TRANSACTIONS
++	 * TCP_MAXSEG
++	 * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
++	 *		in mptcp_retransmit_timer. AND we need to check what is
++	 *		about the subsockets.
++	 * TCP_LINGER2
++	 * TCP_WINDOW_CLAMP
++	 * TCP_USER_TIMEOUT
++	 * TCP_MD5SIG
++	 *
++	 * Socket-options of no concern for the meta-socket (but for the subsocket)
++	 * ======
++	 * SO_PRIORITY
++	 * SO_MARK
++	 * TCP_CONGESTION
++	 * TCP_SYNCNT
++	 * TCP_QUICKACK
++	 */
++
++	/* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
++	inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
++
++	/* Keepalives are handled entirely at the MPTCP-layer */
++	if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
++		inet_csk_reset_keepalive_timer(meta_sk,
++					       keepalive_time_when(tcp_sk(meta_sk)));
++		sock_reset_flag(master_sk, SOCK_KEEPOPEN);
++		inet_csk_delete_keepalive_timer(master_sk);
++	}
++
++	/* Do not propagate subflow-errors up to the MPTCP-layer */
++	inet_sk(master_sk)->recverr = 0;
++}
++
++static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
++{
++	/* IP_TOS also goes to the subflow. */
++	if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
++		inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
++		sub_sk->sk_priority = meta_sk->sk_priority;
++		sk_dst_reset(sub_sk);
++	}
++
++	/* Inherit SO_REUSEADDR */
++	sub_sk->sk_reuse = meta_sk->sk_reuse;
++
++	/* Inherit snd/rcv-buffer locks */
++	sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
++
++	/* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
++	tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
++
++	/* Keepalives are handled entirely at the MPTCP-layer */
++	if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
++		sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
++		inet_csk_delete_keepalive_timer(sub_sk);
++	}
++
++	/* Do not propagate subflow-errors up to the MPTCP-layer */
++	inet_sk(sub_sk)->recverr = 0;
++}
++
++int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	/* skb-sk may be NULL if we receive a packet immediatly after the
++	 * SYN/ACK + MP_CAPABLE.
++	 */
++	struct sock *sk = skb->sk ? skb->sk : meta_sk;
++	int ret = 0;
++
++	skb->sk = NULL;
++
++	if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
++		kfree_skb(skb);
++		return 0;
++	}
++
++	if (sk->sk_family == AF_INET)
++		ret = tcp_v4_do_rcv(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		ret = tcp_v6_do_rcv(sk, skb);
++#endif
++
++	sock_put(sk);
++	return ret;
++}
++
++struct lock_class_key meta_key;
++struct lock_class_key meta_slock_key;
++
++static void mptcp_synack_timer_handler(unsigned long data)
++{
++	struct sock *meta_sk = (struct sock *) data;
++	struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
++
++	/* Only process if socket is not in use. */
++	bh_lock_sock(meta_sk);
++
++	if (sock_owned_by_user(meta_sk)) {
++		/* Try again later. */
++		mptcp_reset_synack_timer(meta_sk, HZ/20);
++		goto out;
++	}
++
++	/* May happen if the queue got destructed in mptcp_close */
++	if (!lopt)
++		goto out;
++
++	inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
++				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
++
++	if (lopt->qlen)
++		mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
++
++out:
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk);
++}
++
++static const struct tcp_sock_ops mptcp_meta_specific = {
++	.__select_window		= __mptcp_select_window,
++	.select_window			= mptcp_select_window,
++	.select_initial_window		= mptcp_select_initial_window,
++	.init_buffer_space		= mptcp_init_buffer_space,
++	.set_rto			= mptcp_tcp_set_rto,
++	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
++	.init_congestion_control	= mptcp_init_congestion_control,
++	.send_fin			= mptcp_send_fin,
++	.write_xmit			= mptcp_write_xmit,
++	.send_active_reset		= mptcp_send_active_reset,
++	.write_wakeup			= mptcp_write_wakeup,
++	.prune_ofo_queue		= mptcp_prune_ofo_queue,
++	.retransmit_timer		= mptcp_retransmit_timer,
++	.time_wait			= mptcp_time_wait,
++	.cleanup_rbuf			= mptcp_cleanup_rbuf,
++};
++
++static const struct tcp_sock_ops mptcp_sub_specific = {
++	.__select_window		= __mptcp_select_window,
++	.select_window			= mptcp_select_window,
++	.select_initial_window		= mptcp_select_initial_window,
++	.init_buffer_space		= mptcp_init_buffer_space,
++	.set_rto			= mptcp_tcp_set_rto,
++	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
++	.init_congestion_control	= mptcp_init_congestion_control,
++	.send_fin			= tcp_send_fin,
++	.write_xmit			= tcp_write_xmit,
++	.send_active_reset		= tcp_send_active_reset,
++	.write_wakeup			= tcp_write_wakeup,
++	.prune_ofo_queue		= tcp_prune_ofo_queue,
++	.retransmit_timer		= tcp_retransmit_timer,
++	.time_wait			= tcp_time_wait,
++	.cleanup_rbuf			= tcp_cleanup_rbuf,
++};
++
++static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++	struct mptcp_cb *mpcb;
++	struct sock *master_sk;
++	struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
++	struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
++	u64 idsn;
++
++	dst_release(meta_sk->sk_rx_dst);
++	meta_sk->sk_rx_dst = NULL;
++	/* This flag is set to announce sock_lock_init to
++	 * reclassify the lock-class of the master socket.
++	 */
++	meta_tp->is_master_sk = 1;
++	master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
++	meta_tp->is_master_sk = 0;
++	if (!master_sk)
++		return -ENOBUFS;
++
++	master_tp = tcp_sk(master_sk);
++	master_icsk = inet_csk(master_sk);
++
++	mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
++	if (!mpcb) {
++		/* sk_free (and __sk_free) requirese wmem_alloc to be 1.
++		 * All the rest is set to 0 thanks to __GFP_ZERO above.
++		 */
++		atomic_set(&master_sk->sk_wmem_alloc, 1);
++		sk_free(master_sk);
++		return -ENOBUFS;
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
++		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++		newnp = inet6_sk(master_sk);
++		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++		newnp->ipv6_mc_list = NULL;
++		newnp->ipv6_ac_list = NULL;
++		newnp->ipv6_fl_list = NULL;
++		newnp->opt = NULL;
++		newnp->pktoptions = NULL;
++		(void)xchg(&newnp->rxpmtu, NULL);
++	} else if (meta_sk->sk_family == AF_INET6) {
++		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
++
++		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
++
++		newnp = inet6_sk(master_sk);
++		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
++
++		newnp->hop_limit	= -1;
++		newnp->mcast_hops	= IPV6_DEFAULT_MCASTHOPS;
++		newnp->mc_loop	= 1;
++		newnp->pmtudisc	= IPV6_PMTUDISC_WANT;
++		newnp->ipv6only	= sock_net(master_sk)->ipv6.sysctl.bindv6only;
++	}
++#endif
++
++	meta_tp->mptcp = NULL;
++
++	/* Store the keys and generate the peer's token */
++	mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
++	mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
++
++	/* Generate Initial data-sequence-numbers */
++	mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
++	idsn = ntohll(idsn) + 1;
++	mpcb->snd_high_order[0] = idsn >> 32;
++	mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
++
++	meta_tp->write_seq = (u32)idsn;
++	meta_tp->snd_sml = meta_tp->write_seq;
++	meta_tp->snd_una = meta_tp->write_seq;
++	meta_tp->snd_nxt = meta_tp->write_seq;
++	meta_tp->pushed_seq = meta_tp->write_seq;
++	meta_tp->snd_up = meta_tp->write_seq;
++
++	mpcb->mptcp_rem_key = remote_key;
++	mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
++	idsn = ntohll(idsn) + 1;
++	mpcb->rcv_high_order[0] = idsn >> 32;
++	mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
++	meta_tp->copied_seq = (u32) idsn;
++	meta_tp->rcv_nxt = (u32) idsn;
++	meta_tp->rcv_wup = (u32) idsn;
++
++	meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
++	meta_tp->snd_wnd = window;
++	meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
++
++	meta_tp->packets_out = 0;
++	meta_icsk->icsk_probes_out = 0;
++
++	/* Set mptcp-pointers */
++	master_tp->mpcb = mpcb;
++	master_tp->meta_sk = meta_sk;
++	meta_tp->mpcb = mpcb;
++	meta_tp->meta_sk = meta_sk;
++	mpcb->meta_sk = meta_sk;
++	mpcb->master_sk = master_sk;
++
++	meta_tp->was_meta_sk = 0;
++
++	/* Initialize the queues */
++	skb_queue_head_init(&mpcb->reinject_queue);
++	skb_queue_head_init(&master_tp->out_of_order_queue);
++	tcp_prequeue_init(master_tp);
++	INIT_LIST_HEAD(&master_tp->tsq_node);
++
++	master_tp->tsq_flags = 0;
++
++	mutex_init(&mpcb->mpcb_mutex);
++
++	/* Init the accept_queue structure, we support a queue of 32 pending
++	 * connections, it does not need to be huge, since we only store  here
++	 * pending subflow creations.
++	 */
++	if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
++		inet_put_port(master_sk);
++		kmem_cache_free(mptcp_cb_cache, mpcb);
++		sk_free(master_sk);
++		return -ENOMEM;
++	}
++
++	/* Redefine function-pointers as the meta-sk is now fully ready */
++	static_key_slow_inc(&mptcp_static_key);
++	meta_tp->mpc = 1;
++	meta_tp->ops = &mptcp_meta_specific;
++
++	meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
++	meta_sk->sk_destruct = mptcp_sock_destruct;
++
++	/* Meta-level retransmit timer */
++	meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
++
++	tcp_init_xmit_timers(master_sk);
++	/* Has been set for sending out the SYN */
++	inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
++
++	if (!meta_tp->inside_tk_table) {
++		/* Adding the meta_tp in the token hashtable - coming from server-side */
++		rcu_read_lock();
++		spin_lock(&mptcp_tk_hashlock);
++
++		__mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
++
++		spin_unlock(&mptcp_tk_hashlock);
++		rcu_read_unlock();
++	}
++	master_tp->inside_tk_table = 0;
++
++	/* Init time-wait stuff */
++	INIT_LIST_HEAD(&mpcb->tw_list);
++	spin_lock_init(&mpcb->tw_lock);
++
++	INIT_HLIST_HEAD(&mpcb->callback_list);
++
++	mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
++
++	mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
++	mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
++	mpcb->orig_window_clamp = meta_tp->window_clamp;
++
++	/* The meta is directly linked - set refcnt to 1 */
++	atomic_set(&mpcb->mpcb_refcnt, 1);
++
++	mptcp_init_path_manager(mpcb);
++	mptcp_init_scheduler(mpcb);
++
++	setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
++		    (unsigned long)meta_sk);
++
++	mptcp_debug("%s: created mpcb with token %#x\n",
++		    __func__, mpcb->mptcp_loc_token);
++
++	return 0;
++}
++
++void mptcp_fallback_meta_sk(struct sock *meta_sk)
++{
++	kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
++	kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
++}
++
++int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
++		   gfp_t flags)
++{
++	struct mptcp_cb *mpcb	= tcp_sk(meta_sk)->mpcb;
++	struct tcp_sock *tp	= tcp_sk(sk);
++
++	tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
++	if (!tp->mptcp)
++		return -ENOMEM;
++
++	tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
++	/* No more space for more subflows? */
++	if (!tp->mptcp->path_index) {
++		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
++		return -EPERM;
++	}
++
++	INIT_HLIST_NODE(&tp->mptcp->cb_list);
++
++	tp->mptcp->tp = tp;
++	tp->mpcb = mpcb;
++	tp->meta_sk = meta_sk;
++
++	static_key_slow_inc(&mptcp_static_key);
++	tp->mpc = 1;
++	tp->ops = &mptcp_sub_specific;
++
++	tp->mptcp->loc_id = loc_id;
++	tp->mptcp->rem_id = rem_id;
++	if (mpcb->sched_ops->init)
++		mpcb->sched_ops->init(sk);
++
++	/* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
++	 * included in mptcp_del_sock(), because the mpcb must remain alive
++	 * until the last subsocket is completely destroyed.
++	 */
++	sock_hold(meta_sk);
++	atomic_inc(&mpcb->mpcb_refcnt);
++
++	tp->mptcp->next = mpcb->connection_list;
++	mpcb->connection_list = tp;
++	tp->mptcp->attached = 1;
++
++	mpcb->cnt_subflows++;
++	atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
++		   &meta_sk->sk_rmem_alloc);
++
++	mptcp_sub_inherit_sockopts(meta_sk, sk);
++	INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
++
++	/* As we successfully allocated the mptcp_tcp_sock, we have to
++	 * change the function-pointers here (for sk_destruct to work correctly)
++	 */
++	sk->sk_error_report = mptcp_sock_def_error_report;
++	sk->sk_data_ready = mptcp_data_ready;
++	sk->sk_write_space = mptcp_write_space;
++	sk->sk_state_change = mptcp_set_state;
++	sk->sk_destruct = mptcp_sock_destruct;
++
++	if (sk->sk_family == AF_INET)
++		mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
++			    __func__ , mpcb->mptcp_loc_token,
++			    tp->mptcp->path_index,
++			    &((struct inet_sock *)tp)->inet_saddr,
++			    ntohs(((struct inet_sock *)tp)->inet_sport),
++			    &((struct inet_sock *)tp)->inet_daddr,
++			    ntohs(((struct inet_sock *)tp)->inet_dport),
++			    mpcb->cnt_subflows);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
++			    __func__ , mpcb->mptcp_loc_token,
++			    tp->mptcp->path_index, &inet6_sk(sk)->saddr,
++			    ntohs(((struct inet_sock *)tp)->inet_sport),
++			    &sk->sk_v6_daddr,
++			    ntohs(((struct inet_sock *)tp)->inet_dport),
++			    mpcb->cnt_subflows);
++#endif
++
++	return 0;
++}
++
++void mptcp_del_sock(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
++	struct mptcp_cb *mpcb;
++
++	if (!tp->mptcp || !tp->mptcp->attached)
++		return;
++
++	mpcb = tp->mpcb;
++	tp_prev = mpcb->connection_list;
++
++	mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
++		    __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
++		    sk->sk_state, is_meta_sk(sk));
++
++	if (tp_prev == tp) {
++		mpcb->connection_list = tp->mptcp->next;
++	} else {
++		for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
++			if (tp_prev->mptcp->next == tp) {
++				tp_prev->mptcp->next = tp->mptcp->next;
++				break;
++			}
++		}
++	}
++	mpcb->cnt_subflows--;
++	if (tp->mptcp->establish_increased)
++		mpcb->cnt_established--;
++
++	tp->mptcp->next = NULL;
++	tp->mptcp->attached = 0;
++	mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
++
++	if (!skb_queue_empty(&sk->sk_write_queue))
++		mptcp_reinject_data(sk, 0);
++
++	if (is_master_tp(tp))
++		mpcb->master_sk = NULL;
++	else if (tp->mptcp->pre_established)
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++
++	rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
++}
++
++/* Updates the metasocket ULID/port data, based on the given sock.
++ * The argument sock must be the sock accessible to the application.
++ * In this function, we update the meta socket info, based on the changes
++ * in the application socket (bind, address allocation, ...)
++ */
++void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
++{
++	if (tcp_sk(sk)->mpcb->pm_ops->new_session)
++		tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
++
++	tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
++}
++
++/* Clean up the receive buffer for full frames taken by the user,
++ * then send an ACK if necessary.  COPIED is the number of bytes
++ * tcp_recvmsg has given to the user so far, it speeds up the
++ * calculation of whether or not we must ACK for the sake of
++ * a window update.
++ */
++void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk;
++	__u32 rcv_window_now = 0;
++
++	if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
++		rcv_window_now = tcp_receive_window(meta_tp);
++
++		if (2 * rcv_window_now > meta_tp->window_clamp)
++			rcv_window_now = 0;
++	}
++
++	mptcp_for_each_sk(meta_tp->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		const struct inet_connection_sock *icsk = inet_csk(sk);
++
++		if (!mptcp_sk_can_send_ack(sk))
++			continue;
++
++		if (!inet_csk_ack_scheduled(sk))
++			goto second_part;
++		/* Delayed ACKs frequently hit locked sockets during bulk
++		 * receive.
++		 */
++		if (icsk->icsk_ack.blocked ||
++		    /* Once-per-two-segments ACK was not sent by tcp_input.c */
++		    tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
++		    /* If this read emptied read buffer, we send ACK, if
++		     * connection is not bidirectional, user drained
++		     * receive buffer and there was a small segment
++		     * in queue.
++		     */
++		    (copied > 0 &&
++		     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
++		      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
++		       !icsk->icsk_ack.pingpong)) &&
++		     !atomic_read(&meta_sk->sk_rmem_alloc))) {
++			tcp_send_ack(sk);
++			continue;
++		}
++
++second_part:
++		/* This here is the second part of tcp_cleanup_rbuf */
++		if (rcv_window_now) {
++			__u32 new_window = tp->ops->__select_window(sk);
++
++			/* Send ACK now, if this read freed lots of space
++			 * in our buffer. Certainly, new_window is new window.
++			 * We can advertise it now, if it is not less than
++			 * current one.
++			 * "Lots" means "at least twice" here.
++			 */
++			if (new_window && new_window >= 2 * rcv_window_now)
++				tcp_send_ack(sk);
++		}
++	}
++}
++
++static int mptcp_sub_send_fin(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *skb = tcp_write_queue_tail(sk);
++	int mss_now;
++
++	/* Optimization, tack on the FIN if we have a queue of
++	 * unsent frames.  But be careful about outgoing SACKS
++	 * and IP options.
++	 */
++	mss_now = tcp_current_mss(sk);
++
++	if (tcp_send_head(sk) != NULL) {
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++		TCP_SKB_CB(skb)->end_seq++;
++		tp->write_seq++;
++	} else {
++		skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
++		if (!skb)
++			return 1;
++
++		/* Reserve space for headers and prepare control bits. */
++		skb_reserve(skb, MAX_TCP_HEADER);
++		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
++		tcp_init_nondata_skb(skb, tp->write_seq,
++				     TCPHDR_ACK | TCPHDR_FIN);
++		tcp_queue_skb(sk, skb);
++	}
++	__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
++
++	return 0;
++}
++
++void mptcp_sub_close_wq(struct work_struct *work)
++{
++	struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
++	struct sock *sk = (struct sock *)tp;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	mutex_lock(&tp->mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	if (sock_flag(sk, SOCK_DEAD))
++		goto exit;
++
++	/* We come from tcp_disconnect. We are sure that meta_sk is set */
++	if (!mptcp(tp)) {
++		tp->closing = 1;
++		sock_rps_reset_flow(sk);
++		tcp_close(sk, 0);
++		goto exit;
++	}
++
++	if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
++		tp->closing = 1;
++		sock_rps_reset_flow(sk);
++		tcp_close(sk, 0);
++	} else if (tcp_close_state(sk)) {
++		sk->sk_shutdown |= SEND_SHUTDOWN;
++		tcp_send_fin(sk);
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&tp->mpcb->mpcb_mutex);
++	sock_put(sk);
++}
++
++void mptcp_sub_close(struct sock *sk, unsigned long delay)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
++
++	/* We are already closing - e.g., call from sock_def_error_report upon
++	 * tcp_disconnect in tcp_close.
++	 */
++	if (tp->closing)
++		return;
++
++	/* Work already scheduled ? */
++	if (work_pending(&work->work)) {
++		/* Work present - who will be first ? */
++		if (jiffies + delay > work->timer.expires)
++			return;
++
++		/* Try canceling - if it fails, work will be executed soon */
++		if (!cancel_delayed_work(work))
++			return;
++		sock_put(sk);
++	}
++
++	if (!delay) {
++		unsigned char old_state = sk->sk_state;
++
++		/* If we are in user-context we can directly do the closing
++		 * procedure. No need to schedule a work-queue.
++		 */
++		if (!in_softirq()) {
++			if (sock_flag(sk, SOCK_DEAD))
++				return;
++
++			if (!mptcp(tp)) {
++				tp->closing = 1;
++				sock_rps_reset_flow(sk);
++				tcp_close(sk, 0);
++				return;
++			}
++
++			if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
++			    sk->sk_state == TCP_CLOSE) {
++				tp->closing = 1;
++				sock_rps_reset_flow(sk);
++				tcp_close(sk, 0);
++			} else if (tcp_close_state(sk)) {
++				sk->sk_shutdown |= SEND_SHUTDOWN;
++				tcp_send_fin(sk);
++			}
++
++			return;
++		}
++
++		/* We directly send the FIN. Because it may take so a long time,
++		 * untile the work-queue will get scheduled...
++		 *
++		 * If mptcp_sub_send_fin returns 1, it failed and thus we reset
++		 * the old state so that tcp_close will finally send the fin
++		 * in user-context.
++		 */
++		if (!sk->sk_err && old_state != TCP_CLOSE &&
++		    tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
++			if (old_state == TCP_ESTABLISHED)
++				TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
++			sk->sk_state = old_state;
++		}
++	}
++
++	sock_hold(sk);
++	queue_delayed_work(mptcp_wq, work, delay);
++}
++
++void mptcp_sub_force_close(struct sock *sk)
++{
++	/* The below tcp_done may have freed the socket, if he is already dead.
++	 * Thus, we are not allowed to access it afterwards. That's why
++	 * we have to store the dead-state in this local variable.
++	 */
++	int sock_is_dead = sock_flag(sk, SOCK_DEAD);
++
++	tcp_sk(sk)->mp_killed = 1;
++
++	if (sk->sk_state != TCP_CLOSE)
++		tcp_done(sk);
++
++	if (!sock_is_dead)
++		mptcp_sub_close(sk, 0);
++}
++EXPORT_SYMBOL(mptcp_sub_force_close);
++
++/* Update the mpcb send window, based on the contributions
++ * of each subflow
++ */
++void mptcp_update_sndbuf(const struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk, *sk;
++	int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
++
++	mptcp_for_each_sk(tp->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		new_sndbuf += sk->sk_sndbuf;
++
++		if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
++			new_sndbuf = sysctl_tcp_wmem[2];
++			break;
++		}
++	}
++	meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
++
++	/* The subflow's call to sk_write_space in tcp_new_space ends up in
++	 * mptcp_write_space.
++	 * It has nothing to do with waking up the application.
++	 * So, we do it here.
++	 */
++	if (old_sndbuf != meta_sk->sk_sndbuf)
++		meta_sk->sk_write_space(meta_sk);
++}
++
++void mptcp_close(struct sock *meta_sk, long timeout)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *sk_it, *tmpsk;
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb;
++	int data_was_unread = 0;
++	int state;
++
++	mptcp_debug("%s: Close of meta_sk with tok %#x\n",
++		    __func__, mpcb->mptcp_loc_token);
++
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock(meta_sk);
++
++	if (meta_tp->inside_tk_table) {
++		/* Detach the mpcb from the token hashtable */
++		mptcp_hash_remove_bh(meta_tp);
++		reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
++	}
++
++	meta_sk->sk_shutdown = SHUTDOWN_MASK;
++	/* We need to flush the recv. buffs.  We do this only on the
++	 * descriptor close, not protocol-sourced closes, because the
++	 * reader process may not have drained the data yet!
++	 */
++	while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
++		u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
++			  tcp_hdr(skb)->fin;
++		data_was_unread += len;
++		__kfree_skb(skb);
++	}
++
++	sk_mem_reclaim(meta_sk);
++
++	/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
++	if (meta_sk->sk_state == TCP_CLOSE) {
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++			if (tcp_sk(sk_it)->send_mp_fclose)
++				continue;
++			mptcp_sub_close(sk_it, 0);
++		}
++		goto adjudge_to_death;
++	}
++
++	if (data_was_unread) {
++		/* Unread data was tossed, zap the connection. */
++		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
++		tcp_set_state(meta_sk, TCP_CLOSE);
++		tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
++							meta_sk->sk_allocation);
++	} else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
++		/* Check zero linger _after_ checking for unread data. */
++		meta_sk->sk_prot->disconnect(meta_sk, 0);
++		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++	} else if (tcp_close_state(meta_sk)) {
++		mptcp_send_fin(meta_sk);
++	} else if (meta_tp->snd_una == meta_tp->write_seq) {
++		/* The DATA_FIN has been sent and acknowledged
++		 * (e.g., by sk_shutdown). Close all the other subflows
++		 */
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++			unsigned long delay = 0;
++			/* If we are the passive closer, don't trigger
++			 * subflow-fin until the subflow has been finned
++			 * by the peer. - thus we add a delay
++			 */
++			if (mpcb->passive_close &&
++			    sk_it->sk_state == TCP_ESTABLISHED)
++				delay = inet_csk(sk_it)->icsk_rto << 3;
++
++			mptcp_sub_close(sk_it, delay);
++		}
++	}
++
++	sk_stream_wait_close(meta_sk, timeout);
++
++adjudge_to_death:
++	state = meta_sk->sk_state;
++	sock_hold(meta_sk);
++	sock_orphan(meta_sk);
++
++	/* socket will be freed after mptcp_close - we have to prevent
++	 * access from the subflows.
++	 */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		/* Similar to sock_orphan, but we don't set it DEAD, because
++		 * the callbacks are still set and must be called.
++		 */
++		write_lock_bh(&sk_it->sk_callback_lock);
++		sk_set_socket(sk_it, NULL);
++		sk_it->sk_wq  = NULL;
++		write_unlock_bh(&sk_it->sk_callback_lock);
++	}
++
++	/* It is the last release_sock in its life. It will remove backlog. */
++	release_sock(meta_sk);
++
++	/* Now socket is owned by kernel and we acquire BH lock
++	 * to finish close. No need to check for user refs.
++	 */
++	local_bh_disable();
++	bh_lock_sock(meta_sk);
++	WARN_ON(sock_owned_by_user(meta_sk));
++
++	percpu_counter_inc(meta_sk->sk_prot->orphan_count);
++
++	/* Have we already been destroyed by a softirq or backlog? */
++	if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
++		goto out;
++
++	/*	This is a (useful) BSD violating of the RFC. There is a
++	 *	problem with TCP as specified in that the other end could
++	 *	keep a socket open forever with no application left this end.
++	 *	We use a 3 minute timeout (about the same as BSD) then kill
++	 *	our end. If they send after that then tough - BUT: long enough
++	 *	that we won't make the old 4*rto = almost no time - whoops
++	 *	reset mistake.
++	 *
++	 *	Nope, it was not mistake. It is really desired behaviour
++	 *	f.e. on http servers, when such sockets are useless, but
++	 *	consume significant resources. Let's do it with special
++	 *	linger2	option.					--ANK
++	 */
++
++	if (meta_sk->sk_state == TCP_FIN_WAIT2) {
++		if (meta_tp->linger2 < 0) {
++			tcp_set_state(meta_sk, TCP_CLOSE);
++			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPABORTONLINGER);
++		} else {
++			const int tmo = tcp_fin_time(meta_sk);
++
++			if (tmo > TCP_TIMEWAIT_LEN) {
++				inet_csk_reset_keepalive_timer(meta_sk,
++							       tmo - TCP_TIMEWAIT_LEN);
++			} else {
++				meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
++							tmo);
++				goto out;
++			}
++		}
++	}
++	if (meta_sk->sk_state != TCP_CLOSE) {
++		sk_mem_reclaim(meta_sk);
++		if (tcp_too_many_orphans(meta_sk, 0)) {
++			if (net_ratelimit())
++				pr_info("MPTCP: too many of orphaned sockets\n");
++			tcp_set_state(meta_sk, TCP_CLOSE);
++			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPABORTONMEMORY);
++		}
++	}
++
++
++	if (meta_sk->sk_state == TCP_CLOSE)
++		inet_csk_destroy_sock(meta_sk);
++	/* Otherwise, socket is reprieved until protocol close. */
++
++out:
++	bh_unlock_sock(meta_sk);
++	local_bh_enable();
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk); /* Taken by sock_hold */
++}
++
++void mptcp_disconnect(struct sock *sk)
++{
++	struct sock *subsk, *tmpsk;
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	mptcp_delete_synack_timer(sk);
++
++	__skb_queue_purge(&tp->mpcb->reinject_queue);
++
++	if (tp->inside_tk_table) {
++		mptcp_hash_remove_bh(tp);
++		reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
++	}
++
++	local_bh_disable();
++	mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
++		/* The socket will get removed from the subsocket-list
++		 * and made non-mptcp by setting mpc to 0.
++		 *
++		 * This is necessary, because tcp_disconnect assumes
++		 * that the connection is completly dead afterwards.
++		 * Thus we need to do a mptcp_del_sock. Due to this call
++		 * we have to make it non-mptcp.
++		 *
++		 * We have to lock the socket, because we set mpc to 0.
++		 * An incoming packet would take the subsocket's lock
++		 * and go on into the receive-path.
++		 * This would be a race.
++		 */
++
++		bh_lock_sock(subsk);
++		mptcp_del_sock(subsk);
++		tcp_sk(subsk)->mpc = 0;
++		tcp_sk(subsk)->ops = &tcp_specific;
++		mptcp_sub_force_close(subsk);
++		bh_unlock_sock(subsk);
++	}
++	local_bh_enable();
++
++	tp->was_meta_sk = 1;
++	tp->mpc = 0;
++	tp->ops = &tcp_specific;
++}
++
++
++/* Returns 1 if we should enable MPTCP for that socket. */
++int mptcp_doit(struct sock *sk)
++{
++	/* Do not allow MPTCP enabling if the MPTCP initialization failed */
++	if (mptcp_init_failed)
++		return 0;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
++		return 0;
++
++	/* Socket may already be established (e.g., called from tcp_recvmsg) */
++	if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
++		return 1;
++
++	/* Don't do mptcp over loopback */
++	if (sk->sk_family == AF_INET &&
++	    (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
++	     ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
++		return 0;
++#if IS_ENABLED(CONFIG_IPV6)
++	if (sk->sk_family == AF_INET6 &&
++	    (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
++	     ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
++		return 0;
++#endif
++	if (mptcp_v6_is_v4_mapped(sk) &&
++	    ipv4_is_loopback(inet_sk(sk)->inet_saddr))
++		return 0;
++
++#ifdef CONFIG_TCP_MD5SIG
++	/* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
++	if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
++		return 0;
++#endif
++
++	return 1;
++}
++
++int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
++{
++	struct tcp_sock *master_tp;
++	struct sock *master_sk;
++
++	if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
++		goto err_alloc_mpcb;
++
++	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++	master_tp = tcp_sk(master_sk);
++
++	if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
++		goto err_add_sock;
++
++	if (__inet_inherit_port(meta_sk, master_sk) < 0)
++		goto err_add_sock;
++
++	meta_sk->sk_prot->unhash(meta_sk);
++
++	if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
++		__inet_hash_nolisten(master_sk, NULL);
++#if IS_ENABLED(CONFIG_IPV6)
++	else
++		__inet6_hash(master_sk, NULL);
++#endif
++
++	master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
++
++	return 0;
++
++err_add_sock:
++	mptcp_fallback_meta_sk(meta_sk);
++
++	inet_csk_prepare_forced_close(master_sk);
++	tcp_done(master_sk);
++	inet_csk_prepare_forced_close(meta_sk);
++	tcp_done(meta_sk);
++
++err_alloc_mpcb:
++	return -ENOBUFS;
++}
++
++static int __mptcp_check_req_master(struct sock *child,
++				    struct request_sock *req)
++{
++	struct tcp_sock *child_tp = tcp_sk(child);
++	struct sock *meta_sk = child;
++	struct mptcp_cb *mpcb;
++	struct mptcp_request_sock *mtreq;
++
++	/* Never contained an MP_CAPABLE */
++	if (!inet_rsk(req)->mptcp_rqsk)
++		return 1;
++
++	if (!inet_rsk(req)->saw_mpc) {
++		/* Fallback to regular TCP, because we saw one SYN without
++		 * MP_CAPABLE. In tcp_check_req we continue the regular path.
++		 * But, the socket has been added to the reqsk_tk_htb, so we
++		 * must still remove it.
++		 */
++		mptcp_reqsk_remove_tk(req);
++		return 1;
++	}
++
++	/* Just set this values to pass them to mptcp_alloc_mpcb */
++	mtreq = mptcp_rsk(req);
++	child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
++	child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
++
++	if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
++				   child_tp->snd_wnd))
++		return -ENOBUFS;
++
++	child = tcp_sk(child)->mpcb->master_sk;
++	child_tp = tcp_sk(child);
++	mpcb = child_tp->mpcb;
++
++	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++
++	mpcb->dss_csum = mtreq->dss_csum;
++	mpcb->server_side = 1;
++
++	/* Will be moved to ESTABLISHED by  tcp_rcv_state_process() */
++	mptcp_update_metasocket(child, meta_sk);
++
++	/* Needs to be done here additionally, because when accepting a
++	 * new connection we pass by __reqsk_free and not reqsk_free.
++	 */
++	mptcp_reqsk_remove_tk(req);
++
++	/* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
++	sock_put(meta_sk);
++
++	return 0;
++}
++
++int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
++{
++	struct sock *meta_sk = child, *master_sk;
++	struct sk_buff *skb;
++	u32 new_mapping;
++	int ret;
++
++	ret = __mptcp_check_req_master(child, req);
++	if (ret)
++		return ret;
++
++	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
++
++	/* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
++	 * pre-MPTCP data in the receive queue.
++	 */
++	tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
++				       tcp_rsk(req)->rcv_isn - 1;
++
++	/* Map subflow sequence number to data sequence numbers. We need to map
++	 * these data to [IDSN - len - 1, IDSN[.
++	 */
++	new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
++
++	/* There should be only one skb: the SYN + data. */
++	skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
++		TCP_SKB_CB(skb)->seq += new_mapping;
++		TCP_SKB_CB(skb)->end_seq += new_mapping;
++	}
++
++	/* With fastopen we change the semantics of the relative subflow
++	 * sequence numbers to deal with middleboxes that could add/remove
++	 * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
++	 * instead of the regular TCP ISN.
++	 */
++	tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
++
++	/* We need to update copied_seq of the master_sk to account for the
++	 * already moved data to the meta receive queue.
++	 */
++	tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
++
++	/* Handled by the master_sk */
++	tcp_sk(meta_sk)->fastopen_rsk = NULL;
++
++	return 0;
++}
++
++int mptcp_check_req_master(struct sock *sk, struct sock *child,
++			   struct request_sock *req,
++			   struct request_sock **prev)
++{
++	struct sock *meta_sk = child;
++	int ret;
++
++	ret = __mptcp_check_req_master(child, req);
++	if (ret)
++		return ret;
++
++	inet_csk_reqsk_queue_unlink(sk, req, prev);
++	inet_csk_reqsk_queue_removed(sk, req);
++	inet_csk_reqsk_queue_add(sk, req, meta_sk);
++
++	return 0;
++}
++
++struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
++				   struct request_sock *req,
++				   struct request_sock **prev,
++				   const struct mptcp_options_received *mopt)
++{
++	struct tcp_sock *child_tp = tcp_sk(child);
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	u8 hash_mac_check[20];
++
++	child_tp->inside_tk_table = 0;
++
++	if (!mopt->join_ack)
++		goto teardown;
++
++	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++			(u8 *)&mpcb->mptcp_loc_key,
++			(u8 *)&mtreq->mptcp_rem_nonce,
++			(u8 *)&mtreq->mptcp_loc_nonce,
++			(u32 *)hash_mac_check);
++
++	if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
++		goto teardown;
++
++	/* Point it to the same struct socket and wq as the meta_sk */
++	sk_set_socket(child, meta_sk->sk_socket);
++	child->sk_wq = meta_sk->sk_wq;
++
++	if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
++		/* Has been inherited, but now child_tp->mptcp is NULL */
++		child_tp->mpc = 0;
++		child_tp->ops = &tcp_specific;
++
++		/* TODO when we support acking the third ack for new subflows,
++		 * we should silently discard this third ack, by returning NULL.
++		 *
++		 * Maybe, at the retransmission we will have enough memory to
++		 * fully add the socket to the meta-sk.
++		 */
++		goto teardown;
++	}
++
++	/* The child is a clone of the meta socket, we must now reset
++	 * some of the fields
++	 */
++	child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
++
++	/* We should allow proper increase of the snd/rcv-buffers. Thus, we
++	 * use the original values instead of the bloated up ones from the
++	 * clone.
++	 */
++	child->sk_sndbuf = mpcb->orig_sk_sndbuf;
++	child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
++
++	child_tp->mptcp->slave_sk = 1;
++	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
++	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
++	child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
++
++	child_tp->tsq_flags = 0;
++
++	/* Subflows do not use the accept queue, as they
++	 * are attached immediately to the mpcb.
++	 */
++	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++	reqsk_free(req);
++	return child;
++
++teardown:
++	/* Drop this request - sock creation failed. */
++	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
++	reqsk_free(req);
++	inet_csk_prepare_forced_close(child);
++	tcp_done(child);
++	return meta_sk;
++}
++
++int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
++{
++	struct mptcp_tw *mptw;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++
++	/* A subsocket in tw can only receive data. So, if we are in
++	 * infinite-receive, then we should not reply with a data-ack or act
++	 * upon general MPTCP-signaling. We prevent this by simply not creating
++	 * the mptcp_tw_sock.
++	 */
++	if (mpcb->infinite_mapping_rcv) {
++		tw->mptcp_tw = NULL;
++		return 0;
++	}
++
++	/* Alloc MPTCP-tw-sock */
++	mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
++	if (!mptw)
++		return -ENOBUFS;
++
++	atomic_inc(&mpcb->mpcb_refcnt);
++
++	tw->mptcp_tw = mptw;
++	mptw->loc_key = mpcb->mptcp_loc_key;
++	mptw->meta_tw = mpcb->in_time_wait;
++	if (mptw->meta_tw) {
++		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
++		if (mpcb->mptw_state != TCP_TIME_WAIT)
++			mptw->rcv_nxt++;
++	}
++	rcu_assign_pointer(mptw->mpcb, mpcb);
++
++	spin_lock(&mpcb->tw_lock);
++	list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
++	mptw->in_list = 1;
++	spin_unlock(&mpcb->tw_lock);
++
++	return 0;
++}
++
++void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
++{
++	struct mptcp_cb *mpcb;
++
++	rcu_read_lock();
++	mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
++
++	/* If we are still holding a ref to the mpcb, we have to remove ourself
++	 * from the list and drop the ref properly.
++	 */
++	if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
++		spin_lock(&mpcb->tw_lock);
++		if (tw->mptcp_tw->in_list) {
++			list_del_rcu(&tw->mptcp_tw->list);
++			tw->mptcp_tw->in_list = 0;
++		}
++		spin_unlock(&mpcb->tw_lock);
++
++		/* Twice, because we increased it above */
++		mptcp_mpcb_put(mpcb);
++		mptcp_mpcb_put(mpcb);
++	}
++
++	rcu_read_unlock();
++
++	kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
++}
++
++/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
++ * data-fin.
++ */
++void mptcp_time_wait(struct sock *sk, int state, int timeo)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_tw *mptw;
++
++	/* Used for sockets that go into tw after the meta
++	 * (see mptcp_init_tw_sock())
++	 */
++	tp->mpcb->in_time_wait = 1;
++	tp->mpcb->mptw_state = state;
++
++	/* Update the time-wait-sock's information */
++	rcu_read_lock_bh();
++	list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
++		mptw->meta_tw = 1;
++		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
++
++		/* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
++		 * pretend as if the DATA_FIN has already reached us, that way
++		 * the checks in tcp_timewait_state_process will be good as the
++		 * DATA_FIN comes in.
++		 */
++		if (state != TCP_TIME_WAIT)
++			mptw->rcv_nxt++;
++	}
++	rcu_read_unlock_bh();
++
++	tcp_done(sk);
++}
++
++void mptcp_tsq_flags(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	/* It will be handled as a regular deferred-call */
++	if (is_meta_sk(sk))
++		return;
++
++	if (hlist_unhashed(&tp->mptcp->cb_list)) {
++		hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
++		/* We need to hold it here, as the sock_hold is not assured
++		 * by the release_sock as it is done in regular TCP.
++		 *
++		 * The subsocket may get inet_csk_destroy'd while it is inside
++		 * the callback_list.
++		 */
++		sock_hold(sk);
++	}
++
++	if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
++		sock_hold(meta_sk);
++}
++
++void mptcp_tsq_sub_deferred(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_tcp_sock *mptcp;
++	struct hlist_node *tmp;
++
++	BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
++
++	__sock_put(meta_sk);
++	hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
++		struct tcp_sock *tp = mptcp->tp;
++		struct sock *sk = (struct sock *)tp;
++
++		hlist_del_init(&mptcp->cb_list);
++		sk->sk_prot->release_cb(sk);
++		/* Final sock_put (cfr. mptcp_tsq_flags */
++		sock_put(sk);
++	}
++}
++
++void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
++			   struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_options_received mopt;
++	u8 mptcp_hash_mac[20];
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	mtreq = mptcp_rsk(req);
++	mtreq->mptcp_mpcb = mpcb;
++	mtreq->is_sub = 1;
++	inet_rsk(req)->mptcp_rqsk = 1;
++
++	mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
++
++	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++			(u8 *)&mpcb->mptcp_rem_key,
++			(u8 *)&mtreq->mptcp_loc_nonce,
++			(u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
++	mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
++
++	mtreq->rem_id = mopt.rem_id;
++	mtreq->rcv_low_prio = mopt.low_prio;
++	inet_rsk(req)->saw_mpc = 1;
++}
++
++void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
++{
++	struct mptcp_options_received mopt;
++	struct mptcp_request_sock *mreq = mptcp_rsk(req);
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	mreq->is_sub = 0;
++	inet_rsk(req)->mptcp_rqsk = 1;
++	mreq->dss_csum = mopt.dss_csum;
++	mreq->hash_entry.pprev = NULL;
++
++	mptcp_reqsk_new_mptcp(req, &mopt, skb);
++}
++
++int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
++{
++	struct mptcp_options_received mopt;
++	const struct tcp_sock *tp = tcp_sk(sk);
++	__u32 isn = TCP_SKB_CB(skb)->when;
++	bool want_cookie = false;
++
++	if ((sysctl_tcp_syncookies == 2 ||
++	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
++		want_cookie = tcp_syn_flood_action(sk, skb,
++						   mptcp_request_sock_ops.slab_name);
++		if (!want_cookie)
++			goto drop;
++	}
++
++	mptcp_init_mp_opt(&mopt);
++	tcp_parse_mptcp_options(skb, &mopt);
++
++	if (mopt.is_mp_join)
++		return mptcp_do_join_short(skb, &mopt, sock_net(sk));
++	if (mopt.drop_me)
++		goto drop;
++
++	if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
++		mopt.saw_mpc = 0;
++
++	if (skb->protocol == htons(ETH_P_IP)) {
++		if (mopt.saw_mpc && !want_cookie) {
++			if (skb_rtable(skb)->rt_flags &
++			    (RTCF_BROADCAST | RTCF_MULTICAST))
++				goto drop;
++
++			return tcp_conn_request(&mptcp_request_sock_ops,
++						&mptcp_request_sock_ipv4_ops,
++						sk, skb);
++		}
++
++		return tcp_v4_conn_request(sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		if (mopt.saw_mpc && !want_cookie) {
++			if (!ipv6_unicast_destination(skb))
++				goto drop;
++
++			return tcp_conn_request(&mptcp6_request_sock_ops,
++						&mptcp_request_sock_ipv6_ops,
++						sk, skb);
++		}
++
++		return tcp_v6_conn_request(sk, skb);
++#endif
++	}
++drop:
++	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
++	return 0;
++}
++
++struct workqueue_struct *mptcp_wq;
++EXPORT_SYMBOL(mptcp_wq);
++
++/* Output /proc/net/mptcp */
++static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
++{
++	struct tcp_sock *meta_tp;
++	const struct net *net = seq->private;
++	int i, n = 0;
++
++	seq_printf(seq, "  sl  loc_tok  rem_tok  v6 local_address                         remote_address                        st ns tx_queue rx_queue inode");
++	seq_putc(seq, '\n');
++
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		struct hlist_nulls_node *node;
++		rcu_read_lock_bh();
++		hlist_nulls_for_each_entry_rcu(meta_tp, node,
++					       &tk_hashtable[i], tk_table) {
++			struct mptcp_cb *mpcb = meta_tp->mpcb;
++			struct sock *meta_sk = (struct sock *)meta_tp;
++			struct inet_sock *isk = inet_sk(meta_sk);
++
++			if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
++				continue;
++
++			if (capable(CAP_NET_ADMIN)) {
++				seq_printf(seq, "%4d: %04X %04X ", n++,
++						mpcb->mptcp_loc_token,
++						mpcb->mptcp_rem_token);
++			} else {
++				seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
++			}
++			if (meta_sk->sk_family == AF_INET ||
++			    mptcp_v6_is_v4_mapped(meta_sk)) {
++				seq_printf(seq, " 0 %08X:%04X                         %08X:%04X                        ",
++					   isk->inet_rcv_saddr,
++					   ntohs(isk->inet_sport),
++					   isk->inet_daddr,
++					   ntohs(isk->inet_dport));
++#if IS_ENABLED(CONFIG_IPV6)
++			} else if (meta_sk->sk_family == AF_INET6) {
++				struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
++				struct in6_addr *dst = &meta_sk->sk_v6_daddr;
++				seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
++					   src->s6_addr32[0], src->s6_addr32[1],
++					   src->s6_addr32[2], src->s6_addr32[3],
++					   ntohs(isk->inet_sport),
++					   dst->s6_addr32[0], dst->s6_addr32[1],
++					   dst->s6_addr32[2], dst->s6_addr32[3],
++					   ntohs(isk->inet_dport));
++#endif
++			}
++			seq_printf(seq, " %02X %02X %08X:%08X %lu",
++				   meta_sk->sk_state, mpcb->cnt_subflows,
++				   meta_tp->write_seq - meta_tp->snd_una,
++				   max_t(int, meta_tp->rcv_nxt -
++					 meta_tp->copied_seq, 0),
++				   sock_i_ino(meta_sk));
++			seq_putc(seq, '\n');
++		}
++
++		rcu_read_unlock_bh();
++	}
++
++	return 0;
++}
++
++static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
++{
++	return single_open_net(inode, file, mptcp_pm_seq_show);
++}
++
++static const struct file_operations mptcp_pm_seq_fops = {
++	.owner = THIS_MODULE,
++	.open = mptcp_pm_seq_open,
++	.read = seq_read,
++	.llseek = seq_lseek,
++	.release = single_release_net,
++};
++
++static int mptcp_pm_init_net(struct net *net)
++{
++	if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
++		return -ENOMEM;
++
++	return 0;
++}
++
++static void mptcp_pm_exit_net(struct net *net)
++{
++	remove_proc_entry("mptcp", net->proc_net);
++}
++
++static struct pernet_operations mptcp_pm_proc_ops = {
++	.init = mptcp_pm_init_net,
++	.exit = mptcp_pm_exit_net,
++};
++
++/* General initialization of mptcp */
++void __init mptcp_init(void)
++{
++	int i;
++	struct ctl_table_header *mptcp_sysctl;
++
++	mptcp_sock_cache = kmem_cache_create("mptcp_sock",
++					     sizeof(struct mptcp_tcp_sock),
++					     0, SLAB_HWCACHE_ALIGN,
++					     NULL);
++	if (!mptcp_sock_cache)
++		goto mptcp_sock_cache_failed;
++
++	mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
++					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++					   NULL);
++	if (!mptcp_cb_cache)
++		goto mptcp_cb_cache_failed;
++
++	mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
++					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++					   NULL);
++	if (!mptcp_tw_cache)
++		goto mptcp_tw_cache_failed;
++
++	get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
++
++	mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
++	if (!mptcp_wq)
++		goto alloc_workqueue_failed;
++
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
++		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
++				      i + MPTCP_REQSK_NULLS_BASE);
++		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
++	}
++
++	spin_lock_init(&mptcp_reqsk_hlock);
++	spin_lock_init(&mptcp_tk_hashlock);
++
++	if (register_pernet_subsys(&mptcp_pm_proc_ops))
++		goto pernet_failed;
++
++#if IS_ENABLED(CONFIG_IPV6)
++	if (mptcp_pm_v6_init())
++		goto mptcp_pm_v6_failed;
++#endif
++	if (mptcp_pm_v4_init())
++		goto mptcp_pm_v4_failed;
++
++	mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
++	if (!mptcp_sysctl)
++		goto register_sysctl_failed;
++
++	if (mptcp_register_path_manager(&mptcp_pm_default))
++		goto register_pm_failed;
++
++	if (mptcp_register_scheduler(&mptcp_sched_default))
++		goto register_sched_failed;
++
++	pr_info("MPTCP: Stable release v0.89.0-rc");
++
++	mptcp_init_failed = false;
++
++	return;
++
++register_sched_failed:
++	mptcp_unregister_path_manager(&mptcp_pm_default);
++register_pm_failed:
++	unregister_net_sysctl_table(mptcp_sysctl);
++register_sysctl_failed:
++	mptcp_pm_v4_undo();
++mptcp_pm_v4_failed:
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_pm_v6_undo();
++mptcp_pm_v6_failed:
++#endif
++	unregister_pernet_subsys(&mptcp_pm_proc_ops);
++pernet_failed:
++	destroy_workqueue(mptcp_wq);
++alloc_workqueue_failed:
++	kmem_cache_destroy(mptcp_tw_cache);
++mptcp_tw_cache_failed:
++	kmem_cache_destroy(mptcp_cb_cache);
++mptcp_cb_cache_failed:
++	kmem_cache_destroy(mptcp_sock_cache);
++mptcp_sock_cache_failed:
++	mptcp_init_failed = true;
++}
+diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
+new file mode 100644
+index 000000000000..3a54413ce25b
+--- /dev/null
++++ b/net/mptcp/mptcp_fullmesh.c
+@@ -0,0 +1,1722 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#include <net/addrconf.h>
++#endif
++
++enum {
++	MPTCP_EVENT_ADD = 1,
++	MPTCP_EVENT_DEL,
++	MPTCP_EVENT_MOD,
++};
++
++#define MPTCP_SUBFLOW_RETRY_DELAY	1000
++
++/* Max number of local or remote addresses we can store.
++ * When changing, see the bitfield below in fullmesh_rem4/6.
++ */
++#define MPTCP_MAX_ADDR	8
++
++struct fullmesh_rem4 {
++	u8		rem4_id;
++	u8		bitfield;
++	u8		retry_bitfield;
++	__be16		port;
++	struct in_addr	addr;
++};
++
++struct fullmesh_rem6 {
++	u8		rem6_id;
++	u8		bitfield;
++	u8		retry_bitfield;
++	__be16		port;
++	struct in6_addr	addr;
++};
++
++struct mptcp_loc_addr {
++	struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
++	u8 loc4_bits;
++	u8 next_v4_index;
++
++	struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
++	u8 loc6_bits;
++	u8 next_v6_index;
++};
++
++struct mptcp_addr_event {
++	struct list_head list;
++	unsigned short	family;
++	u8	code:7,
++		low_prio:1;
++	union inet_addr addr;
++};
++
++struct fullmesh_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++	/* Delayed worker, when the routing-tables are not yet ready. */
++	struct delayed_work subflow_retry_work;
++
++	/* Remote addresses */
++	struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
++	struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
++
++	struct mptcp_cb *mpcb;
++
++	u16 remove_addrs; /* Addresses to remove */
++	u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
++	u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
++
++	u8	add_addr; /* Are we sending an add_addr? */
++
++	u8 rem4_bits;
++	u8 rem6_bits;
++};
++
++struct mptcp_fm_ns {
++	struct mptcp_loc_addr __rcu *local;
++	spinlock_t local_lock; /* Protecting the above pointer */
++	struct list_head events;
++	struct delayed_work address_worker;
++
++	struct net *net;
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly;
++
++static void full_mesh_create_subflows(struct sock *meta_sk);
++
++static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
++{
++	return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
++}
++
++static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
++{
++	return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
++}
++
++/* Find the first free index in the bitfield */
++static int __mptcp_find_free_index(u8 bitfield, u8 base)
++{
++	int i;
++
++	/* There are anyways no free bits... */
++	if (bitfield == 0xff)
++		goto exit;
++
++	i = ffs(~(bitfield >> base)) - 1;
++	if (i < 0)
++		goto exit;
++
++	/* No free bits when starting at base, try from 0 on */
++	if (i + base >= sizeof(bitfield) * 8)
++		return __mptcp_find_free_index(bitfield, 0);
++
++	return i + base;
++exit:
++	return -1;
++}
++
++static int mptcp_find_free_index(u8 bitfield)
++{
++	return __mptcp_find_free_index(bitfield, 0);
++}
++
++static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
++			      const struct in_addr *addr,
++			      __be16 port, u8 id)
++{
++	int i;
++	struct fullmesh_rem4 *rem4;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		rem4 = &fmp->remaddr4[i];
++
++		/* Address is already in the list --- continue */
++		if (rem4->rem4_id == id &&
++		    rem4->addr.s_addr == addr->s_addr && rem4->port == port)
++			return;
++
++		/* This may be the case, when the peer is behind a NAT. He is
++		 * trying to JOIN, thus sending the JOIN with a certain ID.
++		 * However the src_addr of the IP-packet has been changed. We
++		 * update the addr in the list, because this is the address as
++		 * OUR BOX sees it.
++		 */
++		if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
++			/* update the address */
++			mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
++				    __func__, &rem4->addr.s_addr,
++				    &addr->s_addr, id);
++			rem4->addr.s_addr = addr->s_addr;
++			rem4->port = port;
++			mpcb->list_rcvd = 1;
++			return;
++		}
++	}
++
++	i = mptcp_find_free_index(fmp->rem4_bits);
++	/* Do we have already the maximum number of local/remote addresses? */
++	if (i < 0) {
++		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
++			    __func__, MPTCP_MAX_ADDR, &addr->s_addr);
++		return;
++	}
++
++	rem4 = &fmp->remaddr4[i];
++
++	/* Address is not known yet, store it */
++	rem4->addr.s_addr = addr->s_addr;
++	rem4->port = port;
++	rem4->bitfield = 0;
++	rem4->retry_bitfield = 0;
++	rem4->rem4_id = id;
++	mpcb->list_rcvd = 1;
++	fmp->rem4_bits |= (1 << i);
++
++	return;
++}
++
++static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
++			      const struct in6_addr *addr,
++			      __be16 port, u8 id)
++{
++	int i;
++	struct fullmesh_rem6 *rem6;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		rem6 = &fmp->remaddr6[i];
++
++		/* Address is already in the list --- continue */
++		if (rem6->rem6_id == id &&
++		    ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
++			return;
++
++		/* This may be the case, when the peer is behind a NAT. He is
++		 * trying to JOIN, thus sending the JOIN with a certain ID.
++		 * However the src_addr of the IP-packet has been changed. We
++		 * update the addr in the list, because this is the address as
++		 * OUR BOX sees it.
++		 */
++		if (rem6->rem6_id == id) {
++			/* update the address */
++			mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
++				    __func__, &rem6->addr, addr, id);
++			rem6->addr = *addr;
++			rem6->port = port;
++			mpcb->list_rcvd = 1;
++			return;
++		}
++	}
++
++	i = mptcp_find_free_index(fmp->rem6_bits);
++	/* Do we have already the maximum number of local/remote addresses? */
++	if (i < 0) {
++		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
++			    __func__, MPTCP_MAX_ADDR, addr);
++		return;
++	}
++
++	rem6 = &fmp->remaddr6[i];
++
++	/* Address is not known yet, store it */
++	rem6->addr = *addr;
++	rem6->port = port;
++	rem6->bitfield = 0;
++	rem6->retry_bitfield = 0;
++	rem6->rem6_id = id;
++	mpcb->list_rcvd = 1;
++	fmp->rem6_bits |= (1 << i);
++
++	return;
++}
++
++static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		if (fmp->remaddr4[i].rem4_id == id) {
++			/* remove address from bitfield */
++			fmp->rem4_bits &= ~(1 << i);
++
++			break;
++		}
++	}
++}
++
++static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		if (fmp->remaddr6[i].rem6_id == id) {
++			/* remove address from bitfield */
++			fmp->rem6_bits &= ~(1 << i);
++
++			break;
++		}
++	}
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
++				       const struct in_addr *addr, u8 index)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
++			fmp->remaddr4[i].bitfield |= (1 << index);
++			return;
++		}
++	}
++}
++
++/* Sets the bitfield of the remote-address field */
++static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
++				       const struct in6_addr *addr, u8 index)
++{
++	int i;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
++			fmp->remaddr6[i].bitfield |= (1 << index);
++			return;
++		}
++	}
++}
++
++static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
++				    const union inet_addr *addr,
++				    sa_family_t family, u8 id)
++{
++	if (family == AF_INET)
++		mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
++	else
++		mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
++}
++
++static void retry_subflow_worker(struct work_struct *work)
++{
++	struct delayed_work *delayed_work = container_of(work,
++							 struct delayed_work,
++							 work);
++	struct fullmesh_priv *fmp = container_of(delayed_work,
++						 struct fullmesh_priv,
++						 subflow_retry_work);
++	struct mptcp_cb *mpcb = fmp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int iter = 0, i;
++
++	/* We need a local (stable) copy of the address-list. Really, it is not
++	 * such a big deal, if the address-list is not 100% up-to-date.
++	 */
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++	rcu_read_unlock_bh();
++
++	if (!mptcp_local)
++		return;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
++		/* Do we need to retry establishing a subflow ? */
++		if (rem->retry_bitfield) {
++			int i = mptcp_find_free_index(~rem->retry_bitfield);
++			struct mptcp_rem4 rem4;
++
++			rem->bitfield |= (1 << i);
++			rem->retry_bitfield &= ~(1 << i);
++
++			rem4.addr = rem->addr;
++			rem4.port = rem->port;
++			rem4.rem4_id = rem->rem4_id;
++
++			mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
++			goto next_subflow;
++		}
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
++
++		/* Do we need to retry establishing a subflow ? */
++		if (rem->retry_bitfield) {
++			int i = mptcp_find_free_index(~rem->retry_bitfield);
++			struct mptcp_rem6 rem6;
++
++			rem->bitfield |= (1 << i);
++			rem->retry_bitfield &= ~(1 << i);
++
++			rem6.addr = rem->addr;
++			rem6.port = rem->port;
++			rem6.rem6_id = rem->rem6_id;
++
++			mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
++			goto next_subflow;
++		}
++	}
++#endif
++
++exit:
++	kfree(mptcp_local);
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
++						 subflow_work);
++	struct mptcp_cb *mpcb = fmp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int iter = 0, retry = 0;
++	int i;
++
++	/* We need a local (stable) copy of the address-list. Really, it is not
++	 * such a big deal, if the address-list is not 100% up-to-date.
++	 */
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
++	rcu_read_unlock_bh();
++
++	if (!mptcp_local)
++		return;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		struct fullmesh_rem4 *rem;
++		u8 remaining_bits;
++
++		rem = &fmp->remaddr4[i];
++		remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
++
++		/* Are there still combinations to handle? */
++		if (remaining_bits) {
++			int i = mptcp_find_free_index(~remaining_bits);
++			struct mptcp_rem4 rem4;
++
++			rem->bitfield |= (1 << i);
++
++			rem4.addr = rem->addr;
++			rem4.port = rem->port;
++			rem4.rem4_id = rem->rem4_id;
++
++			/* If a route is not yet available then retry once */
++			if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
++						   &rem4) == -ENETUNREACH)
++				retry = rem->retry_bitfield |= (1 << i);
++			goto next_subflow;
++		}
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		struct fullmesh_rem6 *rem;
++		u8 remaining_bits;
++
++		rem = &fmp->remaddr6[i];
++		remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
++
++		/* Are there still combinations to handle? */
++		if (remaining_bits) {
++			int i = mptcp_find_free_index(~remaining_bits);
++			struct mptcp_rem6 rem6;
++
++			rem->bitfield |= (1 << i);
++
++			rem6.addr = rem->addr;
++			rem6.port = rem->port;
++			rem6.rem6_id = rem->rem6_id;
++
++			/* If a route is not yet available then retry once */
++			if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
++						   &rem6) == -ENETUNREACH)
++				retry = rem->retry_bitfield |= (1 << i);
++			goto next_subflow;
++		}
++	}
++#endif
++
++	if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
++		sock_hold(meta_sk);
++		queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
++				   msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
++	}
++
++exit:
++	kfree(mptcp_local);
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	struct sock *sk = mptcp_select_ack_sock(meta_sk);
++
++	fmp->remove_addrs |= (1 << addr_id);
++	mpcb->addr_signal = 1;
++
++	if (sk)
++		tcp_send_ack(sk);
++}
++
++static void update_addr_bitfields(struct sock *meta_sk,
++				  const struct mptcp_loc_addr *mptcp_local)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	int i;
++
++	/* The bits in announced_addrs_* always match with loc*_bits. So, a
++	 * simply & operation unsets the correct bits, because these go from
++	 * announced to non-announced
++	 */
++	fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
++
++	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
++		fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
++		fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
++	}
++
++	fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
++
++	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
++		fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
++		fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
++	}
++}
++
++static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
++			      sa_family_t family, const union inet_addr *addr)
++{
++	int i;
++	u8 loc_bits;
++	bool found = false;
++
++	if (family == AF_INET)
++		loc_bits = mptcp_local->loc4_bits;
++	else
++		loc_bits = mptcp_local->loc6_bits;
++
++	mptcp_for_each_bit_set(loc_bits, i) {
++		if (family == AF_INET &&
++		    mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
++			found = true;
++			break;
++		}
++		if (family == AF_INET6 &&
++		    ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
++				    &addr->in6)) {
++			found = true;
++			break;
++		}
++	}
++
++	if (!found)
++		return -1;
++
++	return i;
++}
++
++static void mptcp_address_worker(struct work_struct *work)
++{
++	const struct delayed_work *delayed_work = container_of(work,
++							 struct delayed_work,
++							 work);
++	struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
++						 struct mptcp_fm_ns,
++						 address_worker);
++	struct net *net = fm_ns->net;
++	struct mptcp_addr_event *event = NULL;
++	struct mptcp_loc_addr *mptcp_local, *old;
++	int i, id = -1; /* id is used in the socket-code on a delete-event */
++	bool success; /* Used to indicate if we succeeded handling the event */
++
++next_event:
++	success = false;
++	kfree(event);
++
++	/* First, let's dequeue an event from our event-list */
++	rcu_read_lock_bh();
++	spin_lock(&fm_ns->local_lock);
++
++	event = list_first_entry_or_null(&fm_ns->events,
++					 struct mptcp_addr_event, list);
++	if (!event) {
++		spin_unlock(&fm_ns->local_lock);
++		rcu_read_unlock_bh();
++		return;
++	}
++
++	list_del(&event->list);
++
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++	if (event->code == MPTCP_EVENT_DEL) {
++		id = mptcp_find_address(mptcp_local, event->family, &event->addr);
++
++		/* Not in the list - so we don't care */
++		if (id < 0) {
++			mptcp_debug("%s could not find id\n", __func__);
++			goto duno;
++		}
++
++		old = mptcp_local;
++		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++				      GFP_ATOMIC);
++		if (!mptcp_local)
++			goto duno;
++
++		if (event->family == AF_INET)
++			mptcp_local->loc4_bits &= ~(1 << id);
++		else
++			mptcp_local->loc6_bits &= ~(1 << id);
++
++		rcu_assign_pointer(fm_ns->local, mptcp_local);
++		kfree(old);
++	} else {
++		int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
++		int j = i;
++
++		if (j < 0) {
++			/* Not in the list, so we have to find an empty slot */
++			if (event->family == AF_INET)
++				i = __mptcp_find_free_index(mptcp_local->loc4_bits,
++							    mptcp_local->next_v4_index);
++			if (event->family == AF_INET6)
++				i = __mptcp_find_free_index(mptcp_local->loc6_bits,
++							    mptcp_local->next_v6_index);
++
++			if (i < 0) {
++				mptcp_debug("%s no more space\n", __func__);
++				goto duno;
++			}
++
++			/* It might have been a MOD-event. */
++			event->code = MPTCP_EVENT_ADD;
++		} else {
++			/* Let's check if anything changes */
++			if (event->family == AF_INET &&
++			    event->low_prio == mptcp_local->locaddr4[i].low_prio)
++				goto duno;
++
++			if (event->family == AF_INET6 &&
++			    event->low_prio == mptcp_local->locaddr6[i].low_prio)
++				goto duno;
++		}
++
++		old = mptcp_local;
++		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
++				      GFP_ATOMIC);
++		if (!mptcp_local)
++			goto duno;
++
++		if (event->family == AF_INET) {
++			mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
++			mptcp_local->locaddr4[i].loc4_id = i + 1;
++			mptcp_local->locaddr4[i].low_prio = event->low_prio;
++		} else {
++			mptcp_local->locaddr6[i].addr = event->addr.in6;
++			mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
++			mptcp_local->locaddr6[i].low_prio = event->low_prio;
++		}
++
++		if (j < 0) {
++			if (event->family == AF_INET) {
++				mptcp_local->loc4_bits |= (1 << i);
++				mptcp_local->next_v4_index = i + 1;
++			} else {
++				mptcp_local->loc6_bits |= (1 << i);
++				mptcp_local->next_v6_index = i + 1;
++			}
++		}
++
++		rcu_assign_pointer(fm_ns->local, mptcp_local);
++		kfree(old);
++	}
++	success = true;
++
++duno:
++	spin_unlock(&fm_ns->local_lock);
++	rcu_read_unlock_bh();
++
++	if (!success)
++		goto next_event;
++
++	/* Now we iterate over the MPTCP-sockets and apply the event. */
++	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
++		const struct hlist_nulls_node *node;
++		struct tcp_sock *meta_tp;
++
++		rcu_read_lock_bh();
++		hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
++					       tk_table) {
++			struct mptcp_cb *mpcb = meta_tp->mpcb;
++			struct sock *meta_sk = (struct sock *)meta_tp, *sk;
++			struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++			bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++			if (sock_net(meta_sk) != net)
++				continue;
++
++			if (meta_v4) {
++				/* skip IPv6 events if meta is IPv4 */
++				if (event->family == AF_INET6)
++					continue;
++			}
++			/* skip IPv4 events if IPV6_V6ONLY is set */
++			else if (event->family == AF_INET &&
++				 inet6_sk(meta_sk)->ipv6only)
++				continue;
++
++			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++				continue;
++
++			bh_lock_sock(meta_sk);
++
++			if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
++			    mpcb->infinite_mapping_snd ||
++			    mpcb->infinite_mapping_rcv ||
++			    mpcb->send_infinite_mapping)
++				goto next;
++
++			/* May be that the pm has changed in-between */
++			if (mpcb->pm_ops != &full_mesh)
++				goto next;
++
++			if (sock_owned_by_user(meta_sk)) {
++				if (!test_and_set_bit(MPTCP_PATH_MANAGER,
++						      &meta_tp->tsq_flags))
++					sock_hold(meta_sk);
++
++				goto next;
++			}
++
++			if (event->code == MPTCP_EVENT_ADD) {
++				fmp->add_addr++;
++				mpcb->addr_signal = 1;
++
++				sk = mptcp_select_ack_sock(meta_sk);
++				if (sk)
++					tcp_send_ack(sk);
++
++				full_mesh_create_subflows(meta_sk);
++			}
++
++			if (event->code == MPTCP_EVENT_DEL) {
++				struct sock *sk, *tmpsk;
++				struct mptcp_loc_addr *mptcp_local;
++				bool found = false;
++
++				mptcp_local = rcu_dereference_bh(fm_ns->local);
++
++				/* In any case, we need to update our bitfields */
++				if (id >= 0)
++					update_addr_bitfields(meta_sk, mptcp_local);
++
++				/* Look for the socket and remove him */
++				mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++					if ((event->family == AF_INET6 &&
++					     (sk->sk_family == AF_INET ||
++					      mptcp_v6_is_v4_mapped(sk))) ||
++					    (event->family == AF_INET &&
++					     (sk->sk_family == AF_INET6 &&
++					      !mptcp_v6_is_v4_mapped(sk))))
++						continue;
++
++					if (event->family == AF_INET &&
++					    (sk->sk_family == AF_INET ||
++					     mptcp_v6_is_v4_mapped(sk)) &&
++					     inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
++						continue;
++
++					if (event->family == AF_INET6 &&
++					    sk->sk_family == AF_INET6 &&
++					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
++						continue;
++
++					/* Reinject, so that pf = 1 and so we
++					 * won't select this one as the
++					 * ack-sock.
++					 */
++					mptcp_reinject_data(sk, 0);
++
++					/* We announce the removal of this id */
++					announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
++
++					mptcp_sub_force_close(sk);
++					found = true;
++				}
++
++				if (found)
++					goto next;
++
++				/* The id may have been given by the event,
++				 * matching on a local address. And it may not
++				 * have matched on one of the above sockets,
++				 * because the client never created a subflow.
++				 * So, we have to finally remove it here.
++				 */
++				if (id > 0)
++					announce_remove_addr(id, meta_sk);
++			}
++
++			if (event->code == MPTCP_EVENT_MOD) {
++				struct sock *sk;
++
++				mptcp_for_each_sk(mpcb, sk) {
++					struct tcp_sock *tp = tcp_sk(sk);
++					if (event->family == AF_INET &&
++					    (sk->sk_family == AF_INET ||
++					     mptcp_v6_is_v4_mapped(sk)) &&
++					     inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
++						if (event->low_prio != tp->mptcp->low_prio) {
++							tp->mptcp->send_mp_prio = 1;
++							tp->mptcp->low_prio = event->low_prio;
++
++							tcp_send_ack(sk);
++						}
++					}
++
++					if (event->family == AF_INET6 &&
++					    sk->sk_family == AF_INET6 &&
++					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
++						if (event->low_prio != tp->mptcp->low_prio) {
++							tp->mptcp->send_mp_prio = 1;
++							tp->mptcp->low_prio = event->low_prio;
++
++							tcp_send_ack(sk);
++						}
++					}
++				}
++			}
++next:
++			bh_unlock_sock(meta_sk);
++			sock_put(meta_sk);
++		}
++		rcu_read_unlock_bh();
++	}
++	goto next_event;
++}
++
++static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
++						     const struct mptcp_addr_event *event)
++{
++	struct mptcp_addr_event *eventq;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++	list_for_each_entry(eventq, &fm_ns->events, list) {
++		if (eventq->family != event->family)
++			continue;
++		if (event->family == AF_INET) {
++			if (eventq->addr.in.s_addr == event->addr.in.s_addr)
++				return eventq;
++		} else {
++			if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
++				return eventq;
++		}
++	}
++	return NULL;
++}
++
++/* We already hold the net-namespace MPTCP-lock */
++static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
++{
++	struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++
++	if (eventq) {
++		switch (event->code) {
++		case MPTCP_EVENT_DEL:
++			mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
++			list_del(&eventq->list);
++			kfree(eventq);
++			break;
++		case MPTCP_EVENT_ADD:
++			mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
++			eventq->low_prio = event->low_prio;
++			eventq->code = MPTCP_EVENT_ADD;
++			return;
++		case MPTCP_EVENT_MOD:
++			mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
++			eventq->low_prio = event->low_prio;
++			eventq->code = MPTCP_EVENT_MOD;
++			return;
++		}
++	}
++
++	/* OK, we have to add the new address to the wait queue */
++	eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
++	if (!eventq)
++		return;
++
++	list_add_tail(&eventq->list, &fm_ns->events);
++
++	/* Create work-queue */
++	if (!delayed_work_pending(&fm_ns->address_worker))
++		queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
++				   msecs_to_jiffies(500));
++}
++
++static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
++				struct net *net)
++{
++	const struct net_device *netdev = ifa->ifa_dev->dev;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	struct mptcp_addr_event mpevent;
++
++	if (ifa->ifa_scope > RT_SCOPE_LINK ||
++	    ipv4_is_loopback(ifa->ifa_local))
++		return;
++
++	spin_lock_bh(&fm_ns->local_lock);
++
++	mpevent.family = AF_INET;
++	mpevent.addr.in.s_addr = ifa->ifa_local;
++	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++	if (event == NETDEV_DOWN || !netif_running(netdev) ||
++	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++		mpevent.code = MPTCP_EVENT_DEL;
++	else if (event == NETDEV_UP)
++		mpevent.code = MPTCP_EVENT_ADD;
++	else if (event == NETDEV_CHANGE)
++		mpevent.code = MPTCP_EVENT_MOD;
++
++	mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
++		    &ifa->ifa_local, mpevent.code, mpevent.low_prio);
++	add_pm_event(net, &mpevent);
++
++	spin_unlock_bh(&fm_ns->local_lock);
++	return;
++}
++
++/* React on IPv4-addr add/rem-events */
++static int mptcp_pm_inetaddr_event(struct notifier_block *this,
++				   unsigned long event, void *ptr)
++{
++	const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
++	struct net *net = dev_net(ifa->ifa_dev->dev);
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	addr4_event_handler(ifa, event, net);
++
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_inetaddr_notifier = {
++		.notifier_call = mptcp_pm_inetaddr_event,
++};
++
++#if IS_ENABLED(CONFIG_IPV6)
++
++/* IPV6-related address/interface watchers */
++struct mptcp_dad_data {
++	struct timer_list timer;
++	struct inet6_ifaddr *ifa;
++};
++
++static void dad_callback(unsigned long arg);
++static int inet6_addr_event(struct notifier_block *this,
++				     unsigned long event, void *ptr);
++
++static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
++{
++	return (ifa->flags & IFA_F_TENTATIVE) &&
++	       ifa->state == INET6_IFADDR_STATE_DAD;
++}
++
++static void dad_init_timer(struct mptcp_dad_data *data,
++				 struct inet6_ifaddr *ifa)
++{
++	data->ifa = ifa;
++	data->timer.data = (unsigned long)data;
++	data->timer.function = dad_callback;
++	if (ifa->idev->cnf.rtr_solicit_delay)
++		data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
++	else
++		data->timer.expires = jiffies + (HZ/10);
++}
++
++static void dad_callback(unsigned long arg)
++{
++	struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
++
++	if (ipv6_is_in_dad_state(data->ifa)) {
++		dad_init_timer(data, data->ifa);
++		add_timer(&data->timer);
++	} else {
++		inet6_addr_event(NULL, NETDEV_UP, data->ifa);
++		in6_ifa_put(data->ifa);
++		kfree(data);
++	}
++}
++
++static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
++{
++	struct mptcp_dad_data *data;
++
++	data = kmalloc(sizeof(*data), GFP_ATOMIC);
++
++	if (!data)
++		return;
++
++	init_timer(&data->timer);
++	dad_init_timer(data, ifa);
++	add_timer(&data->timer);
++	in6_ifa_hold(ifa);
++}
++
++static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
++				struct net *net)
++{
++	const struct net_device *netdev = ifa->idev->dev;
++	int addr_type = ipv6_addr_type(&ifa->addr);
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	struct mptcp_addr_event mpevent;
++
++	if (ifa->scope > RT_SCOPE_LINK ||
++	    addr_type == IPV6_ADDR_ANY ||
++	    (addr_type & IPV6_ADDR_LOOPBACK) ||
++	    (addr_type & IPV6_ADDR_LINKLOCAL))
++		return;
++
++	spin_lock_bh(&fm_ns->local_lock);
++
++	mpevent.family = AF_INET6;
++	mpevent.addr.in6 = ifa->addr;
++	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
++
++	if (event == NETDEV_DOWN || !netif_running(netdev) ||
++	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
++		mpevent.code = MPTCP_EVENT_DEL;
++	else if (event == NETDEV_UP)
++		mpevent.code = MPTCP_EVENT_ADD;
++	else if (event == NETDEV_CHANGE)
++		mpevent.code = MPTCP_EVENT_MOD;
++
++	mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
++		    &ifa->addr, mpevent.code, mpevent.low_prio);
++	add_pm_event(net, &mpevent);
++
++	spin_unlock_bh(&fm_ns->local_lock);
++	return;
++}
++
++/* React on IPv6-addr add/rem-events */
++static int inet6_addr_event(struct notifier_block *this, unsigned long event,
++			    void *ptr)
++{
++	struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
++	struct net *net = dev_net(ifa6->idev->dev);
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	if (ipv6_is_in_dad_state(ifa6))
++		dad_setup_timer(ifa6);
++	else
++		addr6_event_handler(ifa6, event, net);
++
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block inet6_addr_notifier = {
++		.notifier_call = inet6_addr_event,
++};
++
++#endif
++
++/* React on ifup/down-events */
++static int netdev_event(struct notifier_block *this, unsigned long event,
++			void *ptr)
++{
++	const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
++	struct in_device *in_dev;
++#if IS_ENABLED(CONFIG_IPV6)
++	struct inet6_dev *in6_dev;
++#endif
++
++	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
++	      event == NETDEV_CHANGE))
++		return NOTIFY_DONE;
++
++	rcu_read_lock();
++	in_dev = __in_dev_get_rtnl(dev);
++
++	if (in_dev) {
++		for_ifa(in_dev) {
++			mptcp_pm_inetaddr_event(NULL, event, ifa);
++		} endfor_ifa(in_dev);
++	}
++
++#if IS_ENABLED(CONFIG_IPV6)
++	in6_dev = __in6_dev_get(dev);
++
++	if (in6_dev) {
++		struct inet6_ifaddr *ifa6;
++		list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
++			inet6_addr_event(NULL, event, ifa6);
++	}
++#endif
++
++	rcu_read_unlock();
++	return NOTIFY_DONE;
++}
++
++static struct notifier_block mptcp_pm_netdev_notifier = {
++		.notifier_call = netdev_event,
++};
++
++static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
++				const union inet_addr *addr,
++				sa_family_t family, __be16 port, u8 id)
++{
++	if (family == AF_INET)
++		mptcp_addv4_raddr(mpcb, &addr->in, port, id);
++	else
++		mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
++}
++
++static void full_mesh_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	int i, index;
++	union inet_addr saddr, daddr;
++	sa_family_t family;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++	/* Init local variables necessary for the rest */
++	if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
++		saddr.ip = inet_sk(meta_sk)->inet_saddr;
++		daddr.ip = inet_sk(meta_sk)->inet_daddr;
++		family = AF_INET;
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		saddr.in6 = inet6_sk(meta_sk)->saddr;
++		daddr.in6 = meta_sk->sk_v6_daddr;
++		family = AF_INET6;
++#endif
++	}
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	index = mptcp_find_address(mptcp_local, family, &saddr);
++	if (index < 0)
++		goto fallback;
++
++	full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
++	mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
++	fmp->mpcb = mpcb;
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* Look for the address among the local addresses */
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		__be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
++
++		/* We do not need to announce the initial subflow's address again */
++		if (family == AF_INET && saddr.ip == ifa_address)
++			continue;
++
++		fmp->add_addr++;
++		mpcb->addr_signal = 1;
++	}
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++	/* skip IPv6 addresses if meta-socket is IPv4 */
++	if (meta_v4)
++		goto skip_ipv6;
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
++
++		/* We do not need to announce the initial subflow's address again */
++		if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
++			continue;
++
++		fmp->add_addr++;
++		mpcb->addr_signal = 1;
++	}
++
++skip_ipv6:
++#endif
++
++	rcu_read_unlock();
++
++	if (family == AF_INET)
++		fmp->announced_addrs_v4 |= (1 << index);
++	else
++		fmp->announced_addrs_v6 |= (1 << index);
++
++	for (i = fmp->add_addr; i && fmp->add_addr; i--)
++		tcp_send_ack(mpcb->master_sk);
++
++	return;
++
++fallback:
++	rcu_read_unlock();
++	mptcp_fallback_default(mpcb);
++	return;
++}
++
++static void full_mesh_create_subflows(struct sock *meta_sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		return;
++
++	if (!work_pending(&fmp->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &fmp->subflow_work);
++	}
++}
++
++/* Called upon release_sock, if the socket was owned by the user during
++ * a path-management event.
++ */
++static void full_mesh_release_sock(struct sock *meta_sk)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
++	struct sock *sk, *tmpsk;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++	int i;
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* First, detect modifications or additions */
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		struct in_addr ifa = mptcp_local->locaddr4[i].addr;
++		bool found = false;
++
++		mptcp_for_each_sk(mpcb, sk) {
++			struct tcp_sock *tp = tcp_sk(sk);
++
++			if (sk->sk_family == AF_INET6 &&
++			    !mptcp_v6_is_v4_mapped(sk))
++				continue;
++
++			if (inet_sk(sk)->inet_saddr != ifa.s_addr)
++				continue;
++
++			found = true;
++
++			if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
++				tp->mptcp->send_mp_prio = 1;
++				tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
++
++				tcp_send_ack(sk);
++			}
++		}
++
++		if (!found) {
++			fmp->add_addr++;
++			mpcb->addr_signal = 1;
++
++			sk = mptcp_select_ack_sock(meta_sk);
++			if (sk)
++				tcp_send_ack(sk);
++			full_mesh_create_subflows(meta_sk);
++		}
++	}
++
++skip_ipv4:
++#if IS_ENABLED(CONFIG_IPV6)
++	/* skip IPv6 addresses if meta-socket is IPv4 */
++	if (meta_v4)
++		goto removal;
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
++		bool found = false;
++
++		mptcp_for_each_sk(mpcb, sk) {
++			struct tcp_sock *tp = tcp_sk(sk);
++
++			if (sk->sk_family == AF_INET ||
++			    mptcp_v6_is_v4_mapped(sk))
++				continue;
++
++			if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
++				continue;
++
++			found = true;
++
++			if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
++				tp->mptcp->send_mp_prio = 1;
++				tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
++
++				tcp_send_ack(sk);
++			}
++		}
++
++		if (!found) {
++			fmp->add_addr++;
++			mpcb->addr_signal = 1;
++
++			sk = mptcp_select_ack_sock(meta_sk);
++			if (sk)
++				tcp_send_ack(sk);
++			full_mesh_create_subflows(meta_sk);
++		}
++	}
++
++removal:
++#endif
++
++	/* Now, detect address-removals */
++	mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
++		bool shall_remove = true;
++
++		if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
++			mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++				if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
++					shall_remove = false;
++					break;
++				}
++			}
++		} else {
++			mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++				if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
++					shall_remove = false;
++					break;
++				}
++			}
++		}
++
++		if (shall_remove) {
++			/* Reinject, so that pf = 1 and so we
++			 * won't select this one as the
++			 * ack-sock.
++			 */
++			mptcp_reinject_data(sk, 0);
++
++			announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
++					     meta_sk);
++
++			mptcp_sub_force_close(sk);
++		}
++	}
++
++	/* Just call it optimistically. It actually cannot do any harm */
++	update_addr_bitfields(meta_sk, mptcp_local);
++
++	rcu_read_unlock();
++}
++
++static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
++				  struct net *net, bool *low_prio)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	int index, id = -1;
++
++	/* Handle the backup-flows */
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	index = mptcp_find_address(mptcp_local, family, addr);
++
++	if (index != -1) {
++		if (family == AF_INET) {
++			id = mptcp_local->locaddr4[index].loc4_id;
++			*low_prio = mptcp_local->locaddr4[index].low_prio;
++		} else {
++			id = mptcp_local->locaddr6[index].loc6_id;
++			*low_prio = mptcp_local->locaddr6[index].low_prio;
++		}
++	}
++
++
++	rcu_read_unlock();
++
++	return id;
++}
++
++static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
++				  struct tcp_out_options *opts,
++				  struct sk_buff *skb)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
++	int remove_addr_len;
++	u8 unannouncedv4 = 0, unannouncedv6 = 0;
++	bool meta_v4 = meta_sk->sk_family == AF_INET;
++
++	mpcb->addr_signal = 0;
++
++	if (likely(!fmp->add_addr))
++		goto remove_addr;
++
++	rcu_read_lock();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
++		goto skip_ipv4;
++
++	/* IPv4 */
++	unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
++	if (unannouncedv4 &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
++		int ind = mptcp_find_free_index(~unannouncedv4);
++
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_ADD_ADDR;
++		opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
++		opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
++		opts->add_addr_v4 = 1;
++
++		if (skb) {
++			fmp->announced_addrs_v4 |= (1 << ind);
++			fmp->add_addr--;
++		}
++		*size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
++	}
++
++	if (meta_v4)
++		goto skip_ipv6;
++
++skip_ipv4:
++	/* IPv6 */
++	unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
++	if (unannouncedv6 &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
++		int ind = mptcp_find_free_index(~unannouncedv6);
++
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_ADD_ADDR;
++		opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
++		opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
++		opts->add_addr_v6 = 1;
++
++		if (skb) {
++			fmp->announced_addrs_v6 |= (1 << ind);
++			fmp->add_addr--;
++		}
++		*size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
++	}
++
++skip_ipv6:
++	rcu_read_unlock();
++
++	if (!unannouncedv4 && !unannouncedv6 && skb)
++		fmp->add_addr--;
++
++remove_addr:
++	if (likely(!fmp->remove_addrs))
++		goto exit;
++
++	remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
++	if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
++		goto exit;
++
++	opts->options |= OPTION_MPTCP;
++	opts->mptcp_options |= OPTION_REMOVE_ADDR;
++	opts->remove_addrs = fmp->remove_addrs;
++	*size += remove_addr_len;
++	if (skb)
++		fmp->remove_addrs = 0;
++
++exit:
++	mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
++}
++
++static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
++{
++	mptcp_v4_rem_raddress(mpcb, rem_id);
++	mptcp_v6_rem_raddress(mpcb, rem_id);
++}
++
++/* Output /proc/net/mptcp_fullmesh */
++static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
++{
++	const struct net *net = seq->private;
++	struct mptcp_loc_addr *mptcp_local;
++	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
++	int i;
++
++	seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
++
++	rcu_read_lock_bh();
++	mptcp_local = rcu_dereference(fm_ns->local);
++
++	seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
++
++	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
++		struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
++
++		seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
++			   loc4->low_prio, &loc4->addr);
++	}
++
++	seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
++
++	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
++		struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
++
++		seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
++			   loc6->low_prio, &loc6->addr);
++	}
++	rcu_read_unlock_bh();
++
++	return 0;
++}
++
++static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
++{
++	return single_open_net(inode, file, mptcp_fm_seq_show);
++}
++
++static const struct file_operations mptcp_fm_seq_fops = {
++	.owner = THIS_MODULE,
++	.open = mptcp_fm_seq_open,
++	.read = seq_read,
++	.llseek = seq_lseek,
++	.release = single_release_net,
++};
++
++static int mptcp_fm_init_net(struct net *net)
++{
++	struct mptcp_loc_addr *mptcp_local;
++	struct mptcp_fm_ns *fm_ns;
++	int err = 0;
++
++	fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
++	if (!fm_ns)
++		return -ENOBUFS;
++
++	mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
++	if (!mptcp_local) {
++		err = -ENOBUFS;
++		goto err_mptcp_local;
++	}
++
++	if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
++			 &mptcp_fm_seq_fops)) {
++		err = -ENOMEM;
++		goto err_seq_fops;
++	}
++
++	mptcp_local->next_v4_index = 1;
++
++	rcu_assign_pointer(fm_ns->local, mptcp_local);
++	INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
++	INIT_LIST_HEAD(&fm_ns->events);
++	spin_lock_init(&fm_ns->local_lock);
++	fm_ns->net = net;
++	net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
++
++	return 0;
++err_seq_fops:
++	kfree(mptcp_local);
++err_mptcp_local:
++	kfree(fm_ns);
++	return err;
++}
++
++static void mptcp_fm_exit_net(struct net *net)
++{
++	struct mptcp_addr_event *eventq, *tmp;
++	struct mptcp_fm_ns *fm_ns;
++	struct mptcp_loc_addr *mptcp_local;
++
++	fm_ns = fm_get_ns(net);
++	cancel_delayed_work_sync(&fm_ns->address_worker);
++
++	rcu_read_lock_bh();
++
++	mptcp_local = rcu_dereference_bh(fm_ns->local);
++	kfree(mptcp_local);
++
++	spin_lock(&fm_ns->local_lock);
++	list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
++		list_del(&eventq->list);
++		kfree(eventq);
++	}
++	spin_unlock(&fm_ns->local_lock);
++
++	rcu_read_unlock_bh();
++
++	remove_proc_entry("mptcp_fullmesh", net->proc_net);
++
++	kfree(fm_ns);
++}
++
++static struct pernet_operations full_mesh_net_ops = {
++	.init = mptcp_fm_init_net,
++	.exit = mptcp_fm_exit_net,
++};
++
++static struct mptcp_pm_ops full_mesh __read_mostly = {
++	.new_session = full_mesh_new_session,
++	.release_sock = full_mesh_release_sock,
++	.fully_established = full_mesh_create_subflows,
++	.new_remote_address = full_mesh_create_subflows,
++	.get_local_id = full_mesh_get_local_id,
++	.addr_signal = full_mesh_addr_signal,
++	.add_raddr = full_mesh_add_raddr,
++	.rem_raddr = full_mesh_rem_raddr,
++	.name = "fullmesh",
++	.owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init full_mesh_register(void)
++{
++	int ret;
++
++	BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
++
++	ret = register_pernet_subsys(&full_mesh_net_ops);
++	if (ret)
++		goto out;
++
++	ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++	if (ret)
++		goto err_reg_inetaddr;
++	ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
++	if (ret)
++		goto err_reg_netdev;
++
++#if IS_ENABLED(CONFIG_IPV6)
++	ret = register_inet6addr_notifier(&inet6_addr_notifier);
++	if (ret)
++		goto err_reg_inet6addr;
++#endif
++
++	ret = mptcp_register_path_manager(&full_mesh);
++	if (ret)
++		goto err_reg_pm;
++
++out:
++	return ret;
++
++
++err_reg_pm:
++#if IS_ENABLED(CONFIG_IPV6)
++	unregister_inet6addr_notifier(&inet6_addr_notifier);
++err_reg_inet6addr:
++#endif
++	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++err_reg_netdev:
++	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++err_reg_inetaddr:
++	unregister_pernet_subsys(&full_mesh_net_ops);
++	goto out;
++}
++
++static void full_mesh_unregister(void)
++{
++#if IS_ENABLED(CONFIG_IPV6)
++	unregister_inet6addr_notifier(&inet6_addr_notifier);
++#endif
++	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
++	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
++	unregister_pernet_subsys(&full_mesh_net_ops);
++	mptcp_unregister_path_manager(&full_mesh);
++}
++
++module_init(full_mesh_register);
++module_exit(full_mesh_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("Full-Mesh MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
+new file mode 100644
+index 000000000000..43704ccb639e
+--- /dev/null
++++ b/net/mptcp/mptcp_input.c
+@@ -0,0 +1,2405 @@
++/*
++ *	MPTCP implementation - Sending side
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <asm/unaligned.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++
++#include <linux/kconfig.h>
++
++/* is seq1 < seq2 ? */
++static inline bool before64(const u64 seq1, const u64 seq2)
++{
++	return (s64)(seq1 - seq2) < 0;
++}
++
++/* is seq1 > seq2 ? */
++#define after64(seq1, seq2)	before64(seq2, seq1)
++
++static inline void mptcp_become_fully_estab(struct sock *sk)
++{
++	tcp_sk(sk)->mptcp->fully_established = 1;
++
++	if (is_master_tp(tcp_sk(sk)) &&
++	    tcp_sk(sk)->mpcb->pm_ops->fully_established)
++		tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
++}
++
++/* Similar to tcp_tso_acked without any memory accounting */
++static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
++					   struct sk_buff *skb)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	u32 packets_acked, len;
++
++	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
++
++	packets_acked = tcp_skb_pcount(skb);
++
++	if (skb_unclone(skb, GFP_ATOMIC))
++		return 0;
++
++	len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
++	__pskb_trim_head(skb, len);
++
++	TCP_SKB_CB(skb)->seq += len;
++	skb->ip_summed = CHECKSUM_PARTIAL;
++	skb->truesize	     -= len;
++
++	/* Any change of skb->len requires recalculation of tso factor. */
++	if (tcp_skb_pcount(skb) > 1)
++		tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
++	packets_acked -= tcp_skb_pcount(skb);
++
++	if (packets_acked) {
++		BUG_ON(tcp_skb_pcount(skb) == 0);
++		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
++	}
++
++	return packets_acked;
++}
++
++/**
++ * Cleans the meta-socket retransmission queue and the reinject-queue.
++ * @sk must be the metasocket.
++ */
++static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
++{
++	struct sk_buff *skb, *tmp;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	bool acked = false;
++	u32 acked_pcount;
++
++	while ((skb = tcp_write_queue_head(meta_sk)) &&
++	       skb != tcp_send_head(meta_sk)) {
++		bool fully_acked = true;
++
++		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++			if (tcp_skb_pcount(skb) == 1 ||
++			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++				break;
++
++			acked_pcount = tcp_tso_acked(meta_sk, skb);
++			if (!acked_pcount)
++				break;
++
++			fully_acked = false;
++		} else {
++			acked_pcount = tcp_skb_pcount(skb);
++		}
++
++		acked = true;
++		meta_tp->packets_out -= acked_pcount;
++		meta_tp->retrans_stamp = 0;
++
++		if (!fully_acked)
++			break;
++
++		tcp_unlink_write_queue(skb, meta_sk);
++
++		if (mptcp_is_data_fin(skb)) {
++			struct sock *sk_it;
++
++			/* DATA_FIN has been acknowledged - now we can close
++			 * the subflows
++			 */
++			mptcp_for_each_sk(mpcb, sk_it) {
++				unsigned long delay = 0;
++
++				/* If we are the passive closer, don't trigger
++				 * subflow-fin until the subflow has been finned
++				 * by the peer - thus we add a delay.
++				 */
++				if (mpcb->passive_close &&
++				    sk_it->sk_state == TCP_ESTABLISHED)
++					delay = inet_csk(sk_it)->icsk_rto << 3;
++
++				mptcp_sub_close(sk_it, delay);
++			}
++		}
++		sk_wmem_free_skb(meta_sk, skb);
++	}
++	/* Remove acknowledged data from the reinject queue */
++	skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
++		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
++			if (tcp_skb_pcount(skb) == 1 ||
++			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
++				break;
++
++			mptcp_tso_acked_reinject(meta_sk, skb);
++			break;
++		}
++
++		__skb_unlink(skb, &mpcb->reinject_queue);
++		__kfree_skb(skb);
++	}
++
++	if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
++		meta_tp->snd_up = meta_tp->snd_una;
++
++	if (acked) {
++		tcp_rearm_rto(meta_sk);
++		/* Normally this is done in tcp_try_undo_loss - but MPTCP
++		 * does not call this function.
++		 */
++		inet_csk(meta_sk)->icsk_retransmits = 0;
++	}
++}
++
++/* Inspired by tcp_rcv_state_process */
++static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
++				   const struct sk_buff *skb, u32 data_seq,
++				   u16 data_len)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++	const struct tcphdr *th = tcp_hdr(skb);
++
++	/* State-machine handling if FIN has been enqueued and he has
++	 * been acked (snd_una == write_seq) - it's important that this
++	 * here is after sk_wmem_free_skb because otherwise
++	 * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
++	 */
++	switch (meta_sk->sk_state) {
++	case TCP_FIN_WAIT1: {
++		struct dst_entry *dst;
++		int tmo;
++
++		if (meta_tp->snd_una != meta_tp->write_seq)
++			break;
++
++		tcp_set_state(meta_sk, TCP_FIN_WAIT2);
++		meta_sk->sk_shutdown |= SEND_SHUTDOWN;
++
++		dst = __sk_dst_get(sk);
++		if (dst)
++			dst_confirm(dst);
++
++		if (!sock_flag(meta_sk, SOCK_DEAD)) {
++			/* Wake up lingering close() */
++			meta_sk->sk_state_change(meta_sk);
++			break;
++		}
++
++		if (meta_tp->linger2 < 0 ||
++		    (data_len &&
++		     after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
++			   meta_tp->rcv_nxt))) {
++			mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++			tcp_done(meta_sk);
++			NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++			return 1;
++		}
++
++		tmo = tcp_fin_time(meta_sk);
++		if (tmo > TCP_TIMEWAIT_LEN) {
++			inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
++		} else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
++			/* Bad case. We could lose such FIN otherwise.
++			 * It is not a big problem, but it looks confusing
++			 * and not so rare event. We still can lose it now,
++			 * if it spins in bh_lock_sock(), but it is really
++			 * marginal case.
++			 */
++			inet_csk_reset_keepalive_timer(meta_sk, tmo);
++		} else {
++			meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
++		}
++		break;
++	}
++	case TCP_CLOSING:
++	case TCP_LAST_ACK:
++		if (meta_tp->snd_una == meta_tp->write_seq) {
++			tcp_done(meta_sk);
++			return 1;
++		}
++		break;
++	}
++
++	/* step 7: process the segment text */
++	switch (meta_sk->sk_state) {
++	case TCP_FIN_WAIT1:
++	case TCP_FIN_WAIT2:
++		/* RFC 793 says to queue data in these states,
++		 * RFC 1122 says we MUST send a reset.
++		 * BSD 4.4 also does reset.
++		 */
++		if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
++			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
++			    !mptcp_is_data_fin2(skb, tp)) {
++				NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
++				mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
++				tcp_reset(meta_sk);
++				return 1;
++			}
++		}
++		break;
++	}
++
++	return 0;
++}
++
++/**
++ * @return:
++ *  i) 1: Everything's fine.
++ *  ii) -1: A reset has been sent on the subflow - csum-failure
++ *  iii) 0: csum-failure but no reset sent, because it's the last subflow.
++ *	 Last packet should not be destroyed by the caller because it has
++ *	 been done here.
++ */
++static int mptcp_verif_dss_csum(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *tmp, *tmp1, *last = NULL;
++	__wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
++	int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
++	int iter = 0;
++
++	skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
++		unsigned int csum_len;
++
++		if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
++			/* Mapping ends in the middle of the packet -
++			 * csum only these bytes
++			 */
++			csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
++		else
++			csum_len = tmp->len;
++
++		offset = 0;
++		if (overflowed) {
++			char first_word[4];
++			first_word[0] = 0;
++			first_word[1] = 0;
++			first_word[2] = 0;
++			first_word[3] = *(tmp->data);
++			csum_tcp = csum_partial(first_word, 4, csum_tcp);
++			offset = 1;
++			csum_len--;
++			overflowed = 0;
++		}
++
++		csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
++
++		/* Was it on an odd-length? Then we have to merge the next byte
++		 * correctly (see above)
++		 */
++		if (csum_len != (csum_len & (~1)))
++			overflowed = 1;
++
++		if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
++			__be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
++
++			/* If a 64-bit dss is present, we increase the offset
++			 * by 4 bytes, as the high-order 64-bits will be added
++			 * in the final csum_partial-call.
++			 */
++			u32 offset = skb_transport_offset(tmp) +
++				     TCP_SKB_CB(tmp)->dss_off;
++			if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
++				offset += 4;
++
++			csum_tcp = skb_checksum(tmp, offset,
++						MPTCP_SUB_LEN_SEQ_CSUM,
++						csum_tcp);
++
++			csum_tcp = csum_partial(&data_seq,
++						sizeof(data_seq), csum_tcp);
++
++			dss_csum_added = 1; /* Just do it once */
++		}
++		last = tmp;
++		iter++;
++
++		if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
++		    !before(TCP_SKB_CB(tmp1)->seq,
++			    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++			break;
++	}
++
++	/* Now, checksum must be 0 */
++	if (unlikely(csum_fold(csum_tcp))) {
++		pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
++		       __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
++		       dss_csum_added, overflowed, iter);
++
++		tp->mptcp->send_mp_fail = 1;
++
++		/* map_data_seq is the data-seq number of the
++		 * mapping we are currently checking
++		 */
++		tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
++
++		if (tp->mpcb->cnt_subflows > 1) {
++			mptcp_send_reset(sk);
++			ans = -1;
++		} else {
++			tp->mpcb->send_infinite_mapping = 1;
++
++			/* Need to purge the rcv-queue as it's no more valid */
++			while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
++				tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
++				kfree_skb(tmp);
++			}
++
++			ans = 0;
++		}
++	}
++
++	return ans;
++}
++
++static inline void mptcp_prepare_skb(struct sk_buff *skb,
++				     const struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 inc = 0;
++
++	/* If skb is the end of this mapping (end is always at mapping-boundary
++	 * thanks to the splitting/trimming), then we need to increase
++	 * data-end-seq by 1 if this here is a data-fin.
++	 *
++	 * We need to do -1 because end_seq includes the subflow-FIN.
++	 */
++	if (tp->mptcp->map_data_fin &&
++	    (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
++	    (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++		inc = 1;
++
++		/* We manually set the fin-flag if it is a data-fin. For easy
++		 * processing in tcp_recvmsg.
++		 */
++		tcp_hdr(skb)->fin = 1;
++	} else {
++		/* We may have a subflow-fin with data but without data-fin */
++		tcp_hdr(skb)->fin = 0;
++	}
++
++	/* Adapt data-seq's to the packet itself. We kinda transform the
++	 * dss-mapping to a per-packet granularity. This is necessary to
++	 * correctly handle overlapping mappings coming from different
++	 * subflows. Otherwise it would be a complete mess.
++	 */
++	tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
++	tcb->end_seq = tcb->seq + skb->len + inc;
++}
++
++/**
++ * @return: 1 if the segment has been eaten and can be suppressed,
++ *          otherwise 0.
++ */
++static inline int mptcp_direct_copy(const struct sk_buff *skb,
++				    struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
++	int eaten = 0;
++
++	__set_current_state(TASK_RUNNING);
++
++	local_bh_enable();
++	if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
++		meta_tp->ucopy.len -= chunk;
++		meta_tp->copied_seq += chunk;
++		eaten = (chunk == skb->len);
++		tcp_rcv_space_adjust(meta_sk);
++	}
++	local_bh_disable();
++	return eaten;
++}
++
++static inline void mptcp_reset_mapping(struct tcp_sock *tp)
++{
++	tp->mptcp->map_data_len = 0;
++	tp->mptcp->map_data_seq = 0;
++	tp->mptcp->map_subseq = 0;
++	tp->mptcp->map_data_fin = 0;
++	tp->mptcp->mapping_present = 0;
++}
++
++/* The DSS-mapping received on the sk only covers the second half of the skb
++ * (cut at seq). We trim the head from the skb.
++ * Data will be freed upon kfree().
++ *
++ * Inspired by tcp_trim_head().
++ */
++static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++	int len = seq - TCP_SKB_CB(skb)->seq;
++	u32 new_seq = TCP_SKB_CB(skb)->seq + len;
++
++	if (len < skb_headlen(skb))
++		__skb_pull(skb, len);
++	else
++		__pskb_trim_head(skb, len - skb_headlen(skb));
++
++	TCP_SKB_CB(skb)->seq = new_seq;
++
++	skb->truesize -= len;
++	atomic_sub(len, &sk->sk_rmem_alloc);
++	sk_mem_uncharge(sk, len);
++}
++
++/* The DSS-mapping received on the sk only covers the first half of the skb
++ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
++ * as further packets may resolve the mapping of the second half of data.
++ *
++ * Inspired by tcp_fragment().
++ */
++static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
++{
++	struct sk_buff *buff;
++	int nsize;
++	int nlen, len;
++
++	len = seq - TCP_SKB_CB(skb)->seq;
++	nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
++	if (nsize < 0)
++		nsize = 0;
++
++	/* Get a new skb... force flag on. */
++	buff = alloc_skb(nsize, GFP_ATOMIC);
++	if (buff == NULL)
++		return -ENOMEM;
++
++	skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
++	skb_reset_transport_header(buff);
++
++	tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
++	tcp_hdr(skb)->fin = 0;
++
++	/* We absolutly need to call skb_set_owner_r before refreshing the
++	 * truesize of buff, otherwise the moved data will account twice.
++	 */
++	skb_set_owner_r(buff, sk);
++	nlen = skb->len - len - nsize;
++	buff->truesize += nlen;
++	skb->truesize -= nlen;
++
++	/* Correct the sequence numbers. */
++	TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
++	TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
++	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
++
++	skb_split(skb, buff, len);
++
++	__skb_queue_after(&sk->sk_receive_queue, skb, buff);
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	/* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
++	if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
++	    !tp->mpcb->infinite_mapping_rcv) {
++		/* Remove a pure subflow-fin from the queue and increase
++		 * copied_seq.
++		 */
++		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++		__skb_unlink(skb, &sk->sk_receive_queue);
++		__kfree_skb(skb);
++		return -1;
++	}
++
++	/* If we are not yet fully established and do not know the mapping for
++	 * this segment, this path has to fallback to infinite or be torn down.
++	 */
++	if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
++	    !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
++		pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
++		       __func__, tp->mpcb->mptcp_loc_token,
++		       tp->mptcp->path_index, __builtin_return_address(0),
++		       TCP_SKB_CB(skb)->seq);
++
++		if (!is_master_tp(tp)) {
++			mptcp_send_reset(sk);
++			return 1;
++		}
++
++		tp->mpcb->infinite_mapping_snd = 1;
++		tp->mpcb->infinite_mapping_rcv = 1;
++		/* We do a seamless fallback and should not send a inf.mapping. */
++		tp->mpcb->send_infinite_mapping = 0;
++		tp->mptcp->fully_established = 1;
++	}
++
++	/* Receiver-side becomes fully established when a whole rcv-window has
++	 * been received without the need to fallback due to the previous
++	 * condition.
++	 */
++	if (!tp->mptcp->fully_established) {
++		tp->mptcp->init_rcv_wnd -= skb->len;
++		if (tp->mptcp->init_rcv_wnd < 0)
++			mptcp_become_fully_estab(sk);
++	}
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 *ptr;
++	u32 data_seq, sub_seq, data_len, tcp_end_seq;
++
++	/* If we are in infinite-mapping-mode, the subflow is guaranteed to be
++	 * in-order at the data-level. Thus data-seq-numbers can be inferred
++	 * from what is expected at the data-level.
++	 */
++	if (mpcb->infinite_mapping_rcv) {
++		tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
++		tp->mptcp->map_subseq = tcb->seq;
++		tp->mptcp->map_data_len = skb->len;
++		tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
++		tp->mptcp->mapping_present = 1;
++		return 0;
++	}
++
++	/* No mapping here? Exit - it is either already set or still on its way */
++	if (!mptcp_is_data_seq(skb)) {
++		/* Too many packets without a mapping - this subflow is broken */
++		if (!tp->mptcp->mapping_present &&
++		    tp->rcv_nxt - tp->copied_seq > 65536) {
++			mptcp_send_reset(sk);
++			return 1;
++		}
++
++		return 0;
++	}
++
++	ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
++	ptr++;
++	sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
++	ptr++;
++	data_len = get_unaligned_be16(ptr);
++
++	/* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
++	 * The draft sets it to 0, but we really would like to have the
++	 * real value, to have an easy handling afterwards here in this
++	 * function.
++	 */
++	if (mptcp_is_data_fin(skb) && skb->len == 0)
++		sub_seq = TCP_SKB_CB(skb)->seq;
++
++	/* If there is already a mapping - we check if it maps with the current
++	 * one. If not - we reset.
++	 */
++	if (tp->mptcp->mapping_present &&
++	    (data_seq != (u32)tp->mptcp->map_data_seq ||
++	     sub_seq != tp->mptcp->map_subseq ||
++	     data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
++	     mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
++		/* Mapping in packet is different from what we want */
++		pr_err("%s Mappings do not match!\n", __func__);
++		pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
++		       __func__, data_seq, (u32)tp->mptcp->map_data_seq,
++		       sub_seq, tp->mptcp->map_subseq, data_len,
++		       tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
++		       tp->mptcp->map_data_fin);
++		mptcp_send_reset(sk);
++		return 1;
++	}
++
++	/* If the previous check was good, the current mapping is valid and we exit. */
++	if (tp->mptcp->mapping_present)
++		return 0;
++
++	/* Mapping not yet set on this subflow - we set it here! */
++
++	if (!data_len) {
++		mpcb->infinite_mapping_rcv = 1;
++		tp->mptcp->fully_established = 1;
++		/* We need to repeat mp_fail's until the sender felt
++		 * back to infinite-mapping - here we stop repeating it.
++		 */
++		tp->mptcp->send_mp_fail = 0;
++
++		/* We have to fixup data_len - it must be the same as skb->len */
++		data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
++		sub_seq = tcb->seq;
++
++		/* TODO kill all other subflows than this one */
++		/* data_seq and so on are set correctly */
++
++		/* At this point, the meta-ofo-queue has to be emptied,
++		 * as the following data is guaranteed to be in-order at
++		 * the data and subflow-level
++		 */
++		mptcp_purge_ofo_queue(meta_tp);
++	}
++
++	/* We are sending mp-fail's and thus are in fallback mode.
++	 * Ignore packets which do not announce the fallback and still
++	 * want to provide a mapping.
++	 */
++	if (tp->mptcp->send_mp_fail) {
++		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++		__skb_unlink(skb, &sk->sk_receive_queue);
++		__kfree_skb(skb);
++		return -1;
++	}
++
++	/* FIN increased the mapping-length by 1 */
++	if (mptcp_is_data_fin(skb))
++		data_len--;
++
++	/* Subflow-sequences of packet must be
++	 * (at least partially) be part of the DSS-mapping's
++	 * subflow-sequence-space.
++	 *
++	 * Basically the mapping is not valid, if either of the
++	 * following conditions is true:
++	 *
++	 * 1. It's not a data_fin and
++	 *    MPTCP-sub_seq >= TCP-end_seq
++	 *
++	 * 2. It's a data_fin and TCP-end_seq > TCP-seq and
++	 *    MPTCP-sub_seq >= TCP-end_seq
++	 *
++	 * The previous two can be merged into:
++	 *    TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
++	 *    Because if it's not a data-fin, TCP-end_seq > TCP-seq
++	 *
++	 * 3. It's a data_fin and skb->len == 0 and
++	 *    MPTCP-sub_seq > TCP-end_seq
++	 *
++	 * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
++	 *    MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
++	 *
++	 * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
++	 */
++
++	/* subflow-fin is not part of the mapping - ignore it here ! */
++	tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
++	if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
++	    (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
++	    (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
++	    before(sub_seq, tp->copied_seq)) {
++		/* Subflow-sequences of packet is different from what is in the
++		 * packet's dss-mapping. The peer is misbehaving - reset
++		 */
++		pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
++		       "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
++		       "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
++		       skb->len, data_len, tp->copied_seq);
++		mptcp_send_reset(sk);
++		return 1;
++	}
++
++	/* Does the DSS had 64-bit seqnum's ? */
++	if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
++		/* Wrapped around? */
++		if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
++			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
++		} else {
++			/* Else, access the default high-order bits */
++			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
++		}
++	} else {
++		tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
++
++		if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
++			/* We make sure that the data_seq is invalid.
++			 * It will be dropped later.
++			 */
++			tp->mptcp->map_data_seq += 0xFFFFFFFF;
++			tp->mptcp->map_data_seq += 0xFFFFFFFF;
++		}
++	}
++
++	tp->mptcp->map_data_len = data_len;
++	tp->mptcp->map_subseq = sub_seq;
++	tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
++	tp->mptcp->mapping_present = 1;
++
++	return 0;
++}
++
++/* Similar to tcp_sequence(...) */
++static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
++				 u64 data_seq, u64 end_data_seq)
++{
++	const struct mptcp_cb *mpcb = meta_tp->mpcb;
++	u64 rcv_wup64;
++
++	/* Wrap-around? */
++	if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
++		rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
++				meta_tp->rcv_wup;
++	} else {
++		rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
++						  meta_tp->rcv_wup);
++	}
++
++	return	!before64(end_data_seq, rcv_wup64) &&
++		!after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    -1 this packet was broken - continue with the next one.
++ */
++static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sk_buff *tmp, *tmp1;
++	u32 tcp_end_seq;
++
++	if (!tp->mptcp->mapping_present)
++		return 0;
++
++	/* either, the new skb gave us the mapping and the first segment
++	 * in the sub-rcv-queue has to be trimmed ...
++	 */
++	tmp = skb_peek(&sk->sk_receive_queue);
++	if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
++	    after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
++		mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
++
++	/* ... or the new skb (tail) has to be split at the end. */
++	tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
++	if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
++		u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
++		if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
++			/* TODO : maybe handle this here better.
++			 * We now just force meta-retransmission.
++			 */
++			tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
++			__skb_unlink(skb, &sk->sk_receive_queue);
++			__kfree_skb(skb);
++			return -1;
++		}
++	}
++
++	/* Now, remove old sk_buff's from the receive-queue.
++	 * This may happen if the mapping has been lost for these segments and
++	 * the next mapping has already been received.
++	 */
++	if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
++				break;
++
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++
++			/* Impossible that we could free skb here, because his
++			 * mapping is known to be valid from previous checks
++			 */
++			__kfree_skb(tmp1);
++		}
++	}
++
++	return 0;
++}
++
++/* @return: 0  everything is fine. Just continue processing
++ *	    1  subflow is broken stop everything
++ *	    -1 this mapping has been put in the meta-receive-queue
++ *	    -2 this mapping has been eaten by the application
++ */
++static int mptcp_queue_skb(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sk_buff *tmp, *tmp1;
++	u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
++	bool data_queued = false;
++
++	/* Have we not yet received the full mapping? */
++	if (!tp->mptcp->mapping_present ||
++	    before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++		return 0;
++
++	/* Is this an overlapping mapping? rcv_nxt >= end_data_seq
++	 * OR
++	 * This mapping is out of window
++	 */
++	if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
++	    !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
++			    tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			__kfree_skb(tmp1);
++
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++
++		mptcp_reset_mapping(tp);
++
++		return -1;
++	}
++
++	/* Record it, because we want to send our data_fin on the same path */
++	if (tp->mptcp->map_data_fin) {
++		mpcb->dfin_path_index = tp->mptcp->path_index;
++		mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
++	}
++
++	/* Verify the checksum */
++	if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
++		int ret = mptcp_verif_dss_csum(sk);
++
++		if (ret <= 0) {
++			mptcp_reset_mapping(tp);
++			return 1;
++		}
++	}
++
++	if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
++		/* Seg's have to go to the meta-ofo-queue */
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_prepare_skb(tmp1, sk);
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			/* MUST be done here, because fragstolen may be true later.
++			 * Then, kfree_skb_partial will not account the memory.
++			 */
++			skb_orphan(tmp1);
++
++			if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
++				mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
++			else
++				__kfree_skb(tmp1);
++
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++		tcp_enter_quickack_mode(sk);
++	} else {
++		/* Ready for the meta-rcv-queue */
++		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
++			int eaten = 0;
++			const bool copied_early = false;
++			bool fragstolen = false;
++			u32 old_rcv_nxt = meta_tp->rcv_nxt;
++
++			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_prepare_skb(tmp1, sk);
++			__skb_unlink(tmp1, &sk->sk_receive_queue);
++			/* MUST be done here, because fragstolen may be true.
++			 * Then, kfree_skb_partial will not account the memory.
++			 */
++			skb_orphan(tmp1);
++
++			/* This segment has already been received */
++			if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
++				__kfree_skb(tmp1);
++				goto next;
++			}
++
++#ifdef CONFIG_NET_DMA
++			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt  &&
++			    meta_tp->ucopy.task == current &&
++			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
++			    tmp1->len <= meta_tp->ucopy.len &&
++			    sock_owned_by_user(meta_sk) &&
++			    tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
++				copied_early = true;
++				eaten = 1;
++			}
++#endif
++
++			/* Is direct copy possible ? */
++			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
++			    meta_tp->ucopy.task == current &&
++			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
++			    meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
++			    !copied_early)
++				eaten = mptcp_direct_copy(tmp1, meta_sk);
++
++			if (mpcb->in_time_wait) /* In time-wait, do not receive data */
++				eaten = 1;
++
++			if (!eaten)
++				eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
++
++			meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
++			mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++#ifdef CONFIG_NET_DMA
++			if (copied_early)
++				meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
++#endif
++
++			if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
++				mptcp_fin(meta_sk);
++
++			/* Check if this fills a gap in the ofo queue */
++			if (!skb_queue_empty(&meta_tp->out_of_order_queue))
++				mptcp_ofo_queue(meta_sk);
++
++#ifdef CONFIG_NET_DMA
++			if (copied_early)
++				__skb_queue_tail(&meta_sk->sk_async_wait_queue,
++						 tmp1);
++			else
++#endif
++			if (eaten)
++				kfree_skb_partial(tmp1, fragstolen);
++
++			data_queued = true;
++next:
++			if (!skb_queue_empty(&sk->sk_receive_queue) &&
++			    !before(TCP_SKB_CB(tmp)->seq,
++				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
++				break;
++		}
++	}
++
++	inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
++	mptcp_reset_mapping(tp);
++
++	return data_queued ? -1 : -2;
++}
++
++void mptcp_data_ready(struct sock *sk)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct sk_buff *skb, *tmp;
++	int queued = 0;
++
++	/* restart before the check, because mptcp_fin might have changed the
++	 * state.
++	 */
++restart:
++	/* If the meta cannot receive data, there is no point in pushing data.
++	 * If we are in time-wait, we may still be waiting for the final FIN.
++	 * So, we should proceed with the processing.
++	 */
++	if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
++		skb_queue_purge(&sk->sk_receive_queue);
++		tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
++		goto exit;
++	}
++
++	/* Iterate over all segments, detect their mapping (if we don't have
++	 * one yet), validate them and push everything one level higher.
++	 */
++	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
++		int ret;
++		/* Pre-validation - e.g., early fallback */
++		ret = mptcp_prevalidate_skb(sk, skb);
++		if (ret < 0)
++			goto restart;
++		else if (ret > 0)
++			break;
++
++		/* Set the current mapping */
++		ret = mptcp_detect_mapping(sk, skb);
++		if (ret < 0)
++			goto restart;
++		else if (ret > 0)
++			break;
++
++		/* Validation */
++		if (mptcp_validate_mapping(sk, skb) < 0)
++			goto restart;
++
++		/* Push a level higher */
++		ret = mptcp_queue_skb(sk);
++		if (ret < 0) {
++			if (ret == -1)
++				queued = ret;
++			goto restart;
++		} else if (ret == 0) {
++			continue;
++		} else { /* ret == 1 */
++			break;
++		}
++	}
++
++exit:
++	if (tcp_sk(sk)->close_it) {
++		tcp_send_ack(sk);
++		tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
++	}
++
++	if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
++		meta_sk->sk_data_ready(meta_sk);
++}
++
++
++int mptcp_check_req(struct sk_buff *skb, struct net *net)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	struct sock *meta_sk = NULL;
++
++	/* MPTCP structures not initialized */
++	if (mptcp_init_failed)
++		return 0;
++
++	if (skb->protocol == htons(ETH_P_IP))
++		meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
++					      ip_hdr(skb)->daddr, net);
++#if IS_ENABLED(CONFIG_IPV6)
++	else /* IPv6 */
++		meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
++					      &ipv6_hdr(skb)->daddr, net);
++#endif /* CONFIG_IPV6 */
++
++	if (!meta_sk)
++		return 0;
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++	bh_lock_sock_nested(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++			bh_unlock_sock(meta_sk);
++			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++			sock_put(meta_sk); /* Taken by mptcp_search_req */
++			kfree_skb(skb);
++			return 1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP)) {
++		tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else { /* IPv6 */
++		tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++	}
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
++	return 1;
++}
++
++struct mp_join *mptcp_find_join(const struct sk_buff *skb)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	unsigned char *ptr;
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++	/* Jump through the options to check whether JOIN is there */
++	ptr = (unsigned char *)(th + 1);
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return NULL;
++		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)	/* "silly options" */
++				return NULL;
++			if (opsize > length)
++				return NULL;  /* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
++				return (struct mp_join *)(ptr - 2);
++			}
++			ptr += opsize - 2;
++			length -= opsize;
++		}
++	}
++	return NULL;
++}
++
++int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
++{
++	const struct mptcp_cb *mpcb;
++	struct sock *meta_sk;
++	u32 token;
++	bool meta_v4;
++	struct mp_join *join_opt = mptcp_find_join(skb);
++	if (!join_opt)
++		return 0;
++
++	/* MPTCP structures were not initialized, so return error */
++	if (mptcp_init_failed)
++		return -1;
++
++	token = join_opt->u.syn.token;
++	meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
++	if (!meta_sk) {
++		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++		return -1;
++	}
++
++	meta_v4 = meta_sk->sk_family == AF_INET;
++	if (meta_v4) {
++		if (skb->protocol == htons(ETH_P_IPV6)) {
++			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			return -1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP) &&
++		   inet6_sk(meta_sk)->ipv6only) {
++		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	mpcb = tcp_sk(meta_sk)->mpcb;
++	if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
++		/* We are in fallback-mode on the reception-side -
++		 * no new subflows!
++		 */
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	/* Coming from time-wait-sock processing in tcp_v4_rcv.
++	 * We have to deschedule it before continuing, because otherwise
++	 * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
++	 */
++	if (tw) {
++		inet_twsk_deschedule(tw, &tcp_death_row);
++		inet_twsk_put(tw);
++	}
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++	/* OK, this is a new syn/join, let's create a new open request and
++	 * send syn+ack
++	 */
++	bh_lock_sock_nested(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
++			bh_unlock_sock(meta_sk);
++			NET_INC_STATS_BH(sock_net(meta_sk),
++					 LINUX_MIB_TCPBACKLOGDROP);
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			kfree_skb(skb);
++			return 1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP)) {
++		tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++	} else {
++		tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++	}
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_hash_find */
++	return 1;
++}
++
++int mptcp_do_join_short(struct sk_buff *skb,
++			const struct mptcp_options_received *mopt,
++			struct net *net)
++{
++	struct sock *meta_sk;
++	u32 token;
++	bool meta_v4;
++
++	token = mopt->mptcp_rem_token;
++	meta_sk = mptcp_hash_find(net, token);
++	if (!meta_sk) {
++		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
++		return -1;
++	}
++
++	meta_v4 = meta_sk->sk_family == AF_INET;
++	if (meta_v4) {
++		if (skb->protocol == htons(ETH_P_IPV6)) {
++			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
++			sock_put(meta_sk); /* Taken by mptcp_hash_find */
++			return -1;
++		}
++	} else if (skb->protocol == htons(ETH_P_IP) &&
++		   inet6_sk(meta_sk)->ipv6only) {
++		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
++
++	/* OK, this is a new syn/join, let's create a new open request and
++	 * send syn+ack
++	 */
++	bh_lock_sock(meta_sk);
++
++	/* This check is also done in mptcp_vX_do_rcv. But, there we cannot
++	 * call tcp_vX_send_reset, because we hold already two socket-locks.
++	 * (the listener and the meta from above)
++	 *
++	 * And the send-reset will try to take yet another one (ip_send_reply).
++	 * Thus, we propagate the reset up to tcp_rcv_state_process.
++	 */
++	if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
++	    tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
++	    meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
++		bh_unlock_sock(meta_sk);
++		sock_put(meta_sk); /* Taken by mptcp_hash_find */
++		return -1;
++	}
++
++	if (sock_owned_by_user(meta_sk)) {
++		skb->sk = meta_sk;
++		if (unlikely(sk_add_backlog(meta_sk, skb,
++					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
++			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
++		else
++			/* Must make sure that upper layers won't free the
++			 * skb if it is added to the backlog-queue.
++			 */
++			skb_get(skb);
++	} else {
++		/* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
++		 * the skb will finally be freed by tcp_v4_do_rcv (where we are
++		 * coming from)
++		 */
++		skb_get(skb);
++		if (skb->protocol == htons(ETH_P_IP)) {
++			tcp_v4_do_rcv(meta_sk, skb);
++#if IS_ENABLED(CONFIG_IPV6)
++		} else { /* IPv6 */
++			tcp_v6_do_rcv(meta_sk, skb);
++#endif /* CONFIG_IPV6 */
++		}
++	}
++
++	bh_unlock_sock(meta_sk);
++	sock_put(meta_sk); /* Taken by mptcp_hash_find */
++	return 0;
++}
++
++/**
++ * Equivalent of tcp_fin() for MPTCP
++ * Can be called only when the FIN is validly part
++ * of the data seqnum space. Not before when we get holes.
++ */
++void mptcp_fin(struct sock *meta_sk)
++{
++	struct sock *sk = NULL, *sk_it;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++
++	mptcp_for_each_sk(mpcb, sk_it) {
++		if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
++			sk = sk_it;
++			break;
++		}
++	}
++
++	if (!sk || sk->sk_state == TCP_CLOSE)
++		sk = mptcp_select_ack_sock(meta_sk);
++
++	inet_csk_schedule_ack(sk);
++
++	meta_sk->sk_shutdown |= RCV_SHUTDOWN;
++	sock_set_flag(meta_sk, SOCK_DONE);
++
++	switch (meta_sk->sk_state) {
++	case TCP_SYN_RECV:
++	case TCP_ESTABLISHED:
++		/* Move to CLOSE_WAIT */
++		tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
++		inet_csk(sk)->icsk_ack.pingpong = 1;
++		break;
++
++	case TCP_CLOSE_WAIT:
++	case TCP_CLOSING:
++		/* Received a retransmission of the FIN, do
++		 * nothing.
++		 */
++		break;
++	case TCP_LAST_ACK:
++		/* RFC793: Remain in the LAST-ACK state. */
++		break;
++
++	case TCP_FIN_WAIT1:
++		/* This case occurs when a simultaneous close
++		 * happens, we must ack the received FIN and
++		 * enter the CLOSING state.
++		 */
++		tcp_send_ack(sk);
++		tcp_set_state(meta_sk, TCP_CLOSING);
++		break;
++	case TCP_FIN_WAIT2:
++		/* Received a FIN -- send ACK and enter TIME_WAIT. */
++		tcp_send_ack(sk);
++		meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
++		break;
++	default:
++		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
++		 * cases we should never reach this piece of code.
++		 */
++		pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
++		       meta_sk->sk_state);
++		break;
++	}
++
++	/* It _is_ possible, that we have something out-of-order _after_ FIN.
++	 * Probably, we should reset in this case. For now drop them.
++	 */
++	mptcp_purge_ofo_queue(meta_tp);
++	sk_mem_reclaim(meta_sk);
++
++	if (!sock_flag(meta_sk, SOCK_DEAD)) {
++		meta_sk->sk_state_change(meta_sk);
++
++		/* Do not send POLL_HUP for half duplex close. */
++		if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
++		    meta_sk->sk_state == TCP_CLOSE)
++			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
++		else
++			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
++	}
++
++	return;
++}
++
++static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++
++	if (!meta_tp->packets_out)
++		return;
++
++	tcp_for_write_queue(skb, meta_sk) {
++		if (skb == tcp_send_head(meta_sk))
++			break;
++
++		if (mptcp_retransmit_skb(meta_sk, skb))
++			return;
++
++		if (skb == tcp_write_queue_head(meta_sk))
++			inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++						  inet_csk(meta_sk)->icsk_rto,
++						  TCP_RTO_MAX);
++	}
++}
++
++/* Handle the DATA_ACK */
++static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
++{
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	u32 prior_snd_una = meta_tp->snd_una;
++	int prior_packets;
++	u32 nwin, data_ack, data_seq;
++	u16 data_len = 0;
++
++	/* A valid packet came in - subflow is operational again */
++	tp->pf = 0;
++
++	/* Even if there is no data-ack, we stop retransmitting.
++	 * Except if this is a SYN/ACK. Then it is just a retransmission
++	 */
++	if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
++		tp->mptcp->pre_established = 0;
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++	}
++
++	/* If we are in infinite mapping mode, rx_opt.data_ack has been
++	 * set by mptcp_clean_rtx_infinite.
++	 */
++	if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
++		goto exit;
++
++	data_ack = tp->mptcp->rx_opt.data_ack;
++
++	if (unlikely(!tp->mptcp->fully_established) &&
++	    tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
++		/* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
++		 * includes a data-ack, we are fully established
++		 */
++		mptcp_become_fully_estab(sk);
++
++	/* Get the data_seq */
++	if (mptcp_is_data_seq(skb)) {
++		data_seq = tp->mptcp->rx_opt.data_seq;
++		data_len = tp->mptcp->rx_opt.data_len;
++	} else {
++		data_seq = meta_tp->snd_wl1;
++	}
++
++	/* If the ack is older than previous acks
++	 * then we can probably ignore it.
++	 */
++	if (before(data_ack, prior_snd_una))
++		goto exit;
++
++	/* If the ack includes data we haven't sent yet, discard
++	 * this segment (RFC793 Section 3.9).
++	 */
++	if (after(data_ack, meta_tp->snd_nxt))
++		goto exit;
++
++	/*** Now, update the window  - inspired by tcp_ack_update_window ***/
++	nwin = ntohs(tcp_hdr(skb)->window);
++
++	if (likely(!tcp_hdr(skb)->syn))
++		nwin <<= tp->rx_opt.snd_wscale;
++
++	if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
++		tcp_update_wl(meta_tp, data_seq);
++
++		/* Draft v09, Section 3.3.5:
++		 * [...] It should only update its local receive window values
++		 * when the largest sequence number allowed (i.e.  DATA_ACK +
++		 * receive window) increases. [...]
++		 */
++		if (meta_tp->snd_wnd != nwin &&
++		    !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
++			meta_tp->snd_wnd = nwin;
++
++			if (nwin > meta_tp->max_window)
++				meta_tp->max_window = nwin;
++		}
++	}
++	/*** Done, update the window ***/
++
++	/* We passed data and got it acked, remove any soft error
++	 * log. Something worked...
++	 */
++	sk->sk_err_soft = 0;
++	inet_csk(meta_sk)->icsk_probes_out = 0;
++	meta_tp->rcv_tstamp = tcp_time_stamp;
++	prior_packets = meta_tp->packets_out;
++	if (!prior_packets)
++		goto no_queue;
++
++	meta_tp->snd_una = data_ack;
++
++	mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
++
++	/* We are in loss-state, and something got acked, retransmit the whole
++	 * queue now!
++	 */
++	if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
++	    after(data_ack, prior_snd_una)) {
++		mptcp_xmit_retransmit_queue(meta_sk);
++		inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
++	}
++
++	/* Simplified version of tcp_new_space, because the snd-buffer
++	 * is handled by all the subflows.
++	 */
++	if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
++		sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
++		if (meta_sk->sk_socket &&
++		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++			meta_sk->sk_write_space(meta_sk);
++	}
++
++	if (meta_sk->sk_state != TCP_ESTABLISHED &&
++	    mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
++		return;
++
++exit:
++	mptcp_push_pending_frames(meta_sk);
++
++	return;
++
++no_queue:
++	if (tcp_send_head(meta_sk))
++		tcp_ack_probe(meta_sk);
++
++	mptcp_push_pending_frames(meta_sk);
++
++	return;
++}
++
++void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
++
++	if (!tp->mpcb->infinite_mapping_snd)
++		return;
++
++	/* The difference between both write_seq's represents the offset between
++	 * data-sequence and subflow-sequence. As we are infinite, this must
++	 * match.
++	 *
++	 * Thus, from this difference we can infer the meta snd_una.
++	 */
++	tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
++				     tp->snd_una;
++
++	mptcp_data_ack(sk, skb);
++}
++
++/**** static functions used by mptcp_parse_options */
++
++static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
++{
++	struct sock *sk_it, *tmpsk;
++
++	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++		if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
++			mptcp_reinject_data(sk_it, 0);
++			sk_it->sk_err = ECONNRESET;
++			if (tcp_need_reset(sk_it->sk_state))
++				tcp_sk(sk_it)->ops->send_active_reset(sk_it,
++								      GFP_ATOMIC);
++			mptcp_sub_force_close(sk_it);
++		}
++	}
++}
++
++void mptcp_parse_options(const uint8_t *ptr, int opsize,
++			 struct mptcp_options_received *mopt,
++			 const struct sk_buff *skb)
++{
++	const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
++
++	/* If the socket is mp-capable we would have a mopt. */
++	if (!mopt)
++		return;
++
++	switch (mp_opt->sub) {
++	case MPTCP_SUB_CAPABLE:
++	{
++		const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
++		    opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
++			mptcp_debug("%s: mp_capable: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		if (!sysctl_mptcp_enabled)
++			break;
++
++		/* We only support MPTCP version 0 */
++		if (mpcapable->ver != 0)
++			break;
++
++		/* MPTCP-RFC 6824:
++		 * "If receiving a message with the 'B' flag set to 1, and this
++		 * is not understood, then this SYN MUST be silently ignored;
++		 */
++		if (mpcapable->b) {
++			mopt->drop_me = 1;
++			break;
++		}
++
++		/* MPTCP-RFC 6824:
++		 * "An implementation that only supports this method MUST set
++		 *  bit "H" to 1, and bits "C" through "G" to 0."
++		 */
++		if (!mpcapable->h)
++			break;
++
++		mopt->saw_mpc = 1;
++		mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
++
++		if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
++			mopt->mptcp_key = mpcapable->sender_key;
++
++		break;
++	}
++	case MPTCP_SUB_JOIN:
++	{
++		const struct mp_join *mpjoin = (struct mp_join *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
++		    opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
++		    opsize != MPTCP_SUB_LEN_JOIN_ACK) {
++			mptcp_debug("%s: mp_join: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		/* saw_mpc must be set, because in tcp_check_req we assume that
++		 * it is set to support falling back to reg. TCP if a rexmitted
++		 * SYN has no MP_CAPABLE or MP_JOIN
++		 */
++		switch (opsize) {
++		case MPTCP_SUB_LEN_JOIN_SYN:
++			mopt->is_mp_join = 1;
++			mopt->saw_mpc = 1;
++			mopt->low_prio = mpjoin->b;
++			mopt->rem_id = mpjoin->addr_id;
++			mopt->mptcp_rem_token = mpjoin->u.syn.token;
++			mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
++			break;
++		case MPTCP_SUB_LEN_JOIN_SYNACK:
++			mopt->saw_mpc = 1;
++			mopt->low_prio = mpjoin->b;
++			mopt->rem_id = mpjoin->addr_id;
++			mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
++			mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
++			break;
++		case MPTCP_SUB_LEN_JOIN_ACK:
++			mopt->saw_mpc = 1;
++			mopt->join_ack = 1;
++			memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
++			break;
++		}
++		break;
++	}
++	case MPTCP_SUB_DSS:
++	{
++		const struct mp_dss *mdss = (struct mp_dss *)ptr;
++		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++
++		/* We check opsize for the csum and non-csum case. We do this,
++		 * because the draft says that the csum SHOULD be ignored if
++		 * it has not been negotiated in the MP_CAPABLE but still is
++		 * present in the data.
++		 *
++		 * It will get ignored later in mptcp_queue_skb.
++		 */
++		if (opsize != mptcp_sub_len_dss(mdss, 0) &&
++		    opsize != mptcp_sub_len_dss(mdss, 1)) {
++			mptcp_debug("%s: mp_dss: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		ptr += 4;
++
++		if (mdss->A) {
++			tcb->mptcp_flags |= MPTCPHDR_ACK;
++
++			if (mdss->a) {
++				mopt->data_ack = (u32) get_unaligned_be64(ptr);
++				ptr += MPTCP_SUB_LEN_ACK_64;
++			} else {
++				mopt->data_ack = get_unaligned_be32(ptr);
++				ptr += MPTCP_SUB_LEN_ACK;
++			}
++		}
++
++		tcb->dss_off = (ptr - skb_transport_header(skb));
++
++		if (mdss->M) {
++			if (mdss->m) {
++				u64 data_seq64 = get_unaligned_be64(ptr);
++
++				tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
++				mopt->data_seq = (u32) data_seq64;
++
++				ptr += 12; /* 64-bit dseq + subseq */
++			} else {
++				mopt->data_seq = get_unaligned_be32(ptr);
++				ptr += 8; /* 32-bit dseq + subseq */
++			}
++			mopt->data_len = get_unaligned_be16(ptr);
++
++			tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++			/* Is a check-sum present? */
++			if (opsize == mptcp_sub_len_dss(mdss, 1))
++				tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
++
++			/* DATA_FIN only possible with DSS-mapping */
++			if (mdss->F)
++				tcb->mptcp_flags |= MPTCPHDR_FIN;
++		}
++
++		break;
++	}
++	case MPTCP_SUB_ADD_ADDR:
++	{
++#if IS_ENABLED(CONFIG_IPV6)
++		const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++		if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++		     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++		    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++		     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
++#else
++		if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++		    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
++#endif /* CONFIG_IPV6 */
++			mptcp_debug("%s: mp_add_addr: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		/* We have to manually parse the options if we got two of them. */
++		if (mopt->saw_add_addr) {
++			mopt->more_add_addr = 1;
++			break;
++		}
++		mopt->saw_add_addr = 1;
++		mopt->add_addr_ptr = ptr;
++		break;
++	}
++	case MPTCP_SUB_REMOVE_ADDR:
++		if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
++			mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		if (mopt->saw_rem_addr) {
++			mopt->more_rem_addr = 1;
++			break;
++		}
++		mopt->saw_rem_addr = 1;
++		mopt->rem_addr_ptr = ptr;
++		break;
++	case MPTCP_SUB_PRIO:
++	{
++		const struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++		if (opsize != MPTCP_SUB_LEN_PRIO &&
++		    opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
++			mptcp_debug("%s: mp_prio: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		mopt->saw_low_prio = 1;
++		mopt->low_prio = mpprio->b;
++
++		if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
++			mopt->saw_low_prio = 2;
++			mopt->prio_addr_id = mpprio->addr_id;
++		}
++		break;
++	}
++	case MPTCP_SUB_FAIL:
++		if (opsize != MPTCP_SUB_LEN_FAIL) {
++			mptcp_debug("%s: mp_fail: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++		mopt->mp_fail = 1;
++		break;
++	case MPTCP_SUB_FCLOSE:
++		if (opsize != MPTCP_SUB_LEN_FCLOSE) {
++			mptcp_debug("%s: mp_fclose: bad option size %d\n",
++				    __func__, opsize);
++			break;
++		}
++
++		mopt->mp_fclose = 1;
++		mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
++
++		break;
++	default:
++		mptcp_debug("%s: Received unkown subtype: %d\n",
++			    __func__, mp_opt->sub);
++		break;
++	}
++}
++
++/** Parse only MPTCP options */
++void tcp_parse_mptcp_options(const struct sk_buff *skb,
++			     struct mptcp_options_received *mopt)
++{
++	const struct tcphdr *th = tcp_hdr(skb);
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++	const unsigned char *ptr = (const unsigned char *)(th + 1);
++
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return;
++		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)	/* "silly options" */
++				return;
++			if (opsize > length)
++				return;	/* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP)
++				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
++		}
++		ptr += opsize - 2;
++		length -= opsize;
++	}
++}
++
++int mptcp_check_rtt(const struct tcp_sock *tp, int time)
++{
++	struct mptcp_cb *mpcb = tp->mpcb;
++	struct sock *sk;
++	u32 rtt_max = 0;
++
++	/* In MPTCP, we take the max delay across all flows,
++	 * in order to take into account meta-reordering buffers.
++	 */
++	mptcp_for_each_sk(mpcb, sk) {
++		if (!mptcp_sk_can_recv(sk))
++			continue;
++
++		if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
++			rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
++	}
++	if (time < (rtt_max >> 3) || !rtt_max)
++		return 1;
++
++	return 0;
++}
++
++static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
++{
++	struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	__be16 port = 0;
++	union inet_addr addr;
++	sa_family_t family;
++
++	if (mpadd->ipver == 4) {
++		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++			port  = mpadd->u.v4.port;
++		family = AF_INET;
++		addr.in = mpadd->u.v4.addr;
++#if IS_ENABLED(CONFIG_IPV6)
++	} else if (mpadd->ipver == 6) {
++		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
++			port  = mpadd->u.v6.port;
++		family = AF_INET6;
++		addr.in6 = mpadd->u.v6.addr;
++#endif /* CONFIG_IPV6 */
++	} else {
++		return;
++	}
++
++	if (mpcb->pm_ops->add_raddr)
++		mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
++}
++
++static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
++{
++	struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++	int i;
++	u8 rem_id;
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
++		rem_id = (&mprem->addrs_id)[i];
++
++		if (mpcb->pm_ops->rem_raddr)
++			mpcb->pm_ops->rem_raddr(mpcb, rem_id);
++		mptcp_send_reset_rem_id(mpcb, rem_id);
++	}
++}
++
++static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
++{
++	struct tcphdr *th = tcp_hdr(skb);
++	unsigned char *ptr;
++	int length = (th->doff * 4) - sizeof(struct tcphdr);
++
++	/* Jump through the options to check whether ADD_ADDR is there */
++	ptr = (unsigned char *)(th + 1);
++	while (length > 0) {
++		int opcode = *ptr++;
++		int opsize;
++
++		switch (opcode) {
++		case TCPOPT_EOL:
++			return;
++		case TCPOPT_NOP:
++			length--;
++			continue;
++		default:
++			opsize = *ptr++;
++			if (opsize < 2)
++				return;
++			if (opsize > length)
++				return;  /* don't parse partial options */
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
++#if IS_ENABLED(CONFIG_IPV6)
++				struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++				if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++				     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
++				    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
++				     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
++#else
++				if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
++				    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
++#endif /* CONFIG_IPV6 */
++					goto cont;
++
++				mptcp_handle_add_addr(ptr, sk);
++			}
++			if (opcode == TCPOPT_MPTCP &&
++			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
++				if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
++					goto cont;
++
++				mptcp_handle_rem_addr(ptr, sk);
++			}
++cont:
++			ptr += opsize - 2;
++			length -= opsize;
++		}
++	}
++	return;
++}
++
++static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
++{
++	struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	if (unlikely(mptcp->rx_opt.mp_fail)) {
++		mptcp->rx_opt.mp_fail = 0;
++
++		if (!th->rst && !mpcb->infinite_mapping_snd) {
++			struct sock *sk_it;
++
++			mpcb->send_infinite_mapping = 1;
++			/* We resend everything that has not been acknowledged */
++			meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++
++			/* We artificially restart the whole send-queue. Thus,
++			 * it is as if no packets are in flight
++			 */
++			tcp_sk(meta_sk)->packets_out = 0;
++
++			/* If the snd_nxt already wrapped around, we have to
++			 * undo the wrapping, as we are restarting from snd_una
++			 * on.
++			 */
++			if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
++				mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
++				mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
++			}
++			tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
++
++			/* Trigger a sending on the meta. */
++			mptcp_push_pending_frames(meta_sk);
++
++			mptcp_for_each_sk(mpcb, sk_it) {
++				if (sk != sk_it)
++					mptcp_sub_force_close(sk_it);
++			}
++		}
++
++		return 0;
++	}
++
++	if (unlikely(mptcp->rx_opt.mp_fclose)) {
++		struct sock *sk_it, *tmpsk;
++
++		mptcp->rx_opt.mp_fclose = 0;
++		if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
++			return 0;
++
++		if (tcp_need_reset(sk->sk_state))
++			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
++
++		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
++			mptcp_sub_force_close(sk_it);
++
++		tcp_reset(meta_sk);
++
++		return 1;
++	}
++
++	return 0;
++}
++
++static inline void mptcp_path_array_check(struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++
++	if (unlikely(mpcb->list_rcvd)) {
++		mpcb->list_rcvd = 0;
++		if (mpcb->pm_ops->new_remote_address)
++			mpcb->pm_ops->new_remote_address(meta_sk);
++	}
++}
++
++int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
++			 const struct sk_buff *skb)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
++
++	if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
++		return 0;
++
++	if (mptcp_mp_fail_rcvd(sk, th))
++		return 1;
++
++	/* RFC 6824, Section 3.3:
++	 * If a checksum is not present when its use has been negotiated, the
++	 * receiver MUST close the subflow with a RST as it is considered broken.
++	 */
++	if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
++	    !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
++		if (tcp_need_reset(sk->sk_state))
++			tp->ops->send_active_reset(sk, GFP_ATOMIC);
++
++		mptcp_sub_force_close(sk);
++		return 1;
++	}
++
++	/* We have to acknowledge retransmissions of the third
++	 * ack.
++	 */
++	if (mopt->join_ack) {
++		tcp_send_delayed_ack(sk);
++		mopt->join_ack = 0;
++	}
++
++	if (mopt->saw_add_addr || mopt->saw_rem_addr) {
++		if (mopt->more_add_addr || mopt->more_rem_addr) {
++			mptcp_parse_addropt(skb, sk);
++		} else {
++			if (mopt->saw_add_addr)
++				mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
++			if (mopt->saw_rem_addr)
++				mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
++		}
++
++		mopt->more_add_addr = 0;
++		mopt->saw_add_addr = 0;
++		mopt->more_rem_addr = 0;
++		mopt->saw_rem_addr = 0;
++	}
++	if (mopt->saw_low_prio) {
++		if (mopt->saw_low_prio == 1) {
++			tp->mptcp->rcv_low_prio = mopt->low_prio;
++		} else {
++			struct sock *sk_it;
++			mptcp_for_each_sk(tp->mpcb, sk_it) {
++				struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
++				if (mptcp->rem_id == mopt->prio_addr_id)
++					mptcp->rcv_low_prio = mopt->low_prio;
++			}
++		}
++		mopt->saw_low_prio = 0;
++	}
++
++	mptcp_data_ack(sk, skb);
++
++	mptcp_path_array_check(mptcp_meta_sk(sk));
++	/* Socket may have been mp_killed by a REMOVE_ADDR */
++	if (tp->mp_killed)
++		return 1;
++
++	return 0;
++}
++
++/* In case of fastopen, some data can already be in the write queue.
++ * We need to update the sequence number of the segments as they
++ * were initially TCP sequence numbers.
++ */
++static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
++	struct sk_buff *skb;
++	u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
++
++	/* There should only be one skb in write queue: the data not
++	 * acknowledged in the SYN+ACK. In this case, we need to map
++	 * this data to data sequence numbers.
++	 */
++	skb_queue_walk(&meta_sk->sk_write_queue, skb) {
++		/* If the server only acknowledges partially the data sent in
++		 * the SYN, we need to trim the acknowledged part because
++		 * we don't want to retransmit this already received data.
++		 * When we reach this point, tcp_ack() has already cleaned up
++		 * fully acked segments. However, tcp trims partially acked
++		 * segments only when retransmitting. Since MPTCP comes into
++		 * play only now, we will fake an initial transmit, and
++		 * retransmit_skb() will not be called. The following fragment
++		 * comes from __tcp_retransmit_skb().
++		 */
++		if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
++			BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
++				      master_tp->snd_una));
++			/* tcp_trim_head can only returns ENOMEM if skb is
++			 * cloned. It is not the case here (see
++			 * tcp_send_syn_data).
++			 */
++			BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
++					     TCP_SKB_CB(skb)->seq));
++		}
++
++		TCP_SKB_CB(skb)->seq += new_mapping;
++		TCP_SKB_CB(skb)->end_seq += new_mapping;
++	}
++
++	/* We can advance write_seq by the number of bytes unacknowledged
++	 * and that were mapped in the previous loop.
++	 */
++	meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
++
++	/* The packets from the master_sk will be entailed to it later
++	 * Until that time, its write queue is empty, and
++	 * write_seq must align with snd_una
++	 */
++	master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
++	master_tp->packets_out = 0;
++
++	/* Although these data have been sent already over the subsk,
++	 * They have never been sent over the meta_sk, so we rewind
++	 * the send_head so that tcp considers it as an initial send
++	 * (instead of retransmit).
++	 */
++	meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
++}
++
++/* The skptr is needed, because if we become MPTCP-capable, we have to switch
++ * from meta-socket to master-socket.
++ *
++ * @return: 1 - we want to reset this connection
++ *	    2 - we want to discard the received syn/ack
++ *	    0 - everything is fine - continue
++ */
++int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
++				    const struct sk_buff *skb,
++				    const struct mptcp_options_received *mopt)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	if (mptcp(tp)) {
++		u8 hash_mac_check[20];
++		struct mptcp_cb *mpcb = tp->mpcb;
++
++		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
++				(u8 *)&mpcb->mptcp_loc_key,
++				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++				(u8 *)&tp->mptcp->mptcp_loc_nonce,
++				(u32 *)hash_mac_check);
++		if (memcmp(hash_mac_check,
++			   (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
++			mptcp_sub_force_close(sk);
++			return 1;
++		}
++
++		/* Set this flag in order to postpone data sending
++		 * until the 4th ack arrives.
++		 */
++		tp->mptcp->pre_established = 1;
++		tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
++
++		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
++				(u8 *)&mpcb->mptcp_rem_key,
++				(u8 *)&tp->mptcp->mptcp_loc_nonce,
++				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
++				(u32 *)&tp->mptcp->sender_mac[0]);
++
++	} else if (mopt->saw_mpc) {
++		struct sock *meta_sk = sk;
++
++		if (mptcp_create_master_sk(sk, mopt->mptcp_key,
++					   ntohs(tcp_hdr(skb)->window)))
++			return 2;
++
++		sk = tcp_sk(sk)->mpcb->master_sk;
++		*skptr = sk;
++		tp = tcp_sk(sk);
++
++		/* If fastopen was used data might be in the send queue. We
++		 * need to update their sequence number to MPTCP-level seqno.
++		 * Note that it can happen in rare cases that fastopen_req is
++		 * NULL and syn_data is 0 but fastopen indeed occurred and
++		 * data has been queued in the write queue (but not sent).
++		 * Example of such rare cases: connect is non-blocking and
++		 * TFO is configured to work without cookies.
++		 */
++		if (!skb_queue_empty(&meta_sk->sk_write_queue))
++			mptcp_rcv_synsent_fastopen(meta_sk);
++
++		/* -1, because the SYN consumed 1 byte. In case of TFO, we
++		 * start the subflow-sequence number as if the data of the SYN
++		 * is not part of any mapping.
++		 */
++		tp->mptcp->snt_isn = tp->snd_una - 1;
++		tp->mpcb->dss_csum = mopt->dss_csum;
++		tp->mptcp->include_mpc = 1;
++
++		/* Ensure that fastopen is handled at the meta-level. */
++		tp->fastopen_req = NULL;
++
++		sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
++		sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
++
++		 /* hold in sk_clone_lock due to initialization to 2 */
++		sock_put(sk);
++	} else {
++		tp->request_mptcp = 0;
++
++		if (tp->inside_tk_table)
++			mptcp_hash_remove(tp);
++	}
++
++	if (mptcp(tp))
++		tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
++
++	return 0;
++}
++
++bool mptcp_should_expand_sndbuf(const struct sock *sk)
++{
++	const struct sock *sk_it;
++	const struct sock *meta_sk = mptcp_meta_sk(sk);
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int cnt_backups = 0;
++	int backup_available = 0;
++
++	/* We circumvent this check in tcp_check_space, because we want to
++	 * always call sk_write_space. So, we reproduce the check here.
++	 */
++	if (!meta_sk->sk_socket ||
++	    !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
++		return false;
++
++	/* If the user specified a specific send buffer setting, do
++	 * not modify it.
++	 */
++	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++		return false;
++
++	/* If we are under global TCP memory pressure, do not expand.  */
++	if (sk_under_memory_pressure(meta_sk))
++		return false;
++
++	/* If we are under soft global TCP memory pressure, do not expand.  */
++	if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
++		return false;
++
++
++	/* For MPTCP we look for a subsocket that could send data.
++	 * If we found one, then we update the send-buffer.
++	 */
++	mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++		struct tcp_sock *tp_it = tcp_sk(sk_it);
++
++		if (!mptcp_sk_can_send(sk_it))
++			continue;
++
++		/* Backup-flows have to be counted - if there is no other
++		 * subflow we take the backup-flow into account.
++		 */
++		if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
++			cnt_backups++;
++
++		if (tp_it->packets_out < tp_it->snd_cwnd) {
++			if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
++				backup_available = 1;
++				continue;
++			}
++			return true;
++		}
++	}
++
++	/* Backup-flow is available for sending - update send-buffer */
++	if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
++		return true;
++	return false;
++}
++
++void mptcp_init_buffer_space(struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	int space;
++
++	tcp_init_buffer_space(sk);
++
++	if (is_master_tp(tp)) {
++		meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
++		meta_tp->rcvq_space.time = tcp_time_stamp;
++		meta_tp->rcvq_space.seq = meta_tp->copied_seq;
++
++		/* If there is only one subflow, we just use regular TCP
++		 * autotuning. User-locks are handled already by
++		 * tcp_init_buffer_space
++		 */
++		meta_tp->window_clamp = tp->window_clamp;
++		meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
++		meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
++		meta_sk->sk_sndbuf = sk->sk_sndbuf;
++
++		return;
++	}
++
++	if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
++		goto snd_buf;
++
++	/* Adding a new subflow to the rcv-buffer space. We make a simple
++	 * addition, to give some space to allow traffic on the new subflow.
++	 * Autotuning will increase it further later on.
++	 */
++	space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
++	if (space > meta_sk->sk_rcvbuf) {
++		meta_tp->window_clamp += tp->window_clamp;
++		meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
++		meta_sk->sk_rcvbuf = space;
++	}
++
++snd_buf:
++	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
++		return;
++
++	/* Adding a new subflow to the send-buffer space. We make a simple
++	 * addition, to give some space to allow traffic on the new subflow.
++	 * Autotuning will increase it further later on.
++	 */
++	space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
++	if (space > meta_sk->sk_sndbuf) {
++		meta_sk->sk_sndbuf = space;
++		meta_sk->sk_write_space(meta_sk);
++	}
++}
++
++void mptcp_tcp_set_rto(struct sock *sk)
++{
++	tcp_set_rto(sk);
++	mptcp_set_rto(sk);
++}
+diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
+new file mode 100644
+index 000000000000..1183d1305d35
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv4.c
+@@ -0,0 +1,483 @@
++/*
++ *	MPTCP implementation - IPv4-specific functions
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/ip.h>
++#include <linux/list.h>
++#include <linux/skbuff.h>
++#include <linux/spinlock.h>
++#include <linux/tcp.h>
++
++#include <net/inet_common.h>
++#include <net/inet_connection_sock.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/request_sock.h>
++#include <net/tcp.h>
++
++u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++	u32 hash[MD5_DIGEST_WORDS];
++
++	hash[0] = (__force u32)saddr;
++	hash[1] = (__force u32)daddr;
++	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++	hash[3] = mptcp_seed++;
++
++	md5_transform(hash, mptcp_secret);
++
++	return hash[0];
++}
++
++u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
++{
++	u32 hash[MD5_DIGEST_WORDS];
++
++	hash[0] = (__force u32)saddr;
++	hash[1] = (__force u32)daddr;
++	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
++	hash[3] = mptcp_seed++;
++
++	md5_transform(hash, mptcp_secret);
++
++	return *((u64 *)hash);
++}
++
++
++static void mptcp_v4_reqsk_destructor(struct request_sock *req)
++{
++	mptcp_reqsk_destructor(req);
++
++	tcp_v4_reqsk_destructor(req);
++}
++
++static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
++			     struct sk_buff *skb)
++{
++	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++	mptcp_reqsk_init(req, skb);
++
++	return 0;
++}
++
++static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
++				  struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	union inet_addr addr;
++	int loc_id;
++	bool low_prio = false;
++
++	/* We need to do this as early as possible. Because, if we fail later
++	 * (e.g., get_local_id), then reqsk_free tries to remove the
++	 * request-socket from the htb in mptcp_hash_request_remove as pprev
++	 * may be different from NULL.
++	 */
++	mtreq->hash_entry.pprev = NULL;
++
++	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
++
++	mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
++						    ip_hdr(skb)->daddr,
++						    tcp_hdr(skb)->source,
++						    tcp_hdr(skb)->dest);
++	addr.ip = inet_rsk(req)->ir_loc_addr;
++	loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
++	if (loc_id == -1)
++		return -1;
++	mtreq->loc_id = loc_id;
++	mtreq->low_prio = low_prio;
++
++	mptcp_join_reqsk_init(mpcb, req, skb);
++
++	return 0;
++}
++
++/* Similar to tcp_request_sock_ops */
++struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
++	.family		=	PF_INET,
++	.obj_size	=	sizeof(struct mptcp_request_sock),
++	.rtx_syn_ack	=	tcp_rtx_synack,
++	.send_ack	=	tcp_v4_reqsk_send_ack,
++	.destructor	=	mptcp_v4_reqsk_destructor,
++	.send_reset	=	tcp_v4_send_reset,
++	.syn_ack_timeout =	tcp_syn_ack_timeout,
++};
++
++static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
++					  struct request_sock *req,
++					  const unsigned long timeout)
++{
++	const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++				     inet_rsk(req)->ir_rmt_port,
++				     0, MPTCP_HASH_SIZE);
++	/* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
++	 * want to reset the keepalive-timer (responsible for retransmitting
++	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++	 * overload the keepalive timer. Also, it's not a big deal, because the
++	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++	 * if the third ACK gets lost, the client will handle the retransmission
++	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
++	 * SYN.
++	 */
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++	const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
++				     inet_rsk(req)->ir_rmt_port,
++				     lopt->hash_rnd, lopt->nr_table_entries);
++
++	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++		mptcp_reset_synack_timer(meta_sk, timeout);
++
++	rcu_read_lock();
++	spin_lock(&mptcp_reqsk_hlock);
++	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++	spin_unlock(&mptcp_reqsk_hlock);
++	rcu_read_unlock();
++}
++
++/* Similar to tcp_v4_conn_request */
++static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return tcp_conn_request(&mptcp_request_sock_ops,
++				&mptcp_join_request_sock_ipv4_ops,
++				meta_sk, skb);
++}
++
++/* We only process join requests here. (either the SYN or the final ACK) */
++int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *child, *rsk = NULL;
++	int ret;
++
++	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++		struct tcphdr *th = tcp_hdr(skb);
++		const struct iphdr *iph = ip_hdr(skb);
++		struct sock *sk;
++
++		sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
++					     iph->saddr, th->source, iph->daddr,
++					     th->dest, inet_iif(skb));
++
++		if (!sk) {
++			kfree_skb(skb);
++			return 0;
++		}
++		if (is_meta_sk(sk)) {
++			WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
++			kfree_skb(skb);
++			sock_put(sk);
++			return 0;
++		}
++
++		if (sk->sk_state == TCP_TIME_WAIT) {
++			inet_twsk_put(inet_twsk(sk));
++			kfree_skb(skb);
++			return 0;
++		}
++
++		ret = tcp_v4_do_rcv(sk, skb);
++		sock_put(sk);
++
++		return ret;
++	}
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++	/* Has been removed from the tk-table. Thus, no new subflows.
++	 *
++	 * Check for close-state is necessary, because we may have been closed
++	 * without passing by mptcp_close().
++	 *
++	 * When falling back, no new subflows are allowed either.
++	 */
++	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++		goto reset_and_discard;
++
++	child = tcp_v4_hnd_req(meta_sk, skb);
++
++	if (!child)
++		goto discard;
++
++	if (child != meta_sk) {
++		sock_rps_save_rxhash(child, skb);
++		/* We don't call tcp_child_process here, because we hold
++		 * already the meta-sk-lock and are sure that it is not owned
++		 * by the user.
++		 */
++		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++		bh_unlock_sock(child);
++		sock_put(child);
++		if (ret) {
++			rsk = child;
++			goto reset_and_discard;
++		}
++	} else {
++		if (tcp_hdr(skb)->syn) {
++			mptcp_v4_join_request(meta_sk, skb);
++			goto discard;
++		}
++		goto reset_and_discard;
++	}
++	return 0;
++
++reset_and_discard:
++	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++		const struct tcphdr *th = tcp_hdr(skb);
++		const struct iphdr *iph = ip_hdr(skb);
++		struct request_sock **prev, *req;
++		/* If we end up here, it means we should not have matched on the
++		 * request-socket. But, because the request-sock queue is only
++		 * destroyed in mptcp_close, the socket may actually already be
++		 * in close-state (e.g., through shutdown()) while still having
++		 * pending request sockets.
++		 */
++		req = inet_csk_search_req(meta_sk, &prev, th->source,
++					  iph->saddr, iph->daddr);
++		if (req) {
++			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++					    req);
++			reqsk_free(req);
++		}
++	}
++
++	tcp_v4_send_reset(rsk, skb);
++discard:
++	kfree_skb(skb);
++	return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
++				 const __be32 laddr, const struct net *net)
++{
++	const struct mptcp_request_sock *mtreq;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++	const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++				       hash_entry) {
++		struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
++		meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++		if (ireq->ir_rmt_port == rport &&
++		    ireq->ir_rmt_addr == raddr &&
++		    ireq->ir_loc_addr == laddr &&
++		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
++		    net_eq(net, sock_net(meta_sk)))
++			goto found;
++		meta_sk = NULL;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++		goto begin;
++
++found:
++	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++		meta_sk = NULL;
++	rcu_read_unlock();
++
++	return meta_sk;
++}
++
++/* Create a new IPv4 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
++			   struct mptcp_rem4 *rem)
++{
++	struct tcp_sock *tp;
++	struct sock *sk;
++	struct sockaddr_in loc_in, rem_in;
++	struct socket sock;
++	int ret;
++
++	/** First, create and prepare the new socket */
++
++	sock.type = meta_sk->sk_socket->type;
++	sock.state = SS_UNCONNECTED;
++	sock.wq = meta_sk->sk_socket->wq;
++	sock.file = meta_sk->sk_socket->file;
++	sock.ops = NULL;
++
++	ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++	if (unlikely(ret < 0)) {
++		mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
++		return ret;
++	}
++
++	sk = sock.sk;
++	tp = tcp_sk(sk);
++
++	/* All subsockets need the MPTCP-lock-class */
++	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++	if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
++		goto error;
++
++	tp->mptcp->slave_sk = 1;
++	tp->mptcp->low_prio = loc->low_prio;
++
++	/* Initializing the timer for an MPTCP subflow */
++	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++	/** Then, connect the socket to the peer */
++	loc_in.sin_family = AF_INET;
++	rem_in.sin_family = AF_INET;
++	loc_in.sin_port = 0;
++	if (rem->port)
++		rem_in.sin_port = rem->port;
++	else
++		rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
++	loc_in.sin_addr = loc->addr;
++	rem_in.sin_addr = rem->addr;
++
++	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
++	if (ret < 0) {
++		mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
++		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++		    tp->mptcp->path_index, &loc_in.sin_addr,
++		    ntohs(loc_in.sin_port), &rem_in.sin_addr,
++		    ntohs(rem_in.sin_port));
++
++	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
++		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
++
++	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++				sizeof(struct sockaddr_in), O_NONBLOCK);
++	if (ret < 0 && ret != -EINPROGRESS) {
++		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	sk_set_socket(sk, meta_sk->sk_socket);
++	sk->sk_wq = meta_sk->sk_wq;
++
++	return 0;
++
++error:
++	/* May happen if mptcp_add_sock fails first */
++	if (!mptcp(tp)) {
++		tcp_close(sk, 0);
++	} else {
++		local_bh_disable();
++		mptcp_sub_force_close(sk);
++		local_bh_enable();
++	}
++	return ret;
++}
++EXPORT_SYMBOL(mptcp_init4_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v4_specific = {
++	.queue_xmit	   = ip_queue_xmit,
++	.send_check	   = tcp_v4_send_check,
++	.rebuild_header	   = inet_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v4_syn_recv_sock,
++	.net_header_len	   = sizeof(struct iphdr),
++	.setsockopt	   = ip_setsockopt,
++	.getsockopt	   = ip_getsockopt,
++	.addr2sockaddr	   = inet_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in),
++	.bind_conflict	   = inet_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ip_setsockopt,
++	.compat_getsockopt = compat_ip_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
++
++/* General initialization of IPv4 for MPTCP */
++int mptcp_pm_v4_init(void)
++{
++	int ret = 0;
++	struct request_sock_ops *ops = &mptcp_request_sock_ops;
++
++	mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++	mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
++
++	mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
++	mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
++	mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
++
++	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
++	if (ops->slab_name == NULL) {
++		ret = -ENOMEM;
++		goto out;
++	}
++
++	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++				      NULL);
++
++	if (ops->slab == NULL) {
++		ret =  -ENOMEM;
++		goto err_reqsk_create;
++	}
++
++out:
++	return ret;
++
++err_reqsk_create:
++	kfree(ops->slab_name);
++	ops->slab_name = NULL;
++	goto out;
++}
++
++void mptcp_pm_v4_undo(void)
++{
++	kmem_cache_destroy(mptcp_request_sock_ops.slab);
++	kfree(mptcp_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
+new file mode 100644
+index 000000000000..1036973aa855
+--- /dev/null
++++ b/net/mptcp/mptcp_ipv6.c
+@@ -0,0 +1,518 @@
++/*
++ *	MPTCP implementation - IPv6-specific functions
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/export.h>
++#include <linux/in6.h>
++#include <linux/kernel.h>
++
++#include <net/addrconf.h>
++#include <net/flow.h>
++#include <net/inet6_connection_sock.h>
++#include <net/inet6_hashtables.h>
++#include <net/inet_common.h>
++#include <net/ipv6.h>
++#include <net/ip6_checksum.h>
++#include <net/ip6_route.h>
++#include <net/mptcp.h>
++#include <net/mptcp_v6.h>
++#include <net/tcp.h>
++#include <net/transp_v6.h>
++
++__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
++			 __be16 sport, __be16 dport)
++{
++	u32 secret[MD5_MESSAGE_BYTES / 4];
++	u32 hash[MD5_DIGEST_WORDS];
++	u32 i;
++
++	memcpy(hash, saddr, 16);
++	for (i = 0; i < 4; i++)
++		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++	secret[4] = mptcp_secret[4] +
++		    (((__force u16)sport << 16) + (__force u16)dport);
++	secret[5] = mptcp_seed++;
++	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++		secret[i] = mptcp_secret[i];
++
++	md5_transform(hash, secret);
++
++	return hash[0];
++}
++
++u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
++		     __be16 sport, __be16 dport)
++{
++	u32 secret[MD5_MESSAGE_BYTES / 4];
++	u32 hash[MD5_DIGEST_WORDS];
++	u32 i;
++
++	memcpy(hash, saddr, 16);
++	for (i = 0; i < 4; i++)
++		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
++	secret[4] = mptcp_secret[4] +
++		    (((__force u16)sport << 16) + (__force u16)dport);
++	secret[5] = mptcp_seed++;
++	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
++		secret[i] = mptcp_secret[i];
++
++	md5_transform(hash, secret);
++
++	return *((u64 *)hash);
++}
++
++static void mptcp_v6_reqsk_destructor(struct request_sock *req)
++{
++	mptcp_reqsk_destructor(req);
++
++	tcp_v6_reqsk_destructor(req);
++}
++
++static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
++			     struct sk_buff *skb)
++{
++	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++	mptcp_reqsk_init(req, skb);
++
++	return 0;
++}
++
++static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
++				  struct sk_buff *skb)
++{
++	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++	union inet_addr addr;
++	int loc_id;
++	bool low_prio = false;
++
++	/* We need to do this as early as possible. Because, if we fail later
++	 * (e.g., get_local_id), then reqsk_free tries to remove the
++	 * request-socket from the htb in mptcp_hash_request_remove as pprev
++	 * may be different from NULL.
++	 */
++	mtreq->hash_entry.pprev = NULL;
++
++	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
++
++	mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
++						    ipv6_hdr(skb)->daddr.s6_addr32,
++						    tcp_hdr(skb)->source,
++						    tcp_hdr(skb)->dest);
++	addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
++	loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
++	if (loc_id == -1)
++		return -1;
++	mtreq->loc_id = loc_id;
++	mtreq->low_prio = low_prio;
++
++	mptcp_join_reqsk_init(mpcb, req, skb);
++
++	return 0;
++}
++
++/* Similar to tcp6_request_sock_ops */
++struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
++	.family		=	AF_INET6,
++	.obj_size	=	sizeof(struct mptcp_request_sock),
++	.rtx_syn_ack	=	tcp_v6_rtx_synack,
++	.send_ack	=	tcp_v6_reqsk_send_ack,
++	.destructor	=	mptcp_v6_reqsk_destructor,
++	.send_reset	=	tcp_v6_send_reset,
++	.syn_ack_timeout =	tcp_syn_ack_timeout,
++};
++
++static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
++					  struct request_sock *req,
++					  const unsigned long timeout)
++{
++	const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++				      inet_rsk(req)->ir_rmt_port,
++				      0, MPTCP_HASH_SIZE);
++	/* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
++	 * want to reset the keepalive-timer (responsible for retransmitting
++	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
++	 * overload the keepalive timer. Also, it's not a big deal, because the
++	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
++	 * if the third ACK gets lost, the client will handle the retransmission
++	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
++	 * SYN.
++	 */
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
++	const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
++				      inet_rsk(req)->ir_rmt_port,
++				      lopt->hash_rnd, lopt->nr_table_entries);
++
++	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
++	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
++		mptcp_reset_synack_timer(meta_sk, timeout);
++
++	rcu_read_lock();
++	spin_lock(&mptcp_reqsk_hlock);
++	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
++	spin_unlock(&mptcp_reqsk_hlock);
++	rcu_read_unlock();
++}
++
++static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
++{
++	return tcp_conn_request(&mptcp6_request_sock_ops,
++				&mptcp_join_request_sock_ipv6_ops,
++				meta_sk, skb);
++}
++
++int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *child, *rsk = NULL;
++	int ret;
++
++	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
++		struct tcphdr *th = tcp_hdr(skb);
++		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
++		struct sock *sk;
++
++		sk = __inet6_lookup_established(sock_net(meta_sk),
++						&tcp_hashinfo,
++						&ip6h->saddr, th->source,
++						&ip6h->daddr, ntohs(th->dest),
++						inet6_iif(skb));
++
++		if (!sk) {
++			kfree_skb(skb);
++			return 0;
++		}
++		if (is_meta_sk(sk)) {
++			WARN("%s Did not find a sub-sk!\n", __func__);
++			kfree_skb(skb);
++			sock_put(sk);
++			return 0;
++		}
++
++		if (sk->sk_state == TCP_TIME_WAIT) {
++			inet_twsk_put(inet_twsk(sk));
++			kfree_skb(skb);
++			return 0;
++		}
++
++		ret = tcp_v6_do_rcv(sk, skb);
++		sock_put(sk);
++
++		return ret;
++	}
++	TCP_SKB_CB(skb)->mptcp_flags = 0;
++
++	/* Has been removed from the tk-table. Thus, no new subflows.
++	 *
++	 * Check for close-state is necessary, because we may have been closed
++	 * without passing by mptcp_close().
++	 *
++	 * When falling back, no new subflows are allowed either.
++	 */
++	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
++	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
++		goto reset_and_discard;
++
++	child = tcp_v6_hnd_req(meta_sk, skb);
++
++	if (!child)
++		goto discard;
++
++	if (child != meta_sk) {
++		sock_rps_save_rxhash(child, skb);
++		/* We don't call tcp_child_process here, because we hold
++		 * already the meta-sk-lock and are sure that it is not owned
++		 * by the user.
++		 */
++		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
++		bh_unlock_sock(child);
++		sock_put(child);
++		if (ret) {
++			rsk = child;
++			goto reset_and_discard;
++		}
++	} else {
++		if (tcp_hdr(skb)->syn) {
++			mptcp_v6_join_request(meta_sk, skb);
++			goto discard;
++		}
++		goto reset_and_discard;
++	}
++	return 0;
++
++reset_and_discard:
++	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
++		const struct tcphdr *th = tcp_hdr(skb);
++		struct request_sock **prev, *req;
++		/* If we end up here, it means we should not have matched on the
++		 * request-socket. But, because the request-sock queue is only
++		 * destroyed in mptcp_close, the socket may actually already be
++		 * in close-state (e.g., through shutdown()) while still having
++		 * pending request sockets.
++		 */
++		req = inet6_csk_search_req(meta_sk, &prev, th->source,
++					   &ipv6_hdr(skb)->saddr,
++					   &ipv6_hdr(skb)->daddr, inet6_iif(skb));
++		if (req) {
++			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
++			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
++					    req);
++			reqsk_free(req);
++		}
++	}
++
++	tcp_v6_send_reset(rsk, skb);
++discard:
++	kfree_skb(skb);
++	return 0;
++}
++
++/* After this, the ref count of the meta_sk associated with the request_sock
++ * is incremented. Thus it is the responsibility of the caller
++ * to call sock_put() when the reference is not needed anymore.
++ */
++struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
++				 const struct in6_addr *laddr, const struct net *net)
++{
++	const struct mptcp_request_sock *mtreq;
++	struct sock *meta_sk = NULL;
++	const struct hlist_nulls_node *node;
++	const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
++
++	rcu_read_lock();
++begin:
++	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
++				       hash_entry) {
++		struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
++		meta_sk = mtreq->mptcp_mpcb->meta_sk;
++
++		if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
++		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
++		    ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
++		    ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
++		    net_eq(net, sock_net(meta_sk)))
++			goto found;
++		meta_sk = NULL;
++	}
++	/* A request-socket is destroyed by RCU. So, it might have been recycled
++	 * and put into another hash-table list. So, after the lookup we may
++	 * end up in a different list. So, we may need to restart.
++	 *
++	 * See also the comment in __inet_lookup_established.
++	 */
++	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
++		goto begin;
++
++found:
++	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
++		meta_sk = NULL;
++	rcu_read_unlock();
++
++	return meta_sk;
++}
++
++/* Create a new IPv6 subflow.
++ *
++ * We are in user-context and meta-sock-lock is hold.
++ */
++int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
++			   struct mptcp_rem6 *rem)
++{
++	struct tcp_sock *tp;
++	struct sock *sk;
++	struct sockaddr_in6 loc_in, rem_in;
++	struct socket sock;
++	int ret;
++
++	/** First, create and prepare the new socket */
++
++	sock.type = meta_sk->sk_socket->type;
++	sock.state = SS_UNCONNECTED;
++	sock.wq = meta_sk->sk_socket->wq;
++	sock.file = meta_sk->sk_socket->file;
++	sock.ops = NULL;
++
++	ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
++	if (unlikely(ret < 0)) {
++		mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
++		return ret;
++	}
++
++	sk = sock.sk;
++	tp = tcp_sk(sk);
++
++	/* All subsockets need the MPTCP-lock-class */
++	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
++	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
++
++	if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
++		goto error;
++
++	tp->mptcp->slave_sk = 1;
++	tp->mptcp->low_prio = loc->low_prio;
++
++	/* Initializing the timer for an MPTCP subflow */
++	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
++
++	/** Then, connect the socket to the peer */
++	loc_in.sin6_family = AF_INET6;
++	rem_in.sin6_family = AF_INET6;
++	loc_in.sin6_port = 0;
++	if (rem->port)
++		rem_in.sin6_port = rem->port;
++	else
++		rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
++	loc_in.sin6_addr = loc->addr;
++	rem_in.sin6_addr = rem->addr;
++
++	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
++	if (ret < 0) {
++		mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
++		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
++		    tp->mptcp->path_index, &loc_in.sin6_addr,
++		    ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
++		    ntohs(rem_in.sin6_port));
++
++	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
++		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
++
++	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
++				sizeof(struct sockaddr_in6), O_NONBLOCK);
++	if (ret < 0 && ret != -EINPROGRESS) {
++		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
++			    __func__, ret);
++		goto error;
++	}
++
++	sk_set_socket(sk, meta_sk->sk_socket);
++	sk->sk_wq = meta_sk->sk_wq;
++
++	return 0;
++
++error:
++	/* May happen if mptcp_add_sock fails first */
++	if (!mptcp(tp)) {
++		tcp_close(sk, 0);
++	} else {
++		local_bh_disable();
++		mptcp_sub_force_close(sk);
++		local_bh_enable();
++	}
++	return ret;
++}
++EXPORT_SYMBOL(mptcp_init6_subsockets);
++
++const struct inet_connection_sock_af_ops mptcp_v6_specific = {
++	.queue_xmit	   = inet6_csk_xmit,
++	.send_check	   = tcp_v6_send_check,
++	.rebuild_header	   = inet6_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet6_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
++	.net_header_len	   = sizeof(struct ipv6hdr),
++	.net_frag_header_len = sizeof(struct frag_hdr),
++	.setsockopt	   = ipv6_setsockopt,
++	.getsockopt	   = ipv6_getsockopt,
++	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in6),
++	.bind_conflict	   = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ipv6_setsockopt,
++	.compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
++	.queue_xmit	   = ip_queue_xmit,
++	.send_check	   = tcp_v4_send_check,
++	.rebuild_header	   = inet_sk_rebuild_header,
++	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
++	.conn_request	   = mptcp_conn_request,
++	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
++	.net_header_len	   = sizeof(struct iphdr),
++	.setsockopt	   = ipv6_setsockopt,
++	.getsockopt	   = ipv6_getsockopt,
++	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
++	.sockaddr_len	   = sizeof(struct sockaddr_in6),
++	.bind_conflict	   = inet6_csk_bind_conflict,
++#ifdef CONFIG_COMPAT
++	.compat_setsockopt = compat_ipv6_setsockopt,
++	.compat_getsockopt = compat_ipv6_getsockopt,
++#endif
++};
++
++struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
++struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
++
++int mptcp_pm_v6_init(void)
++{
++	int ret = 0;
++	struct request_sock_ops *ops = &mptcp6_request_sock_ops;
++
++	mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++	mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
++
++	mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
++	mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
++	mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
++
++	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
++	if (ops->slab_name == NULL) {
++		ret = -ENOMEM;
++		goto out;
++	}
++
++	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
++				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
++				      NULL);
++
++	if (ops->slab == NULL) {
++		ret =  -ENOMEM;
++		goto err_reqsk_create;
++	}
++
++out:
++	return ret;
++
++err_reqsk_create:
++	kfree(ops->slab_name);
++	ops->slab_name = NULL;
++	goto out;
++}
++
++void mptcp_pm_v6_undo(void)
++{
++	kmem_cache_destroy(mptcp6_request_sock_ops.slab);
++	kfree(mptcp6_request_sock_ops.slab_name);
++}
+diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
+new file mode 100644
+index 000000000000..6f5087983175
+--- /dev/null
++++ b/net/mptcp/mptcp_ndiffports.c
+@@ -0,0 +1,161 @@
++#include <linux/module.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++
++#if IS_ENABLED(CONFIG_IPV6)
++#include <net/mptcp_v6.h>
++#endif
++
++struct ndiffports_priv {
++	/* Worker struct for subflow establishment */
++	struct work_struct subflow_work;
++
++	struct mptcp_cb *mpcb;
++};
++
++static int num_subflows __read_mostly = 2;
++module_param(num_subflows, int, 0644);
++MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
++
++/**
++ * Create all new subflows, by doing calls to mptcp_initX_subsockets
++ *
++ * This function uses a goto next_subflow, to allow releasing the lock between
++ * new subflows and giving other processes a chance to do some work on the
++ * socket and potentially finishing the communication.
++ **/
++static void create_subflow_worker(struct work_struct *work)
++{
++	const struct ndiffports_priv *pm_priv = container_of(work,
++						     struct ndiffports_priv,
++						     subflow_work);
++	struct mptcp_cb *mpcb = pm_priv->mpcb;
++	struct sock *meta_sk = mpcb->meta_sk;
++	int iter = 0;
++
++next_subflow:
++	if (iter) {
++		release_sock(meta_sk);
++		mutex_unlock(&mpcb->mpcb_mutex);
++
++		cond_resched();
++	}
++	mutex_lock(&mpcb->mpcb_mutex);
++	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
++
++	iter++;
++
++	if (sock_flag(meta_sk, SOCK_DEAD))
++		goto exit;
++
++	if (mpcb->master_sk &&
++	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
++		goto exit;
++
++	if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
++		if (meta_sk->sk_family == AF_INET ||
++		    mptcp_v6_is_v4_mapped(meta_sk)) {
++			struct mptcp_loc4 loc;
++			struct mptcp_rem4 rem;
++
++			loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
++			loc.loc4_id = 0;
++			loc.low_prio = 0;
++
++			rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
++			rem.port = inet_sk(meta_sk)->inet_dport;
++			rem.rem4_id = 0; /* Default 0 */
++
++			mptcp_init4_subsockets(meta_sk, &loc, &rem);
++		} else {
++#if IS_ENABLED(CONFIG_IPV6)
++			struct mptcp_loc6 loc;
++			struct mptcp_rem6 rem;
++
++			loc.addr = inet6_sk(meta_sk)->saddr;
++			loc.loc6_id = 0;
++			loc.low_prio = 0;
++
++			rem.addr = meta_sk->sk_v6_daddr;
++			rem.port = inet_sk(meta_sk)->inet_dport;
++			rem.rem6_id = 0; /* Default 0 */
++
++			mptcp_init6_subsockets(meta_sk, &loc, &rem);
++#endif
++		}
++		goto next_subflow;
++	}
++
++exit:
++	release_sock(meta_sk);
++	mutex_unlock(&mpcb->mpcb_mutex);
++	sock_put(meta_sk);
++}
++
++static void ndiffports_new_session(const struct sock *meta_sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++	/* Initialize workqueue-struct */
++	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
++	fmp->mpcb = mpcb;
++}
++
++static void ndiffports_create_subflows(struct sock *meta_sk)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
++
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
++	    mpcb->send_infinite_mapping ||
++	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
++		return;
++
++	if (!work_pending(&pm_priv->subflow_work)) {
++		sock_hold(meta_sk);
++		queue_work(mptcp_wq, &pm_priv->subflow_work);
++	}
++}
++
++static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
++				   struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++static struct mptcp_pm_ops ndiffports __read_mostly = {
++	.new_session = ndiffports_new_session,
++	.fully_established = ndiffports_create_subflows,
++	.get_local_id = ndiffports_get_local_id,
++	.name = "ndiffports",
++	.owner = THIS_MODULE,
++};
++
++/* General initialization of MPTCP_PM */
++static int __init ndiffports_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
++
++	if (mptcp_register_path_manager(&ndiffports))
++		goto exit;
++
++	return 0;
++
++exit:
++	return -1;
++}
++
++static void ndiffports_unregister(void)
++{
++	mptcp_unregister_path_manager(&ndiffports);
++}
++
++module_init(ndiffports_register);
++module_exit(ndiffports_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
++MODULE_VERSION("0.88");
+diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
+new file mode 100644
+index 000000000000..ec4e98622637
+--- /dev/null
++++ b/net/mptcp/mptcp_ofo_queue.c
+@@ -0,0 +1,295 @@
++/*
++ *	MPTCP implementation - Fast algorithm for MPTCP meta-reordering
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <linux/slab.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
++			    const struct sk_buff *skb)
++{
++	struct tcp_sock *tp;
++
++	mptcp_for_each_tp(mpcb, tp) {
++		if (tp->mptcp->shortcut_ofoqueue == skb) {
++			tp->mptcp->shortcut_ofoqueue = NULL;
++			return;
++		}
++	}
++}
++
++/* Does 'skb' fits after 'here' in the queue 'head' ?
++ * If yes, we queue it and return 1
++ */
++static int mptcp_ofo_queue_after(struct sk_buff_head *head,
++				 struct sk_buff *skb, struct sk_buff *here,
++				 const struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk;
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	u32 seq = TCP_SKB_CB(skb)->seq;
++	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++	/* We want to queue skb after here, thus seq >= end_seq */
++	if (before(seq, TCP_SKB_CB(here)->end_seq))
++		return 0;
++
++	if (seq == TCP_SKB_CB(here)->end_seq) {
++		bool fragstolen = false;
++
++		if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
++			__skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
++			return 1;
++		} else {
++			kfree_skb_partial(skb, fragstolen);
++			return -1;
++		}
++	}
++
++	/* If here is the last one, we can always queue it */
++	if (skb_queue_is_last(head, here)) {
++		__skb_queue_after(head, here, skb);
++		return 1;
++	} else {
++		struct sk_buff *skb1 = skb_queue_next(head, here);
++		/* It's not the last one, but does it fits between 'here' and
++		 * the one after 'here' ? Thus, does end_seq <= after_here->seq
++		 */
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
++			__skb_queue_after(head, here, skb);
++			return 1;
++		}
++	}
++
++	return 0;
++}
++
++static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
++			 struct sk_buff_head *head, struct tcp_sock *tp)
++{
++	struct sock *meta_sk = tp->meta_sk;
++	struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb1, *best_shortcut = NULL;
++	u32 seq = TCP_SKB_CB(skb)->seq;
++	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++	u32 distance = 0xffffffff;
++
++	/* First, check the tp's shortcut */
++	if (!shortcut) {
++		if (skb_queue_empty(head)) {
++			__skb_queue_head(head, skb);
++			goto end;
++		}
++	} else {
++		int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++		/* Does the tp's shortcut is a hit? If yes, we insert. */
++
++		if (ret) {
++			skb = (ret > 0) ? skb : NULL;
++			goto end;
++		}
++	}
++
++	/* Check the shortcuts of the other subsockets. */
++	mptcp_for_each_tp(mpcb, tp_it) {
++		shortcut = tp_it->mptcp->shortcut_ofoqueue;
++		/* Can we queue it here? If yes, do so! */
++		if (shortcut) {
++			int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
++
++			if (ret) {
++				skb = (ret > 0) ? skb : NULL;
++				goto end;
++			}
++		}
++
++		/* Could not queue it, check if we are close.
++		 * We are looking for a shortcut, close enough to seq to
++		 * set skb1 prematurely and thus improve the subsequent lookup,
++		 * which tries to find a skb1 so that skb1->seq <= seq.
++		 *
++		 * So, here we only take shortcuts, whose shortcut->seq > seq,
++		 * and minimize the distance between shortcut->seq and seq and
++		 * set best_shortcut to this one with the minimal distance.
++		 *
++		 * That way, the subsequent while-loop is shortest.
++		 */
++		if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
++			/* Are we closer than the current best shortcut? */
++			if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
++				distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
++				best_shortcut = shortcut;
++			}
++		}
++	}
++
++	if (best_shortcut)
++		skb1 = best_shortcut;
++	else
++		skb1 = skb_peek_tail(head);
++
++	if (seq == TCP_SKB_CB(skb1)->end_seq) {
++		bool fragstolen = false;
++
++		if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
++			__skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
++		} else {
++			kfree_skb_partial(skb, fragstolen);
++			skb = NULL;
++		}
++
++		goto end;
++	}
++
++	/* Find the insertion point, starting from best_shortcut if available.
++	 *
++	 * Inspired from tcp_data_queue_ofo.
++	 */
++	while (1) {
++		/* skb1->seq <= seq */
++		if (!after(TCP_SKB_CB(skb1)->seq, seq))
++			break;
++		if (skb_queue_is_first(head, skb1)) {
++			skb1 = NULL;
++			break;
++		}
++		skb1 = skb_queue_prev(head, skb1);
++	}
++
++	/* Do skb overlap to previous one? */
++	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++			/* All the bits are present. */
++			__kfree_skb(skb);
++			skb = NULL;
++			goto end;
++		}
++		if (seq == TCP_SKB_CB(skb1)->seq) {
++			if (skb_queue_is_first(head, skb1))
++				skb1 = NULL;
++			else
++				skb1 = skb_queue_prev(head, skb1);
++		}
++	}
++	if (!skb1)
++		__skb_queue_head(head, skb);
++	else
++		__skb_queue_after(head, skb1, skb);
++
++	/* And clean segments covered by new one as whole. */
++	while (!skb_queue_is_last(head, skb)) {
++		skb1 = skb_queue_next(head, skb);
++
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++			break;
++
++		__skb_unlink(skb1, head);
++		mptcp_remove_shortcuts(mpcb, skb1);
++		__kfree_skb(skb1);
++	}
++
++end:
++	if (skb) {
++		skb_set_owner_r(skb, meta_sk);
++		tp->mptcp->shortcut_ofoqueue = skb;
++	}
++
++	return;
++}
++
++/**
++ * @sk: the subflow that received this skb.
++ */
++void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
++			      struct sock *sk)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++
++	try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
++		     &tcp_sk(meta_sk)->out_of_order_queue, tp);
++}
++
++bool mptcp_prune_ofo_queue(struct sock *sk)
++{
++	struct tcp_sock *tp	= tcp_sk(sk);
++	bool res		= false;
++
++	if (!skb_queue_empty(&tp->out_of_order_queue)) {
++		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
++		mptcp_purge_ofo_queue(tp);
++
++		/* No sack at the mptcp-level */
++		sk_mem_reclaim(sk);
++		res = true;
++	}
++
++	return res;
++}
++
++void mptcp_ofo_queue(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++
++	while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
++		u32 old_rcv_nxt = meta_tp->rcv_nxt;
++		if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
++			break;
++
++		if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
++			__skb_unlink(skb, &meta_tp->out_of_order_queue);
++			mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++			__kfree_skb(skb);
++			continue;
++		}
++
++		__skb_unlink(skb, &meta_tp->out_of_order_queue);
++		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++
++		__skb_queue_tail(&meta_sk->sk_receive_queue, skb);
++		meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
++		mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
++
++		if (tcp_hdr(skb)->fin)
++			mptcp_fin(meta_sk);
++	}
++}
++
++void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
++{
++	struct sk_buff_head *head = &meta_tp->out_of_order_queue;
++	struct sk_buff *skb, *tmp;
++
++	skb_queue_walk_safe(head, skb, tmp) {
++		__skb_unlink(skb, head);
++		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
++		kfree_skb(skb);
++	}
++}
+diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
+new file mode 100644
+index 000000000000..53f5c43bb488
+--- /dev/null
++++ b/net/mptcp/mptcp_olia.c
+@@ -0,0 +1,311 @@
++/*
++ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
++ *
++ * Algorithm design:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ * Nicolas Gast <nicolas.gast@epfl.ch>
++ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
++ *
++ * Implementation:
++ * Ramin Khalili <ramin.khalili@epfl.ch>
++ *
++ * Ported to the official MPTCP-kernel:
++ * Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ * This program is free software; you can redistribute it and/or
++ * modify it under the terms of the GNU General Public License
++ * as published by the Free Software Foundation; either version
++ * 2 of the License, or (at your option) any later version.
++ */
++
++
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++static int scale = 10;
++
++struct mptcp_olia {
++	u32	mptcp_loss1;
++	u32	mptcp_loss2;
++	u32	mptcp_loss3;
++	int	epsilon_num;
++	u32	epsilon_den;
++	int	mptcp_snd_cwnd_cnt;
++};
++
++static inline int mptcp_olia_sk_can_send(const struct sock *sk)
++{
++	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_olia_scale(u64 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++/* take care of artificially inflate (see RFC5681)
++ * of cwnd during fast-retransmit phase
++ */
++static u32 mptcp_get_crt_cwnd(struct sock *sk)
++{
++	const struct inet_connection_sock *icsk = inet_csk(sk);
++
++	if (icsk->icsk_ca_state == TCP_CA_Recovery)
++		return tcp_sk(sk)->snd_ssthresh;
++	else
++		return tcp_sk(sk)->snd_cwnd;
++}
++
++/* return the dominator of the first term of  the increasing term */
++static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
++{
++	struct sock *sk;
++	u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
++
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		u64 scaled_num;
++		u32 tmp_cwnd;
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
++		rate += div_u64(scaled_num , tp->srtt_us);
++	}
++	rate *= rate;
++	return rate;
++}
++
++/* find the maximum cwnd, used to find set M */
++static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
++{
++	struct sock *sk;
++	u32 best_cwnd = 0;
++
++	mptcp_for_each_sk(mpcb, sk) {
++		u32 tmp_cwnd;
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if (tmp_cwnd > best_cwnd)
++			best_cwnd = tmp_cwnd;
++	}
++	return best_cwnd;
++}
++
++static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
++{
++	struct mptcp_olia *ca;
++	struct tcp_sock *tp;
++	struct sock *sk;
++	u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
++	u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
++	u8 M = 0, B_not_M = 0;
++
++	/* TODO - integrate this in the following loop - we just want to iterate once */
++
++	max_cwnd = mptcp_get_max_cwnd(mpcb);
++
++	/* find the best path */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++		/* TODO - check here and rename variables */
++		tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++			      ca->mptcp_loss2 - ca->mptcp_loss1);
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
++			best_rtt = tmp_rtt;
++			best_int = tmp_int;
++			best_cwnd = tmp_cwnd;
++		}
++	}
++
++	/* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
++	/* find the size of M and B_not_M */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		tmp_cwnd = mptcp_get_crt_cwnd(sk);
++		if (tmp_cwnd == max_cwnd) {
++			M++;
++		} else {
++			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++				      ca->mptcp_loss2 - ca->mptcp_loss1);
++
++			if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
++				B_not_M++;
++		}
++	}
++
++	/* check if the path is in M or B_not_M and set the value of epsilon accordingly */
++	mptcp_for_each_sk(mpcb, sk) {
++		tp = tcp_sk(sk);
++		ca = inet_csk_ca(sk);
++
++		if (!mptcp_olia_sk_can_send(sk))
++			continue;
++
++		if (B_not_M == 0) {
++			ca->epsilon_num = 0;
++			ca->epsilon_den = 1;
++		} else {
++			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
++			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
++				      ca->mptcp_loss2 - ca->mptcp_loss1);
++			tmp_cwnd = mptcp_get_crt_cwnd(sk);
++
++			if (tmp_cwnd < max_cwnd &&
++			    (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
++				ca->epsilon_num = 1;
++				ca->epsilon_den = mpcb->cnt_established * B_not_M;
++			} else if (tmp_cwnd == max_cwnd) {
++				ca->epsilon_num = -1;
++				ca->epsilon_den = mpcb->cnt_established  * M;
++			} else {
++				ca->epsilon_num = 0;
++				ca->epsilon_den = 1;
++			}
++		}
++	}
++}
++
++/* setting the initial values */
++static void mptcp_olia_init(struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_olia *ca = inet_csk_ca(sk);
++
++	if (mptcp(tp)) {
++		ca->mptcp_loss1 = tp->snd_una;
++		ca->mptcp_loss2 = tp->snd_una;
++		ca->mptcp_loss3 = tp->snd_una;
++		ca->mptcp_snd_cwnd_cnt = 0;
++		ca->epsilon_num = 0;
++		ca->epsilon_den = 1;
++	}
++}
++
++/* updating inter-loss distance and ssthresh */
++static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
++{
++	if (!mptcp(tcp_sk(sk)))
++		return;
++
++	if (new_state == TCP_CA_Loss ||
++	    new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
++		struct mptcp_olia *ca = inet_csk_ca(sk);
++
++		if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
++		    !inet_csk(sk)->icsk_retransmits) {
++			ca->mptcp_loss1 = ca->mptcp_loss2;
++			ca->mptcp_loss2 = ca->mptcp_loss3;
++		}
++	}
++}
++
++/* main algorithm */
++static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_olia *ca = inet_csk_ca(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++
++	u64 inc_num, inc_den, rate, cwnd_scaled;
++
++	if (!mptcp(tp)) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	ca->mptcp_loss3 = tp->snd_una;
++
++	if (!tcp_is_cwnd_limited(sk))
++		return;
++
++	/* slow start if it is in the safe area */
++	if (tp->snd_cwnd <= tp->snd_ssthresh) {
++		tcp_slow_start(tp, acked);
++		return;
++	}
++
++	mptcp_get_epsilon(mpcb);
++	rate = mptcp_get_rate(mpcb, tp->srtt_us);
++	cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
++	inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
++
++	/* calculate the increasing term, scaling is used to reduce the rounding effect */
++	if (ca->epsilon_num == -1) {
++		if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
++			inc_num = rate - ca->epsilon_den *
++				cwnd_scaled * cwnd_scaled;
++			ca->mptcp_snd_cwnd_cnt -= div64_u64(
++			    mptcp_olia_scale(inc_num , scale) , inc_den);
++		} else {
++			inc_num = ca->epsilon_den *
++			    cwnd_scaled * cwnd_scaled - rate;
++			ca->mptcp_snd_cwnd_cnt += div64_u64(
++			    mptcp_olia_scale(inc_num , scale) , inc_den);
++		}
++	} else {
++		inc_num = ca->epsilon_num * rate +
++		    ca->epsilon_den * cwnd_scaled * cwnd_scaled;
++		ca->mptcp_snd_cwnd_cnt += div64_u64(
++		    mptcp_olia_scale(inc_num , scale) , inc_den);
++	}
++
++
++	if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
++		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
++			tp->snd_cwnd++;
++		ca->mptcp_snd_cwnd_cnt = 0;
++	} else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
++		tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
++		ca->mptcp_snd_cwnd_cnt = 0;
++	}
++}
++
++static struct tcp_congestion_ops mptcp_olia = {
++	.init		= mptcp_olia_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_olia_cong_avoid,
++	.set_state	= mptcp_olia_set_state,
++	.owner		= THIS_MODULE,
++	.name		= "olia",
++};
++
++static int __init mptcp_olia_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
++	return tcp_register_congestion_control(&mptcp_olia);
++}
++
++static void __exit mptcp_olia_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_olia);
++}
++
++module_init(mptcp_olia_register);
++module_exit(mptcp_olia_unregister);
++
++MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
++MODULE_VERSION("0.1");
+diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
+new file mode 100644
+index 000000000000..400ea254c078
+--- /dev/null
++++ b/net/mptcp/mptcp_output.c
+@@ -0,0 +1,1743 @@
++/*
++ *	MPTCP implementation - Sending side
++ *
++ *	Initial Design & Implementation:
++ *	Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *	Current Maintainer & Author:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	Additional authors:
++ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *	Gregory Detal <gregory.detal@uclouvain.be>
++ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *	Lavkesh Lahngir <lavkesh51@gmail.com>
++ *	Andreas Ripke <ripke@neclab.eu>
++ *	Vlad Dogaru <vlad.dogaru@intel.com>
++ *	Octavian Purdila <octavian.purdila@intel.com>
++ *	John Ronan <jronan@tssg.org>
++ *	Catalin Nicutar <catalin.nicutar@gmail.com>
++ *	Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *	This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/kconfig.h>
++#include <linux/skbuff.h>
++#include <linux/tcp.h>
++
++#include <net/mptcp.h>
++#include <net/mptcp_v4.h>
++#include <net/mptcp_v6.h>
++#include <net/sock.h>
++
++static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
++				 MPTCP_SUB_LEN_ACK_ALIGN +
++				 MPTCP_SUB_LEN_SEQ_ALIGN;
++
++static inline int mptcp_sub_len_remove_addr(u16 bitfield)
++{
++	unsigned int c;
++	for (c = 0; bitfield; c++)
++		bitfield &= bitfield - 1;
++	return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
++}
++
++int mptcp_sub_len_remove_addr_align(u16 bitfield)
++{
++	return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
++}
++EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
++
++/* get the data-seq and end-data-seq and store them again in the
++ * tcp_skb_cb
++ */
++static int mptcp_reconstruct_mapping(struct sk_buff *skb)
++{
++	const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
++	u32 *p32;
++	u16 *p16;
++
++	if (!mpdss->M)
++		return 1;
++
++	/* Move the pointer to the data-seq */
++	p32 = (u32 *)mpdss;
++	p32++;
++	if (mpdss->A) {
++		p32++;
++		if (mpdss->a)
++			p32++;
++	}
++
++	TCP_SKB_CB(skb)->seq = ntohl(*p32);
++
++	/* Get the data_len to calculate the end_data_seq */
++	p32++;
++	p32++;
++	p16 = (u16 *)p32;
++	TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
++
++	return 0;
++}
++
++static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
++{
++	struct sk_buff *skb_it;
++
++	skb_it = tcp_write_queue_head(meta_sk);
++
++	tcp_for_write_queue_from(skb_it, meta_sk) {
++		if (skb_it == tcp_send_head(meta_sk))
++			break;
++
++		if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
++			TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
++			break;
++		}
++	}
++}
++
++/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
++ * coming from the meta-retransmit-timer
++ */
++static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
++				  struct sock *sk, int clone_it)
++{
++	struct sk_buff *skb, *skb1;
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	u32 seq, end_seq;
++
++	if (clone_it) {
++		/* pskb_copy is necessary here, because the TCP/IP-headers
++		 * will be changed when it's going to be reinjected on another
++		 * subflow.
++		 */
++		skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
++	} else {
++		__skb_unlink(orig_skb, &sk->sk_write_queue);
++		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++		sk->sk_wmem_queued -= orig_skb->truesize;
++		sk_mem_uncharge(sk, orig_skb->truesize);
++		skb = orig_skb;
++	}
++	if (unlikely(!skb))
++		return;
++
++	if (sk && mptcp_reconstruct_mapping(skb)) {
++		__kfree_skb(skb);
++		return;
++	}
++
++	skb->sk = meta_sk;
++
++	/* If it reached already the destination, we don't have to reinject it */
++	if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++		__kfree_skb(skb);
++		return;
++	}
++
++	/* Only reinject segments that are fully covered by the mapping */
++	if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
++	    TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
++		u32 seq = TCP_SKB_CB(skb)->seq;
++		u32 end_seq = TCP_SKB_CB(skb)->end_seq;
++
++		__kfree_skb(skb);
++
++		/* Ok, now we have to look for the full mapping in the meta
++		 * send-queue :S
++		 */
++		tcp_for_write_queue(skb, meta_sk) {
++			/* Not yet at the mapping? */
++			if (before(TCP_SKB_CB(skb)->seq, seq))
++				continue;
++			/* We have passed by the mapping */
++			if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
++				return;
++
++			__mptcp_reinject_data(skb, meta_sk, NULL, 1);
++		}
++		return;
++	}
++
++	/* Segment goes back to the MPTCP-layer. So, we need to zero the
++	 * path_mask/dss.
++	 */
++	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++	/* We need to find out the path-mask from the meta-write-queue
++	 * to properly select a subflow.
++	 */
++	mptcp_find_and_set_pathmask(meta_sk, skb);
++
++	/* If it's empty, just add */
++	if (skb_queue_empty(&mpcb->reinject_queue)) {
++		skb_queue_head(&mpcb->reinject_queue, skb);
++		return;
++	}
++
++	/* Find place to insert skb - or even we can 'drop' it, as the
++	 * data is already covered by other skb's in the reinject-queue.
++	 *
++	 * This is inspired by code from tcp_data_queue.
++	 */
++
++	skb1 = skb_peek_tail(&mpcb->reinject_queue);
++	seq = TCP_SKB_CB(skb)->seq;
++	while (1) {
++		if (!after(TCP_SKB_CB(skb1)->seq, seq))
++			break;
++		if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
++			skb1 = NULL;
++			break;
++		}
++		skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++	}
++
++	/* Do skb overlap to previous one? */
++	end_seq = TCP_SKB_CB(skb)->end_seq;
++	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
++		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
++			/* All the bits are present. Don't reinject */
++			__kfree_skb(skb);
++			return;
++		}
++		if (seq == TCP_SKB_CB(skb1)->seq) {
++			if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
++				skb1 = NULL;
++			else
++				skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
++		}
++	}
++	if (!skb1)
++		__skb_queue_head(&mpcb->reinject_queue, skb);
++	else
++		__skb_queue_after(&mpcb->reinject_queue, skb1, skb);
++
++	/* And clean segments covered by new one as whole. */
++	while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
++		skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
++
++		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
++			break;
++
++		__skb_unlink(skb1, &mpcb->reinject_queue);
++		__kfree_skb(skb1);
++	}
++	return;
++}
++
++/* Inserts data into the reinject queue */
++void mptcp_reinject_data(struct sock *sk, int clone_it)
++{
++	struct sk_buff *skb_it, *tmp;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct sock *meta_sk = tp->meta_sk;
++
++	/* It has already been closed - there is really no point in reinjecting */
++	if (meta_sk->sk_state == TCP_CLOSE)
++		return;
++
++	skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
++		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
++		/* Subflow syn's and fin's are not reinjected.
++		 *
++		 * As well as empty subflow-fins with a data-fin.
++		 * They are reinjected below (without the subflow-fin-flag)
++		 */
++		if (tcb->tcp_flags & TCPHDR_SYN ||
++		    (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
++		    (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
++			continue;
++
++		__mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
++	}
++
++	skb_it = tcp_write_queue_tail(meta_sk);
++	/* If sk has sent the empty data-fin, we have to reinject it too. */
++	if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
++	    TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
++		__mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
++	}
++
++	mptcp_push_pending_frames(meta_sk);
++
++	tp->pf = 1;
++}
++EXPORT_SYMBOL(mptcp_reinject_data);
++
++static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
++			       struct sock *subsk)
++{
++	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sock *sk_it;
++	int all_empty = 1, all_acked;
++
++	/* In infinite mapping we always try to combine */
++	if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
++		subsk->sk_shutdown |= SEND_SHUTDOWN;
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++		return;
++	}
++
++	/* Don't combine, if they didn't combine - otherwise we end up in
++	 * TIME_WAIT, even if our app is smart enough to avoid it
++	 */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
++		if (!mpcb->dfin_combined)
++			return;
++	}
++
++	/* If no other subflow has data to send, we can combine */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		if (!mptcp_sk_can_send(sk_it))
++			continue;
++
++		if (!tcp_write_queue_empty(sk_it))
++			all_empty = 0;
++	}
++
++	/* If all data has been DATA_ACKed, we can combine.
++	 * -1, because the data_fin consumed one byte
++	 */
++	all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
++
++	if ((all_empty || all_acked) && tcp_close_state(subsk)) {
++		subsk->sk_shutdown |= SEND_SHUTDOWN;
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
++	}
++}
++
++static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
++				   __be32 *ptr)
++{
++	const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	__be32 *start = ptr;
++	__u16 data_len;
++
++	*ptr++ = htonl(tcb->seq); /* data_seq */
++
++	/* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
++	if (mptcp_is_data_fin(skb) && skb->len == 0)
++		*ptr++ = 0; /* subseq */
++	else
++		*ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
++
++	if (tcb->mptcp_flags & MPTCPHDR_INF)
++		data_len = 0;
++	else
++		data_len = tcb->end_seq - tcb->seq;
++
++	if (tp->mpcb->dss_csum && data_len) {
++		__be16 *p16 = (__be16 *)ptr;
++		__be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
++		__wsum csum;
++
++		*ptr = htonl(((data_len) << 16) |
++			     (TCPOPT_EOL << 8) |
++			     (TCPOPT_EOL));
++		csum = csum_partial(ptr - 2, 12, skb->csum);
++		p16++;
++		*p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
++	} else {
++		*ptr++ = htonl(((data_len) << 16) |
++			       (TCPOPT_NOP << 8) |
++			       (TCPOPT_NOP));
++	}
++
++	return ptr - start;
++}
++
++static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
++				    __be32 *ptr)
++{
++	struct mp_dss *mdss = (struct mp_dss *)ptr;
++	__be32 *start = ptr;
++
++	mdss->kind = TCPOPT_MPTCP;
++	mdss->sub = MPTCP_SUB_DSS;
++	mdss->rsv1 = 0;
++	mdss->rsv2 = 0;
++	mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
++	mdss->m = 0;
++	mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
++	mdss->a = 0;
++	mdss->A = 1;
++	mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
++	ptr++;
++
++	*ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++	return ptr - start;
++}
++
++/* RFC6824 states that once a particular subflow mapping has been sent
++ * out it must never be changed. However, packets may be split while
++ * they are in the retransmission queue (due to SACK or ACKs) and that
++ * arguably means that we would change the mapping (e.g. it splits it,
++ * our sends out a subset of the initial mapping).
++ *
++ * Furthermore, the skb checksum is not always preserved across splits
++ * (e.g. mptcp_fragment) which would mean that we need to recompute
++ * the DSS checksum in this case.
++ *
++ * To avoid this we save the initial DSS mapping which allows us to
++ * send the same DSS mapping even for fragmented retransmits.
++ */
++static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
++{
++	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
++	__be32 *ptr = (__be32 *)tcb->dss;
++
++	tcb->mptcp_flags |= MPTCPHDR_SEQ;
++
++	ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++	ptr += mptcp_write_dss_mapping(tp, skb, ptr);
++}
++
++/* Write the saved DSS mapping to the header */
++static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
++				    __be32 *ptr)
++{
++	__be32 *start = ptr;
++
++	memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
++
++	/* update the data_ack */
++	start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
++
++	/* dss is in a union with inet_skb_parm and
++	 * the IP layer expects zeroed IPCB fields.
++	 */
++	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
++
++	return mptcp_dss_len/sizeof(*ptr);
++}
++
++static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct sock *meta_sk = mptcp_meta_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++	struct tcp_skb_cb *tcb;
++	struct sk_buff *subskb = NULL;
++
++	if (!reinject)
++		TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
++						  MPTCPHDR_SEQ64_INDEX : 0);
++
++	subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
++	if (!subskb)
++		return false;
++
++	/* At the subflow-level we need to call again tcp_init_tso_segs. We
++	 * force this, by setting gso_segs to 0. It has been set to 1 prior to
++	 * the call to mptcp_skb_entail.
++	 */
++	skb_shinfo(subskb)->gso_segs = 0;
++
++	TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
++
++	if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
++	    skb->ip_summed == CHECKSUM_PARTIAL) {
++		subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
++		subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
++	}
++
++	tcb = TCP_SKB_CB(subskb);
++
++	if (tp->mpcb->send_infinite_mapping &&
++	    !tp->mpcb->infinite_mapping_snd &&
++	    !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
++		tp->mptcp->fully_established = 1;
++		tp->mpcb->infinite_mapping_snd = 1;
++		tp->mptcp->infinite_cutoff_seq = tp->write_seq;
++		tcb->mptcp_flags |= MPTCPHDR_INF;
++	}
++
++	if (mptcp_is_data_fin(subskb))
++		mptcp_combine_dfin(subskb, meta_sk, sk);
++
++	mptcp_save_dss_data_seq(tp, subskb);
++
++	tcb->seq = tp->write_seq;
++	tcb->sacked = 0; /* reset the sacked field: from the point of view
++			  * of this subflow, we are sending a brand new
++			  * segment
++			  */
++	/* Take into account seg len */
++	tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
++	tcb->end_seq = tp->write_seq;
++
++	/* If it's a non-payload DATA_FIN (also no subflow-fin), the
++	 * segment is not part of the subflow but on a meta-only-level.
++	 */
++	if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
++		tcp_add_write_queue_tail(sk, subskb);
++		sk->sk_wmem_queued += subskb->truesize;
++		sk_mem_charge(sk, subskb->truesize);
++	} else {
++		int err;
++
++		/* Necessary to initialize for tcp_transmit_skb. mss of 1, as
++		 * skb->len = 0 will force tso_segs to 1.
++		 */
++		tcp_init_tso_segs(sk, subskb, 1);
++		/* Empty data-fins are sent immediatly on the subflow */
++		TCP_SKB_CB(subskb)->when = tcp_time_stamp;
++		err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
++
++		/* It has not been queued, we can free it now. */
++		kfree_skb(subskb);
++
++		if (err)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		tp->mptcp->second_packet = 1;
++		tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
++	}
++
++	return true;
++}
++
++/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
++ * might need to undo some operations done by tcp_fragment.
++ */
++static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
++			  gfp_t gfp, int reinject)
++{
++	int ret, diff, old_factor;
++	struct sk_buff *buff;
++	u8 flags;
++
++	if (skb_headlen(skb) < len)
++		diff = skb->len - len;
++	else
++		diff = skb->data_len;
++	old_factor = tcp_skb_pcount(skb);
++
++	/* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
++	 * At the MPTCP-level we do not care about the absolute value. All we
++	 * care about is that it is set to 1 for accurate packets_out
++	 * accounting.
++	 */
++	ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
++	if (ret)
++		return ret;
++
++	buff = skb->next;
++
++	flags = TCP_SKB_CB(skb)->mptcp_flags;
++	TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
++	TCP_SKB_CB(buff)->mptcp_flags = flags;
++	TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
++
++	/* If reinject == 1, the buff will be added to the reinject
++	 * queue, which is currently not part of memory accounting. So
++	 * undo the changes done by tcp_fragment and update the
++	 * reinject queue. Also, undo changes to the packet counters.
++	 */
++	if (reinject == 1) {
++		int undo = buff->truesize - diff;
++		meta_sk->sk_wmem_queued -= undo;
++		sk_mem_uncharge(meta_sk, undo);
++
++		tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
++		meta_sk->sk_write_queue.qlen--;
++
++		if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
++			undo = old_factor - tcp_skb_pcount(skb) -
++				tcp_skb_pcount(buff);
++			if (undo)
++				tcp_adjust_pcount(meta_sk, skb, -undo);
++		}
++	}
++
++	return 0;
++}
++
++/* Inspired by tcp_write_wakeup */
++int mptcp_write_wakeup(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb;
++	struct sock *sk_it;
++	int ans = 0;
++
++	if (meta_sk->sk_state == TCP_CLOSE)
++		return -1;
++
++	skb = tcp_send_head(meta_sk);
++	if (skb &&
++	    before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
++		unsigned int mss;
++		unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
++		struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
++		struct tcp_sock *subtp;
++		if (!subsk)
++			goto window_probe;
++		subtp = tcp_sk(subsk);
++		mss = tcp_current_mss(subsk);
++
++		seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
++			       tcp_wnd_end(subtp) - subtp->write_seq);
++
++		if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
++			meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
++
++		/* We are probing the opening of a window
++		 * but the window size is != 0
++		 * must have been a result SWS avoidance ( sender )
++		 */
++		if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
++		    skb->len > mss) {
++			seg_size = min(seg_size, mss);
++			TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++			if (mptcp_fragment(meta_sk, skb, seg_size,
++					   GFP_ATOMIC, 0))
++				return -1;
++		} else if (!tcp_skb_pcount(skb)) {
++			/* see mptcp_write_xmit on why we use UINT_MAX */
++			tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++		}
++
++		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
++		if (!mptcp_skb_entail(subsk, skb, 0))
++			return -1;
++		TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++		mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
++						 TCP_SKB_CB(skb)->seq);
++		tcp_event_new_data_sent(meta_sk, skb);
++
++		__tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
++
++		return 0;
++	} else {
++window_probe:
++		if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
++			    meta_tp->snd_una + 0xFFFF)) {
++			mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++				if (mptcp_sk_can_send_ack(sk_it))
++					tcp_xmit_probe_skb(sk_it, 1);
++			}
++		}
++
++		/* At least one of the tcp_xmit_probe_skb's has to succeed */
++		mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
++			int ret;
++
++			if (!mptcp_sk_can_send_ack(sk_it))
++				continue;
++
++			ret = tcp_xmit_probe_skb(sk_it, 0);
++			if (unlikely(ret > 0))
++				ans = ret;
++		}
++		return ans;
++	}
++}
++
++bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
++		     int push_one, gfp_t gfp)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
++	struct sock *subsk = NULL;
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sk_buff *skb;
++	unsigned int sent_pkts;
++	int reinject = 0;
++	unsigned int sublimit;
++
++	sent_pkts = 0;
++
++	while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
++						    &sublimit))) {
++		unsigned int limit;
++
++		subtp = tcp_sk(subsk);
++		mss_now = tcp_current_mss(subsk);
++
++		if (reinject == 1) {
++			if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
++				/* Segment already reached the peer, take the next one */
++				__skb_unlink(skb, &mpcb->reinject_queue);
++				__kfree_skb(skb);
++				continue;
++			}
++		}
++
++		/* If the segment was cloned (e.g. a meta retransmission),
++		 * the header must be expanded/copied so that there is no
++		 * corruption of TSO information.
++		 */
++		if (skb_unclone(skb, GFP_ATOMIC))
++			break;
++
++		if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
++			break;
++
++		/* Force tso_segs to 1 by using UINT_MAX.
++		 * We actually don't care about the exact number of segments
++		 * emitted on the subflow. We need just to set tso_segs, because
++		 * we still need an accurate packets_out count in
++		 * tcp_event_new_data_sent.
++		 */
++		tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
++
++		/* Check for nagle, irregardless of tso_segs. If the segment is
++		 * actually larger than mss_now (TSO segment), then
++		 * tcp_nagle_check will have partial == false and always trigger
++		 * the transmission.
++		 * tcp_write_xmit has a TSO-level nagle check which is not
++		 * subject to the MPTCP-level. It is based on the properties of
++		 * the subflow, not the MPTCP-level.
++		 */
++		if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
++					     (tcp_skb_is_last(meta_sk, skb) ?
++					      nonagle : TCP_NAGLE_PUSH))))
++			break;
++
++		limit = mss_now;
++		/* skb->len > mss_now is the equivalent of tso_segs > 1 in
++		 * tcp_write_xmit. Otherwise split-point would return 0.
++		 */
++		if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++			/* We limit the size of the skb so that it fits into the
++			 * window. Call tcp_mss_split_point to avoid duplicating
++			 * code.
++			 * We really only care about fitting the skb into the
++			 * window. That's why we use UINT_MAX. If the skb does
++			 * not fit into the cwnd_quota or the NIC's max-segs
++			 * limitation, it will be split by the subflow's
++			 * tcp_write_xmit which does the appropriate call to
++			 * tcp_mss_split_point.
++			 */
++			limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++						    UINT_MAX / mss_now,
++						    nonagle);
++
++		if (sublimit)
++			limit = min(limit, sublimit);
++
++		if (skb->len > limit &&
++		    unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
++			break;
++
++		if (!mptcp_skb_entail(subsk, skb, reinject))
++			break;
++		/* Nagle is handled at the MPTCP-layer, so
++		 * always push on the subflow
++		 */
++		__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++		TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++		if (!reinject) {
++			mptcp_check_sndseq_wrap(meta_tp,
++						TCP_SKB_CB(skb)->end_seq -
++						TCP_SKB_CB(skb)->seq);
++			tcp_event_new_data_sent(meta_sk, skb);
++		}
++
++		tcp_minshall_update(meta_tp, mss_now, skb);
++		sent_pkts += tcp_skb_pcount(skb);
++
++		if (reinject > 0) {
++			__skb_unlink(skb, &mpcb->reinject_queue);
++			kfree_skb(skb);
++		}
++
++		if (push_one)
++			break;
++	}
++
++	return !meta_tp->packets_out && tcp_send_head(meta_sk);
++}
++
++void mptcp_write_space(struct sock *sk)
++{
++	mptcp_push_pending_frames(mptcp_meta_sk(sk));
++}
++
++u32 __mptcp_select_window(struct sock *sk)
++{
++	struct inet_connection_sock *icsk = inet_csk(sk);
++	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	int mss, free_space, full_space, window;
++
++	/* MSS for the peer's data.  Previous versions used mss_clamp
++	 * here.  I don't know if the value based on our guesses
++	 * of peer's MSS is better for the performance.  It's more correct
++	 * but may be worse for the performance because of rcv_mss
++	 * fluctuations.  --SAW  1998/11/1
++	 */
++	mss = icsk->icsk_ack.rcv_mss;
++	free_space = tcp_space(sk);
++	full_space = min_t(int, meta_tp->window_clamp,
++			tcp_full_space(sk));
++
++	if (mss > full_space)
++		mss = full_space;
++
++	if (free_space < (full_space >> 1)) {
++		icsk->icsk_ack.quick = 0;
++
++		if (tcp_memory_pressure)
++			/* TODO this has to be adapted when we support different
++			 * MSS's among the subflows.
++			 */
++			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
++						    4U * meta_tp->advmss);
++
++		if (free_space < mss)
++			return 0;
++	}
++
++	if (free_space > meta_tp->rcv_ssthresh)
++		free_space = meta_tp->rcv_ssthresh;
++
++	/* Don't do rounding if we are using window scaling, since the
++	 * scaled window will not line up with the MSS boundary anyway.
++	 */
++	window = meta_tp->rcv_wnd;
++	if (tp->rx_opt.rcv_wscale) {
++		window = free_space;
++
++		/* Advertise enough space so that it won't get scaled away.
++		 * Import case: prevent zero window announcement if
++		 * 1<<rcv_wscale > mss.
++		 */
++		if (((window >> tp->rx_opt.rcv_wscale) << tp->
++		     rx_opt.rcv_wscale) != window)
++			window = (((window >> tp->rx_opt.rcv_wscale) + 1)
++				  << tp->rx_opt.rcv_wscale);
++	} else {
++		/* Get the largest window that is a nice multiple of mss.
++		 * Window clamp already applied above.
++		 * If our current window offering is within 1 mss of the
++		 * free space we just keep it. This prevents the divide
++		 * and multiply from happening most of the time.
++		 * We also don't do any window rounding when the free space
++		 * is too small.
++		 */
++		if (window <= free_space - mss || window > free_space)
++			window = (free_space / mss) * mss;
++		else if (mss == full_space &&
++			 free_space > window + (full_space >> 1))
++			window = free_space;
++	}
++
++	return window;
++}
++
++void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
++		       unsigned *remaining)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++
++	opts->options |= OPTION_MPTCP;
++	if (is_master_tp(tp)) {
++		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
++		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++		opts->mp_capable.sender_key = tp->mptcp_loc_key;
++		opts->dss_csum = !!sysctl_mptcp_checksum;
++	} else {
++		const struct mptcp_cb *mpcb = tp->mpcb;
++
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
++		*remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
++		opts->mp_join_syns.token = mpcb->mptcp_rem_token;
++		opts->mp_join_syns.low_prio  = tp->mptcp->low_prio;
++		opts->addr_id = tp->mptcp->loc_id;
++		opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
++	}
++}
++
++void mptcp_synack_options(struct request_sock *req,
++			  struct tcp_out_options *opts, unsigned *remaining)
++{
++	struct mptcp_request_sock *mtreq;
++	mtreq = mptcp_rsk(req);
++
++	opts->options |= OPTION_MPTCP;
++	/* MPCB not yet set - thus it's a new MPTCP-session */
++	if (!mtreq->is_sub) {
++		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
++		opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
++		opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
++		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
++	} else {
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
++		opts->mp_join_syns.sender_truncated_mac =
++				mtreq->mptcp_hash_tmac;
++		opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
++		opts->mp_join_syns.low_prio = mtreq->low_prio;
++		opts->addr_id = mtreq->loc_id;
++		*remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
++	}
++}
++
++void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
++			       struct tcp_out_options *opts, unsigned *size)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct mptcp_cb *mpcb = tp->mpcb;
++	const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
++
++	/* We are coming from tcp_current_mss with the meta_sk as an argument.
++	 * It does not make sense to check for the options, because when the
++	 * segment gets sent, another subflow will be chosen.
++	 */
++	if (!skb && is_meta_sk(sk))
++		return;
++
++	/* In fallback mp_fail-mode, we have to repeat it until the fallback
++	 * has been done by the sender
++	 */
++	if (unlikely(tp->mptcp->send_mp_fail)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_FAIL;
++		*size += MPTCP_SUB_LEN_FAIL;
++		return;
++	}
++
++	if (unlikely(tp->send_mp_fclose)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_FCLOSE;
++		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++		*size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
++		return;
++	}
++
++	/* 1. If we are the sender of the infinite-mapping, we need the
++	 *    MPTCPHDR_INF-flag, because a retransmission of the
++	 *    infinite-announcment still needs the mptcp-option.
++	 *
++	 *    We need infinite_cutoff_seq, because retransmissions from before
++	 *    the infinite-cutoff-moment still need the MPTCP-signalling to stay
++	 *    consistent.
++	 *
++	 * 2. If we are the receiver of the infinite-mapping, we always skip
++	 *    mptcp-options, because acknowledgments from before the
++	 *    infinite-mapping point have already been sent out.
++	 *
++	 * I know, the whole infinite-mapping stuff is ugly...
++	 *
++	 * TODO: Handle wrapped data-sequence numbers
++	 *       (even if it's very unlikely)
++	 */
++	if (unlikely(mpcb->infinite_mapping_snd) &&
++	    ((mpcb->send_infinite_mapping && tcb &&
++	      mptcp_is_data_seq(skb) &&
++	      !(tcb->mptcp_flags & MPTCPHDR_INF) &&
++	      !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
++	     !mpcb->send_infinite_mapping))
++		return;
++
++	if (unlikely(tp->mptcp->include_mpc)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_CAPABLE |
++				       OPTION_TYPE_ACK;
++		*size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
++		opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
++		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
++		opts->dss_csum = mpcb->dss_csum;
++
++		if (skb)
++			tp->mptcp->include_mpc = 0;
++	}
++	if (unlikely(tp->mptcp->pre_established)) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
++		*size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
++	}
++
++	if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_DATA_ACK;
++		/* If !skb, we come from tcp_current_mss and thus we always
++		 * assume that the DSS-option will be set for the data-packet.
++		 */
++		if (skb && !mptcp_is_data_seq(skb)) {
++			*size += MPTCP_SUB_LEN_ACK_ALIGN;
++		} else {
++			/* Doesn't matter, if csum included or not. It will be
++			 * either 10 or 12, and thus aligned = 12
++			 */
++			*size += MPTCP_SUB_LEN_ACK_ALIGN +
++				 MPTCP_SUB_LEN_SEQ_ALIGN;
++		}
++
++		*size += MPTCP_SUB_LEN_DSS_ALIGN;
++	}
++
++	if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
++		mpcb->pm_ops->addr_signal(sk, size, opts, skb);
++
++	if (unlikely(tp->mptcp->send_mp_prio) &&
++	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
++		opts->options |= OPTION_MPTCP;
++		opts->mptcp_options |= OPTION_MP_PRIO;
++		if (skb)
++			tp->mptcp->send_mp_prio = 0;
++		*size += MPTCP_SUB_LEN_PRIO_ALIGN;
++	}
++
++	return;
++}
++
++u16 mptcp_select_window(struct sock *sk)
++{
++	u16 new_win		= tcp_select_window(sk);
++	struct tcp_sock *tp	= tcp_sk(sk);
++	struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
++
++	meta_tp->rcv_wnd	= tp->rcv_wnd;
++	meta_tp->rcv_wup	= meta_tp->rcv_nxt;
++
++	return new_win;
++}
++
++void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
++			 const struct tcp_out_options *opts,
++			 struct sk_buff *skb)
++{
++	if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
++		struct mp_capable *mpc = (struct mp_capable *)ptr;
++
++		mpc->kind = TCPOPT_MPTCP;
++
++		if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
++		    (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
++			mpc->sender_key = opts->mp_capable.sender_key;
++			mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
++			ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
++		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++			mpc->sender_key = opts->mp_capable.sender_key;
++			mpc->receiver_key = opts->mp_capable.receiver_key;
++			mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
++			ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
++		}
++
++		mpc->sub = MPTCP_SUB_CAPABLE;
++		mpc->ver = 0;
++		mpc->a = opts->dss_csum;
++		mpc->b = 0;
++		mpc->rsv = 0;
++		mpc->h = 1;
++	}
++
++	if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
++		struct mp_join *mpj = (struct mp_join *)ptr;
++
++		mpj->kind = TCPOPT_MPTCP;
++		mpj->sub = MPTCP_SUB_JOIN;
++		mpj->rsv = 0;
++
++		if (OPTION_TYPE_SYN & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
++			mpj->u.syn.token = opts->mp_join_syns.token;
++			mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
++			mpj->b = opts->mp_join_syns.low_prio;
++			mpj->addr_id = opts->addr_id;
++			ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
++		} else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
++			mpj->u.synack.mac =
++				opts->mp_join_syns.sender_truncated_mac;
++			mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
++			mpj->b = opts->mp_join_syns.low_prio;
++			mpj->addr_id = opts->addr_id;
++			ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
++		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
++			mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
++			mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
++			memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
++			ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
++		}
++	}
++	if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
++		struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
++
++		mpadd->kind = TCPOPT_MPTCP;
++		if (opts->add_addr_v4) {
++			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
++			mpadd->sub = MPTCP_SUB_ADD_ADDR;
++			mpadd->ipver = 4;
++			mpadd->addr_id = opts->add_addr4.addr_id;
++			mpadd->u.v4.addr = opts->add_addr4.addr;
++			ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
++		} else if (opts->add_addr_v6) {
++			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
++			mpadd->sub = MPTCP_SUB_ADD_ADDR;
++			mpadd->ipver = 6;
++			mpadd->addr_id = opts->add_addr6.addr_id;
++			memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
++			       sizeof(mpadd->u.v6.addr));
++			ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
++		}
++	}
++	if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
++		struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
++		u8 *addrs_id;
++		int id, len, len_align;
++
++		len = mptcp_sub_len_remove_addr(opts->remove_addrs);
++		len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
++
++		mprem->kind = TCPOPT_MPTCP;
++		mprem->len = len;
++		mprem->sub = MPTCP_SUB_REMOVE_ADDR;
++		mprem->rsv = 0;
++		addrs_id = &mprem->addrs_id;
++
++		mptcp_for_each_bit_set(opts->remove_addrs, id)
++			*(addrs_id++) = id;
++
++		/* Fill the rest with NOP's */
++		if (len_align > len) {
++			int i;
++			for (i = 0; i < len_align - len; i++)
++				*(addrs_id++) = TCPOPT_NOP;
++		}
++
++		ptr += len_align >> 2;
++	}
++	if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
++		struct mp_fail *mpfail = (struct mp_fail *)ptr;
++
++		mpfail->kind = TCPOPT_MPTCP;
++		mpfail->len = MPTCP_SUB_LEN_FAIL;
++		mpfail->sub = MPTCP_SUB_FAIL;
++		mpfail->rsv1 = 0;
++		mpfail->rsv2 = 0;
++		mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
++
++		ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
++	}
++	if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
++		struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
++
++		mpfclose->kind = TCPOPT_MPTCP;
++		mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
++		mpfclose->sub = MPTCP_SUB_FCLOSE;
++		mpfclose->rsv1 = 0;
++		mpfclose->rsv2 = 0;
++		mpfclose->key = opts->mp_capable.receiver_key;
++
++		ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
++	}
++
++	if (OPTION_DATA_ACK & opts->mptcp_options) {
++		if (!mptcp_is_data_seq(skb))
++			ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
++		else
++			ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
++	}
++	if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
++		struct mp_prio *mpprio = (struct mp_prio *)ptr;
++
++		mpprio->kind = TCPOPT_MPTCP;
++		mpprio->len = MPTCP_SUB_LEN_PRIO;
++		mpprio->sub = MPTCP_SUB_PRIO;
++		mpprio->rsv = 0;
++		mpprio->b = tp->mptcp->low_prio;
++		mpprio->addr_id = TCPOPT_NOP;
++
++		ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
++	}
++}
++
++/* Sends the datafin */
++void mptcp_send_fin(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
++	int mss_now;
++
++	if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
++		meta_tp->mpcb->passive_close = 1;
++
++	/* Optimization, tack on the FIN if we have a queue of
++	 * unsent frames.  But be careful about outgoing SACKS
++	 * and IP options.
++	 */
++	mss_now = mptcp_current_mss(meta_sk);
++
++	if (tcp_send_head(meta_sk) != NULL) {
++		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++		TCP_SKB_CB(skb)->end_seq++;
++		meta_tp->write_seq++;
++	} else {
++		/* Socket is locked, keep trying until memory is available. */
++		for (;;) {
++			skb = alloc_skb_fclone(MAX_TCP_HEADER,
++					       meta_sk->sk_allocation);
++			if (skb)
++				break;
++			yield();
++		}
++		/* Reserve space for headers and prepare control bits. */
++		skb_reserve(skb, MAX_TCP_HEADER);
++
++		tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
++		TCP_SKB_CB(skb)->end_seq++;
++		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
++		tcp_queue_skb(meta_sk, skb);
++	}
++	__tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
++}
++
++void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
++
++	if (!mpcb->cnt_subflows)
++		return;
++
++	WARN_ON(meta_tp->send_mp_fclose);
++
++	/* First - select a socket */
++	sk = mptcp_select_ack_sock(meta_sk);
++
++	/* May happen if no subflow is in an appropriate state */
++	if (!sk)
++		return;
++
++	/* We are in infinite mode - just send a reset */
++	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
++		sk->sk_err = ECONNRESET;
++		if (tcp_need_reset(sk->sk_state))
++			tcp_send_active_reset(sk, priority);
++		mptcp_sub_force_close(sk);
++		return;
++	}
++
++
++	tcp_sk(sk)->send_mp_fclose = 1;
++	/** Reset all other subflows */
++
++	/* tcp_done must be handled with bh disabled */
++	if (!in_serving_softirq())
++		local_bh_disable();
++
++	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
++		if (tcp_sk(sk_it)->send_mp_fclose)
++			continue;
++
++		sk_it->sk_err = ECONNRESET;
++		if (tcp_need_reset(sk_it->sk_state))
++			tcp_send_active_reset(sk_it, GFP_ATOMIC);
++		mptcp_sub_force_close(sk_it);
++	}
++
++	if (!in_serving_softirq())
++		local_bh_enable();
++
++	tcp_send_ack(sk);
++	inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
++
++	meta_tp->send_mp_fclose = 1;
++}
++
++static void mptcp_ack_retransmit_timer(struct sock *sk)
++{
++	struct sk_buff *skb;
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct inet_connection_sock *icsk = inet_csk(sk);
++
++	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
++		goto out; /* Routing failure or similar */
++
++	if (!tp->retrans_stamp)
++		tp->retrans_stamp = tcp_time_stamp ? : 1;
++
++	if (tcp_write_timeout(sk)) {
++		tp->mptcp->pre_established = 0;
++		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
++		tp->ops->send_active_reset(sk, GFP_ATOMIC);
++		goto out;
++	}
++
++	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
++	if (skb == NULL) {
++		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++			       jiffies + icsk->icsk_rto);
++		return;
++	}
++
++	/* Reserve space for headers and prepare control bits */
++	skb_reserve(skb, MAX_TCP_HEADER);
++	tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
++
++	TCP_SKB_CB(skb)->when = tcp_time_stamp;
++	if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
++		/* Retransmission failed because of local congestion,
++		 * do not backoff.
++		 */
++		if (!icsk->icsk_retransmits)
++			icsk->icsk_retransmits = 1;
++		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++			       jiffies + icsk->icsk_rto);
++		return;
++	}
++
++
++	icsk->icsk_retransmits++;
++	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
++	sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
++		       jiffies + icsk->icsk_rto);
++	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
++		__sk_dst_reset(sk);
++
++out:;
++}
++
++void mptcp_ack_handler(unsigned long data)
++{
++	struct sock *sk = (struct sock *)data;
++	struct sock *meta_sk = mptcp_meta_sk(sk);
++
++	bh_lock_sock(meta_sk);
++	if (sock_owned_by_user(meta_sk)) {
++		/* Try again later */
++		sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
++			       jiffies + (HZ / 20));
++		goto out_unlock;
++	}
++
++	if (sk->sk_state == TCP_CLOSE)
++		goto out_unlock;
++	if (!tcp_sk(sk)->mptcp->pre_established)
++		goto out_unlock;
++
++	mptcp_ack_retransmit_timer(sk);
++
++	sk_mem_reclaim(sk);
++
++out_unlock:
++	bh_unlock_sock(meta_sk);
++	sock_put(sk);
++}
++
++/* Similar to tcp_retransmit_skb
++ *
++ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
++ * meta-level.
++ */
++int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct sock *subsk;
++	unsigned int limit, mss_now;
++	int err = -1;
++
++	/* Do not sent more than we queued. 1/4 is reserved for possible
++	 * copying overhead: fragmentation, tunneling, mangling etc.
++	 *
++	 * This is a meta-retransmission thus we check on the meta-socket.
++	 */
++	if (atomic_read(&meta_sk->sk_wmem_alloc) >
++	    min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
++		return -EAGAIN;
++	}
++
++	/* We need to make sure that the retransmitted segment can be sent on a
++	 * subflow right now. If it is too big, it needs to be fragmented.
++	 */
++	subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
++	if (!subsk) {
++		/* We want to increase icsk_retransmits, thus return 0, so that
++		 * mptcp_retransmit_timer enters the desired branch.
++		 */
++		err = 0;
++		goto failed;
++	}
++	mss_now = tcp_current_mss(subsk);
++
++	/* If the segment was cloned (e.g. a meta retransmission), the header
++	 * must be expanded/copied so that there is no corruption of TSO
++	 * information.
++	 */
++	if (skb_unclone(skb, GFP_ATOMIC)) {
++		err = -ENOMEM;
++		goto failed;
++	}
++
++	/* Must have been set by mptcp_write_xmit before */
++	BUG_ON(!tcp_skb_pcount(skb));
++
++	limit = mss_now;
++	/* skb->len > mss_now is the equivalent of tso_segs > 1 in
++	 * tcp_write_xmit. Otherwise split-point would return 0.
++	 */
++	if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
++		limit = tcp_mss_split_point(meta_sk, skb, mss_now,
++					    UINT_MAX / mss_now,
++					    TCP_NAGLE_OFF);
++
++	if (skb->len > limit &&
++	    unlikely(mptcp_fragment(meta_sk, skb, limit,
++				    GFP_ATOMIC, 0)))
++		goto failed;
++
++	if (!mptcp_skb_entail(subsk, skb, -1))
++		goto failed;
++	TCP_SKB_CB(skb)->when = tcp_time_stamp;
++
++	/* Update global TCP statistics. */
++	TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
++
++	/* Diff to tcp_retransmit_skb */
++
++	/* Save stamp of the first retransmit. */
++	if (!meta_tp->retrans_stamp)
++		meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
++
++	__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++
++	return 0;
++
++failed:
++	NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
++	return err;
++}
++
++/* Similar to tcp_retransmit_timer
++ *
++ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
++ * and that we don't have an srtt estimation at the meta-level.
++ */
++void mptcp_retransmit_timer(struct sock *meta_sk)
++{
++	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
++	struct mptcp_cb *mpcb = meta_tp->mpcb;
++	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
++	int err;
++
++	/* In fallback, retransmission is handled at the subflow-level */
++	if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
++	    mpcb->send_infinite_mapping)
++		return;
++
++	WARN_ON(tcp_write_queue_empty(meta_sk));
++
++	if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
++	    !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
++		/* Receiver dastardly shrinks window. Our retransmits
++		 * become zero probes, but we should not timeout this
++		 * connection. If the socket is an orphan, time it out,
++		 * we cannot allow such beasts to hang infinitely.
++		 */
++		struct inet_sock *meta_inet = inet_sk(meta_sk);
++		if (meta_sk->sk_family == AF_INET) {
++			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++				       &meta_inet->inet_daddr,
++				       ntohs(meta_inet->inet_dport),
++				       meta_inet->inet_num, meta_tp->snd_una,
++				       meta_tp->snd_nxt);
++		}
++#if IS_ENABLED(CONFIG_IPV6)
++		else if (meta_sk->sk_family == AF_INET6) {
++			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
++				       &meta_sk->sk_v6_daddr,
++				       ntohs(meta_inet->inet_dport),
++				       meta_inet->inet_num, meta_tp->snd_una,
++				       meta_tp->snd_nxt);
++		}
++#endif
++		if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
++			tcp_write_err(meta_sk);
++			return;
++		}
++
++		mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++		goto out_reset_timer;
++	}
++
++	if (tcp_write_timeout(meta_sk))
++		return;
++
++	if (meta_icsk->icsk_retransmits == 0)
++		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
++
++	meta_icsk->icsk_ca_state = TCP_CA_Loss;
++
++	err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
++	if (err > 0) {
++		/* Retransmission failed because of local congestion,
++		 * do not backoff.
++		 */
++		if (!meta_icsk->icsk_retransmits)
++			meta_icsk->icsk_retransmits = 1;
++		inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
++					  min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
++					  TCP_RTO_MAX);
++		return;
++	}
++
++	/* Increase the timeout each time we retransmit.  Note that
++	 * we do not increase the rtt estimate.  rto is initialized
++	 * from rtt, but increases here.  Jacobson (SIGCOMM 88) suggests
++	 * that doubling rto each time is the least we can get away with.
++	 * In KA9Q, Karn uses this for the first few times, and then
++	 * goes to quadratic.  netBSD doubles, but only goes up to *64,
++	 * and clamps at 1 to 64 sec afterwards.  Note that 120 sec is
++	 * defined in the protocol as the maximum possible RTT.  I guess
++	 * we'll have to use something other than TCP to talk to the
++	 * University of Mars.
++	 *
++	 * PAWS allows us longer timeouts and large windows, so once
++	 * implemented ftp to mars will work nicely. We will have to fix
++	 * the 120 second clamps though!
++	 */
++	meta_icsk->icsk_backoff++;
++	meta_icsk->icsk_retransmits++;
++
++out_reset_timer:
++	/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
++	 * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
++	 * might be increased if the stream oscillates between thin and thick,
++	 * thus the old value might already be too high compared to the value
++	 * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
++	 * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
++	 * exponential backoff behaviour to avoid continue hammering
++	 * linear-timeout retransmissions into a black hole
++	 */
++	if (meta_sk->sk_state == TCP_ESTABLISHED &&
++	    (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
++	    tcp_stream_is_thin(meta_tp) &&
++	    meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
++		meta_icsk->icsk_backoff = 0;
++		/* We cannot do the same as in tcp_write_timer because the
++		 * srtt is not set here.
++		 */
++		mptcp_set_rto(meta_sk);
++	} else {
++		/* Use normal (exponential) backoff */
++		meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
++	}
++	inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
++
++	return;
++}
++
++/* Modify values to an mptcp-level for the initial window of new subflows */
++void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
++				__u32 *window_clamp, int wscale_ok,
++				__u8 *rcv_wscale, __u32 init_rcv_wnd,
++				 const struct sock *sk)
++{
++	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
++
++	*window_clamp = mpcb->orig_window_clamp;
++	__space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
++
++	tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
++				  wscale_ok, rcv_wscale, init_rcv_wnd, sk);
++}
++
++static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
++				  unsigned int (*mss_cb)(struct sock *sk))
++{
++	struct sock *sk;
++	u64 rate = 0;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++		int this_mss;
++		u64 this_rate;
++
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		/* Do not consider subflows without a RTT estimation yet
++		 * otherwise this_rate >>> rate.
++		 */
++		if (unlikely(!tp->srtt_us))
++			continue;
++
++		this_mss = mss_cb(sk);
++
++		/* If this_mss is smaller than mss, it means that a segment will
++		 * be splitted in two (or more) when pushed on this subflow. If
++		 * you consider that mss = 1428 and this_mss = 1420 then two
++		 * segments will be generated: a 1420-byte and 8-byte segment.
++		 * The latter will introduce a large overhead as for a single
++		 * data segment 2 slots will be used in the congestion window.
++		 * Therefore reducing by ~2 the potential throughput of this
++		 * subflow. Indeed, 1428 will be send while 2840 could have been
++		 * sent if mss == 1420 reducing the throughput by 2840 / 1428.
++		 *
++		 * The following algorithm take into account this overhead
++		 * when computing the potential throughput that MPTCP can
++		 * achieve when generating mss-byte segments.
++		 *
++		 * The formulae is the following:
++		 *  \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
++		 * Where ratio is computed as follows:
++		 *  \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
++		 *
++		 * ratio gives the reduction factor of the theoretical
++		 * throughput a subflow can achieve if MPTCP uses a specific
++		 * MSS value.
++		 */
++		this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
++				      max(tp->snd_cwnd, tp->packets_out),
++				      (u64)tp->srtt_us *
++				      DIV_ROUND_UP(mss, this_mss) * this_mss);
++		rate += this_rate;
++	}
++
++	return rate;
++}
++
++static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
++					unsigned int (*mss_cb)(struct sock *sk))
++{
++	unsigned int mss = 0;
++	u64 rate = 0;
++	struct sock *sk;
++
++	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++		int this_mss;
++		u64 this_rate;
++
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		this_mss = mss_cb(sk);
++
++		/* Same mss values will produce the same throughput. */
++		if (this_mss == mss)
++			continue;
++
++		/* See whether using this mss value can theoretically improve
++		 * the performances.
++		 */
++		this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
++		if (this_rate >= rate) {
++			mss = this_mss;
++			rate = this_rate;
++		}
++	}
++
++	return mss;
++}
++
++unsigned int mptcp_current_mss(struct sock *meta_sk)
++{
++	unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
++
++	/* If no subflow is available, we take a default-mss from the
++	 * meta-socket.
++	 */
++	return !mss ? tcp_current_mss(meta_sk) : mss;
++}
++
++static unsigned int mptcp_select_size_mss(struct sock *sk)
++{
++	return tcp_sk(sk)->mss_cache;
++}
++
++int mptcp_select_size(const struct sock *meta_sk, bool sg)
++{
++	unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
++
++	if (sg) {
++		if (mptcp_sk_can_gso(meta_sk)) {
++			mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
++		} else {
++			int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
++
++			if (mss >= pgbreak &&
++			    mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
++				mss = pgbreak;
++		}
++	}
++
++	return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
++}
++
++int mptcp_check_snd_buf(const struct tcp_sock *tp)
++{
++	const struct sock *sk;
++	u32 rtt_max = tp->srtt_us;
++	u64 bw_est;
++
++	if (!tp->srtt_us)
++		return tp->reordering + 1;
++
++	mptcp_for_each_sk(tp->mpcb, sk) {
++		if (!mptcp_sk_can_send(sk))
++			continue;
++
++		if (rtt_max < tcp_sk(sk)->srtt_us)
++			rtt_max = tcp_sk(sk)->srtt_us;
++	}
++
++	bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
++				(u64)tp->srtt_us);
++
++	return max_t(unsigned int, (u32)(bw_est >> 16),
++			tp->reordering + 1);
++}
++
++unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
++				  int large_allowed)
++{
++	struct sock *sk;
++	u32 xmit_size_goal = 0;
++
++	if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
++		mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
++			int this_size_goal;
++
++			if (!mptcp_sk_can_send(sk))
++				continue;
++
++			this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
++			if (this_size_goal > xmit_size_goal)
++				xmit_size_goal = this_size_goal;
++		}
++	}
++
++	return max(xmit_size_goal, mss_now);
++}
++
++/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
++int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
++{
++	if (skb_cloned(skb)) {
++		if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
++			return -ENOMEM;
++	}
++
++	__pskb_trim_head(skb, len);
++
++	TCP_SKB_CB(skb)->seq += len;
++	skb->ip_summed = CHECKSUM_PARTIAL;
++
++	skb->truesize	     -= len;
++	sk->sk_wmem_queued   -= len;
++	sk_mem_uncharge(sk, len);
++	sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
++
++	/* Any change of skb->len requires recalculation of tso factor. */
++	if (tcp_skb_pcount(skb) > 1)
++		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
++
++	return 0;
++}
+diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
+new file mode 100644
+index 000000000000..9542f950729f
+--- /dev/null
++++ b/net/mptcp/mptcp_pm.c
+@@ -0,0 +1,169 @@
++/*
++ *     MPTCP implementation - MPTCP-subflow-management
++ *
++ *     Initial Design & Implementation:
++ *     Sébastien Barré <sebastien.barre@uclouvain.be>
++ *
++ *     Current Maintainer & Author:
++ *     Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *     Additional authors:
++ *     Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
++ *     Gregory Detal <gregory.detal@uclouvain.be>
++ *     Fabien Duchêne <fabien.duchene@uclouvain.be>
++ *     Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
++ *     Lavkesh Lahngir <lavkesh51@gmail.com>
++ *     Andreas Ripke <ripke@neclab.eu>
++ *     Vlad Dogaru <vlad.dogaru@intel.com>
++ *     Octavian Purdila <octavian.purdila@intel.com>
++ *     John Ronan <jronan@tssg.org>
++ *     Catalin Nicutar <catalin.nicutar@gmail.com>
++ *     Brandon Heller <brandonh@stanford.edu>
++ *
++ *
++ *     This program is free software; you can redistribute it and/or
++ *      modify it under the terms of the GNU General Public License
++ *      as published by the Free Software Foundation; either version
++ *      2 of the License, or (at your option) any later version.
++ */
++
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_pm_list_lock);
++static LIST_HEAD(mptcp_pm_list);
++
++static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
++			    struct net *net, bool *low_prio)
++{
++	return 0;
++}
++
++struct mptcp_pm_ops mptcp_pm_default = {
++	.get_local_id = mptcp_default_id, /* We do not care */
++	.name = "default",
++	.owner = THIS_MODULE,
++};
++
++static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
++{
++	struct mptcp_pm_ops *e;
++
++	list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
++		if (strcmp(e->name, name) == 0)
++			return e;
++	}
++
++	return NULL;
++}
++
++int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
++{
++	int ret = 0;
++
++	if (!pm->get_local_id)
++		return -EINVAL;
++
++	spin_lock(&mptcp_pm_list_lock);
++	if (mptcp_pm_find(pm->name)) {
++		pr_notice("%s already registered\n", pm->name);
++		ret = -EEXIST;
++	} else {
++		list_add_tail_rcu(&pm->list, &mptcp_pm_list);
++		pr_info("%s registered\n", pm->name);
++	}
++	spin_unlock(&mptcp_pm_list_lock);
++
++	return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
++
++void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
++{
++	spin_lock(&mptcp_pm_list_lock);
++	list_del_rcu(&pm->list);
++	spin_unlock(&mptcp_pm_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
++
++void mptcp_get_default_path_manager(char *name)
++{
++	struct mptcp_pm_ops *pm;
++
++	BUG_ON(list_empty(&mptcp_pm_list));
++
++	rcu_read_lock();
++	pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
++	strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
++	rcu_read_unlock();
++}
++
++int mptcp_set_default_path_manager(const char *name)
++{
++	struct mptcp_pm_ops *pm;
++	int ret = -ENOENT;
++
++	spin_lock(&mptcp_pm_list_lock);
++	pm = mptcp_pm_find(name);
++#ifdef CONFIG_MODULES
++	if (!pm && capable(CAP_NET_ADMIN)) {
++		spin_unlock(&mptcp_pm_list_lock);
++
++		request_module("mptcp_%s", name);
++		spin_lock(&mptcp_pm_list_lock);
++		pm = mptcp_pm_find(name);
++	}
++#endif
++
++	if (pm) {
++		list_move(&pm->list, &mptcp_pm_list);
++		ret = 0;
++	} else {
++		pr_info("%s is not available\n", name);
++	}
++	spin_unlock(&mptcp_pm_list_lock);
++
++	return ret;
++}
++
++void mptcp_init_path_manager(struct mptcp_cb *mpcb)
++{
++	struct mptcp_pm_ops *pm;
++
++	rcu_read_lock();
++	list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
++		if (try_module_get(pm->owner)) {
++			mpcb->pm_ops = pm;
++			break;
++		}
++	}
++	rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
++{
++	module_put(mpcb->pm_ops->owner);
++}
++
++/* Fallback to the default path-manager. */
++void mptcp_fallback_default(struct mptcp_cb *mpcb)
++{
++	struct mptcp_pm_ops *pm;
++
++	mptcp_cleanup_path_manager(mpcb);
++	pm = mptcp_pm_find("default");
++
++	/* Cannot fail - it's the default module */
++	try_module_get(pm->owner);
++	mpcb->pm_ops = pm;
++}
++EXPORT_SYMBOL_GPL(mptcp_fallback_default);
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_path_manager_default(void)
++{
++	return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
++}
++late_initcall(mptcp_path_manager_default);
+diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
+new file mode 100644
+index 000000000000..93278f684069
+--- /dev/null
++++ b/net/mptcp/mptcp_rr.c
+@@ -0,0 +1,301 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static unsigned char num_segments __read_mostly = 1;
++module_param(num_segments, byte, 0644);
++MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
++
++static bool cwnd_limited __read_mostly = 1;
++module_param(cwnd_limited, bool, 0644);
++MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
++
++struct rrsched_priv {
++	unsigned char quota;
++};
++
++static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
++{
++	return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
++				  bool zero_wnd_test, bool cwnd_test)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	unsigned int space, in_flight;
++
++	/* Set of states for which we are allowed to send data */
++	if (!mptcp_sk_can_send(sk))
++		return false;
++
++	/* We do not send data on this subflow unless it is
++	 * fully established, i.e. the 4th ack has been received.
++	 */
++	if (tp->mptcp->pre_established)
++		return false;
++
++	if (tp->pf)
++		return false;
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++		/* If SACK is disabled, and we got a loss, TCP does not exit
++		 * the loss-state until something above high_seq has been acked.
++		 * (see tcp_try_undo_recovery)
++		 *
++		 * high_seq is the snd_nxt at the moment of the RTO. As soon
++		 * as we have an RTO, we won't push data on the subflow.
++		 * Thus, snd_una can never go beyond high_seq.
++		 */
++		if (!tcp_is_reno(tp))
++			return false;
++		else if (tp->snd_una != tp->high_seq)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		/* Make sure that we send in-order data */
++		if (skb && tp->mptcp->second_packet &&
++		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++			return false;
++	}
++
++	if (!cwnd_test)
++		goto zero_wnd_test;
++
++	in_flight = tcp_packets_in_flight(tp);
++	/* Not even a single spot in the cwnd */
++	if (in_flight >= tp->snd_cwnd)
++		return false;
++
++	/* Now, check if what is queued in the subflow's send-queue
++	 * already fills the cwnd.
++	 */
++	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++	if (tp->write_seq - tp->snd_nxt > space)
++		return false;
++
++zero_wnd_test:
++	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++		return false;
++
++	return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++	/* If the skb has already been enqueued in this sk, try to find
++	 * another one.
++	 */
++	return skb &&
++		/* Has the skb already been enqueued into this subsocket? */
++		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* We just look for any subflow that is available */
++static struct sock *rr_get_available_subflow(struct sock *meta_sk,
++					     struct sk_buff *skb,
++					     bool zero_wnd_test)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk, *bestsk = NULL, *backupsk = NULL;
++
++	/* Answer data_fin on same subflow!!! */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++	    skb && mptcp_is_data_fin(skb)) {
++		mptcp_for_each_sk(mpcb, sk) {
++			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++			    mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++				return sk;
++		}
++	}
++
++	/* First, find the best subflow */
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++
++		if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
++			continue;
++
++		if (mptcp_rr_dont_reinject_skb(tp, skb)) {
++			backupsk = sk;
++			continue;
++		}
++
++		bestsk = sk;
++	}
++
++	if (bestsk) {
++		sk = bestsk;
++	} else if (backupsk) {
++		/* It has been sent on all subflows once - let's give it a
++		 * chance again by restarting its pathmask.
++		 */
++		if (skb)
++			TCP_SKB_CB(skb)->path_mask = 0;
++		sk = backupsk;
++	}
++
++	return sk;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sk_buff *skb = NULL;
++
++	*reinject = 0;
++
++	/* If we are in fallback-mode, just take from the meta-send-queue */
++	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++		return tcp_send_head(meta_sk);
++
++	skb = skb_peek(&mpcb->reinject_queue);
++
++	if (skb)
++		*reinject = 1;
++	else
++		skb = tcp_send_head(meta_sk);
++	return skb;
++}
++
++static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
++					     int *reinject,
++					     struct sock **subsk,
++					     unsigned int *limit)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk_it, *choose_sk = NULL;
++	struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
++	unsigned char split = num_segments;
++	unsigned char iter = 0, full_subs = 0;
++
++	/* As we set it, we have to reset it as well. */
++	*limit = 0;
++
++	if (!skb)
++		return NULL;
++
++	if (*reinject) {
++		*subsk = rr_get_available_subflow(meta_sk, skb, false);
++		if (!*subsk)
++			return NULL;
++
++		return skb;
++	}
++
++retry:
++
++	/* First, we look for a subflow who is currently being used */
++	mptcp_for_each_sk(mpcb, sk_it) {
++		struct tcp_sock *tp_it = tcp_sk(sk_it);
++		struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++		if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++			continue;
++
++		iter++;
++
++		/* Is this subflow currently being used? */
++		if (rsp->quota > 0 && rsp->quota < num_segments) {
++			split = num_segments - rsp->quota;
++			choose_sk = sk_it;
++			goto found;
++		}
++
++		/* Or, it's totally unused */
++		if (!rsp->quota) {
++			split = num_segments;
++			choose_sk = sk_it;
++		}
++
++		/* Or, it must then be fully used  */
++		if (rsp->quota == num_segments)
++			full_subs++;
++	}
++
++	/* All considered subflows have a full quota, and we considered at
++	 * least one.
++	 */
++	if (iter && iter == full_subs) {
++		/* So, we restart this round by setting quota to 0 and retry
++		 * to find a subflow.
++		 */
++		mptcp_for_each_sk(mpcb, sk_it) {
++			struct tcp_sock *tp_it = tcp_sk(sk_it);
++			struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
++
++			if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
++				continue;
++
++			rsp->quota = 0;
++		}
++
++		goto retry;
++	}
++
++found:
++	if (choose_sk) {
++		unsigned int mss_now;
++		struct tcp_sock *choose_tp = tcp_sk(choose_sk);
++		struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
++
++		if (!mptcp_rr_is_available(choose_sk, skb, false, true))
++			return NULL;
++
++		*subsk = choose_sk;
++		mss_now = tcp_current_mss(*subsk);
++		*limit = split * mss_now;
++
++		if (skb->len > mss_now)
++			rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
++		else
++			rsp->quota++;
++
++		return skb;
++	}
++
++	return NULL;
++}
++
++static struct mptcp_sched_ops mptcp_sched_rr = {
++	.get_subflow = rr_get_available_subflow,
++	.next_segment = mptcp_rr_next_segment,
++	.name = "roundrobin",
++	.owner = THIS_MODULE,
++};
++
++static int __init rr_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
++
++	if (mptcp_register_scheduler(&mptcp_sched_rr))
++		return -1;
++
++	return 0;
++}
++
++static void rr_unregister(void)
++{
++	mptcp_unregister_scheduler(&mptcp_sched_rr);
++}
++
++module_init(rr_register);
++module_exit(rr_unregister);
++
++MODULE_AUTHOR("Christoph Paasch");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
++MODULE_VERSION("0.89");
+diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
+new file mode 100644
+index 000000000000..6c7ff4eceac1
+--- /dev/null
++++ b/net/mptcp/mptcp_sched.c
+@@ -0,0 +1,493 @@
++/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
++
++#include <linux/module.h>
++#include <net/mptcp.h>
++
++static DEFINE_SPINLOCK(mptcp_sched_list_lock);
++static LIST_HEAD(mptcp_sched_list);
++
++struct defsched_priv {
++	u32	last_rbuf_opti;
++};
++
++static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
++{
++	return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
++}
++
++/* If the sub-socket sk available to send the skb? */
++static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
++			       bool zero_wnd_test)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	unsigned int mss_now, space, in_flight;
++
++	/* Set of states for which we are allowed to send data */
++	if (!mptcp_sk_can_send(sk))
++		return false;
++
++	/* We do not send data on this subflow unless it is
++	 * fully established, i.e. the 4th ack has been received.
++	 */
++	if (tp->mptcp->pre_established)
++		return false;
++
++	if (tp->pf)
++		return false;
++
++	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
++		/* If SACK is disabled, and we got a loss, TCP does not exit
++		 * the loss-state until something above high_seq has been acked.
++		 * (see tcp_try_undo_recovery)
++		 *
++		 * high_seq is the snd_nxt at the moment of the RTO. As soon
++		 * as we have an RTO, we won't push data on the subflow.
++		 * Thus, snd_una can never go beyond high_seq.
++		 */
++		if (!tcp_is_reno(tp))
++			return false;
++		else if (tp->snd_una != tp->high_seq)
++			return false;
++	}
++
++	if (!tp->mptcp->fully_established) {
++		/* Make sure that we send in-order data */
++		if (skb && tp->mptcp->second_packet &&
++		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
++			return false;
++	}
++
++	/* If TSQ is already throttling us, do not send on this subflow. When
++	 * TSQ gets cleared the subflow becomes eligible again.
++	 */
++	if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
++		return false;
++
++	in_flight = tcp_packets_in_flight(tp);
++	/* Not even a single spot in the cwnd */
++	if (in_flight >= tp->snd_cwnd)
++		return false;
++
++	/* Now, check if what is queued in the subflow's send-queue
++	 * already fills the cwnd.
++	 */
++	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
++
++	if (tp->write_seq - tp->snd_nxt > space)
++		return false;
++
++	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
++		return false;
++
++	mss_now = tcp_current_mss(sk);
++
++	/* Don't send on this subflow if we bypass the allowed send-window at
++	 * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
++	 * calculated end_seq (because here at this point end_seq is still at
++	 * the meta-level).
++	 */
++	if (skb && !zero_wnd_test &&
++	    after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
++		return false;
++
++	return true;
++}
++
++/* Are we not allowed to reinject this skb on tp? */
++static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
++{
++	/* If the skb has already been enqueued in this sk, try to find
++	 * another one.
++	 */
++	return skb &&
++		/* Has the skb already been enqueued into this subsocket? */
++		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
++}
++
++/* This is the scheduler. This function decides on which flow to send
++ * a given MSS. If all subflows are found to be busy, NULL is returned
++ * The flow is selected based on the shortest RTT.
++ * If all paths have full cong windows, we simply return NULL.
++ *
++ * Additionally, this function is aware of the backup-subflows.
++ */
++static struct sock *get_available_subflow(struct sock *meta_sk,
++					  struct sk_buff *skb,
++					  bool zero_wnd_test)
++{
++	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
++	u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
++	int cnt_backups = 0;
++
++	/* if there is only one subflow, bypass the scheduling function */
++	if (mpcb->cnt_subflows == 1) {
++		bestsk = (struct sock *)mpcb->connection_list;
++		if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
++			bestsk = NULL;
++		return bestsk;
++	}
++
++	/* Answer data_fin on same subflow!!! */
++	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
++	    skb && mptcp_is_data_fin(skb)) {
++		mptcp_for_each_sk(mpcb, sk) {
++			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
++			    mptcp_is_available(sk, skb, zero_wnd_test))
++				return sk;
++		}
++	}
++
++	/* First, find the best subflow */
++	mptcp_for_each_sk(mpcb, sk) {
++		struct tcp_sock *tp = tcp_sk(sk);
++
++		if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
++			cnt_backups++;
++
++		if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++		    tp->srtt_us < lowprio_min_time_to_peer) {
++			if (!mptcp_is_available(sk, skb, zero_wnd_test))
++				continue;
++
++			if (mptcp_dont_reinject_skb(tp, skb)) {
++				backupsk = sk;
++				continue;
++			}
++
++			lowprio_min_time_to_peer = tp->srtt_us;
++			lowpriosk = sk;
++		} else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
++			   tp->srtt_us < min_time_to_peer) {
++			if (!mptcp_is_available(sk, skb, zero_wnd_test))
++				continue;
++
++			if (mptcp_dont_reinject_skb(tp, skb)) {
++				backupsk = sk;
++				continue;
++			}
++
++			min_time_to_peer = tp->srtt_us;
++			bestsk = sk;
++		}
++	}
++
++	if (mpcb->cnt_established == cnt_backups && lowpriosk) {
++		sk = lowpriosk;
++	} else if (bestsk) {
++		sk = bestsk;
++	} else if (backupsk) {
++		/* It has been sent on all subflows once - let's give it a
++		 * chance again by restarting its pathmask.
++		 */
++		if (skb)
++			TCP_SKB_CB(skb)->path_mask = 0;
++		sk = backupsk;
++	}
++
++	return sk;
++}
++
++static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
++{
++	struct sock *meta_sk;
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct tcp_sock *tp_it;
++	struct sk_buff *skb_head;
++	struct defsched_priv *dsp = defsched_get_priv(tp);
++
++	if (tp->mpcb->cnt_subflows == 1)
++		return NULL;
++
++	meta_sk = mptcp_meta_sk(sk);
++	skb_head = tcp_write_queue_head(meta_sk);
++
++	if (!skb_head || skb_head == tcp_send_head(meta_sk))
++		return NULL;
++
++	/* If penalization is optional (coming from mptcp_next_segment() and
++	 * We are not send-buffer-limited we do not penalize. The retransmission
++	 * is just an optimization to fix the idle-time due to the delay before
++	 * we wake up the application.
++	 */
++	if (!penal && sk_stream_memory_free(meta_sk))
++		goto retrans;
++
++	/* Only penalize again after an RTT has elapsed */
++	if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
++		goto retrans;
++
++	/* Half the cwnd of the slow flow */
++	mptcp_for_each_tp(tp->mpcb, tp_it) {
++		if (tp_it != tp &&
++		    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++			if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
++				tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
++				if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
++					tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
++
++				dsp->last_rbuf_opti = tcp_time_stamp;
++			}
++			break;
++		}
++	}
++
++retrans:
++
++	/* Segment not yet injected into this path? Take it!!! */
++	if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
++		bool do_retrans = false;
++		mptcp_for_each_tp(tp->mpcb, tp_it) {
++			if (tp_it != tp &&
++			    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
++				if (tp_it->snd_cwnd <= 4) {
++					do_retrans = true;
++					break;
++				}
++
++				if (4 * tp->srtt_us >= tp_it->srtt_us) {
++					do_retrans = false;
++					break;
++				} else {
++					do_retrans = true;
++				}
++			}
++		}
++
++		if (do_retrans && mptcp_is_available(sk, skb_head, false))
++			return skb_head;
++	}
++	return NULL;
++}
++
++/* Returns the next segment to be sent from the mptcp meta-queue.
++ * (chooses the reinject queue if any segment is waiting in it, otherwise,
++ * chooses the normal write queue).
++ * Sets *@reinject to 1 if the returned segment comes from the
++ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
++ * and sets it to -1 if it is a meta-level retransmission to optimize the
++ * receive-buffer.
++ */
++static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
++{
++	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
++	struct sk_buff *skb = NULL;
++
++	*reinject = 0;
++
++	/* If we are in fallback-mode, just take from the meta-send-queue */
++	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
++		return tcp_send_head(meta_sk);
++
++	skb = skb_peek(&mpcb->reinject_queue);
++
++	if (skb) {
++		*reinject = 1;
++	} else {
++		skb = tcp_send_head(meta_sk);
++
++		if (!skb && meta_sk->sk_socket &&
++		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
++		    sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
++			struct sock *subsk = get_available_subflow(meta_sk, NULL,
++								   false);
++			if (!subsk)
++				return NULL;
++
++			skb = mptcp_rcv_buf_optimization(subsk, 0);
++			if (skb)
++				*reinject = -1;
++		}
++	}
++	return skb;
++}
++
++static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
++					  int *reinject,
++					  struct sock **subsk,
++					  unsigned int *limit)
++{
++	struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
++	unsigned int mss_now;
++	struct tcp_sock *subtp;
++	u16 gso_max_segs;
++	u32 max_len, max_segs, window, needed;
++
++	/* As we set it, we have to reset it as well. */
++	*limit = 0;
++
++	if (!skb)
++		return NULL;
++
++	*subsk = get_available_subflow(meta_sk, skb, false);
++	if (!*subsk)
++		return NULL;
++
++	subtp = tcp_sk(*subsk);
++	mss_now = tcp_current_mss(*subsk);
++
++	if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
++		skb = mptcp_rcv_buf_optimization(*subsk, 1);
++		if (skb)
++			*reinject = -1;
++		else
++			return NULL;
++	}
++
++	/* No splitting required, as we will only send one single segment */
++	if (skb->len <= mss_now)
++		return skb;
++
++	/* The following is similar to tcp_mss_split_point, but
++	 * we do not care about nagle, because we will anyways
++	 * use TCP_NAGLE_PUSH, which overrides this.
++	 *
++	 * So, we first limit according to the cwnd/gso-size and then according
++	 * to the subflow's window.
++	 */
++
++	gso_max_segs = (*subsk)->sk_gso_max_segs;
++	if (!gso_max_segs) /* No gso supported on the subflow's NIC */
++		gso_max_segs = 1;
++	max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
++	if (!max_segs)
++		return NULL;
++
++	max_len = mss_now * max_segs;
++	window = tcp_wnd_end(subtp) - subtp->write_seq;
++
++	needed = min(skb->len, window);
++	if (max_len <= skb->len)
++		/* Take max_win, which is actually the cwnd/gso-size */
++		*limit = max_len;
++	else
++		/* Or, take the window */
++		*limit = needed;
++
++	return skb;
++}
++
++static void defsched_init(struct sock *sk)
++{
++	struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
++
++	dsp->last_rbuf_opti = tcp_time_stamp;
++}
++
++struct mptcp_sched_ops mptcp_sched_default = {
++	.get_subflow = get_available_subflow,
++	.next_segment = mptcp_next_segment,
++	.init = defsched_init,
++	.name = "default",
++	.owner = THIS_MODULE,
++};
++
++static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
++{
++	struct mptcp_sched_ops *e;
++
++	list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
++		if (strcmp(e->name, name) == 0)
++			return e;
++	}
++
++	return NULL;
++}
++
++int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
++{
++	int ret = 0;
++
++	if (!sched->get_subflow || !sched->next_segment)
++		return -EINVAL;
++
++	spin_lock(&mptcp_sched_list_lock);
++	if (mptcp_sched_find(sched->name)) {
++		pr_notice("%s already registered\n", sched->name);
++		ret = -EEXIST;
++	} else {
++		list_add_tail_rcu(&sched->list, &mptcp_sched_list);
++		pr_info("%s registered\n", sched->name);
++	}
++	spin_unlock(&mptcp_sched_list_lock);
++
++	return ret;
++}
++EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
++
++void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
++{
++	spin_lock(&mptcp_sched_list_lock);
++	list_del_rcu(&sched->list);
++	spin_unlock(&mptcp_sched_list_lock);
++}
++EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
++
++void mptcp_get_default_scheduler(char *name)
++{
++	struct mptcp_sched_ops *sched;
++
++	BUG_ON(list_empty(&mptcp_sched_list));
++
++	rcu_read_lock();
++	sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
++	strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
++	rcu_read_unlock();
++}
++
++int mptcp_set_default_scheduler(const char *name)
++{
++	struct mptcp_sched_ops *sched;
++	int ret = -ENOENT;
++
++	spin_lock(&mptcp_sched_list_lock);
++	sched = mptcp_sched_find(name);
++#ifdef CONFIG_MODULES
++	if (!sched && capable(CAP_NET_ADMIN)) {
++		spin_unlock(&mptcp_sched_list_lock);
++
++		request_module("mptcp_%s", name);
++		spin_lock(&mptcp_sched_list_lock);
++		sched = mptcp_sched_find(name);
++	}
++#endif
++
++	if (sched) {
++		list_move(&sched->list, &mptcp_sched_list);
++		ret = 0;
++	} else {
++		pr_info("%s is not available\n", name);
++	}
++	spin_unlock(&mptcp_sched_list_lock);
++
++	return ret;
++}
++
++void mptcp_init_scheduler(struct mptcp_cb *mpcb)
++{
++	struct mptcp_sched_ops *sched;
++
++	rcu_read_lock();
++	list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
++		if (try_module_get(sched->owner)) {
++			mpcb->sched_ops = sched;
++			break;
++		}
++	}
++	rcu_read_unlock();
++}
++
++/* Manage refcounts on socket close. */
++void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
++{
++	module_put(mpcb->sched_ops->owner);
++}
++
++/* Set default value from kernel configuration at bootup */
++static int __init mptcp_scheduler_default(void)
++{
++	BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
++
++	return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
++}
++late_initcall(mptcp_scheduler_default);
+diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
+new file mode 100644
+index 000000000000..29ca1d868d17
+--- /dev/null
++++ b/net/mptcp/mptcp_wvegas.c
+@@ -0,0 +1,268 @@
++/*
++ *	MPTCP implementation - WEIGHTED VEGAS
++ *
++ *	Algorithm design:
++ *	Yu Cao <cyAnalyst@126.com>
++ *	Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
++ *	Xiaoming Fu <fu@cs.uni-goettinggen.de>
++ *
++ *	Implementation:
++ *	Yu Cao <cyAnalyst@126.com>
++ *	Enhuan Dong <deh13@mails.tsinghua.edu.cn>
++ *
++ *	Ported to the official MPTCP-kernel:
++ *	Christoph Paasch <christoph.paasch@uclouvain.be>
++ *
++ *	This program is free software; you can redistribute it and/or
++ *	modify it under the terms of the GNU General Public License
++ *	as published by the Free Software Foundation; either version
++ *	2 of the License, or (at your option) any later version.
++ */
++
++#include <linux/skbuff.h>
++#include <net/tcp.h>
++#include <net/mptcp.h>
++#include <linux/module.h>
++#include <linux/tcp.h>
++
++static int initial_alpha = 2;
++static int total_alpha = 10;
++static int gamma = 1;
++
++module_param(initial_alpha, int, 0644);
++MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
++module_param(total_alpha, int, 0644);
++MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
++module_param(gamma, int, 0644);
++MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
++
++#define MPTCP_WVEGAS_SCALE 16
++
++/* wVegas variables */
++struct wvegas {
++	u32	beg_snd_nxt;	/* right edge during last RTT */
++	u8	doing_wvegas_now;/* if true, do wvegas for this RTT */
++
++	u16	cnt_rtt;		/* # of RTTs measured within last RTT */
++	u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
++	u32	base_rtt;	/* the min of all wVegas RTT measurements seen (in usec) */
++
++	u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
++	u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
++	int alpha; /* alpha for each subflows */
++
++	u32 queue_delay; /* queue delay*/
++};
++
++
++static inline u64 mptcp_wvegas_scale(u32 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++static void wvegas_enable(const struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->doing_wvegas_now = 1;
++
++	wvegas->beg_snd_nxt = tp->snd_nxt;
++
++	wvegas->cnt_rtt = 0;
++	wvegas->sampled_rtt = 0;
++
++	wvegas->instant_rate = 0;
++	wvegas->alpha = initial_alpha;
++	wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
++
++	wvegas->queue_delay = 0;
++}
++
++static inline void wvegas_disable(const struct sock *sk)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->doing_wvegas_now = 0;
++}
++
++static void mptcp_wvegas_init(struct sock *sk)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	wvegas->base_rtt = 0x7fffffff;
++	wvegas_enable(sk);
++}
++
++static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
++{
++	return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
++}
++
++static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
++{
++	struct wvegas *wvegas = inet_csk_ca(sk);
++	u32 vrtt;
++
++	if (rtt_us < 0)
++		return;
++
++	vrtt = rtt_us + 1;
++
++	if (vrtt < wvegas->base_rtt)
++		wvegas->base_rtt = vrtt;
++
++	wvegas->sampled_rtt += vrtt;
++	wvegas->cnt_rtt++;
++}
++
++static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
++{
++	if (ca_state == TCP_CA_Open)
++		wvegas_enable(sk);
++	else
++		wvegas_disable(sk);
++}
++
++static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++	if (event == CA_EVENT_CWND_RESTART) {
++		mptcp_wvegas_init(sk);
++	} else if (event == CA_EVENT_LOSS) {
++		struct wvegas *wvegas = inet_csk_ca(sk);
++		wvegas->instant_rate = 0;
++	}
++}
++
++static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
++{
++	return  min(tp->snd_ssthresh, tp->snd_cwnd - 1);
++}
++
++static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
++{
++	u64 total_rate = 0;
++	struct sock *sub_sk;
++	const struct wvegas *wvegas = inet_csk_ca(sk);
++
++	if (!mpcb)
++		return wvegas->weight;
++
++
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
++
++		/* sampled_rtt is initialized by 0 */
++		if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
++			total_rate += sub_wvegas->instant_rate;
++	}
++
++	if (total_rate && wvegas->instant_rate)
++		return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
++	else
++		return wvegas->weight;
++}
++
++static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	struct wvegas *wvegas = inet_csk_ca(sk);
++
++	if (!wvegas->doing_wvegas_now) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	if (after(ack, wvegas->beg_snd_nxt)) {
++		wvegas->beg_snd_nxt  = tp->snd_nxt;
++
++		if (wvegas->cnt_rtt <= 2) {
++			tcp_reno_cong_avoid(sk, ack, acked);
++		} else {
++			u32 rtt, diff, q_delay;
++			u64 target_cwnd;
++
++			rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
++			target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
++
++			diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
++
++			if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
++				tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
++				tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++
++			} else if (tp->snd_cwnd <= tp->snd_ssthresh) {
++				tcp_slow_start(tp, acked);
++			} else {
++				if (diff >= wvegas->alpha) {
++					wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
++					wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
++					wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
++				}
++				if (diff > wvegas->alpha) {
++					tp->snd_cwnd--;
++					tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
++				} else if (diff < wvegas->alpha) {
++					tp->snd_cwnd++;
++				}
++
++				/* Try to drain link queue if needed*/
++				q_delay = rtt - wvegas->base_rtt;
++				if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
++					wvegas->queue_delay = q_delay;
++
++				if (q_delay >= 2 * wvegas->queue_delay) {
++					u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
++					tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
++					wvegas->queue_delay = 0;
++				}
++			}
++
++			if (tp->snd_cwnd < 2)
++				tp->snd_cwnd = 2;
++			else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
++				tp->snd_cwnd = tp->snd_cwnd_clamp;
++
++			tp->snd_ssthresh = tcp_current_ssthresh(sk);
++		}
++
++		wvegas->cnt_rtt = 0;
++		wvegas->sampled_rtt = 0;
++	}
++	/* Use normal slow start */
++	else if (tp->snd_cwnd <= tp->snd_ssthresh)
++		tcp_slow_start(tp, acked);
++}
++
++
++static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
++	.init		= mptcp_wvegas_init,
++	.ssthresh	= tcp_reno_ssthresh,
++	.cong_avoid	= mptcp_wvegas_cong_avoid,
++	.pkts_acked	= mptcp_wvegas_pkts_acked,
++	.set_state	= mptcp_wvegas_state,
++	.cwnd_event	= mptcp_wvegas_cwnd_event,
++
++	.owner		= THIS_MODULE,
++	.name		= "wvegas",
++};
++
++static int __init mptcp_wvegas_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
++	tcp_register_congestion_control(&mptcp_wvegas);
++	return 0;
++}
++
++static void __exit mptcp_wvegas_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_wvegas);
++}
++
++module_init(mptcp_wvegas_register);
++module_exit(mptcp_wvegas_unregister);
++
++MODULE_AUTHOR("Yu Cao, Enhuan Dong");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP wVegas");
++MODULE_VERSION("0.1");


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-06 11:39 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-06 11:39 UTC (permalink / raw
  To: gentoo-commits

commit:     f2f011b9a8a9057b75a30940d240fd4aaeb7d9e3
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Mon Oct  6 11:39:51 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Mon Oct  6 11:39:51 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=f2f011b9

Remove dup.

---
 2500_multipath-tcp-v3.16-872d7f6c6f4e.patch | 19230 --------------------------
 1 file changed, 19230 deletions(-)

diff --git a/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch b/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
deleted file mode 100644
index 3000da3..0000000
--- a/2500_multipath-tcp-v3.16-872d7f6c6f4e.patch
+++ /dev/null
@@ -1,19230 +0,0 @@
-diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
-index 768a0fb67dd6..5a46d91a8df9 100644
---- a/drivers/infiniband/hw/cxgb4/cm.c
-+++ b/drivers/infiniband/hw/cxgb4/cm.c
-@@ -3432,7 +3432,7 @@ static void build_cpl_pass_accept_req(struct sk_buff *skb, int stid , u8 tos)
- 	 */
- 	memset(&tmp_opt, 0, sizeof(tmp_opt));
- 	tcp_clear_options(&tmp_opt);
--	tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+	tcp_parse_options(skb, &tmp_opt, NULL, 0, NULL);
- 
- 	req = (struct cpl_pass_accept_req *)__skb_push(skb, sizeof(*req));
- 	memset(req, 0, sizeof(*req));
-diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
-index 2faef339d8f2..d86c853ffaad 100644
---- a/include/linux/ipv6.h
-+++ b/include/linux/ipv6.h
-@@ -256,16 +256,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
- 	return inet_sk(__sk)->pinet6;
- }
- 
--static inline struct request_sock *inet6_reqsk_alloc(struct request_sock_ops *ops)
--{
--	struct request_sock *req = reqsk_alloc(ops);
--
--	if (req)
--		inet_rsk(req)->pktopts = NULL;
--
--	return req;
--}
--
- static inline struct raw6_sock *raw6_sk(const struct sock *sk)
- {
- 	return (struct raw6_sock *)sk;
-@@ -309,12 +299,6 @@ static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
- 	return NULL;
- }
- 
--static inline struct inet6_request_sock *
--			inet6_rsk(const struct request_sock *rsk)
--{
--	return NULL;
--}
--
- static inline struct raw6_sock *raw6_sk(const struct sock *sk)
- {
- 	return NULL;
-diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
-index ec89301ada41..99ea4b0e3693 100644
---- a/include/linux/skbuff.h
-+++ b/include/linux/skbuff.h
-@@ -2784,8 +2784,10 @@ static inline bool __skb_checksum_validate_needed(struct sk_buff *skb,
- 						  bool zero_okay,
- 						  __sum16 check)
- {
--	if (skb_csum_unnecessary(skb) || (zero_okay && !check)) {
--		skb->csum_valid = 1;
-+	if (skb_csum_unnecessary(skb)) {
-+		return false;
-+	} else if (zero_okay && !check) {
-+		skb->ip_summed = CHECKSUM_UNNECESSARY;
- 		return false;
- 	}
- 
-diff --git a/include/linux/tcp.h b/include/linux/tcp.h
-index a0513210798f..7bc2e078d6ca 100644
---- a/include/linux/tcp.h
-+++ b/include/linux/tcp.h
-@@ -53,7 +53,7 @@ static inline unsigned int tcp_optlen(const struct sk_buff *skb)
- /* TCP Fast Open */
- #define TCP_FASTOPEN_COOKIE_MIN	4	/* Min Fast Open Cookie size in bytes */
- #define TCP_FASTOPEN_COOKIE_MAX	16	/* Max Fast Open Cookie size in bytes */
--#define TCP_FASTOPEN_COOKIE_SIZE 8	/* the size employed by this impl. */
-+#define TCP_FASTOPEN_COOKIE_SIZE 4	/* the size employed by this impl. */
- 
- /* TCP Fast Open Cookie as stored in memory */
- struct tcp_fastopen_cookie {
-@@ -72,6 +72,51 @@ struct tcp_sack_block {
- 	u32	end_seq;
- };
- 
-+struct tcp_out_options {
-+	u16	options;	/* bit field of OPTION_* */
-+	u8	ws;		/* window scale, 0 to disable */
-+	u8	num_sack_blocks;/* number of SACK blocks to include */
-+	u8	hash_size;	/* bytes in hash_location */
-+	u16	mss;		/* 0 to disable */
-+	__u8	*hash_location;	/* temporary pointer, overloaded */
-+	__u32	tsval, tsecr;	/* need to include OPTION_TS */
-+	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
-+#ifdef CONFIG_MPTCP
-+	u16	mptcp_options;	/* bit field of MPTCP related OPTION_* */
-+	u8	dss_csum:1,
-+		add_addr_v4:1,
-+		add_addr_v6:1;	/* dss-checksum required? */
-+
-+	union {
-+		struct {
-+			__u64	sender_key;	/* sender's key for mptcp */
-+			__u64	receiver_key;	/* receiver's key for mptcp */
-+		} mp_capable;
-+
-+		struct {
-+			__u64	sender_truncated_mac;
-+			__u32	sender_nonce;
-+					/* random number of the sender */
-+			__u32	token;	/* token for mptcp */
-+			u8	low_prio:1;
-+		} mp_join_syns;
-+	};
-+
-+	struct {
-+		struct in_addr addr;
-+		u8 addr_id;
-+	} add_addr4;
-+
-+	struct {
-+		struct in6_addr addr;
-+		u8 addr_id;
-+	} add_addr6;
-+
-+	u16	remove_addrs;	/* list of address id */
-+	u8	addr_id;	/* address id (mp_join or add_address) */
-+#endif /* CONFIG_MPTCP */
-+};
-+
- /*These are used to set the sack_ok field in struct tcp_options_received */
- #define TCP_SACK_SEEN     (1 << 0)   /*1 = peer is SACK capable, */
- #define TCP_FACK_ENABLED  (1 << 1)   /*1 = FACK is enabled locally*/
-@@ -95,6 +140,9 @@ struct tcp_options_received {
- 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
- };
- 
-+struct mptcp_cb;
-+struct mptcp_tcp_sock;
-+
- static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
- {
- 	rx_opt->tstamp_ok = rx_opt->sack_ok = 0;
-@@ -111,10 +159,7 @@ struct tcp_request_sock_ops;
- 
- struct tcp_request_sock {
- 	struct inet_request_sock 	req;
--#ifdef CONFIG_TCP_MD5SIG
--	/* Only used by TCP MD5 Signature so far. */
- 	const struct tcp_request_sock_ops *af_specific;
--#endif
- 	struct sock			*listener; /* needed for TFO */
- 	u32				rcv_isn;
- 	u32				snt_isn;
-@@ -130,6 +175,8 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
- 	return (struct tcp_request_sock *)req;
- }
- 
-+struct tcp_md5sig_key;
-+
- struct tcp_sock {
- 	/* inet_connection_sock has to be the first member of tcp_sock */
- 	struct inet_connection_sock	inet_conn;
-@@ -326,6 +373,37 @@ struct tcp_sock {
- 	 * socket. Used to retransmit SYNACKs etc.
- 	 */
- 	struct request_sock *fastopen_rsk;
-+
-+	/* MPTCP/TCP-specific callbacks */
-+	const struct tcp_sock_ops	*ops;
-+
-+	struct mptcp_cb		*mpcb;
-+	struct sock		*meta_sk;
-+	/* We keep these flags even if CONFIG_MPTCP is not checked, because
-+	 * it allows checking MPTCP capability just by checking the mpc flag,
-+	 * rather than adding ifdefs everywhere.
-+	 */
-+	u16     mpc:1,          /* Other end is multipath capable */
-+		inside_tk_table:1, /* Is the tcp_sock inside the token-table? */
-+		send_mp_fclose:1,
-+		request_mptcp:1, /* Did we send out an MP_CAPABLE?
-+				  * (this speeds up mptcp_doit() in tcp_recvmsg)
-+				  */
-+		mptcp_enabled:1, /* Is MPTCP enabled from the application ? */
-+		pf:1, /* Potentially Failed state: when this flag is set, we
-+		       * stop using the subflow
-+		       */
-+		mp_killed:1, /* Killed with a tcp_done in mptcp? */
-+		was_meta_sk:1,	/* This was a meta sk (in case of reuse) */
-+		is_master_sk,
-+		close_it:1,	/* Must close socket in mptcp_data_ready? */
-+		closing:1;
-+	struct mptcp_tcp_sock *mptcp;
-+#ifdef CONFIG_MPTCP
-+	struct hlist_nulls_node tk_table;
-+	u32		mptcp_loc_token;
-+	u64		mptcp_loc_key;
-+#endif /* CONFIG_MPTCP */
- };
- 
- enum tsq_flags {
-@@ -337,6 +415,8 @@ enum tsq_flags {
- 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
- 				    * tcp_v{4|6}_mtu_reduced()
- 				    */
-+	MPTCP_PATH_MANAGER, /* MPTCP deferred creation of new subflows */
-+	MPTCP_SUB_DEFERRED, /* A subflow got deferred - process them */
- };
- 
- static inline struct tcp_sock *tcp_sk(const struct sock *sk)
-@@ -355,6 +435,7 @@ struct tcp_timewait_sock {
- #ifdef CONFIG_TCP_MD5SIG
- 	struct tcp_md5sig_key	  *tw_md5_key;
- #endif
-+	struct mptcp_tw		  *mptcp_tw;
- };
- 
- static inline struct tcp_timewait_sock *tcp_twsk(const struct sock *sk)
-diff --git a/include/net/inet6_connection_sock.h b/include/net/inet6_connection_sock.h
-index 74af137304be..83f63033897a 100644
---- a/include/net/inet6_connection_sock.h
-+++ b/include/net/inet6_connection_sock.h
-@@ -27,6 +27,8 @@ int inet6_csk_bind_conflict(const struct sock *sk,
- 
- struct dst_entry *inet6_csk_route_req(struct sock *sk, struct flowi6 *fl6,
- 				      const struct request_sock *req);
-+u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-+		    const u32 rnd, const u32 synq_hsize);
- 
- struct request_sock *inet6_csk_search_req(const struct sock *sk,
- 					  struct request_sock ***prevp,
-diff --git a/include/net/inet_common.h b/include/net/inet_common.h
-index fe7994c48b75..780f229f46a8 100644
---- a/include/net/inet_common.h
-+++ b/include/net/inet_common.h
-@@ -1,6 +1,8 @@
- #ifndef _INET_COMMON_H
- #define _INET_COMMON_H
- 
-+#include <net/sock.h>
-+
- extern const struct proto_ops inet_stream_ops;
- extern const struct proto_ops inet_dgram_ops;
- 
-@@ -13,6 +15,8 @@ struct sock;
- struct sockaddr;
- struct socket;
- 
-+int inet_create(struct net *net, struct socket *sock, int protocol, int kern);
-+int inet6_create(struct net *net, struct socket *sock, int protocol, int kern);
- int inet_release(struct socket *sock);
- int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
- 			int addr_len, int flags);
-diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
-index 7a4313887568..f62159e39839 100644
---- a/include/net/inet_connection_sock.h
-+++ b/include/net/inet_connection_sock.h
-@@ -30,6 +30,7 @@
- 
- struct inet_bind_bucket;
- struct tcp_congestion_ops;
-+struct tcp_options_received;
- 
- /*
-  * Pointers to address related TCP functions
-@@ -243,6 +244,9 @@ static inline void inet_csk_reset_xmit_timer(struct sock *sk, const int what,
- 
- struct sock *inet_csk_accept(struct sock *sk, int flags, int *err);
- 
-+u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
-+		   const u32 synq_hsize);
-+
- struct request_sock *inet_csk_search_req(const struct sock *sk,
- 					 struct request_sock ***prevp,
- 					 const __be16 rport,
-diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
-index b1edf17bec01..6a32d8d6b85e 100644
---- a/include/net/inet_sock.h
-+++ b/include/net/inet_sock.h
-@@ -86,10 +86,14 @@ struct inet_request_sock {
- 				wscale_ok  : 1,
- 				ecn_ok	   : 1,
- 				acked	   : 1,
--				no_srccheck: 1;
-+				no_srccheck: 1,
-+				mptcp_rqsk : 1,
-+				saw_mpc    : 1;
- 	kmemcheck_bitfield_end(flags);
--	struct ip_options_rcu	*opt;
--	struct sk_buff		*pktopts;
-+	union {
-+		struct ip_options_rcu	*opt;
-+		struct sk_buff		*pktopts;
-+	};
- 	u32                     ir_mark;
- };
- 
-diff --git a/include/net/mptcp.h b/include/net/mptcp.h
-new file mode 100644
-index 000000000000..712780fc39e4
---- /dev/null
-+++ b/include/net/mptcp.h
-@@ -0,0 +1,1439 @@
-+/*
-+ *	MPTCP implementation
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef _MPTCP_H
-+#define _MPTCP_H
-+
-+#include <linux/inetdevice.h>
-+#include <linux/ipv6.h>
-+#include <linux/list.h>
-+#include <linux/net.h>
-+#include <linux/netpoll.h>
-+#include <linux/skbuff.h>
-+#include <linux/socket.h>
-+#include <linux/tcp.h>
-+#include <linux/kernel.h>
-+
-+#include <asm/byteorder.h>
-+#include <asm/unaligned.h>
-+#include <crypto/hash.h>
-+#include <net/tcp.h>
-+
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	#define ntohll(x)  be64_to_cpu(x)
-+	#define htonll(x)  cpu_to_be64(x)
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	#define ntohll(x) (x)
-+	#define htonll(x) (x)
-+#endif
-+
-+struct mptcp_loc4 {
-+	u8		loc4_id;
-+	u8		low_prio:1;
-+	struct in_addr	addr;
-+};
-+
-+struct mptcp_rem4 {
-+	u8		rem4_id;
-+	__be16		port;
-+	struct in_addr	addr;
-+};
-+
-+struct mptcp_loc6 {
-+	u8		loc6_id;
-+	u8		low_prio:1;
-+	struct in6_addr	addr;
-+};
-+
-+struct mptcp_rem6 {
-+	u8		rem6_id;
-+	__be16		port;
-+	struct in6_addr	addr;
-+};
-+
-+struct mptcp_request_sock {
-+	struct tcp_request_sock		req;
-+	/* hlist-nulls entry to the hash-table. Depending on whether this is a
-+	 * a new MPTCP connection or an additional subflow, the request-socket
-+	 * is either in the mptcp_reqsk_tk_htb or mptcp_reqsk_htb.
-+	 */
-+	struct hlist_nulls_node		hash_entry;
-+
-+	union {
-+		struct {
-+			/* Only on initial subflows */
-+			u64		mptcp_loc_key;
-+			u64		mptcp_rem_key;
-+			u32		mptcp_loc_token;
-+		};
-+
-+		struct {
-+			/* Only on additional subflows */
-+			struct mptcp_cb	*mptcp_mpcb;
-+			u32		mptcp_rem_nonce;
-+			u32		mptcp_loc_nonce;
-+			u64		mptcp_hash_tmac;
-+		};
-+	};
-+
-+	u8				loc_id;
-+	u8				rem_id; /* Address-id in the MP_JOIN */
-+	u8				dss_csum:1,
-+					is_sub:1, /* Is this a new subflow? */
-+					low_prio:1, /* Interface set to low-prio? */
-+					rcv_low_prio:1;
-+};
-+
-+struct mptcp_options_received {
-+	u16	saw_mpc:1,
-+		dss_csum:1,
-+		drop_me:1,
-+
-+		is_mp_join:1,
-+		join_ack:1,
-+
-+		saw_low_prio:2, /* 0x1 - low-prio set for this subflow
-+				 * 0x2 - low-prio set for another subflow
-+				 */
-+		low_prio:1,
-+
-+		saw_add_addr:2, /* Saw at least one add_addr option:
-+				 * 0x1: IPv4 - 0x2: IPv6
-+				 */
-+		more_add_addr:1, /* Saw one more add-addr. */
-+
-+		saw_rem_addr:1, /* Saw at least one rem_addr option */
-+		more_rem_addr:1, /* Saw one more rem-addr. */
-+
-+		mp_fail:1,
-+		mp_fclose:1;
-+	u8	rem_id;		/* Address-id in the MP_JOIN */
-+	u8	prio_addr_id;	/* Address-id in the MP_PRIO */
-+
-+	const unsigned char *add_addr_ptr; /* Pointer to add-address option */
-+	const unsigned char *rem_addr_ptr; /* Pointer to rem-address option */
-+
-+	u32	data_ack;
-+	u32	data_seq;
-+	u16	data_len;
-+
-+	u32	mptcp_rem_token;/* Remote token */
-+
-+	/* Key inside the option (from mp_capable or fast_close) */
-+	u64	mptcp_key;
-+
-+	u32	mptcp_recv_nonce;
-+	u64	mptcp_recv_tmac;
-+	u8	mptcp_recv_mac[20];
-+};
-+
-+struct mptcp_tcp_sock {
-+	struct tcp_sock	*next;		/* Next subflow socket */
-+	struct hlist_node cb_list;
-+	struct mptcp_options_received rx_opt;
-+
-+	 /* Those three fields record the current mapping */
-+	u64	map_data_seq;
-+	u32	map_subseq;
-+	u16	map_data_len;
-+	u16	slave_sk:1,
-+		fully_established:1,
-+		establish_increased:1,
-+		second_packet:1,
-+		attached:1,
-+		send_mp_fail:1,
-+		include_mpc:1,
-+		mapping_present:1,
-+		map_data_fin:1,
-+		low_prio:1, /* use this socket as backup */
-+		rcv_low_prio:1, /* Peer sent low-prio option to us */
-+		send_mp_prio:1, /* Trigger to send mp_prio on this socket */
-+		pre_established:1; /* State between sending 3rd ACK and
-+				    * receiving the fourth ack of new subflows.
-+				    */
-+
-+	/* isn: needed to translate abs to relative subflow seqnums */
-+	u32	snt_isn;
-+	u32	rcv_isn;
-+	u8	path_index;
-+	u8	loc_id;
-+	u8	rem_id;
-+
-+#define MPTCP_SCHED_SIZE 4
-+	u8	mptcp_sched[MPTCP_SCHED_SIZE] __aligned(8);
-+
-+	struct sk_buff  *shortcut_ofoqueue; /* Shortcut to the current modified
-+					     * skb in the ofo-queue.
-+					     */
-+
-+	int	init_rcv_wnd;
-+	u32	infinite_cutoff_seq;
-+	struct delayed_work work;
-+	u32	mptcp_loc_nonce;
-+	struct tcp_sock *tp; /* Where is my daddy? */
-+	u32	last_end_data_seq;
-+
-+	/* MP_JOIN subflow: timer for retransmitting the 3rd ack */
-+	struct timer_list mptcp_ack_timer;
-+
-+	/* HMAC of the third ack */
-+	char sender_mac[20];
-+};
-+
-+struct mptcp_tw {
-+	struct list_head list;
-+	u64 loc_key;
-+	u64 rcv_nxt;
-+	struct mptcp_cb __rcu *mpcb;
-+	u8 meta_tw:1,
-+	   in_list:1;
-+};
-+
-+#define MPTCP_PM_NAME_MAX 16
-+struct mptcp_pm_ops {
-+	struct list_head list;
-+
-+	/* Signal the creation of a new MPTCP-session. */
-+	void (*new_session)(const struct sock *meta_sk);
-+	void (*release_sock)(struct sock *meta_sk);
-+	void (*fully_established)(struct sock *meta_sk);
-+	void (*new_remote_address)(struct sock *meta_sk);
-+	int  (*get_local_id)(sa_family_t family, union inet_addr *addr,
-+			     struct net *net, bool *low_prio);
-+	void (*addr_signal)(struct sock *sk, unsigned *size,
-+			    struct tcp_out_options *opts, struct sk_buff *skb);
-+	void (*add_raddr)(struct mptcp_cb *mpcb, const union inet_addr *addr,
-+			  sa_family_t family, __be16 port, u8 id);
-+	void (*rem_raddr)(struct mptcp_cb *mpcb, u8 rem_id);
-+	void (*init_subsocket_v4)(struct sock *sk, struct in_addr addr);
-+	void (*init_subsocket_v6)(struct sock *sk, struct in6_addr addr);
-+
-+	char		name[MPTCP_PM_NAME_MAX];
-+	struct module	*owner;
-+};
-+
-+#define MPTCP_SCHED_NAME_MAX 16
-+struct mptcp_sched_ops {
-+	struct list_head list;
-+
-+	struct sock *		(*get_subflow)(struct sock *meta_sk,
-+					       struct sk_buff *skb,
-+					       bool zero_wnd_test);
-+	struct sk_buff *	(*next_segment)(struct sock *meta_sk,
-+						int *reinject,
-+						struct sock **subsk,
-+						unsigned int *limit);
-+	void			(*init)(struct sock *sk);
-+
-+	char			name[MPTCP_SCHED_NAME_MAX];
-+	struct module		*owner;
-+};
-+
-+struct mptcp_cb {
-+	/* list of sockets in this multipath connection */
-+	struct tcp_sock *connection_list;
-+	/* list of sockets that need a call to release_cb */
-+	struct hlist_head callback_list;
-+
-+	/* High-order bits of 64-bit sequence numbers */
-+	u32 snd_high_order[2];
-+	u32 rcv_high_order[2];
-+
-+	u16	send_infinite_mapping:1,
-+		in_time_wait:1,
-+		list_rcvd:1, /* XXX TO REMOVE */
-+		addr_signal:1, /* Path-manager wants us to call addr_signal */
-+		dss_csum:1,
-+		server_side:1,
-+		infinite_mapping_rcv:1,
-+		infinite_mapping_snd:1,
-+		dfin_combined:1,   /* Was the DFIN combined with subflow-fin? */
-+		passive_close:1,
-+		snd_hiseq_index:1, /* Index in snd_high_order of snd_nxt */
-+		rcv_hiseq_index:1; /* Index in rcv_high_order of rcv_nxt */
-+
-+	/* socket count in this connection */
-+	u8 cnt_subflows;
-+	u8 cnt_established;
-+
-+	struct mptcp_sched_ops *sched_ops;
-+
-+	struct sk_buff_head reinject_queue;
-+	/* First cache-line boundary is here minus 8 bytes. But from the
-+	 * reinject-queue only the next and prev pointers are regularly
-+	 * accessed. Thus, the whole data-path is on a single cache-line.
-+	 */
-+
-+	u64	csum_cutoff_seq;
-+
-+	/***** Start of fields, used for connection closure */
-+	spinlock_t	 tw_lock;
-+	unsigned char	 mptw_state;
-+	u8		 dfin_path_index;
-+
-+	struct list_head tw_list;
-+
-+	/***** Start of fields, used for subflow establishment and closure */
-+	atomic_t	mpcb_refcnt;
-+
-+	/* Mutex needed, because otherwise mptcp_close will complain that the
-+	 * socket is owned by the user.
-+	 * E.g., mptcp_sub_close_wq is taking the meta-lock.
-+	 */
-+	struct mutex	mpcb_mutex;
-+
-+	/***** Start of fields, used for subflow establishment */
-+	struct sock *meta_sk;
-+
-+	/* Master socket, also part of the connection_list, this
-+	 * socket is the one that the application sees.
-+	 */
-+	struct sock *master_sk;
-+
-+	__u64	mptcp_loc_key;
-+	__u64	mptcp_rem_key;
-+	__u32	mptcp_loc_token;
-+	__u32	mptcp_rem_token;
-+
-+#define MPTCP_PM_SIZE 608
-+	u8 mptcp_pm[MPTCP_PM_SIZE] __aligned(8);
-+	struct mptcp_pm_ops *pm_ops;
-+
-+	u32 path_index_bits;
-+	/* Next pi to pick up in case a new path becomes available */
-+	u8 next_path_index;
-+
-+	/* Original snd/rcvbuf of the initial subflow.
-+	 * Used for the new subflows on the server-side to allow correct
-+	 * autotuning
-+	 */
-+	int orig_sk_rcvbuf;
-+	int orig_sk_sndbuf;
-+	u32 orig_window_clamp;
-+
-+	/* Timer for retransmitting SYN/ACK+MP_JOIN */
-+	struct timer_list synack_timer;
-+};
-+
-+#define MPTCP_SUB_CAPABLE			0
-+#define MPTCP_SUB_LEN_CAPABLE_SYN		12
-+#define MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN		12
-+#define MPTCP_SUB_LEN_CAPABLE_ACK		20
-+#define MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN		20
-+
-+#define MPTCP_SUB_JOIN			1
-+#define MPTCP_SUB_LEN_JOIN_SYN		12
-+#define MPTCP_SUB_LEN_JOIN_SYN_ALIGN	12
-+#define MPTCP_SUB_LEN_JOIN_SYNACK	16
-+#define MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN	16
-+#define MPTCP_SUB_LEN_JOIN_ACK		24
-+#define MPTCP_SUB_LEN_JOIN_ACK_ALIGN	24
-+
-+#define MPTCP_SUB_DSS		2
-+#define MPTCP_SUB_LEN_DSS	4
-+#define MPTCP_SUB_LEN_DSS_ALIGN	4
-+
-+/* Lengths for seq and ack are the ones without the generic MPTCP-option header,
-+ * as they are part of the DSS-option.
-+ * To get the total length, just add the different options together.
-+ */
-+#define MPTCP_SUB_LEN_SEQ	10
-+#define MPTCP_SUB_LEN_SEQ_CSUM	12
-+#define MPTCP_SUB_LEN_SEQ_ALIGN	12
-+
-+#define MPTCP_SUB_LEN_SEQ_64		14
-+#define MPTCP_SUB_LEN_SEQ_CSUM_64	16
-+#define MPTCP_SUB_LEN_SEQ_64_ALIGN	16
-+
-+#define MPTCP_SUB_LEN_ACK	4
-+#define MPTCP_SUB_LEN_ACK_ALIGN	4
-+
-+#define MPTCP_SUB_LEN_ACK_64		8
-+#define MPTCP_SUB_LEN_ACK_64_ALIGN	8
-+
-+/* This is the "default" option-length we will send out most often.
-+ * MPTCP DSS-header
-+ * 32-bit data sequence number
-+ * 32-bit data ack
-+ *
-+ * It is necessary to calculate the effective MSS we will be using when
-+ * sending data.
-+ */
-+#define MPTCP_SUB_LEN_DSM_ALIGN  (MPTCP_SUB_LEN_DSS_ALIGN +		\
-+				  MPTCP_SUB_LEN_SEQ_ALIGN +		\
-+				  MPTCP_SUB_LEN_ACK_ALIGN)
-+
-+#define MPTCP_SUB_ADD_ADDR		3
-+#define MPTCP_SUB_LEN_ADD_ADDR4		8
-+#define MPTCP_SUB_LEN_ADD_ADDR6		20
-+#define MPTCP_SUB_LEN_ADD_ADDR4_ALIGN	8
-+#define MPTCP_SUB_LEN_ADD_ADDR6_ALIGN	20
-+
-+#define MPTCP_SUB_REMOVE_ADDR	4
-+#define MPTCP_SUB_LEN_REMOVE_ADDR	4
-+
-+#define MPTCP_SUB_PRIO		5
-+#define MPTCP_SUB_LEN_PRIO	3
-+#define MPTCP_SUB_LEN_PRIO_ADDR	4
-+#define MPTCP_SUB_LEN_PRIO_ALIGN	4
-+
-+#define MPTCP_SUB_FAIL		6
-+#define MPTCP_SUB_LEN_FAIL	12
-+#define MPTCP_SUB_LEN_FAIL_ALIGN	12
-+
-+#define MPTCP_SUB_FCLOSE	7
-+#define MPTCP_SUB_LEN_FCLOSE	12
-+#define MPTCP_SUB_LEN_FCLOSE_ALIGN	12
-+
-+
-+#define OPTION_MPTCP		(1 << 5)
-+
-+#ifdef CONFIG_MPTCP
-+
-+/* Used for checking if the mptcp initialization has been successful */
-+extern bool mptcp_init_failed;
-+
-+/* MPTCP options */
-+#define OPTION_TYPE_SYN		(1 << 0)
-+#define OPTION_TYPE_SYNACK	(1 << 1)
-+#define OPTION_TYPE_ACK		(1 << 2)
-+#define OPTION_MP_CAPABLE	(1 << 3)
-+#define OPTION_DATA_ACK		(1 << 4)
-+#define OPTION_ADD_ADDR		(1 << 5)
-+#define OPTION_MP_JOIN		(1 << 6)
-+#define OPTION_MP_FAIL		(1 << 7)
-+#define OPTION_MP_FCLOSE	(1 << 8)
-+#define OPTION_REMOVE_ADDR	(1 << 9)
-+#define OPTION_MP_PRIO		(1 << 10)
-+
-+/* MPTCP flags: both TX and RX */
-+#define MPTCPHDR_SEQ		0x01 /* DSS.M option is present */
-+#define MPTCPHDR_FIN		0x02 /* DSS.F option is present */
-+#define MPTCPHDR_SEQ64_INDEX	0x04 /* index of seq in mpcb->snd_high_order */
-+/* MPTCP flags: RX only */
-+#define MPTCPHDR_ACK		0x08
-+#define MPTCPHDR_SEQ64_SET	0x10 /* Did we received a 64-bit seq number?  */
-+#define MPTCPHDR_SEQ64_OFO	0x20 /* Is it not in our circular array? */
-+#define MPTCPHDR_DSS_CSUM	0x40
-+#define MPTCPHDR_JOIN		0x80
-+/* MPTCP flags: TX only */
-+#define MPTCPHDR_INF		0x08
-+
-+struct mptcp_option {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	ver:4,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		ver:4;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+};
-+
-+struct mp_capable {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	ver:4,
-+		sub:4;
-+	__u8	h:1,
-+		rsv:5,
-+		b:1,
-+		a:1;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		ver:4;
-+	__u8	a:1,
-+		b:1,
-+		rsv:5,
-+		h:1;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u64	sender_key;
-+	__u64	receiver_key;
-+} __attribute__((__packed__));
-+
-+struct mp_join {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	b:1,
-+		rsv:3,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		rsv:3,
-+		b:1;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u8	addr_id;
-+	union {
-+		struct {
-+			u32	token;
-+			u32	nonce;
-+		} syn;
-+		struct {
-+			__u64	mac;
-+			u32	nonce;
-+		} synack;
-+		struct {
-+			__u8	mac[20];
-+		} ack;
-+	} u;
-+} __attribute__((__packed__));
-+
-+struct mp_dss {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u16	rsv1:4,
-+		sub:4,
-+		A:1,
-+		a:1,
-+		M:1,
-+		m:1,
-+		F:1,
-+		rsv2:3;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u16	sub:4,
-+		rsv1:4,
-+		rsv2:3,
-+		F:1,
-+		m:1,
-+		M:1,
-+		a:1,
-+		A:1;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+};
-+
-+struct mp_add_addr {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	ipver:4,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		ipver:4;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u8	addr_id;
-+	union {
-+		struct {
-+			struct in_addr	addr;
-+			__be16		port;
-+		} v4;
-+		struct {
-+			struct in6_addr	addr;
-+			__be16		port;
-+		} v6;
-+	} u;
-+} __attribute__((__packed__));
-+
-+struct mp_remove_addr {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	rsv:4,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		rsv:4;
-+#else
-+#error "Adjust your <asm/byteorder.h> defines"
-+#endif
-+	/* list of addr_id */
-+	__u8	addrs_id;
-+};
-+
-+struct mp_fail {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u16	rsv1:4,
-+		sub:4,
-+		rsv2:8;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u16	sub:4,
-+		rsv1:4,
-+		rsv2:8;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__be64	data_seq;
-+} __attribute__((__packed__));
-+
-+struct mp_fclose {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u16	rsv1:4,
-+		sub:4,
-+		rsv2:8;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u16	sub:4,
-+		rsv1:4,
-+		rsv2:8;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u64	key;
-+} __attribute__((__packed__));
-+
-+struct mp_prio {
-+	__u8	kind;
-+	__u8	len;
-+#if defined(__LITTLE_ENDIAN_BITFIELD)
-+	__u8	b:1,
-+		rsv:3,
-+		sub:4;
-+#elif defined(__BIG_ENDIAN_BITFIELD)
-+	__u8	sub:4,
-+		rsv:3,
-+		b:1;
-+#else
-+#error	"Adjust your <asm/byteorder.h> defines"
-+#endif
-+	__u8	addr_id;
-+} __attribute__((__packed__));
-+
-+static inline int mptcp_sub_len_dss(const struct mp_dss *m, const int csum)
-+{
-+	return 4 + m->A * (4 + m->a * 4) + m->M * (10 + m->m * 4 + csum * 2);
-+}
-+
-+#define MPTCP_APP	2
-+
-+extern int sysctl_mptcp_enabled;
-+extern int sysctl_mptcp_checksum;
-+extern int sysctl_mptcp_debug;
-+extern int sysctl_mptcp_syn_retries;
-+
-+extern struct workqueue_struct *mptcp_wq;
-+
-+#define mptcp_debug(fmt, args...)					\
-+	do {								\
-+		if (unlikely(sysctl_mptcp_debug))			\
-+			pr_err(__FILE__ ": " fmt, ##args);	\
-+	} while (0)
-+
-+/* Iterates over all subflows */
-+#define mptcp_for_each_tp(mpcb, tp)					\
-+	for ((tp) = (mpcb)->connection_list; (tp); (tp) = (tp)->mptcp->next)
-+
-+#define mptcp_for_each_sk(mpcb, sk)					\
-+	for ((sk) = (struct sock *)(mpcb)->connection_list;		\
-+	     sk;							\
-+	     sk = (struct sock *)tcp_sk(sk)->mptcp->next)
-+
-+#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)			\
-+	for (__sk = (struct sock *)(__mpcb)->connection_list,		\
-+	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL; \
-+	     __sk;							\
-+	     __sk = __temp,						\
-+	     __temp = __sk ? (struct sock *)tcp_sk(__sk)->mptcp->next : NULL)
-+
-+/* Iterates over all bit set to 1 in a bitset */
-+#define mptcp_for_each_bit_set(b, i)					\
-+	for (i = ffs(b) - 1; i >= 0; i = ffs(b >> (i + 1) << (i + 1)) - 1)
-+
-+#define mptcp_for_each_bit_unset(b, i)					\
-+	mptcp_for_each_bit_set(~b, i)
-+
-+extern struct lock_class_key meta_key;
-+extern struct lock_class_key meta_slock_key;
-+extern u32 mptcp_secret[MD5_MESSAGE_BYTES / 4];
-+
-+/* This is needed to ensure that two subsequent key/nonce-generation result in
-+ * different keys/nonces if the IPs and ports are the same.
-+ */
-+extern u32 mptcp_seed;
-+
-+#define MPTCP_HASH_SIZE                1024
-+
-+extern struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
-+
-+/* This second hashtable is needed to retrieve request socks
-+ * created as a result of a join request. While the SYN contains
-+ * the token, the final ack does not, so we need a separate hashtable
-+ * to retrieve the mpcb.
-+ */
-+extern struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
-+extern spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
-+
-+/* Lock, protecting the two hash-tables that hold the token. Namely,
-+ * mptcp_reqsk_tk_htb and tk_hashtable
-+ */
-+extern spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
-+
-+/* Request-sockets can be hashed in the tk_htb for collision-detection or in
-+ * the regular htb for join-connections. We need to define different NULLS
-+ * values so that we can correctly detect a request-socket that has been
-+ * recycled. See also c25eb3bfb9729.
-+ */
-+#define MPTCP_REQSK_NULLS_BASE (1U << 29)
-+
-+
-+void mptcp_data_ready(struct sock *sk);
-+void mptcp_write_space(struct sock *sk);
-+
-+void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
-+			      struct sock *sk);
-+void mptcp_ofo_queue(struct sock *meta_sk);
-+void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp);
-+void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied);
-+int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
-+		   gfp_t flags);
-+void mptcp_del_sock(struct sock *sk);
-+void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk);
-+void mptcp_reinject_data(struct sock *orig_sk, int clone_it);
-+void mptcp_update_sndbuf(const struct tcp_sock *tp);
-+void mptcp_send_fin(struct sock *meta_sk);
-+void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority);
-+bool mptcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+		      int push_one, gfp_t gfp);
-+void tcp_parse_mptcp_options(const struct sk_buff *skb,
-+			     struct mptcp_options_received *mopt);
-+void mptcp_parse_options(const uint8_t *ptr, int opsize,
-+			 struct mptcp_options_received *mopt,
-+			 const struct sk_buff *skb);
-+void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
-+		       unsigned *remaining);
-+void mptcp_synack_options(struct request_sock *req,
-+			  struct tcp_out_options *opts,
-+			  unsigned *remaining);
-+void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
-+			       struct tcp_out_options *opts, unsigned *size);
-+void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+			 const struct tcp_out_options *opts,
-+			 struct sk_buff *skb);
-+void mptcp_close(struct sock *meta_sk, long timeout);
-+int mptcp_doit(struct sock *sk);
-+int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window);
-+int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req);
-+int mptcp_check_req_master(struct sock *sk, struct sock *child,
-+			   struct request_sock *req,
-+			   struct request_sock **prev);
-+struct sock *mptcp_check_req_child(struct sock *sk, struct sock *child,
-+				   struct request_sock *req,
-+				   struct request_sock **prev,
-+				   const struct mptcp_options_received *mopt);
-+u32 __mptcp_select_window(struct sock *sk);
-+void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
-+					__u32 *window_clamp, int wscale_ok,
-+					__u8 *rcv_wscale, __u32 init_rcv_wnd,
-+					const struct sock *sk);
-+unsigned int mptcp_current_mss(struct sock *meta_sk);
-+int mptcp_select_size(const struct sock *meta_sk, bool sg);
-+void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn);
-+void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
-+		     u32 *hash_out);
-+void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk);
-+void mptcp_fin(struct sock *meta_sk);
-+void mptcp_retransmit_timer(struct sock *meta_sk);
-+int mptcp_write_wakeup(struct sock *meta_sk);
-+void mptcp_sub_close_wq(struct work_struct *work);
-+void mptcp_sub_close(struct sock *sk, unsigned long delay);
-+struct sock *mptcp_select_ack_sock(const struct sock *meta_sk);
-+void mptcp_fallback_meta_sk(struct sock *meta_sk);
-+int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+void mptcp_ack_handler(unsigned long);
-+int mptcp_check_rtt(const struct tcp_sock *tp, int time);
-+int mptcp_check_snd_buf(const struct tcp_sock *tp);
-+int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
-+			 const struct sk_buff *skb);
-+void __init mptcp_init(void);
-+int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len);
-+void mptcp_destroy_sock(struct sock *sk);
-+int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
-+				    const struct sk_buff *skb,
-+				    const struct mptcp_options_received *mopt);
-+unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
-+				  int large_allowed);
-+int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw);
-+void mptcp_twsk_destructor(struct tcp_timewait_sock *tw);
-+void mptcp_time_wait(struct sock *sk, int state, int timeo);
-+void mptcp_disconnect(struct sock *sk);
-+bool mptcp_should_expand_sndbuf(const struct sock *sk);
-+int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb);
-+void mptcp_tsq_flags(struct sock *sk);
-+void mptcp_tsq_sub_deferred(struct sock *meta_sk);
-+struct mp_join *mptcp_find_join(const struct sk_buff *skb);
-+void mptcp_hash_remove_bh(struct tcp_sock *meta_tp);
-+void mptcp_hash_remove(struct tcp_sock *meta_tp);
-+struct sock *mptcp_hash_find(const struct net *net, const u32 token);
-+int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw);
-+int mptcp_do_join_short(struct sk_buff *skb,
-+			const struct mptcp_options_received *mopt,
-+			struct net *net);
-+void mptcp_reqsk_destructor(struct request_sock *req);
-+void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+			   const struct mptcp_options_received *mopt,
-+			   const struct sk_buff *skb);
-+int mptcp_check_req(struct sk_buff *skb, struct net *net);
-+void mptcp_connect_init(struct sock *sk);
-+void mptcp_sub_force_close(struct sock *sk);
-+int mptcp_sub_len_remove_addr_align(u16 bitfield);
-+void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+			    const struct sk_buff *skb);
-+void mptcp_init_buffer_space(struct sock *sk);
-+void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
-+			   struct sk_buff *skb);
-+void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb);
-+int mptcp_conn_request(struct sock *sk, struct sk_buff *skb);
-+void mptcp_init_congestion_control(struct sock *sk);
-+
-+/* MPTCP-path-manager registration/initialization functions */
-+int mptcp_register_path_manager(struct mptcp_pm_ops *pm);
-+void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm);
-+void mptcp_init_path_manager(struct mptcp_cb *mpcb);
-+void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb);
-+void mptcp_fallback_default(struct mptcp_cb *mpcb);
-+void mptcp_get_default_path_manager(char *name);
-+int mptcp_set_default_path_manager(const char *name);
-+extern struct mptcp_pm_ops mptcp_pm_default;
-+
-+/* MPTCP-scheduler registration/initialization functions */
-+int mptcp_register_scheduler(struct mptcp_sched_ops *sched);
-+void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched);
-+void mptcp_init_scheduler(struct mptcp_cb *mpcb);
-+void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb);
-+void mptcp_get_default_scheduler(char *name);
-+int mptcp_set_default_scheduler(const char *name);
-+extern struct mptcp_sched_ops mptcp_sched_default;
-+
-+static inline void mptcp_reset_synack_timer(struct sock *meta_sk,
-+					    unsigned long len)
-+{
-+	sk_reset_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer,
-+		       jiffies + len);
-+}
-+
-+static inline void mptcp_delete_synack_timer(struct sock *meta_sk)
-+{
-+	sk_stop_timer(meta_sk, &tcp_sk(meta_sk)->mpcb->synack_timer);
-+}
-+
-+static inline bool is_mptcp_enabled(const struct sock *sk)
-+{
-+	if (!sysctl_mptcp_enabled || mptcp_init_failed)
-+		return false;
-+
-+	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
-+		return false;
-+
-+	return true;
-+}
-+
-+static inline int mptcp_pi_to_flag(int pi)
-+{
-+	return 1 << (pi - 1);
-+}
-+
-+static inline
-+struct mptcp_request_sock *mptcp_rsk(const struct request_sock *req)
-+{
-+	return (struct mptcp_request_sock *)req;
-+}
-+
-+static inline
-+struct request_sock *rev_mptcp_rsk(const struct mptcp_request_sock *req)
-+{
-+	return (struct request_sock *)req;
-+}
-+
-+static inline bool mptcp_can_sendpage(struct sock *sk)
-+{
-+	struct sock *sk_it;
-+
-+	if (tcp_sk(sk)->mpcb->dss_csum)
-+		return false;
-+
-+	mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it) {
-+		if (!(sk_it->sk_route_caps & NETIF_F_SG) ||
-+		    !(sk_it->sk_route_caps & NETIF_F_ALL_CSUM))
-+			return false;
-+	}
-+
-+	return true;
-+}
-+
-+static inline void mptcp_push_pending_frames(struct sock *meta_sk)
-+{
-+	/* We check packets out and send-head here. TCP only checks the
-+	 * send-head. But, MPTCP also checks packets_out, as this is an
-+	 * indication that we might want to do opportunistic reinjection.
-+	 */
-+	if (tcp_sk(meta_sk)->packets_out || tcp_send_head(meta_sk)) {
-+		struct tcp_sock *tp = tcp_sk(meta_sk);
-+
-+		/* We don't care about the MSS, because it will be set in
-+		 * mptcp_write_xmit.
-+		 */
-+		__tcp_push_pending_frames(meta_sk, 0, tp->nonagle);
-+	}
-+}
-+
-+static inline void mptcp_send_reset(struct sock *sk)
-+{
-+	tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
-+	mptcp_sub_force_close(sk);
-+}
-+
-+static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
-+{
-+	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ;
-+}
-+
-+static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
-+{
-+	return TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_FIN;
-+}
-+
-+/* Is it a data-fin while in infinite mapping mode?
-+ * In infinite mode, a subflow-fin is in fact a data-fin.
-+ */
-+static inline bool mptcp_is_data_fin2(const struct sk_buff *skb,
-+				     const struct tcp_sock *tp)
-+{
-+	return mptcp_is_data_fin(skb) ||
-+	       (tp->mpcb->infinite_mapping_rcv && tcp_hdr(skb)->fin);
-+}
-+
-+static inline u8 mptcp_get_64_bit(u64 data_seq, struct mptcp_cb *mpcb)
-+{
-+	u64 data_seq_high = (u32)(data_seq >> 32);
-+
-+	if (mpcb->rcv_high_order[0] == data_seq_high)
-+		return 0;
-+	else if (mpcb->rcv_high_order[1] == data_seq_high)
-+		return MPTCPHDR_SEQ64_INDEX;
-+	else
-+		return MPTCPHDR_SEQ64_OFO;
-+}
-+
-+/* Sets the data_seq and returns pointer to the in-skb field of the data_seq.
-+ * If the packet has a 64-bit dseq, the pointer points to the last 32 bits.
-+ */
-+static inline __u32 *mptcp_skb_set_data_seq(const struct sk_buff *skb,
-+					    u32 *data_seq,
-+					    struct mptcp_cb *mpcb)
-+{
-+	__u32 *ptr = (__u32 *)(skb_transport_header(skb) + TCP_SKB_CB(skb)->dss_off);
-+
-+	if (TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_SEQ64_SET) {
-+		u64 data_seq64 = get_unaligned_be64(ptr);
-+
-+		if (mpcb)
-+			TCP_SKB_CB(skb)->mptcp_flags |= mptcp_get_64_bit(data_seq64, mpcb);
-+
-+		*data_seq = (u32)data_seq64;
-+		ptr++;
-+	} else {
-+		*data_seq = get_unaligned_be32(ptr);
-+	}
-+
-+	return ptr;
-+}
-+
-+static inline struct sock *mptcp_meta_sk(const struct sock *sk)
-+{
-+	return tcp_sk(sk)->meta_sk;
-+}
-+
-+static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
-+{
-+	return tcp_sk(tp->meta_sk);
-+}
-+
-+static inline int is_meta_tp(const struct tcp_sock *tp)
-+{
-+	return tp->mpcb && mptcp_meta_tp(tp) == tp;
-+}
-+
-+static inline int is_meta_sk(const struct sock *sk)
-+{
-+	return sk->sk_type == SOCK_STREAM  && sk->sk_protocol == IPPROTO_TCP &&
-+	       mptcp(tcp_sk(sk)) && mptcp_meta_sk(sk) == sk;
-+}
-+
-+static inline int is_master_tp(const struct tcp_sock *tp)
-+{
-+	return !mptcp(tp) || (!tp->mptcp->slave_sk && !is_meta_tp(tp));
-+}
-+
-+static inline void mptcp_hash_request_remove(struct request_sock *req)
-+{
-+	int in_softirq = 0;
-+
-+	if (hlist_nulls_unhashed(&mptcp_rsk(req)->hash_entry))
-+		return;
-+
-+	if (in_softirq()) {
-+		spin_lock(&mptcp_reqsk_hlock);
-+		in_softirq = 1;
-+	} else {
-+		spin_lock_bh(&mptcp_reqsk_hlock);
-+	}
-+
-+	hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
-+
-+	if (in_softirq)
-+		spin_unlock(&mptcp_reqsk_hlock);
-+	else
-+		spin_unlock_bh(&mptcp_reqsk_hlock);
-+}
-+
-+static inline void mptcp_init_mp_opt(struct mptcp_options_received *mopt)
-+{
-+	mopt->saw_mpc = 0;
-+	mopt->dss_csum = 0;
-+	mopt->drop_me = 0;
-+
-+	mopt->is_mp_join = 0;
-+	mopt->join_ack = 0;
-+
-+	mopt->saw_low_prio = 0;
-+	mopt->low_prio = 0;
-+
-+	mopt->saw_add_addr = 0;
-+	mopt->more_add_addr = 0;
-+
-+	mopt->saw_rem_addr = 0;
-+	mopt->more_rem_addr = 0;
-+
-+	mopt->mp_fail = 0;
-+	mopt->mp_fclose = 0;
-+}
-+
-+static inline void mptcp_reset_mopt(struct tcp_sock *tp)
-+{
-+	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
-+
-+	mopt->saw_low_prio = 0;
-+	mopt->saw_add_addr = 0;
-+	mopt->more_add_addr = 0;
-+	mopt->saw_rem_addr = 0;
-+	mopt->more_rem_addr = 0;
-+	mopt->join_ack = 0;
-+	mopt->mp_fail = 0;
-+	mopt->mp_fclose = 0;
-+}
-+
-+static inline __be32 mptcp_get_highorder_sndbits(const struct sk_buff *skb,
-+						 const struct mptcp_cb *mpcb)
-+{
-+	return htonl(mpcb->snd_high_order[(TCP_SKB_CB(skb)->mptcp_flags &
-+			MPTCPHDR_SEQ64_INDEX) ? 1 : 0]);
-+}
-+
-+static inline u64 mptcp_get_data_seq_64(const struct mptcp_cb *mpcb, int index,
-+					u32 data_seq_32)
-+{
-+	return ((u64)mpcb->rcv_high_order[index] << 32) | data_seq_32;
-+}
-+
-+static inline u64 mptcp_get_rcv_nxt_64(const struct tcp_sock *meta_tp)
-+{
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	return mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
-+				     meta_tp->rcv_nxt);
-+}
-+
-+static inline void mptcp_check_sndseq_wrap(struct tcp_sock *meta_tp, int inc)
-+{
-+	if (unlikely(meta_tp->snd_nxt > meta_tp->snd_nxt + inc)) {
-+		struct mptcp_cb *mpcb = meta_tp->mpcb;
-+		mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
-+		mpcb->snd_high_order[mpcb->snd_hiseq_index] += 2;
-+	}
-+}
-+
-+static inline void mptcp_check_rcvseq_wrap(struct tcp_sock *meta_tp,
-+					   u32 old_rcv_nxt)
-+{
-+	if (unlikely(old_rcv_nxt > meta_tp->rcv_nxt)) {
-+		struct mptcp_cb *mpcb = meta_tp->mpcb;
-+		mpcb->rcv_high_order[mpcb->rcv_hiseq_index] += 2;
-+		mpcb->rcv_hiseq_index = mpcb->rcv_hiseq_index ? 0 : 1;
-+	}
-+}
-+
-+static inline int mptcp_sk_can_send(const struct sock *sk)
-+{
-+	return tcp_passive_fastopen(sk) ||
-+	       ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
-+		!tcp_sk(sk)->mptcp->pre_established);
-+}
-+
-+static inline int mptcp_sk_can_recv(const struct sock *sk)
-+{
-+	return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2);
-+}
-+
-+static inline int mptcp_sk_can_send_ack(const struct sock *sk)
-+{
-+	return !((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV |
-+					TCPF_CLOSE | TCPF_LISTEN)) &&
-+	       !tcp_sk(sk)->mptcp->pre_established;
-+}
-+
-+/* Only support GSO if all subflows supports it */
-+static inline bool mptcp_sk_can_gso(const struct sock *meta_sk)
-+{
-+	struct sock *sk;
-+
-+	if (tcp_sk(meta_sk)->mpcb->dss_csum)
-+		return false;
-+
-+	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+		if (!sk_can_gso(sk))
-+			return false;
-+	}
-+	return true;
-+}
-+
-+static inline bool mptcp_can_sg(const struct sock *meta_sk)
-+{
-+	struct sock *sk;
-+
-+	if (tcp_sk(meta_sk)->mpcb->dss_csum)
-+		return false;
-+
-+	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+		if (!(sk->sk_route_caps & NETIF_F_SG))
-+			return false;
-+	}
-+	return true;
-+}
-+
-+static inline void mptcp_set_rto(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *sk_it;
-+	struct inet_connection_sock *micsk = inet_csk(mptcp_meta_sk(sk));
-+	__u32 max_rto = 0;
-+
-+	/* We are in recovery-phase on the MPTCP-level. Do not update the
-+	 * RTO, because this would kill exponential backoff.
-+	 */
-+	if (micsk->icsk_retransmits)
-+		return;
-+
-+	mptcp_for_each_sk(tp->mpcb, sk_it) {
-+		if (mptcp_sk_can_send(sk_it) &&
-+		    inet_csk(sk_it)->icsk_rto > max_rto)
-+			max_rto = inet_csk(sk_it)->icsk_rto;
-+	}
-+	if (max_rto) {
-+		micsk->icsk_rto = max_rto << 1;
-+
-+		/* A successfull rto-measurement - reset backoff counter */
-+		micsk->icsk_backoff = 0;
-+	}
-+}
-+
-+static inline int mptcp_sysctl_syn_retries(void)
-+{
-+	return sysctl_mptcp_syn_retries;
-+}
-+
-+static inline void mptcp_sub_close_passive(struct sock *sk)
-+{
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(meta_sk);
-+
-+	/* Only close, if the app did a send-shutdown (passive close), and we
-+	 * received the data-ack of the data-fin.
-+	 */
-+	if (tp->mpcb->passive_close && meta_tp->snd_una == meta_tp->write_seq)
-+		mptcp_sub_close(sk, 0);
-+}
-+
-+static inline bool mptcp_fallback_infinite(struct sock *sk, int flag)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	/* If data has been acknowleged on the meta-level, fully_established
-+	 * will have been set before and thus we will not fall back to infinite
-+	 * mapping.
-+	 */
-+	if (likely(tp->mptcp->fully_established))
-+		return false;
-+
-+	if (!(flag & MPTCP_FLAG_DATA_ACKED))
-+		return false;
-+
-+	/* Don't fallback twice ;) */
-+	if (tp->mpcb->infinite_mapping_snd)
-+		return false;
-+
-+	pr_err("%s %#x will fallback - pi %d, src %pI4 dst %pI4 from %pS\n",
-+	       __func__, tp->mpcb->mptcp_loc_token, tp->mptcp->path_index,
-+	       &inet_sk(sk)->inet_saddr, &inet_sk(sk)->inet_daddr,
-+	       __builtin_return_address(0));
-+	if (!is_master_tp(tp))
-+		return true;
-+
-+	tp->mpcb->infinite_mapping_snd = 1;
-+	tp->mpcb->infinite_mapping_rcv = 1;
-+	tp->mptcp->fully_established = 1;
-+
-+	return false;
-+}
-+
-+/* Find the first index whose bit in the bit-field == 0 */
-+static inline u8 mptcp_set_new_pathindex(struct mptcp_cb *mpcb)
-+{
-+	u8 base = mpcb->next_path_index;
-+	int i;
-+
-+	/* Start at 1, because 0 is reserved for the meta-sk */
-+	mptcp_for_each_bit_unset(mpcb->path_index_bits >> base, i) {
-+		if (i + base < 1)
-+			continue;
-+		if (i + base >= sizeof(mpcb->path_index_bits) * 8)
-+			break;
-+		i += base;
-+		mpcb->path_index_bits |= (1 << i);
-+		mpcb->next_path_index = i + 1;
-+		return i;
-+	}
-+	mptcp_for_each_bit_unset(mpcb->path_index_bits, i) {
-+		if (i >= sizeof(mpcb->path_index_bits) * 8)
-+			break;
-+		if (i < 1)
-+			continue;
-+		mpcb->path_index_bits |= (1 << i);
-+		mpcb->next_path_index = i + 1;
-+		return i;
-+	}
-+
-+	return 0;
-+}
-+
-+static inline bool mptcp_v6_is_v4_mapped(const struct sock *sk)
-+{
-+	return sk->sk_family == AF_INET6 &&
-+	       ipv6_addr_type(&inet6_sk(sk)->saddr) == IPV6_ADDR_MAPPED;
-+}
-+
-+/* TCP and MPTCP mpc flag-depending functions */
-+u16 mptcp_select_window(struct sock *sk);
-+void mptcp_init_buffer_space(struct sock *sk);
-+void mptcp_tcp_set_rto(struct sock *sk);
-+
-+/* TCP and MPTCP flag-depending functions */
-+bool mptcp_prune_ofo_queue(struct sock *sk);
-+
-+#else /* CONFIG_MPTCP */
-+#define mptcp_debug(fmt, args...)	\
-+	do {				\
-+	} while (0)
-+
-+/* Without MPTCP, we just do one iteration
-+ * over the only socket available. This assumes that
-+ * the sk/tp arg is the socket in that case.
-+ */
-+#define mptcp_for_each_sk(mpcb, sk)
-+#define mptcp_for_each_sk_safe(__mpcb, __sk, __temp)
-+
-+static inline bool mptcp_is_data_fin(const struct sk_buff *skb)
-+{
-+	return false;
-+}
-+static inline bool mptcp_is_data_seq(const struct sk_buff *skb)
-+{
-+	return false;
-+}
-+static inline struct sock *mptcp_meta_sk(const struct sock *sk)
-+{
-+	return NULL;
-+}
-+static inline struct tcp_sock *mptcp_meta_tp(const struct tcp_sock *tp)
-+{
-+	return NULL;
-+}
-+static inline int is_meta_sk(const struct sock *sk)
-+{
-+	return 0;
-+}
-+static inline int is_master_tp(const struct tcp_sock *tp)
-+{
-+	return 0;
-+}
-+static inline void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_del_sock(const struct sock *sk) {}
-+static inline void mptcp_update_metasocket(struct sock *sock, const struct sock *meta_sk) {}
-+static inline void mptcp_reinject_data(struct sock *orig_sk, int clone_it) {}
-+static inline void mptcp_update_sndbuf(const struct tcp_sock *tp) {}
-+static inline void mptcp_clean_rtx_infinite(const struct sk_buff *skb,
-+					    const struct sock *sk) {}
-+static inline void mptcp_sub_close(struct sock *sk, unsigned long delay) {}
-+static inline void mptcp_set_rto(const struct sock *sk) {}
-+static inline void mptcp_send_fin(const struct sock *meta_sk) {}
-+static inline void mptcp_parse_options(const uint8_t *ptr, const int opsize,
-+				       const struct mptcp_options_received *mopt,
-+				       const struct sk_buff *skb) {}
-+static inline void mptcp_syn_options(const struct sock *sk,
-+				     struct tcp_out_options *opts,
-+				     unsigned *remaining) {}
-+static inline void mptcp_synack_options(struct request_sock *req,
-+					struct tcp_out_options *opts,
-+					unsigned *remaining) {}
-+
-+static inline void mptcp_established_options(struct sock *sk,
-+					     struct sk_buff *skb,
-+					     struct tcp_out_options *opts,
-+					     unsigned *size) {}
-+static inline void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+				       const struct tcp_out_options *opts,
-+				       struct sk_buff *skb) {}
-+static inline void mptcp_close(struct sock *meta_sk, long timeout) {}
-+static inline int mptcp_doit(struct sock *sk)
-+{
-+	return 0;
-+}
-+static inline int mptcp_check_req_fastopen(struct sock *child,
-+					   struct request_sock *req)
-+{
-+	return 1;
-+}
-+static inline int mptcp_check_req_master(const struct sock *sk,
-+					 const struct sock *child,
-+					 struct request_sock *req,
-+					 struct request_sock **prev)
-+{
-+	return 1;
-+}
-+static inline struct sock *mptcp_check_req_child(struct sock *sk,
-+						 struct sock *child,
-+						 struct request_sock *req,
-+						 struct request_sock **prev,
-+						 const struct mptcp_options_received *mopt)
-+{
-+	return NULL;
-+}
-+static inline unsigned int mptcp_current_mss(struct sock *meta_sk)
-+{
-+	return 0;
-+}
-+static inline int mptcp_select_size(const struct sock *meta_sk, bool sg)
-+{
-+	return 0;
-+}
-+static inline void mptcp_sub_close_passive(struct sock *sk) {}
-+static inline bool mptcp_fallback_infinite(const struct sock *sk, int flag)
-+{
-+	return false;
-+}
-+static inline void mptcp_init_mp_opt(const struct mptcp_options_received *mopt) {}
-+static inline int mptcp_check_rtt(const struct tcp_sock *tp, int time)
-+{
-+	return 0;
-+}
-+static inline int mptcp_check_snd_buf(const struct tcp_sock *tp)
-+{
-+	return 0;
-+}
-+static inline int mptcp_sysctl_syn_retries(void)
-+{
-+	return 0;
-+}
-+static inline void mptcp_send_reset(const struct sock *sk) {}
-+static inline int mptcp_handle_options(struct sock *sk,
-+				       const struct tcphdr *th,
-+				       struct sk_buff *skb)
-+{
-+	return 0;
-+}
-+static inline void mptcp_reset_mopt(struct tcp_sock *tp) {}
-+static inline void  __init mptcp_init(void) {}
-+static inline int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
-+{
-+	return 0;
-+}
-+static inline bool mptcp_sk_can_gso(const struct sock *sk)
-+{
-+	return false;
-+}
-+static inline bool mptcp_can_sg(const struct sock *meta_sk)
-+{
-+	return false;
-+}
-+static inline unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk,
-+						u32 mss_now, int large_allowed)
-+{
-+	return 0;
-+}
-+static inline void mptcp_destroy_sock(struct sock *sk) {}
-+static inline int mptcp_rcv_synsent_state_process(struct sock *sk,
-+						  struct sock **skptr,
-+						  struct sk_buff *skb,
-+						  const struct mptcp_options_received *mopt)
-+{
-+	return 0;
-+}
-+static inline bool mptcp_can_sendpage(struct sock *sk)
-+{
-+	return false;
-+}
-+static inline int mptcp_init_tw_sock(struct sock *sk,
-+				     struct tcp_timewait_sock *tw)
-+{
-+	return 0;
-+}
-+static inline void mptcp_twsk_destructor(struct tcp_timewait_sock *tw) {}
-+static inline void mptcp_disconnect(struct sock *sk) {}
-+static inline void mptcp_tsq_flags(struct sock *sk) {}
-+static inline void mptcp_tsq_sub_deferred(struct sock *meta_sk) {}
-+static inline void mptcp_hash_remove_bh(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_hash_remove(struct tcp_sock *meta_tp) {}
-+static inline void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+					 const struct tcp_options_received *rx_opt,
-+					 const struct mptcp_options_received *mopt,
-+					 const struct sk_buff *skb) {}
-+static inline void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+					  const struct sk_buff *skb) {}
-+static inline void mptcp_delete_synack_timer(struct sock *meta_sk) {}
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* _MPTCP_H */
-diff --git a/include/net/mptcp_v4.h b/include/net/mptcp_v4.h
-new file mode 100644
-index 000000000000..93ad97c77c5a
---- /dev/null
-+++ b/include/net/mptcp_v4.h
-@@ -0,0 +1,67 @@
-+/*
-+ *	MPTCP implementation
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef MPTCP_V4_H_
-+#define MPTCP_V4_H_
-+
-+
-+#include <linux/in.h>
-+#include <linux/skbuff.h>
-+#include <net/mptcp.h>
-+#include <net/request_sock.h>
-+#include <net/sock.h>
-+
-+extern struct request_sock_ops mptcp_request_sock_ops;
-+extern const struct inet_connection_sock_af_ops mptcp_v4_specific;
-+extern struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
-+extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
-+
-+#ifdef CONFIG_MPTCP
-+
-+int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
-+				 const __be32 laddr, const struct net *net);
-+int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
-+			   struct mptcp_rem4 *rem);
-+int mptcp_pm_v4_init(void);
-+void mptcp_pm_v4_undo(void);
-+u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
-+u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport);
-+
-+#else
-+
-+static inline int mptcp_v4_do_rcv(const struct sock *meta_sk,
-+				  const struct sk_buff *skb)
-+{
-+	return 0;
-+}
-+
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* MPTCP_V4_H_ */
-diff --git a/include/net/mptcp_v6.h b/include/net/mptcp_v6.h
-new file mode 100644
-index 000000000000..49a4f30ccd4d
---- /dev/null
-+++ b/include/net/mptcp_v6.h
-@@ -0,0 +1,69 @@
-+/*
-+ *	MPTCP implementation
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef _MPTCP_V6_H
-+#define _MPTCP_V6_H
-+
-+#include <linux/in6.h>
-+#include <net/if_inet6.h>
-+
-+#include <net/mptcp.h>
-+
-+
-+#ifdef CONFIG_MPTCP
-+extern const struct inet_connection_sock_af_ops mptcp_v6_mapped;
-+extern const struct inet_connection_sock_af_ops mptcp_v6_specific;
-+extern struct request_sock_ops mptcp6_request_sock_ops;
-+extern struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
-+extern struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
-+
-+int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb);
-+struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
-+				 const struct in6_addr *laddr, const struct net *net);
-+int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
-+			   struct mptcp_rem6 *rem);
-+int mptcp_pm_v6_init(void);
-+void mptcp_pm_v6_undo(void);
-+__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
-+			 __be16 sport, __be16 dport);
-+u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
-+		     __be16 sport, __be16 dport);
-+
-+#else /* CONFIG_MPTCP */
-+
-+#define mptcp_v6_mapped ipv6_mapped
-+
-+static inline int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	return 0;
-+}
-+
-+#endif /* CONFIG_MPTCP */
-+
-+#endif /* _MPTCP_V6_H */
-diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
-index 361d26077196..bae95a11c531 100644
---- a/include/net/net_namespace.h
-+++ b/include/net/net_namespace.h
-@@ -16,6 +16,7 @@
- #include <net/netns/packet.h>
- #include <net/netns/ipv4.h>
- #include <net/netns/ipv6.h>
-+#include <net/netns/mptcp.h>
- #include <net/netns/ieee802154_6lowpan.h>
- #include <net/netns/sctp.h>
- #include <net/netns/dccp.h>
-@@ -92,6 +93,9 @@ struct net {
- #if IS_ENABLED(CONFIG_IPV6)
- 	struct netns_ipv6	ipv6;
- #endif
-+#if IS_ENABLED(CONFIG_MPTCP)
-+	struct netns_mptcp	mptcp;
-+#endif
- #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
- 	struct netns_ieee802154_lowpan	ieee802154_lowpan;
- #endif
-diff --git a/include/net/netns/mptcp.h b/include/net/netns/mptcp.h
-new file mode 100644
-index 000000000000..bad418b04cc8
---- /dev/null
-+++ b/include/net/netns/mptcp.h
-@@ -0,0 +1,44 @@
-+/*
-+ *	MPTCP implementation - MPTCP namespace
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#ifndef __NETNS_MPTCP_H__
-+#define __NETNS_MPTCP_H__
-+
-+#include <linux/compiler.h>
-+
-+enum {
-+	MPTCP_PM_FULLMESH = 0,
-+	MPTCP_PM_MAX
-+};
-+
-+struct netns_mptcp {
-+	void *path_managers[MPTCP_PM_MAX];
-+};
-+
-+#endif /* __NETNS_MPTCP_H__ */
-diff --git a/include/net/request_sock.h b/include/net/request_sock.h
-index 7f830ff67f08..e79e87a8e1a6 100644
---- a/include/net/request_sock.h
-+++ b/include/net/request_sock.h
-@@ -164,7 +164,7 @@ struct request_sock_queue {
- };
- 
- int reqsk_queue_alloc(struct request_sock_queue *queue,
--		      unsigned int nr_table_entries);
-+		      unsigned int nr_table_entries, gfp_t flags);
- 
- void __reqsk_queue_destroy(struct request_sock_queue *queue);
- void reqsk_queue_destroy(struct request_sock_queue *queue);
-diff --git a/include/net/sock.h b/include/net/sock.h
-index 156350745700..0e23cae8861f 100644
---- a/include/net/sock.h
-+++ b/include/net/sock.h
-@@ -901,6 +901,16 @@ void sk_clear_memalloc(struct sock *sk);
- 
- int sk_wait_data(struct sock *sk, long *timeo);
- 
-+/* START - needed for MPTCP */
-+struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority, int family);
-+void sock_lock_init(struct sock *sk);
-+
-+extern struct lock_class_key af_callback_keys[AF_MAX];
-+extern char *const af_family_clock_key_strings[AF_MAX+1];
-+
-+#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
-+/* END - needed for MPTCP */
-+
- struct request_sock_ops;
- struct timewait_sock_ops;
- struct inet_hashinfo;
-diff --git a/include/net/tcp.h b/include/net/tcp.h
-index 7286db80e8b8..ff92e74cd684 100644
---- a/include/net/tcp.h
-+++ b/include/net/tcp.h
-@@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
- #define TCPOPT_SACK             5       /* SACK Block */
- #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
- #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
-+#define TCPOPT_MPTCP		30
- #define TCPOPT_EXP		254	/* Experimental */
- /* Magic number to be after the option value for sharing TCP
-  * experimental options. See draft-ietf-tcpm-experimental-options-00.txt
-@@ -229,6 +230,27 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
- #define	TFO_SERVER_WO_SOCKOPT1	0x400
- #define	TFO_SERVER_WO_SOCKOPT2	0x800
- 
-+/* Flags from tcp_input.c for tcp_ack */
-+#define FLAG_DATA               0x01 /* Incoming frame contained data.          */
-+#define FLAG_WIN_UPDATE         0x02 /* Incoming ACK was a window update.       */
-+#define FLAG_DATA_ACKED         0x04 /* This ACK acknowledged new data.         */
-+#define FLAG_RETRANS_DATA_ACKED 0x08 /* "" "" some of which was retransmitted.  */
-+#define FLAG_SYN_ACKED          0x10 /* This ACK acknowledged SYN.              */
-+#define FLAG_DATA_SACKED        0x20 /* New SACK.                               */
-+#define FLAG_ECE                0x40 /* ECE in this ACK                         */
-+#define FLAG_SLOWPATH           0x100 /* Do not skip RFC checks for window update.*/
-+#define FLAG_ORIG_SACK_ACKED    0x200 /* Never retransmitted data are (s)acked  */
-+#define FLAG_SND_UNA_ADVANCED   0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
-+#define FLAG_DSACKING_ACK       0x800 /* SACK blocks contained D-SACK info */
-+#define FLAG_SACK_RENEGING      0x2000 /* snd_una advanced to a sacked seq */
-+#define FLAG_UPDATE_TS_RECENT   0x4000 /* tcp_replace_ts_recent() */
-+#define MPTCP_FLAG_DATA_ACKED	0x8000
-+
-+#define FLAG_ACKED              (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
-+#define FLAG_NOT_DUP            (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
-+#define FLAG_CA_ALERT           (FLAG_DATA_SACKED|FLAG_ECE)
-+#define FLAG_FORWARD_PROGRESS   (FLAG_ACKED|FLAG_DATA_SACKED)
-+
- extern struct inet_timewait_death_row tcp_death_row;
- 
- /* sysctl variables for tcp */
-@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
- #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
- #define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
- 
-+/**** START - Exports needed for MPTCP ****/
-+extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops;
-+extern const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops;
-+
-+struct mptcp_options_received;
-+
-+void tcp_enter_quickack_mode(struct sock *sk);
-+int tcp_close_state(struct sock *sk);
-+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-+			 const struct sk_buff *skb);
-+int tcp_xmit_probe_skb(struct sock *sk, int urgent);
-+void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb);
-+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-+		     gfp_t gfp_mask);
-+unsigned int tcp_mss_split_point(const struct sock *sk,
-+				 const struct sk_buff *skb,
-+				 unsigned int mss_now,
-+				 unsigned int max_segs,
-+				 int nonagle);
-+bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+		    unsigned int cur_mss, int nonagle);
-+bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+		      unsigned int cur_mss);
-+unsigned int tcp_cwnd_test(const struct tcp_sock *tp, const struct sk_buff *skb);
-+int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+		      unsigned int mss_now);
-+void __pskb_trim_head(struct sk_buff *skb, int len);
-+void tcp_queue_skb(struct sock *sk, struct sk_buff *skb);
-+void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags);
-+void tcp_reset(struct sock *sk);
-+bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
-+			   const u32 ack_seq, const u32 nwin);
-+bool tcp_urg_mode(const struct tcp_sock *tp);
-+void tcp_ack_probe(struct sock *sk);
-+void tcp_rearm_rto(struct sock *sk);
-+int tcp_write_timeout(struct sock *sk);
-+bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
-+			   unsigned int timeout, bool syn_set);
-+void tcp_write_err(struct sock *sk);
-+void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr);
-+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+			  unsigned int mss_now);
-+
-+int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req);
-+void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+			   struct request_sock *req);
-+__u32 tcp_v4_init_sequence(const struct sk_buff *skb);
-+int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-+		       struct flowi *fl,
-+		       struct request_sock *req,
-+		       u16 queue_mapping,
-+		       struct tcp_fastopen_cookie *foc);
-+void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb);
-+struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb);
-+struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb);
-+void tcp_v4_reqsk_destructor(struct request_sock *req);
-+
-+int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req);
-+void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+			   struct request_sock *req);
-+__u32 tcp_v6_init_sequence(const struct sk_buff *skb);
-+int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-+		       struct flowi *fl, struct request_sock *req,
-+		       u16 queue_mapping, struct tcp_fastopen_cookie *foc);
-+void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
-+int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
-+int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
-+void tcp_v6_destroy_sock(struct sock *sk);
-+void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
-+void tcp_v6_hash(struct sock *sk);
-+struct sock *tcp_v6_hnd_req(struct sock *sk,struct sk_buff *skb);
-+struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-+			          struct request_sock *req,
-+				  struct dst_entry *dst);
-+void tcp_v6_reqsk_destructor(struct request_sock *req);
-+
-+unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
-+				       int large_allowed);
-+u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb);
-+
-+void skb_clone_fraglist(struct sk_buff *skb);
-+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
-+
-+void inet_twsk_free(struct inet_timewait_sock *tw);
-+int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb);
-+/* These states need RST on ABORT according to RFC793 */
-+static inline bool tcp_need_reset(int state)
-+{
-+	return (1 << state) &
-+	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
-+		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
-+}
-+
-+bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
-+			    int hlen);
-+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-+			       bool *fragstolen);
-+bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to,
-+		      struct sk_buff *from, bool *fragstolen);
-+/**** END - Exports needed for MPTCP ****/
-+
- void tcp_tasklet_init(void);
- 
- void tcp_v4_err(struct sk_buff *skb, u32);
-@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 		size_t len, int nonblock, int flags, int *addr_len);
- void tcp_parse_options(const struct sk_buff *skb,
- 		       struct tcp_options_received *opt_rx,
-+		       struct mptcp_options_received *mopt_rx,
- 		       int estab, struct tcp_fastopen_cookie *foc);
- const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
- 
-@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
- 
- u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
- 			      u16 *mssp);
--__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mss);
--#else
--static inline __u32 cookie_v4_init_sequence(struct sock *sk,
--					    struct sk_buff *skb,
--					    __u16 *mss)
--{
--	return 0;
--}
-+__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
-+			      __u16 *mss);
- #endif
- 
- __u32 cookie_init_timestamp(struct request_sock *req);
-@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
- 			      const struct tcphdr *th, u16 *mssp);
- __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
- 			      __u16 *mss);
--#else
--static inline __u32 cookie_v6_init_sequence(struct sock *sk,
--					    struct sk_buff *skb,
--					    __u16 *mss)
--{
--	return 0;
--}
- #endif
- /* tcp_output.c */
- 
-@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
- void tcp_send_loss_probe(struct sock *sk);
- bool tcp_schedule_loss_probe(struct sock *sk);
- 
-+u16 tcp_select_window(struct sock *sk);
-+bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+		int push_one, gfp_t gfp);
-+
- /* tcp_input.c */
- void tcp_resume_early_retransmit(struct sock *sk);
- void tcp_rearm_rto(struct sock *sk);
- void tcp_reset(struct sock *sk);
-+void tcp_set_rto(struct sock *sk);
-+bool tcp_should_expand_sndbuf(const struct sock *sk);
-+bool tcp_prune_ofo_queue(struct sock *sk);
- 
- /* tcp_timer.c */
- void tcp_init_xmit_timers(struct sock *);
-@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
-  */
- struct tcp_skb_cb {
- 	union {
--		struct inet_skb_parm	h4;
-+		union {
-+			struct inet_skb_parm	h4;
- #if IS_ENABLED(CONFIG_IPV6)
--		struct inet6_skb_parm	h6;
-+			struct inet6_skb_parm	h6;
- #endif
--	} header;	/* For incoming frames		*/
-+		} header;	/* For incoming frames		*/
-+#ifdef CONFIG_MPTCP
-+		union {			/* For MPTCP outgoing frames */
-+			__u32 path_mask; /* paths that tried to send this skb */
-+			__u32 dss[6];	/* DSS options */
-+		};
-+#endif
-+	};
- 	__u32		seq;		/* Starting sequence number	*/
- 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
- 	__u32		when;		/* used to compute rtt's	*/
-+#ifdef CONFIG_MPTCP
-+	__u8		mptcp_flags;	/* flags for the MPTCP layer    */
-+	__u8		dss_off;	/* Number of 4-byte words until
-+					 * seq-number */
-+#endif
- 	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
- 
- 	__u8		sacked;		/* State flags for SACK/FACK.	*/
-@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
- /* Determine a window scaling and initial window to offer. */
- void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
- 			       __u32 *window_clamp, int wscale_ok,
--			       __u8 *rcv_wscale, __u32 init_rcv_wnd);
-+			       __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+			       const struct sock *sk);
- 
- static inline int tcp_win_from_space(int space)
- {
-@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
- 		space - (space>>sysctl_tcp_adv_win_scale);
- }
- 
-+#ifdef CONFIG_MPTCP
-+extern struct static_key mptcp_static_key;
-+static inline bool mptcp(const struct tcp_sock *tp)
-+{
-+	return static_key_false(&mptcp_static_key) && tp->mpc;
-+}
-+#else
-+static inline bool mptcp(const struct tcp_sock *tp)
-+{
-+	return 0;
-+}
-+#endif
-+
- /* Note: caller must be prepared to deal with negative returns */ 
- static inline int tcp_space(const struct sock *sk)
- {
-+	if (mptcp(tcp_sk(sk)))
-+		sk = tcp_sk(sk)->meta_sk;
-+
- 	return tcp_win_from_space(sk->sk_rcvbuf -
- 				  atomic_read(&sk->sk_rmem_alloc));
- } 
- 
- static inline int tcp_full_space(const struct sock *sk)
- {
-+	if (mptcp(tcp_sk(sk)))
-+		sk = tcp_sk(sk)->meta_sk;
-+
- 	return tcp_win_from_space(sk->sk_rcvbuf); 
- }
- 
-@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
- 	ireq->wscale_ok = rx_opt->wscale_ok;
- 	ireq->acked = 0;
- 	ireq->ecn_ok = 0;
-+	ireq->mptcp_rqsk = 0;
-+	ireq->saw_mpc = 0;
- 	ireq->ir_rmt_port = tcp_hdr(skb)->source;
- 	ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
- }
-@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
- void tcp4_proc_exit(void);
- #endif
- 
-+int tcp_rtx_synack(struct sock *sk, struct request_sock *req);
-+int tcp_conn_request(struct request_sock_ops *rsk_ops,
-+		     const struct tcp_request_sock_ops *af_ops,
-+		     struct sock *sk, struct sk_buff *skb);
-+
- /* TCP af-specific functions */
- struct tcp_sock_af_ops {
- #ifdef CONFIG_TCP_MD5SIG
-@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
- #endif
- };
- 
-+/* TCP/MPTCP-specific functions */
-+struct tcp_sock_ops {
-+	u32 (*__select_window)(struct sock *sk);
-+	u16 (*select_window)(struct sock *sk);
-+	void (*select_initial_window)(int __space, __u32 mss, __u32 *rcv_wnd,
-+				      __u32 *window_clamp, int wscale_ok,
-+				      __u8 *rcv_wscale, __u32 init_rcv_wnd,
-+				      const struct sock *sk);
-+	void (*init_buffer_space)(struct sock *sk);
-+	void (*set_rto)(struct sock *sk);
-+	bool (*should_expand_sndbuf)(const struct sock *sk);
-+	void (*send_fin)(struct sock *sk);
-+	bool (*write_xmit)(struct sock *sk, unsigned int mss_now, int nonagle,
-+			   int push_one, gfp_t gfp);
-+	void (*send_active_reset)(struct sock *sk, gfp_t priority);
-+	int (*write_wakeup)(struct sock *sk);
-+	bool (*prune_ofo_queue)(struct sock *sk);
-+	void (*retransmit_timer)(struct sock *sk);
-+	void (*time_wait)(struct sock *sk, int state, int timeo);
-+	void (*cleanup_rbuf)(struct sock *sk, int copied);
-+	void (*init_congestion_control)(struct sock *sk);
-+};
-+extern const struct tcp_sock_ops tcp_specific;
-+
- struct tcp_request_sock_ops {
-+	u16 mss_clamp;
- #ifdef CONFIG_TCP_MD5SIG
- 	struct tcp_md5sig_key	*(*md5_lookup) (struct sock *sk,
- 						struct request_sock *req);
-@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
- 						  const struct request_sock *req,
- 						  const struct sk_buff *skb);
- #endif
-+	int (*init_req)(struct request_sock *req, struct sock *sk,
-+			 struct sk_buff *skb);
-+#ifdef CONFIG_SYN_COOKIES
-+	__u32 (*cookie_init_seq)(struct sock *sk, const struct sk_buff *skb,
-+				 __u16 *mss);
-+#endif
-+	struct dst_entry *(*route_req)(struct sock *sk, struct flowi *fl,
-+				       const struct request_sock *req,
-+				       bool *strict);
-+	__u32 (*init_seq)(const struct sk_buff *skb);
-+	int (*send_synack)(struct sock *sk, struct dst_entry *dst,
-+			   struct flowi *fl, struct request_sock *req,
-+			   u16 queue_mapping, struct tcp_fastopen_cookie *foc);
-+	void (*queue_hash_add)(struct sock *sk, struct request_sock *req,
-+			       const unsigned long timeout);
- };
- 
-+#ifdef CONFIG_SYN_COOKIES
-+static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
-+					 struct sock *sk, struct sk_buff *skb,
-+					 __u16 *mss)
-+{
-+	return ops->cookie_init_seq(sk, skb, mss);
-+}
-+#else
-+static inline __u32 cookie_init_sequence(const struct tcp_request_sock_ops *ops,
-+					 struct sock *sk, struct sk_buff *skb,
-+					 __u16 *mss)
-+{
-+	return 0;
-+}
-+#endif
-+
- int tcpv4_offload_init(void);
- 
- void tcp_v4_init(void);
-diff --git a/include/uapi/linux/if.h b/include/uapi/linux/if.h
-index 9cf2394f0bcf..c2634b6ed854 100644
---- a/include/uapi/linux/if.h
-+++ b/include/uapi/linux/if.h
-@@ -109,6 +109,9 @@ enum net_device_flags {
- #define IFF_DORMANT			IFF_DORMANT
- #define IFF_ECHO			IFF_ECHO
- 
-+#define IFF_NOMULTIPATH	0x80000		/* Disable for MPTCP 		*/
-+#define IFF_MPBACKUP	0x100000	/* Use as backup path for MPTCP */
-+
- #define IFF_VOLATILE	(IFF_LOOPBACK|IFF_POINTOPOINT|IFF_BROADCAST|IFF_ECHO|\
- 		IFF_MASTER|IFF_SLAVE|IFF_RUNNING|IFF_LOWER_UP|IFF_DORMANT)
- 
-diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
-index 3b9718328d8b..487475681d84 100644
---- a/include/uapi/linux/tcp.h
-+++ b/include/uapi/linux/tcp.h
-@@ -112,6 +112,7 @@ enum {
- #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
- #define TCP_TIMESTAMP		24
- #define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
-+#define MPTCP_ENABLED		26
- 
- struct tcp_repair_opt {
- 	__u32	opt_code;
-diff --git a/net/Kconfig b/net/Kconfig
-index d92afe4204d9..96b58593ad5e 100644
---- a/net/Kconfig
-+++ b/net/Kconfig
-@@ -79,6 +79,7 @@ if INET
- source "net/ipv4/Kconfig"
- source "net/ipv6/Kconfig"
- source "net/netlabel/Kconfig"
-+source "net/mptcp/Kconfig"
- 
- endif # if INET
- 
-diff --git a/net/Makefile b/net/Makefile
-index cbbbe6d657ca..244bac1435b1 100644
---- a/net/Makefile
-+++ b/net/Makefile
-@@ -20,6 +20,7 @@ obj-$(CONFIG_INET)		+= ipv4/
- obj-$(CONFIG_XFRM)		+= xfrm/
- obj-$(CONFIG_UNIX)		+= unix/
- obj-$(CONFIG_NET)		+= ipv6/
-+obj-$(CONFIG_MPTCP)		+= mptcp/
- obj-$(CONFIG_PACKET)		+= packet/
- obj-$(CONFIG_NET_KEY)		+= key/
- obj-$(CONFIG_BRIDGE)		+= bridge/
-diff --git a/net/core/dev.c b/net/core/dev.c
-index 367a586d0c8a..215d2757fbf6 100644
---- a/net/core/dev.c
-+++ b/net/core/dev.c
-@@ -5420,7 +5420,7 @@ int __dev_change_flags(struct net_device *dev, unsigned int flags)
- 
- 	dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
- 			       IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
--			       IFF_AUTOMEDIA)) |
-+			       IFF_AUTOMEDIA | IFF_NOMULTIPATH | IFF_MPBACKUP)) |
- 		     (dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
- 				    IFF_ALLMULTI));
- 
-diff --git a/net/core/request_sock.c b/net/core/request_sock.c
-index 467f326126e0..909dfa13f499 100644
---- a/net/core/request_sock.c
-+++ b/net/core/request_sock.c
-@@ -38,7 +38,8 @@ int sysctl_max_syn_backlog = 256;
- EXPORT_SYMBOL(sysctl_max_syn_backlog);
- 
- int reqsk_queue_alloc(struct request_sock_queue *queue,
--		      unsigned int nr_table_entries)
-+		      unsigned int nr_table_entries,
-+		      gfp_t flags)
- {
- 	size_t lopt_size = sizeof(struct listen_sock);
- 	struct listen_sock *lopt;
-@@ -48,9 +49,11 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
- 	nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
- 	lopt_size += nr_table_entries * sizeof(struct request_sock *);
- 	if (lopt_size > PAGE_SIZE)
--		lopt = vzalloc(lopt_size);
-+		lopt = __vmalloc(lopt_size,
-+			flags | __GFP_HIGHMEM | __GFP_ZERO,
-+			PAGE_KERNEL);
- 	else
--		lopt = kzalloc(lopt_size, GFP_KERNEL);
-+		lopt = kzalloc(lopt_size, flags);
- 	if (lopt == NULL)
- 		return -ENOMEM;
- 
-diff --git a/net/core/skbuff.c b/net/core/skbuff.c
-index c1a33033cbe2..8abc5d60fbe3 100644
---- a/net/core/skbuff.c
-+++ b/net/core/skbuff.c
-@@ -472,7 +472,7 @@ static inline void skb_drop_fraglist(struct sk_buff *skb)
- 	skb_drop_list(&skb_shinfo(skb)->frag_list);
- }
- 
--static void skb_clone_fraglist(struct sk_buff *skb)
-+void skb_clone_fraglist(struct sk_buff *skb)
- {
- 	struct sk_buff *list;
- 
-@@ -897,7 +897,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
- 	skb->inner_mac_header += off;
- }
- 
--static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
-+void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
- {
- 	__copy_skb_header(new, old);
- 
-diff --git a/net/core/sock.c b/net/core/sock.c
-index 026e01f70274..359295523177 100644
---- a/net/core/sock.c
-+++ b/net/core/sock.c
-@@ -136,6 +136,11 @@
- 
- #include <trace/events/sock.h>
- 
-+#ifdef CONFIG_MPTCP
-+#include <net/mptcp.h>
-+#include <net/inet_common.h>
-+#endif
-+
- #ifdef CONFIG_INET
- #include <net/tcp.h>
- #endif
-@@ -280,7 +285,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
-   "slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG"      ,
-   "slock-AF_NFC"   , "slock-AF_VSOCK"    ,"slock-AF_MAX"
- };
--static const char *const af_family_clock_key_strings[AF_MAX+1] = {
-+char *const af_family_clock_key_strings[AF_MAX+1] = {
-   "clock-AF_UNSPEC", "clock-AF_UNIX"     , "clock-AF_INET"     ,
-   "clock-AF_AX25"  , "clock-AF_IPX"      , "clock-AF_APPLETALK",
-   "clock-AF_NETROM", "clock-AF_BRIDGE"   , "clock-AF_ATMPVC"   ,
-@@ -301,7 +306,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
-  * sk_callback_lock locking rules are per-address-family,
-  * so split the lock classes by using a per-AF key:
-  */
--static struct lock_class_key af_callback_keys[AF_MAX];
-+struct lock_class_key af_callback_keys[AF_MAX];
- 
- /* Take into consideration the size of the struct sk_buff overhead in the
-  * determination of these values, since that is non-constant across
-@@ -422,8 +427,6 @@ static void sock_warn_obsolete_bsdism(const char *name)
- 	}
- }
- 
--#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
--
- static void sock_disable_timestamp(struct sock *sk, unsigned long flags)
- {
- 	if (sk->sk_flags & flags) {
-@@ -1253,8 +1256,25 @@ lenout:
-  *
-  * (We also register the sk_lock with the lock validator.)
-  */
--static inline void sock_lock_init(struct sock *sk)
--{
-+void sock_lock_init(struct sock *sk)
-+{
-+#ifdef CONFIG_MPTCP
-+	/* Reclassify the lock-class for subflows */
-+	if (sk->sk_type == SOCK_STREAM && sk->sk_protocol == IPPROTO_TCP)
-+		if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->is_master_sk) {
-+			sock_lock_init_class_and_name(sk, "slock-AF_INET-MPTCP",
-+						      &meta_slock_key,
-+						      "sk_lock-AF_INET-MPTCP",
-+						      &meta_key);
-+
-+			/* We don't yet have the mptcp-point.
-+			 * Thus we still need inet_sock_destruct
-+			 */
-+			sk->sk_destruct = inet_sock_destruct;
-+			return;
-+		}
-+#endif
-+
- 	sock_lock_init_class_and_name(sk,
- 			af_family_slock_key_strings[sk->sk_family],
- 			af_family_slock_keys + sk->sk_family,
-@@ -1301,7 +1321,7 @@ void sk_prot_clear_portaddr_nulls(struct sock *sk, int size)
- }
- EXPORT_SYMBOL(sk_prot_clear_portaddr_nulls);
- 
--static struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
-+struct sock *sk_prot_alloc(struct proto *prot, gfp_t priority,
- 		int family)
- {
- 	struct sock *sk;
-diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
-index 4db3c2a1679c..04cb17d4b0ce 100644
---- a/net/dccp/ipv6.c
-+++ b/net/dccp/ipv6.c
-@@ -386,7 +386,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
- 	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
- 		goto drop;
- 
--	req = inet6_reqsk_alloc(&dccp6_request_sock_ops);
-+	req = inet_reqsk_alloc(&dccp6_request_sock_ops);
- 	if (req == NULL)
- 		goto drop;
- 
-diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
-index 05c57f0fcabe..630434db0085 100644
---- a/net/ipv4/Kconfig
-+++ b/net/ipv4/Kconfig
-@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
- 	For further details see:
- 	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
- 
-+config TCP_CONG_COUPLED
-+	tristate "MPTCP COUPLED CONGESTION CONTROL"
-+	depends on MPTCP
-+	default n
-+	---help---
-+	MultiPath TCP Coupled Congestion Control
-+	To enable it, just put 'coupled' in tcp_congestion_control
-+
-+config TCP_CONG_OLIA
-+	tristate "MPTCP Opportunistic Linked Increase"
-+	depends on MPTCP
-+	default n
-+	---help---
-+	MultiPath TCP Opportunistic Linked Increase Congestion Control
-+	To enable it, just put 'olia' in tcp_congestion_control
-+
-+config TCP_CONG_WVEGAS
-+	tristate "MPTCP WVEGAS CONGESTION CONTROL"
-+	depends on MPTCP
-+	default n
-+	---help---
-+	wVegas congestion control for MPTCP
-+	To enable it, just put 'wvegas' in tcp_congestion_control
-+
- choice
- 	prompt "Default TCP congestion control"
- 	default DEFAULT_CUBIC
-@@ -584,6 +608,15 @@ choice
- 	config DEFAULT_WESTWOOD
- 		bool "Westwood" if TCP_CONG_WESTWOOD=y
- 
-+	config DEFAULT_COUPLED
-+		bool "Coupled" if TCP_CONG_COUPLED=y
-+
-+	config DEFAULT_OLIA
-+		bool "Olia" if TCP_CONG_OLIA=y
-+
-+	config DEFAULT_WVEGAS
-+		bool "Wvegas" if TCP_CONG_WVEGAS=y
-+
- 	config DEFAULT_RENO
- 		bool "Reno"
- 
-@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
- 	default "vegas" if DEFAULT_VEGAS
- 	default "westwood" if DEFAULT_WESTWOOD
- 	default "veno" if DEFAULT_VENO
-+	default "coupled" if DEFAULT_COUPLED
-+	default "wvegas" if DEFAULT_WVEGAS
- 	default "reno" if DEFAULT_RENO
- 	default "cubic"
- 
-diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
-index d156b3c5f363..4afd6d8d9028 100644
---- a/net/ipv4/af_inet.c
-+++ b/net/ipv4/af_inet.c
-@@ -104,6 +104,7 @@
- #include <net/ip_fib.h>
- #include <net/inet_connection_sock.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
- #include <net/udp.h>
- #include <net/udplite.h>
- #include <net/ping.h>
-@@ -246,8 +247,7 @@ EXPORT_SYMBOL(inet_listen);
-  *	Create an inet socket.
-  */
- 
--static int inet_create(struct net *net, struct socket *sock, int protocol,
--		       int kern)
-+int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
- {
- 	struct sock *sk;
- 	struct inet_protosw *answer;
-@@ -676,6 +676,23 @@ int inet_accept(struct socket *sock, struct socket *newsock, int flags)
- 	lock_sock(sk2);
- 
- 	sock_rps_record_flow(sk2);
-+
-+	if (sk2->sk_protocol == IPPROTO_TCP && mptcp(tcp_sk(sk2))) {
-+		struct sock *sk_it = sk2;
-+
-+		mptcp_for_each_sk(tcp_sk(sk2)->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+
-+		if (tcp_sk(sk2)->mpcb->master_sk) {
-+			sk_it = tcp_sk(sk2)->mpcb->master_sk;
-+
-+			write_lock_bh(&sk_it->sk_callback_lock);
-+			sk_it->sk_wq = newsock->wq;
-+			sk_it->sk_socket = newsock;
-+			write_unlock_bh(&sk_it->sk_callback_lock);
-+		}
-+	}
-+
- 	WARN_ON(!((1 << sk2->sk_state) &
- 		  (TCPF_ESTABLISHED | TCPF_SYN_RECV |
- 		  TCPF_CLOSE_WAIT | TCPF_CLOSE)));
-@@ -1763,6 +1780,9 @@ static int __init inet_init(void)
- 
- 	ip_init();
- 
-+	/* We must initialize MPTCP before TCP. */
-+	mptcp_init();
-+
- 	tcp_v4_init();
- 
- 	/* Setup TCP slab cache for open requests. */
-diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
-index 14d02ea905b6..7d734d8af19b 100644
---- a/net/ipv4/inet_connection_sock.c
-+++ b/net/ipv4/inet_connection_sock.c
-@@ -23,6 +23,7 @@
- #include <net/route.h>
- #include <net/tcp_states.h>
- #include <net/xfrm.h>
-+#include <net/mptcp.h>
- 
- #ifdef INET_CSK_DEBUG
- const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer value\n";
-@@ -465,8 +466,8 @@ no_route:
- }
- EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
- 
--static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
--				 const u32 rnd, const u32 synq_hsize)
-+u32 inet_synq_hash(const __be32 raddr, const __be16 rport, const u32 rnd,
-+		   const u32 synq_hsize)
- {
- 	return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
- }
-@@ -647,7 +648,7 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
- 
- 	lopt->clock_hand = i;
- 
--	if (lopt->qlen)
-+	if (lopt->qlen && !is_meta_sk(parent))
- 		inet_csk_reset_keepalive_timer(parent, interval);
- }
- EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
-@@ -664,7 +665,9 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
- 				 const struct request_sock *req,
- 				 const gfp_t priority)
- {
--	struct sock *newsk = sk_clone_lock(sk, priority);
-+	struct sock *newsk;
-+
-+	newsk = sk_clone_lock(sk, priority);
- 
- 	if (newsk != NULL) {
- 		struct inet_connection_sock *newicsk = inet_csk(newsk);
-@@ -743,7 +746,8 @@ int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
- {
- 	struct inet_sock *inet = inet_sk(sk);
- 	struct inet_connection_sock *icsk = inet_csk(sk);
--	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
-+	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries,
-+				   GFP_KERNEL);
- 
- 	if (rc != 0)
- 		return rc;
-@@ -801,9 +805,14 @@ void inet_csk_listen_stop(struct sock *sk)
- 
- 	while ((req = acc_req) != NULL) {
- 		struct sock *child = req->sk;
-+		bool mutex_taken = false;
- 
- 		acc_req = req->dl_next;
- 
-+		if (is_meta_sk(child)) {
-+			mutex_lock(&tcp_sk(child)->mpcb->mpcb_mutex);
-+			mutex_taken = true;
-+		}
- 		local_bh_disable();
- 		bh_lock_sock(child);
- 		WARN_ON(sock_owned_by_user(child));
-@@ -832,6 +841,8 @@ void inet_csk_listen_stop(struct sock *sk)
- 
- 		bh_unlock_sock(child);
- 		local_bh_enable();
-+		if (mutex_taken)
-+			mutex_unlock(&tcp_sk(child)->mpcb->mpcb_mutex);
- 		sock_put(child);
- 
- 		sk_acceptq_removed(sk);
-diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
-index c86624b36a62..0ff3fe004d62 100644
---- a/net/ipv4/syncookies.c
-+++ b/net/ipv4/syncookies.c
-@@ -170,7 +170,8 @@ u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
- }
- EXPORT_SYMBOL_GPL(__cookie_v4_init_sequence);
- 
--__u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
-+__u32 cookie_v4_init_sequence(struct sock *sk, const struct sk_buff *skb,
-+			      __u16 *mssp)
- {
- 	const struct iphdr *iph = ip_hdr(skb);
- 	const struct tcphdr *th = tcp_hdr(skb);
-@@ -284,7 +285,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
- 
- 	/* check for timestamp cookie support */
- 	memset(&tcp_opt, 0, sizeof(tcp_opt));
--	tcp_parse_options(skb, &tcp_opt, 0, NULL);
-+	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
- 
- 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
- 		goto out;
-@@ -355,10 +356,10 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
- 	/* Try to redo what tcp_v4_send_synack did. */
- 	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
- 
--	tcp_select_initial_window(tcp_full_space(sk), req->mss,
--				  &req->rcv_wnd, &req->window_clamp,
--				  ireq->wscale_ok, &rcv_wscale,
--				  dst_metric(&rt->dst, RTAX_INITRWND));
-+	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
-+				       &req->rcv_wnd, &req->window_clamp,
-+				       ireq->wscale_ok, &rcv_wscale,
-+				       dst_metric(&rt->dst, RTAX_INITRWND), sk);
- 
- 	ireq->rcv_wscale  = rcv_wscale;
- 
-diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
-index 9d2118e5fbc7..2cb89f886d45 100644
---- a/net/ipv4/tcp.c
-+++ b/net/ipv4/tcp.c
-@@ -271,6 +271,7 @@
- 
- #include <net/icmp.h>
- #include <net/inet_common.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- #include <net/xfrm.h>
- #include <net/ip.h>
-@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
- 	return period;
- }
- 
-+const struct tcp_sock_ops tcp_specific = {
-+	.__select_window		= __tcp_select_window,
-+	.select_window			= tcp_select_window,
-+	.select_initial_window		= tcp_select_initial_window,
-+	.init_buffer_space		= tcp_init_buffer_space,
-+	.set_rto			= tcp_set_rto,
-+	.should_expand_sndbuf		= tcp_should_expand_sndbuf,
-+	.init_congestion_control	= tcp_init_congestion_control,
-+	.send_fin			= tcp_send_fin,
-+	.write_xmit			= tcp_write_xmit,
-+	.send_active_reset		= tcp_send_active_reset,
-+	.write_wakeup			= tcp_write_wakeup,
-+	.prune_ofo_queue		= tcp_prune_ofo_queue,
-+	.retransmit_timer		= tcp_retransmit_timer,
-+	.time_wait			= tcp_time_wait,
-+	.cleanup_rbuf			= tcp_cleanup_rbuf,
-+};
-+
- /* Address-family independent initialization for a tcp_sock.
-  *
-  * NOTE: A lot of things set to zero explicitly by call to
-@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
- 	sk->sk_sndbuf = sysctl_tcp_wmem[1];
- 	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
- 
-+	tp->ops = &tcp_specific;
-+
- 	local_bh_disable();
- 	sock_update_memcg(sk);
- 	sk_sockets_allocated_inc(sk);
-@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
- 	int ret;
- 
- 	sock_rps_record_flow(sk);
-+
-+#ifdef CONFIG_MPTCP
-+	if (mptcp(tcp_sk(sk))) {
-+		struct sock *sk_it;
-+		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+	}
-+#endif
- 	/*
- 	 * We can't seek on a socket input
- 	 */
-@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
- 	return NULL;
- }
- 
--static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
--				       int large_allowed)
-+unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	u32 xmit_size_goal, old_size_goal;
-@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
- {
- 	int mss_now;
- 
--	mss_now = tcp_current_mss(sk);
--	*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+	if (mptcp(tcp_sk(sk))) {
-+		mss_now = mptcp_current_mss(sk);
-+		*size_goal = mptcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+	} else {
-+		mss_now = tcp_current_mss(sk);
-+		*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
-+	}
- 
- 	return mss_now;
- }
-@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
- 	 * is fully established.
- 	 */
- 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
--	    !tcp_passive_fastopen(sk)) {
-+	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
-+				  tp->mpcb->master_sk : sk)) {
- 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- 			goto out_err;
- 	}
- 
-+	if (mptcp(tp)) {
-+		struct sock *sk_it = sk;
-+
-+		/* We must check this with socket-lock hold because we iterate
-+		 * over the subflows.
-+		 */
-+		if (!mptcp_can_sendpage(sk)) {
-+			ssize_t ret;
-+
-+			release_sock(sk);
-+			ret = sock_no_sendpage(sk->sk_socket, page, offset,
-+					       size, flags);
-+			lock_sock(sk);
-+			return ret;
-+		}
-+
-+		mptcp_for_each_sk(tp->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+	}
-+
- 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
- 
- 	mss_now = tcp_send_mss(sk, &size_goal, flags);
-@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
- {
- 	ssize_t res;
- 
--	if (!(sk->sk_route_caps & NETIF_F_SG) ||
--	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM))
-+	/* If MPTCP is enabled, we check it later after establishment */
-+	if (!mptcp(tcp_sk(sk)) && (!(sk->sk_route_caps & NETIF_F_SG) ||
-+	    !(sk->sk_route_caps & NETIF_F_ALL_CSUM)))
- 		return sock_no_sendpage(sk->sk_socket, page, offset, size,
- 					flags);
- 
-@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 	int tmp = tp->mss_cache;
- 
-+	if (mptcp(tp))
-+		return mptcp_select_size(sk, sg);
-+
- 	if (sg) {
- 		if (sk_can_gso(sk)) {
- 			/* Small frames wont use a full page:
-@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 	 * is fully established.
- 	 */
- 	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
--	    !tcp_passive_fastopen(sk)) {
-+	    !tcp_passive_fastopen(mptcp(tp) && tp->mpcb->master_sk ?
-+				  tp->mpcb->master_sk : sk)) {
- 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
- 			goto do_error;
- 	}
- 
-+	if (mptcp(tp)) {
-+		struct sock *sk_it = sk;
-+		mptcp_for_each_sk(tp->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+	}
-+
- 	if (unlikely(tp->repair)) {
- 		if (tp->repair_queue == TCP_RECV_QUEUE) {
- 			copied = tcp_send_rcvq(sk, msg, size);
-@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
- 		goto out_err;
- 
--	sg = !!(sk->sk_route_caps & NETIF_F_SG);
-+	if (mptcp(tp))
-+		sg = mptcp_can_sg(sk);
-+	else
-+		sg = !!(sk->sk_route_caps & NETIF_F_SG);
- 
- 	while (--iovlen >= 0) {
- 		size_t seglen = iov->iov_len;
-@@ -1183,8 +1251,15 @@ new_segment:
- 
- 				/*
- 				 * Check whether we can use HW checksum.
-+				 *
-+				 * If dss-csum is enabled, we do not do hw-csum.
-+				 * In case of non-mptcp we check the
-+				 * device-capabilities.
-+				 * In case of mptcp, hw-csum's will be handled
-+				 * later in mptcp_write_xmit.
- 				 */
--				if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
-+				if (((mptcp(tp) && !tp->mpcb->dss_csum) || !mptcp(tp)) &&
-+				    (mptcp(tp) || sk->sk_route_caps & NETIF_F_ALL_CSUM))
- 					skb->ip_summed = CHECKSUM_PARTIAL;
- 
- 				skb_entail(sk, skb);
-@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
- 
- 		/* Optimize, __tcp_select_window() is not cheap. */
- 		if (2*rcv_window_now <= tp->window_clamp) {
--			__u32 new_window = __tcp_select_window(sk);
-+			__u32 new_window = tp->ops->__select_window(sk);
- 
- 			/* Send ACK now, if this read freed lots of space
- 			 * in our buffer. Certainly, new_window is new window.
-@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
- 	/* Clean up data we have read: This will do ACK frames. */
- 	if (copied > 0) {
- 		tcp_recv_skb(sk, seq, &offset);
--		tcp_cleanup_rbuf(sk, copied);
-+		tp->ops->cleanup_rbuf(sk, copied);
- 	}
- 	return copied;
- }
-@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 
- 	lock_sock(sk);
- 
-+#ifdef CONFIG_MPTCP
-+	if (mptcp(tp)) {
-+		struct sock *sk_it;
-+		mptcp_for_each_sk(tp->mpcb, sk_it)
-+			sock_rps_record_flow(sk_it);
-+	}
-+#endif
-+
- 	err = -ENOTCONN;
- 	if (sk->sk_state == TCP_LISTEN)
- 		goto out;
-@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 			}
- 		}
- 
--		tcp_cleanup_rbuf(sk, copied);
-+		tp->ops->cleanup_rbuf(sk, copied);
- 
- 		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
- 			/* Install new reader */
-@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
- 			if (tp->rcv_wnd == 0 &&
- 			    !skb_queue_empty(&sk->sk_async_wait_queue)) {
- 				tcp_service_net_dma(sk, true);
--				tcp_cleanup_rbuf(sk, copied);
-+				tp->ops->cleanup_rbuf(sk, copied);
- 			} else
- 				dma_async_issue_pending(tp->ucopy.dma_chan);
- 		}
-@@ -1993,7 +2076,7 @@ skip_copy:
- 	 */
- 
- 	/* Clean up data we have read: This will do ACK frames. */
--	tcp_cleanup_rbuf(sk, copied);
-+	tp->ops->cleanup_rbuf(sk, copied);
- 
- 	release_sock(sk);
- 	return copied;
-@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
-   /* TCP_CLOSING	*/ TCP_CLOSING,
- };
- 
--static int tcp_close_state(struct sock *sk)
-+int tcp_close_state(struct sock *sk)
- {
- 	int next = (int)new_state[sk->sk_state];
- 	int ns = next & TCP_STATE_MASK;
-@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
- 	     TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
- 		/* Clear out any half completed packets.  FIN if needed. */
- 		if (tcp_close_state(sk))
--			tcp_send_fin(sk);
-+			tcp_sk(sk)->ops->send_fin(sk);
- 	}
- }
- EXPORT_SYMBOL(tcp_shutdown);
-@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
- 	int data_was_unread = 0;
- 	int state;
- 
-+	if (is_meta_sk(sk)) {
-+		mptcp_close(sk, timeout);
-+		return;
-+	}
-+
- 	lock_sock(sk);
- 	sk->sk_shutdown = SHUTDOWN_MASK;
- 
-@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
- 		/* Unread data was tossed, zap the connection. */
- 		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
- 		tcp_set_state(sk, TCP_CLOSE);
--		tcp_send_active_reset(sk, sk->sk_allocation);
-+		tcp_sk(sk)->ops->send_active_reset(sk, sk->sk_allocation);
- 	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
- 		/* Check zero linger _after_ checking for unread data. */
- 		sk->sk_prot->disconnect(sk, 0);
-@@ -2247,7 +2335,7 @@ adjudge_to_death:
- 		struct tcp_sock *tp = tcp_sk(sk);
- 		if (tp->linger2 < 0) {
- 			tcp_set_state(sk, TCP_CLOSE);
--			tcp_send_active_reset(sk, GFP_ATOMIC);
-+			tp->ops->send_active_reset(sk, GFP_ATOMIC);
- 			NET_INC_STATS_BH(sock_net(sk),
- 					LINUX_MIB_TCPABORTONLINGER);
- 		} else {
-@@ -2257,7 +2345,8 @@ adjudge_to_death:
- 				inet_csk_reset_keepalive_timer(sk,
- 						tmo - TCP_TIMEWAIT_LEN);
- 			} else {
--				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+				tcp_sk(sk)->ops->time_wait(sk, TCP_FIN_WAIT2,
-+							   tmo);
- 				goto out;
- 			}
- 		}
-@@ -2266,7 +2355,7 @@ adjudge_to_death:
- 		sk_mem_reclaim(sk);
- 		if (tcp_check_oom(sk, 0)) {
- 			tcp_set_state(sk, TCP_CLOSE);
--			tcp_send_active_reset(sk, GFP_ATOMIC);
-+			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
- 			NET_INC_STATS_BH(sock_net(sk),
- 					LINUX_MIB_TCPABORTONMEMORY);
- 		}
-@@ -2291,15 +2380,6 @@ out:
- }
- EXPORT_SYMBOL(tcp_close);
- 
--/* These states need RST on ABORT according to RFC793 */
--
--static inline bool tcp_need_reset(int state)
--{
--	return (1 << state) &
--	       (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_FIN_WAIT1 |
--		TCPF_FIN_WAIT2 | TCPF_SYN_RECV);
--}
--
- int tcp_disconnect(struct sock *sk, int flags)
- {
- 	struct inet_sock *inet = inet_sk(sk);
-@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
- 		/* The last check adjusts for discrepancy of Linux wrt. RFC
- 		 * states
- 		 */
--		tcp_send_active_reset(sk, gfp_any());
-+		tp->ops->send_active_reset(sk, gfp_any());
- 		sk->sk_err = ECONNRESET;
- 	} else if (old_state == TCP_SYN_SENT)
- 		sk->sk_err = ECONNRESET;
-@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
- 	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
- 		inet_reset_saddr(sk);
- 
-+	if (is_meta_sk(sk)) {
-+		mptcp_disconnect(sk);
-+	} else {
-+		if (tp->inside_tk_table)
-+			mptcp_hash_remove_bh(tp);
-+	}
-+
- 	sk->sk_shutdown = 0;
- 	sock_reset_flag(sk, SOCK_DONE);
- 	tp->srtt_us = 0;
-@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- 		break;
- 
- 	case TCP_DEFER_ACCEPT:
-+		/* An established MPTCP-connection (mptcp(tp) only returns true
-+		 * if the socket is established) should not use DEFER on new
-+		 * subflows.
-+		 */
-+		if (mptcp(tp))
-+			break;
- 		/* Translate value in seconds to number of retransmits */
- 		icsk->icsk_accept_queue.rskq_defer_accept =
- 			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
-@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- 			    (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
- 			    inet_csk_ack_scheduled(sk)) {
- 				icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
--				tcp_cleanup_rbuf(sk, 1);
-+				tp->ops->cleanup_rbuf(sk, 1);
- 				if (!(val & 1))
- 					icsk->icsk_ack.pingpong = 1;
- 			}
-@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
- 		tp->notsent_lowat = val;
- 		sk->sk_write_space(sk);
- 		break;
-+#ifdef CONFIG_MPTCP
-+	case MPTCP_ENABLED:
-+		if (sk->sk_state == TCP_CLOSE || sk->sk_state == TCP_LISTEN) {
-+			if (val)
-+				tp->mptcp_enabled = 1;
-+			else
-+				tp->mptcp_enabled = 0;
-+		} else {
-+			err = -EPERM;
-+		}
-+		break;
-+#endif
- 	default:
- 		err = -ENOPROTOOPT;
- 		break;
-@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
- 	case TCP_NOTSENT_LOWAT:
- 		val = tp->notsent_lowat;
- 		break;
-+#ifdef CONFIG_MPTCP
-+	case MPTCP_ENABLED:
-+		val = tp->mptcp_enabled;
-+		break;
-+#endif
- 	default:
- 		return -ENOPROTOOPT;
- 	}
-@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
- 	if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
- 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
- 
-+	WARN_ON(sk->sk_state == TCP_CLOSE);
- 	tcp_set_state(sk, TCP_CLOSE);
-+
- 	tcp_clear_xmit_timers(sk);
-+
- 	if (req != NULL)
- 		reqsk_fastopen_remove(sk, req, false);
- 
-diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
-index 9771563ab564..5c230d96c4c1 100644
---- a/net/ipv4/tcp_fastopen.c
-+++ b/net/ipv4/tcp_fastopen.c
-@@ -7,6 +7,7 @@
- #include <linux/rculist.h>
- #include <net/inetpeer.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
- 
- int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE;
- 
-@@ -133,7 +134,7 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- {
- 	struct tcp_sock *tp;
- 	struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
--	struct sock *child;
-+	struct sock *child, *meta_sk;
- 
- 	req->num_retrans = 0;
- 	req->num_timeout = 0;
-@@ -176,13 +177,6 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- 	/* Add the child socket directly into the accept queue */
- 	inet_csk_reqsk_queue_add(sk, req, child);
- 
--	/* Now finish processing the fastopen child socket. */
--	inet_csk(child)->icsk_af_ops->rebuild_header(child);
--	tcp_init_congestion_control(child);
--	tcp_mtup_init(child);
--	tcp_init_metrics(child);
--	tcp_init_buffer_space(child);
--
- 	/* Queue the data carried in the SYN packet. We need to first
- 	 * bump skb's refcnt because the caller will attempt to free it.
- 	 *
-@@ -199,8 +193,24 @@ static bool tcp_fastopen_create_child(struct sock *sk,
- 		tp->syn_data_acked = 1;
- 	}
- 	tcp_rsk(req)->rcv_nxt = tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-+
-+	meta_sk = child;
-+	if (!mptcp_check_req_fastopen(meta_sk, req)) {
-+		child = tcp_sk(meta_sk)->mpcb->master_sk;
-+		tp = tcp_sk(child);
-+	}
-+
-+	/* Now finish processing the fastopen child socket. */
-+	inet_csk(child)->icsk_af_ops->rebuild_header(child);
-+	tp->ops->init_congestion_control(child);
-+	tcp_mtup_init(child);
-+	tcp_init_metrics(child);
-+	tp->ops->init_buffer_space(child);
-+
- 	sk->sk_data_ready(sk);
--	bh_unlock_sock(child);
-+	if (mptcp(tcp_sk(child)))
-+		bh_unlock_sock(child);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(child);
- 	WARN_ON(req->sk == NULL);
- 	return true;
-diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
-index 40639c288dc2..3273bb69f387 100644
---- a/net/ipv4/tcp_input.c
-+++ b/net/ipv4/tcp_input.c
-@@ -74,6 +74,9 @@
- #include <linux/ipsec.h>
- #include <asm/unaligned.h>
- #include <net/netdma.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
- 
- int sysctl_tcp_timestamps __read_mostly = 1;
- int sysctl_tcp_window_scaling __read_mostly = 1;
-@@ -99,25 +102,6 @@ int sysctl_tcp_thin_dupack __read_mostly;
- int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
- int sysctl_tcp_early_retrans __read_mostly = 3;
- 
--#define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
--#define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
--#define FLAG_DATA_ACKED		0x04 /* This ACK acknowledged new data.		*/
--#define FLAG_RETRANS_DATA_ACKED	0x08 /* "" "" some of which was retransmitted.	*/
--#define FLAG_SYN_ACKED		0x10 /* This ACK acknowledged SYN.		*/
--#define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
--#define FLAG_ECE		0x40 /* ECE in this ACK				*/
--#define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/
--#define FLAG_ORIG_SACK_ACKED	0x200 /* Never retransmitted data are (s)acked	*/
--#define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
--#define FLAG_DSACKING_ACK	0x800 /* SACK blocks contained D-SACK info */
--#define FLAG_SACK_RENEGING	0x2000 /* snd_una advanced to a sacked seq */
--#define FLAG_UPDATE_TS_RECENT	0x4000 /* tcp_replace_ts_recent() */
--
--#define FLAG_ACKED		(FLAG_DATA_ACKED|FLAG_SYN_ACKED)
--#define FLAG_NOT_DUP		(FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
--#define FLAG_CA_ALERT		(FLAG_DATA_SACKED|FLAG_ECE)
--#define FLAG_FORWARD_PROGRESS	(FLAG_ACKED|FLAG_DATA_SACKED)
--
- #define TCP_REMNANT (TCP_FLAG_FIN|TCP_FLAG_URG|TCP_FLAG_SYN|TCP_FLAG_PSH)
- #define TCP_HP_BITS (~(TCP_RESERVED_BITS|TCP_FLAG_PSH))
- 
-@@ -181,7 +165,7 @@ static void tcp_incr_quickack(struct sock *sk)
- 		icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
- }
- 
--static void tcp_enter_quickack_mode(struct sock *sk)
-+void tcp_enter_quickack_mode(struct sock *sk)
- {
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	tcp_incr_quickack(sk);
-@@ -283,8 +267,12 @@ static void tcp_sndbuf_expand(struct sock *sk)
- 	per_mss = roundup_pow_of_two(per_mss) +
- 		  SKB_DATA_ALIGN(sizeof(struct sk_buff));
- 
--	nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
--	nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
-+	if (mptcp(tp)) {
-+		nr_segs = mptcp_check_snd_buf(tp);
-+	} else {
-+		nr_segs = max_t(u32, TCP_INIT_CWND, tp->snd_cwnd);
-+		nr_segs = max_t(u32, nr_segs, tp->reordering + 1);
-+	}
- 
- 	/* Fast Recovery (RFC 5681 3.2) :
- 	 * Cubic needs 1.7 factor, rounded to 2 to include
-@@ -292,8 +280,16 @@ static void tcp_sndbuf_expand(struct sock *sk)
- 	 */
- 	sndmem = 2 * nr_segs * per_mss;
- 
--	if (sk->sk_sndbuf < sndmem)
-+	/* MPTCP: after this sndmem is the new contribution of the
-+	 * current subflow to the aggregated sndbuf */
-+	if (sk->sk_sndbuf < sndmem) {
-+		int old_sndbuf = sk->sk_sndbuf;
- 		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
-+		/* MPTCP: ok, the subflow sndbuf has grown, reflect
-+		 * this in the aggregate buffer.*/
-+		if (mptcp(tp) && old_sndbuf != sk->sk_sndbuf)
-+			mptcp_update_sndbuf(tp);
-+	}
- }
- 
- /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
-@@ -342,10 +338,12 @@ static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
- static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
- 
- 	/* Check #1 */
--	if (tp->rcv_ssthresh < tp->window_clamp &&
--	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
-+	if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
-+	    (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
- 	    !sk_under_memory_pressure(sk)) {
- 		int incr;
- 
-@@ -353,14 +351,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
- 		 * will fit to rcvbuf in future.
- 		 */
- 		if (tcp_win_from_space(skb->truesize) <= skb->len)
--			incr = 2 * tp->advmss;
-+			incr = 2 * meta_tp->advmss;
- 		else
--			incr = __tcp_grow_window(sk, skb);
-+			incr = __tcp_grow_window(meta_sk, skb);
- 
- 		if (incr) {
- 			incr = max_t(int, incr, 2 * skb->len);
--			tp->rcv_ssthresh = min(tp->rcv_ssthresh + incr,
--					       tp->window_clamp);
-+			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh + incr,
-+					            meta_tp->window_clamp);
- 			inet_csk(sk)->icsk_ack.quick |= 1;
- 		}
- 	}
-@@ -543,7 +541,10 @@ void tcp_rcv_space_adjust(struct sock *sk)
- 	int copied;
- 
- 	time = tcp_time_stamp - tp->rcvq_space.time;
--	if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
-+	if (mptcp(tp)) {
-+		if (mptcp_check_rtt(tp, time))
-+			return;
-+	} else if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
- 		return;
- 
- 	/* Number of bytes copied to user in last RTT */
-@@ -761,7 +762,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
- /* Calculate rto without backoff.  This is the second half of Van Jacobson's
-  * routine referred to above.
-  */
--static void tcp_set_rto(struct sock *sk)
-+void tcp_set_rto(struct sock *sk)
- {
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 	/* Old crap is replaced with new one. 8)
-@@ -1376,7 +1377,11 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
- 	int len;
- 	int in_sack;
- 
--	if (!sk_can_gso(sk))
-+	/* For MPTCP we cannot shift skb-data and remove one skb from the
-+	 * send-queue, because this will make us loose the DSS-option (which
-+	 * is stored in TCP_SKB_CB(skb)->dss) of the skb we are removing.
-+	 */
-+	if (!sk_can_gso(sk) || mptcp(tp))
- 		goto fallback;
- 
- 	/* Normally R but no L won't result in plain S */
-@@ -2915,7 +2920,7 @@ static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
- 		return false;
- 
- 	tcp_rtt_estimator(sk, seq_rtt_us);
--	tcp_set_rto(sk);
-+	tp->ops->set_rto(sk);
- 
- 	/* RFC6298: only reset backoff on valid RTT measurement. */
- 	inet_csk(sk)->icsk_backoff = 0;
-@@ -3000,7 +3005,7 @@ void tcp_resume_early_retransmit(struct sock *sk)
- }
- 
- /* If we get here, the whole TSO packet has not been acked. */
--static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
-+u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	u32 packets_acked;
-@@ -3095,6 +3100,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
- 		 */
- 		if (!(scb->tcp_flags & TCPHDR_SYN)) {
- 			flag |= FLAG_DATA_ACKED;
-+			if (mptcp(tp) && mptcp_is_data_seq(skb))
-+				flag |= MPTCP_FLAG_DATA_ACKED;
- 		} else {
- 			flag |= FLAG_SYN_ACKED;
- 			tp->retrans_stamp = 0;
-@@ -3189,7 +3196,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
- 	return flag;
- }
- 
--static void tcp_ack_probe(struct sock *sk)
-+void tcp_ack_probe(struct sock *sk)
- {
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 	struct inet_connection_sock *icsk = inet_csk(sk);
-@@ -3236,9 +3243,8 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
- /* Check that window update is acceptable.
-  * The function assumes that snd_una<=ack<=snd_next.
-  */
--static inline bool tcp_may_update_window(const struct tcp_sock *tp,
--					const u32 ack, const u32 ack_seq,
--					const u32 nwin)
-+bool tcp_may_update_window(const struct tcp_sock *tp, const u32 ack,
-+			   const u32 ack_seq, const u32 nwin)
- {
- 	return	after(ack, tp->snd_una) ||
- 		after(ack_seq, tp->snd_wl1) ||
-@@ -3357,7 +3363,7 @@ static void tcp_process_tlp_ack(struct sock *sk, u32 ack, int flag)
- }
- 
- /* This routine deals with incoming acks, but not outgoing ones. */
--static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
-+static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
- {
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct tcp_sock *tp = tcp_sk(sk);
-@@ -3449,6 +3455,16 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
- 				    sack_rtt_us);
- 	acked -= tp->packets_out;
- 
-+	if (mptcp(tp)) {
-+		if (mptcp_fallback_infinite(sk, flag)) {
-+			pr_err("%s resetting flow\n", __func__);
-+			mptcp_send_reset(sk);
-+			goto invalid_ack;
-+		}
-+
-+		mptcp_clean_rtx_infinite(skb, sk);
-+	}
-+
- 	/* Advance cwnd if state allows */
- 	if (tcp_may_raise_cwnd(sk, flag))
- 		tcp_cong_avoid(sk, ack, acked);
-@@ -3512,8 +3528,9 @@ old_ack:
-  * the fast version below fails.
-  */
- void tcp_parse_options(const struct sk_buff *skb,
--		       struct tcp_options_received *opt_rx, int estab,
--		       struct tcp_fastopen_cookie *foc)
-+		       struct tcp_options_received *opt_rx,
-+		       struct mptcp_options_received *mopt,
-+		       int estab, struct tcp_fastopen_cookie *foc)
- {
- 	const unsigned char *ptr;
- 	const struct tcphdr *th = tcp_hdr(skb);
-@@ -3596,6 +3613,9 @@ void tcp_parse_options(const struct sk_buff *skb,
- 				 */
- 				break;
- #endif
-+			case TCPOPT_MPTCP:
-+				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
-+				break;
- 			case TCPOPT_EXP:
- 				/* Fast Open option shares code 254 using a
- 				 * 16 bits magic number. It's valid only in
-@@ -3657,8 +3677,8 @@ static bool tcp_fast_parse_options(const struct sk_buff *skb,
- 		if (tcp_parse_aligned_timestamp(tp, th))
- 			return true;
- 	}
--
--	tcp_parse_options(skb, &tp->rx_opt, 1, NULL);
-+	tcp_parse_options(skb, &tp->rx_opt, mptcp(tp) ? &tp->mptcp->rx_opt : NULL,
-+			  1, NULL);
- 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
- 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
- 
-@@ -3831,6 +3851,8 @@ static void tcp_fin(struct sock *sk)
- 		dst = __sk_dst_get(sk);
- 		if (!dst || !dst_metric(dst, RTAX_QUICKACK))
- 			inet_csk(sk)->icsk_ack.pingpong = 1;
-+		if (mptcp(tp))
-+			mptcp_sub_close_passive(sk);
- 		break;
- 
- 	case TCP_CLOSE_WAIT:
-@@ -3852,9 +3874,16 @@ static void tcp_fin(struct sock *sk)
- 		tcp_set_state(sk, TCP_CLOSING);
- 		break;
- 	case TCP_FIN_WAIT2:
-+		if (mptcp(tp)) {
-+			/* The socket will get closed by mptcp_data_ready.
-+			 * We first have to process all data-sequences.
-+			 */
-+			tp->close_it = 1;
-+			break;
-+		}
- 		/* Received a FIN -- send ACK and enter TIME_WAIT. */
- 		tcp_send_ack(sk);
--		tcp_time_wait(sk, TCP_TIME_WAIT, 0);
-+		tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
- 		break;
- 	default:
- 		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
-@@ -3876,6 +3905,10 @@ static void tcp_fin(struct sock *sk)
- 	if (!sock_flag(sk, SOCK_DEAD)) {
- 		sk->sk_state_change(sk);
- 
-+		/* Don't wake up MPTCP-subflows */
-+		if (mptcp(tp))
-+			return;
-+
- 		/* Do not send POLL_HUP for half duplex close. */
- 		if (sk->sk_shutdown == SHUTDOWN_MASK ||
- 		    sk->sk_state == TCP_CLOSE)
-@@ -4073,7 +4106,11 @@ static void tcp_ofo_queue(struct sock *sk)
- 			tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
- 		}
- 
--		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
-+		/* In case of MPTCP, the segment may be empty if it's a
-+		 * non-data DATA_FIN. (see beginning of tcp_data_queue)
-+		 */
-+		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt) &&
-+		    !(mptcp(tp) && TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq)) {
- 			SOCK_DEBUG(sk, "ofo packet was already received\n");
- 			__skb_unlink(skb, &tp->out_of_order_queue);
- 			__kfree_skb(skb);
-@@ -4091,12 +4128,14 @@ static void tcp_ofo_queue(struct sock *sk)
- 	}
- }
- 
--static bool tcp_prune_ofo_queue(struct sock *sk);
- static int tcp_prune_queue(struct sock *sk);
- 
- static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- 				 unsigned int size)
- {
-+	if (mptcp(tcp_sk(sk)))
-+		sk = mptcp_meta_sk(sk);
-+
- 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
- 	    !sk_rmem_schedule(sk, skb, size)) {
- 
-@@ -4104,7 +4143,7 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
- 			return -1;
- 
- 		if (!sk_rmem_schedule(sk, skb, size)) {
--			if (!tcp_prune_ofo_queue(sk))
-+			if (!tcp_sk(sk)->ops->prune_ofo_queue(sk))
- 				return -1;
- 
- 			if (!sk_rmem_schedule(sk, skb, size))
-@@ -4127,15 +4166,16 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
-  * Better try to coalesce them right now to avoid future collapses.
-  * Returns true if caller should free @from instead of queueing it
-  */
--static bool tcp_try_coalesce(struct sock *sk,
--			     struct sk_buff *to,
--			     struct sk_buff *from,
--			     bool *fragstolen)
-+bool tcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from,
-+		      bool *fragstolen)
- {
- 	int delta;
- 
- 	*fragstolen = false;
- 
-+	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
-+		return false;
-+
- 	if (tcp_hdr(from)->fin)
- 		return false;
- 
-@@ -4225,7 +4265,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
- 
- 	/* Do skb overlap to previous one? */
- 	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
--		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+		/* MPTCP allows non-data data-fin to be in the ofo-queue */
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq) &&
-+		    !(mptcp(tp) && end_seq == seq)) {
- 			/* All the bits are present. Drop. */
- 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
- 			__kfree_skb(skb);
-@@ -4263,6 +4305,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
- 					 end_seq);
- 			break;
- 		}
-+		/* MPTCP allows non-data data-fin to be in the ofo-queue */
-+		if (mptcp(tp) && TCP_SKB_CB(skb1)->seq == TCP_SKB_CB(skb1)->end_seq)
-+			continue;
- 		__skb_unlink(skb1, &tp->out_of_order_queue);
- 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
- 				 TCP_SKB_CB(skb1)->end_seq);
-@@ -4280,8 +4325,8 @@ end:
- 	}
- }
- 
--static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
--		  bool *fragstolen)
-+int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
-+			       bool *fragstolen)
- {
- 	int eaten;
- 	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
-@@ -4343,7 +4388,10 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
- 	int eaten = -1;
- 	bool fragstolen = false;
- 
--	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-+	/* If no data is present, but a data_fin is in the options, we still
-+	 * have to call mptcp_queue_skb later on. */
-+	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq &&
-+	    !(mptcp(tp) && mptcp_is_data_fin(skb)))
- 		goto drop;
- 
- 	skb_dst_drop(skb);
-@@ -4389,7 +4437,7 @@ queue_and_out:
- 			eaten = tcp_queue_rcv(sk, skb, 0, &fragstolen);
- 		}
- 		tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
--		if (skb->len)
-+		if (skb->len || mptcp_is_data_fin(skb))
- 			tcp_event_data_recv(sk, skb);
- 		if (th->fin)
- 			tcp_fin(sk);
-@@ -4411,7 +4459,11 @@ queue_and_out:
- 
- 		if (eaten > 0)
- 			kfree_skb_partial(skb, fragstolen);
--		if (!sock_flag(sk, SOCK_DEAD))
-+		if (!sock_flag(sk, SOCK_DEAD) || mptcp(tp))
-+			/* MPTCP: we always have to call data_ready, because
-+			 * we may be about to receive a data-fin, which still
-+			 * must get queued.
-+			 */
- 			sk->sk_data_ready(sk);
- 		return;
- 	}
-@@ -4463,6 +4515,8 @@ static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
- 		next = skb_queue_next(list, skb);
- 
- 	__skb_unlink(skb, list);
-+	if (mptcp(tcp_sk(sk)))
-+		mptcp_remove_shortcuts(tcp_sk(sk)->mpcb, skb);
- 	__kfree_skb(skb);
- 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
- 
-@@ -4630,7 +4684,7 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
-  * Purge the out-of-order queue.
-  * Return true if queue was pruned.
-  */
--static bool tcp_prune_ofo_queue(struct sock *sk)
-+bool tcp_prune_ofo_queue(struct sock *sk)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	bool res = false;
-@@ -4686,7 +4740,7 @@ static int tcp_prune_queue(struct sock *sk)
- 	/* Collapsing did not help, destructive actions follow.
- 	 * This must not ever occur. */
- 
--	tcp_prune_ofo_queue(sk);
-+	tp->ops->prune_ofo_queue(sk);
- 
- 	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
- 		return 0;
-@@ -4702,7 +4756,29 @@ static int tcp_prune_queue(struct sock *sk)
- 	return -1;
- }
- 
--static bool tcp_should_expand_sndbuf(const struct sock *sk)
-+/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
-+ * As additional protections, we do not touch cwnd in retransmission phases,
-+ * and if application hit its sndbuf limit recently.
-+ */
-+void tcp_cwnd_application_limited(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
-+	    sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
-+		/* Limited by application or receiver window. */
-+		u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
-+		u32 win_used = max(tp->snd_cwnd_used, init_win);
-+		if (win_used < tp->snd_cwnd) {
-+			tp->snd_ssthresh = tcp_current_ssthresh(sk);
-+			tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
-+		}
-+		tp->snd_cwnd_used = 0;
-+	}
-+	tp->snd_cwnd_stamp = tcp_time_stamp;
-+}
-+
-+bool tcp_should_expand_sndbuf(const struct sock *sk)
- {
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 
-@@ -4737,7 +4813,7 @@ static void tcp_new_space(struct sock *sk)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 
--	if (tcp_should_expand_sndbuf(sk)) {
-+	if (tp->ops->should_expand_sndbuf(sk)) {
- 		tcp_sndbuf_expand(sk);
- 		tp->snd_cwnd_stamp = tcp_time_stamp;
- 	}
-@@ -4749,8 +4825,9 @@ static void tcp_check_space(struct sock *sk)
- {
- 	if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
- 		sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
--		if (sk->sk_socket &&
--		    test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
-+		if (mptcp(tcp_sk(sk)) ||
-+		    (sk->sk_socket &&
-+			test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)))
- 			tcp_new_space(sk);
- 	}
- }
-@@ -4773,7 +4850,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
- 	     /* ... and right edge of window advances far enough.
- 	      * (tcp_recvmsg() will send ACK otherwise). Or...
- 	      */
--	     __tcp_select_window(sk) >= tp->rcv_wnd) ||
-+	     tp->ops->__select_window(sk) >= tp->rcv_wnd) ||
- 	    /* We ACK each frame or... */
- 	    tcp_in_quickack_mode(sk) ||
- 	    /* We have out of order data. */
-@@ -4875,6 +4952,10 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, const struct tcphdr *t
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 
-+	/* MPTCP urgent data is not yet supported */
-+	if (mptcp(tp))
-+		return;
-+
- 	/* Check if we get a new urgent pointer - normally not. */
- 	if (th->urg)
- 		tcp_check_urg(sk, th);
-@@ -4942,8 +5023,7 @@ static inline bool tcp_checksum_complete_user(struct sock *sk,
- }
- 
- #ifdef CONFIG_NET_DMA
--static bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
--				  int hlen)
-+bool tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, int hlen)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	int chunk = skb->len - hlen;
-@@ -5052,9 +5132,15 @@ syn_challenge:
- 		goto discard;
- 	}
- 
-+	/* If valid: post process the received MPTCP options. */
-+	if (mptcp(tp) && mptcp_handle_options(sk, th, skb))
-+		goto discard;
-+
- 	return true;
- 
- discard:
-+	if (mptcp(tp))
-+		mptcp_reset_mopt(tp);
- 	__kfree_skb(skb);
- 	return false;
- }
-@@ -5106,6 +5192,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- 
- 	tp->rx_opt.saw_tstamp = 0;
- 
-+	/* MPTCP: force slowpath. */
-+	if (mptcp(tp))
-+		goto slow_path;
-+
- 	/*	pred_flags is 0xS?10 << 16 + snd_wnd
- 	 *	if header_prediction is to be made
- 	 *	'S' will always be tp->tcp_header_len >> 2
-@@ -5205,7 +5295,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- 					NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
- 				}
- 				if (copied_early)
--					tcp_cleanup_rbuf(sk, skb->len);
-+					tp->ops->cleanup_rbuf(sk, skb->len);
- 			}
- 			if (!eaten) {
- 				if (tcp_checksum_complete_user(sk, skb))
-@@ -5313,14 +5403,14 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
- 
- 	tcp_init_metrics(sk);
- 
--	tcp_init_congestion_control(sk);
-+	tp->ops->init_congestion_control(sk);
- 
- 	/* Prevent spurious tcp_cwnd_restart() on first data
- 	 * packet.
- 	 */
- 	tp->lsndtime = tcp_time_stamp;
- 
--	tcp_init_buffer_space(sk);
-+	tp->ops->init_buffer_space(sk);
- 
- 	if (sock_flag(sk, SOCK_KEEPOPEN))
- 		inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
-@@ -5350,7 +5440,7 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
- 		/* Get original SYNACK MSS value if user MSS sets mss_clamp */
- 		tcp_clear_options(&opt);
- 		opt.user_mss = opt.mss_clamp = 0;
--		tcp_parse_options(synack, &opt, 0, NULL);
-+		tcp_parse_options(synack, &opt, NULL, 0, NULL);
- 		mss = opt.mss_clamp;
- 	}
- 
-@@ -5365,7 +5455,11 @@ static bool tcp_rcv_fastopen_synack(struct sock *sk, struct sk_buff *synack,
- 
- 	tcp_fastopen_cache_set(sk, mss, cookie, syn_drop);
- 
--	if (data) { /* Retransmit unacked data in SYN */
-+	/* In mptcp case, we do not rely on "retransmit", but instead on
-+	 * "transmit", because if fastopen data is not acked, the retransmission
-+	 * becomes the first MPTCP data (see mptcp_rcv_synsent_fastopen).
-+	 */
-+	if (data && !mptcp(tp)) { /* Retransmit unacked data in SYN */
- 		tcp_for_write_queue_from(data, sk) {
- 			if (data == tcp_send_head(sk) ||
- 			    __tcp_retransmit_skb(sk, data))
-@@ -5388,8 +5482,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	struct tcp_fastopen_cookie foc = { .len = -1 };
- 	int saved_clamp = tp->rx_opt.mss_clamp;
-+	struct mptcp_options_received mopt;
-+	mptcp_init_mp_opt(&mopt);
- 
--	tcp_parse_options(skb, &tp->rx_opt, 0, &foc);
-+	tcp_parse_options(skb, &tp->rx_opt,
-+			  mptcp(tp) ? &tp->mptcp->rx_opt : &mopt, 0, &foc);
- 	if (tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
- 		tp->rx_opt.rcv_tsecr -= tp->tsoffset;
- 
-@@ -5448,6 +5545,30 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
- 		tcp_ack(sk, skb, FLAG_SLOWPATH);
- 
-+		if (tp->request_mptcp || mptcp(tp)) {
-+			int ret;
-+			ret = mptcp_rcv_synsent_state_process(sk, &sk,
-+							      skb, &mopt);
-+
-+			/* May have changed if we support MPTCP */
-+			tp = tcp_sk(sk);
-+			icsk = inet_csk(sk);
-+
-+			if (ret == 1)
-+				goto reset_and_undo;
-+			if (ret == 2)
-+				goto discard;
-+		}
-+
-+		if (mptcp(tp) && !is_master_tp(tp)) {
-+			/* Timer for repeating the ACK until an answer
-+			 * arrives. Used only when establishing an additional
-+			 * subflow inside of an MPTCP connection.
-+			 */
-+			sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+				       jiffies + icsk->icsk_rto);
-+		}
-+
- 		/* Ok.. it's good. Set up sequence numbers and
- 		 * move to established.
- 		 */
-@@ -5474,6 +5595,11 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- 			tp->tcp_header_len = sizeof(struct tcphdr);
- 		}
- 
-+		if (mptcp(tp)) {
-+			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
-+			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-+		}
-+
- 		if (tcp_is_sack(tp) && sysctl_tcp_fack)
- 			tcp_enable_fack(tp);
- 
-@@ -5494,9 +5620,12 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- 		    tcp_rcv_fastopen_synack(sk, skb, &foc))
- 			return -1;
- 
--		if (sk->sk_write_pending ||
-+		/* With MPTCP we cannot send data on the third ack due to the
-+		 * lack of option-space to combine with an MP_CAPABLE.
-+		 */
-+		if (!mptcp(tp) && (sk->sk_write_pending ||
- 		    icsk->icsk_accept_queue.rskq_defer_accept ||
--		    icsk->icsk_ack.pingpong) {
-+		    icsk->icsk_ack.pingpong)) {
- 			/* Save one ACK. Data will be ready after
- 			 * several ticks, if write_pending is set.
- 			 *
-@@ -5536,6 +5665,7 @@ discard:
- 	    tcp_paws_reject(&tp->rx_opt, 0))
- 		goto discard_and_undo;
- 
-+	/* TODO - check this here for MPTCP */
- 	if (th->syn) {
- 		/* We see SYN without ACK. It is attempt of
- 		 * simultaneous connect with crossed SYNs.
-@@ -5552,6 +5682,11 @@ discard:
- 			tp->tcp_header_len = sizeof(struct tcphdr);
- 		}
- 
-+		if (mptcp(tp)) {
-+			tp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
-+			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
-+		}
-+
- 		tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
- 		tp->rcv_wup = TCP_SKB_CB(skb)->seq + 1;
- 
-@@ -5610,6 +5745,7 @@ reset_and_undo:
- 
- int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 			  const struct tcphdr *th, unsigned int len)
-+	__releases(&sk->sk_lock.slock)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	struct inet_connection_sock *icsk = inet_csk(sk);
-@@ -5661,6 +5797,16 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 
- 	case TCP_SYN_SENT:
- 		queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
-+		if (is_meta_sk(sk)) {
-+			sk = tcp_sk(sk)->mpcb->master_sk;
-+			tp = tcp_sk(sk);
-+
-+			/* Need to call it here, because it will announce new
-+			 * addresses, which can only be done after the third ack
-+			 * of the 3-way handshake.
-+			 */
-+			mptcp_update_metasocket(sk, tp->meta_sk);
-+		}
- 		if (queued >= 0)
- 			return queued;
- 
-@@ -5668,6 +5814,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 		tcp_urg(sk, skb, th);
- 		__kfree_skb(skb);
- 		tcp_data_snd_check(sk);
-+		if (mptcp(tp) && is_master_tp(tp))
-+			bh_unlock_sock(sk);
- 		return 0;
- 	}
- 
-@@ -5706,11 +5854,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 			synack_stamp = tp->lsndtime;
- 			/* Make sure socket is routed, for correct metrics. */
- 			icsk->icsk_af_ops->rebuild_header(sk);
--			tcp_init_congestion_control(sk);
-+			tp->ops->init_congestion_control(sk);
- 
- 			tcp_mtup_init(sk);
- 			tp->copied_seq = tp->rcv_nxt;
--			tcp_init_buffer_space(sk);
-+			tp->ops->init_buffer_space(sk);
- 		}
- 		smp_mb();
- 		tcp_set_state(sk, TCP_ESTABLISHED);
-@@ -5730,6 +5878,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 
- 		if (tp->rx_opt.tstamp_ok)
- 			tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
-+		if (mptcp(tp))
-+			tp->advmss -= MPTCP_SUB_LEN_DSM_ALIGN;
- 
- 		if (req) {
- 			/* Re-arm the timer because data may have been sent out.
-@@ -5751,6 +5901,12 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 
- 		tcp_initialize_rcv_mss(sk);
- 		tcp_fast_path_on(tp);
-+		/* Send an ACK when establishing a new
-+		 * MPTCP subflow, i.e. using an MP_JOIN
-+		 * subtype.
-+		 */
-+		if (mptcp(tp) && !is_master_tp(tp))
-+			tcp_send_ack(sk);
- 		break;
- 
- 	case TCP_FIN_WAIT1: {
-@@ -5802,7 +5958,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 		tmo = tcp_fin_time(sk);
- 		if (tmo > TCP_TIMEWAIT_LEN) {
- 			inet_csk_reset_keepalive_timer(sk, tmo - TCP_TIMEWAIT_LEN);
--		} else if (th->fin || sock_owned_by_user(sk)) {
-+		} else if (th->fin || mptcp_is_data_fin(skb) ||
-+			   sock_owned_by_user(sk)) {
- 			/* Bad case. We could lose such FIN otherwise.
- 			 * It is not a big problem, but it looks confusing
- 			 * and not so rare event. We still can lose it now,
-@@ -5811,7 +5968,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 			 */
- 			inet_csk_reset_keepalive_timer(sk, tmo);
- 		} else {
--			tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+			tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
- 			goto discard;
- 		}
- 		break;
-@@ -5819,7 +5976,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 
- 	case TCP_CLOSING:
- 		if (tp->snd_una == tp->write_seq) {
--			tcp_time_wait(sk, TCP_TIME_WAIT, 0);
-+			tp->ops->time_wait(sk, TCP_TIME_WAIT, 0);
- 			goto discard;
- 		}
- 		break;
-@@ -5831,6 +5988,9 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 			goto discard;
- 		}
- 		break;
-+	case TCP_CLOSE:
-+		if (tp->mp_killed)
-+			goto discard;
- 	}
- 
- 	/* step 6: check the URG bit */
-@@ -5851,7 +6011,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- 		 */
- 		if (sk->sk_shutdown & RCV_SHUTDOWN) {
- 			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
--			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
-+			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
-+			    !mptcp(tp)) {
-+				/* In case of mptcp, the reset is handled by
-+				 * mptcp_rcv_state_process
-+				 */
- 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
- 				tcp_reset(sk);
- 				return 1;
-@@ -5877,3 +6041,154 @@ discard:
- 	return 0;
- }
- EXPORT_SYMBOL(tcp_rcv_state_process);
-+
-+static inline void pr_drop_req(struct request_sock *req, __u16 port, int family)
-+{
-+	struct inet_request_sock *ireq = inet_rsk(req);
-+
-+	if (family == AF_INET)
-+		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
-+			       &ireq->ir_rmt_addr, port);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else if (family == AF_INET6)
-+		LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI6/%u\n"),
-+			       &ireq->ir_v6_rmt_addr, port);
-+#endif
-+}
-+
-+int tcp_conn_request(struct request_sock_ops *rsk_ops,
-+		     const struct tcp_request_sock_ops *af_ops,
-+		     struct sock *sk, struct sk_buff *skb)
-+{
-+	struct tcp_options_received tmp_opt;
-+	struct request_sock *req;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct dst_entry *dst = NULL;
-+	__u32 isn = TCP_SKB_CB(skb)->when;
-+	bool want_cookie = false, fastopen;
-+	struct flowi fl;
-+	struct tcp_fastopen_cookie foc = { .len = -1 };
-+	int err;
-+
-+
-+	/* TW buckets are converted to open requests without
-+	 * limitations, they conserve resources and peer is
-+	 * evidently real one.
-+	 */
-+	if ((sysctl_tcp_syncookies == 2 ||
-+	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-+		want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
-+		if (!want_cookie)
-+			goto drop;
-+	}
-+
-+
-+	/* Accept backlog is full. If we have already queued enough
-+	 * of warm entries in syn queue, drop request. It is better than
-+	 * clogging syn queue with openreqs with exponentially increasing
-+	 * timeout.
-+	 */
-+	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
-+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
-+		goto drop;
-+	}
-+
-+	req = inet_reqsk_alloc(rsk_ops);
-+	if (!req)
-+		goto drop;
-+
-+	tcp_rsk(req)->af_specific = af_ops;
-+
-+	tcp_clear_options(&tmp_opt);
-+	tmp_opt.mss_clamp = af_ops->mss_clamp;
-+	tmp_opt.user_mss  = tp->rx_opt.user_mss;
-+	tcp_parse_options(skb, &tmp_opt, NULL, 0, want_cookie ? NULL : &foc);
-+
-+	if (want_cookie && !tmp_opt.saw_tstamp)
-+		tcp_clear_options(&tmp_opt);
-+
-+	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
-+	tcp_openreq_init(req, &tmp_opt, skb);
-+
-+	if (af_ops->init_req(req, sk, skb))
-+		goto drop_and_free;
-+
-+	if (security_inet_conn_request(sk, skb, req))
-+		goto drop_and_free;
-+
-+	if (!want_cookie || tmp_opt.tstamp_ok)
-+		TCP_ECN_create_request(req, skb, sock_net(sk));
-+
-+	if (want_cookie) {
-+		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
-+		req->cookie_ts = tmp_opt.tstamp_ok;
-+	} else if (!isn) {
-+		/* VJ's idea. We save last timestamp seen
-+		 * from the destination in peer table, when entering
-+		 * state TIME-WAIT, and check against it before
-+		 * accepting new connection request.
-+		 *
-+		 * If "isn" is not zero, this request hit alive
-+		 * timewait bucket, so that all the necessary checks
-+		 * are made in the function processing timewait state.
-+		 */
-+		if (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle) {
-+			bool strict;
-+
-+			dst = af_ops->route_req(sk, &fl, req, &strict);
-+			if (dst && strict &&
-+			    !tcp_peer_is_proven(req, dst, true)) {
-+				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
-+				goto drop_and_release;
-+			}
-+		}
-+		/* Kill the following clause, if you dislike this way. */
-+		else if (!sysctl_tcp_syncookies &&
-+			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
-+			  (sysctl_max_syn_backlog >> 2)) &&
-+			 !tcp_peer_is_proven(req, dst, false)) {
-+			/* Without syncookies last quarter of
-+			 * backlog is filled with destinations,
-+			 * proven to be alive.
-+			 * It means that we continue to communicate
-+			 * to destinations, already remembered
-+			 * to the moment of synflood.
-+			 */
-+			pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
-+				    rsk_ops->family);
-+			goto drop_and_release;
-+		}
-+
-+		isn = af_ops->init_seq(skb);
-+	}
-+	if (!dst) {
-+		dst = af_ops->route_req(sk, &fl, req, NULL);
-+		if (!dst)
-+			goto drop_and_free;
-+	}
-+
-+	tcp_rsk(req)->snt_isn = isn;
-+	tcp_openreq_init_rwin(req, sk, dst);
-+	fastopen = !want_cookie &&
-+		   tcp_try_fastopen(sk, skb, req, &foc, dst);
-+	err = af_ops->send_synack(sk, dst, &fl, req,
-+				  skb_get_queue_mapping(skb), &foc);
-+	if (!fastopen) {
-+		if (err || want_cookie)
-+			goto drop_and_free;
-+
-+		tcp_rsk(req)->listener = NULL;
-+		af_ops->queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
-+	}
-+
-+	return 0;
-+
-+drop_and_release:
-+	dst_release(dst);
-+drop_and_free:
-+	reqsk_free(req);
-+drop:
-+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-+	return 0;
-+}
-+EXPORT_SYMBOL(tcp_conn_request);
-diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
-index 77cccda1ad0c..c77017f600f1 100644
---- a/net/ipv4/tcp_ipv4.c
-+++ b/net/ipv4/tcp_ipv4.c
-@@ -67,6 +67,8 @@
- #include <net/icmp.h>
- #include <net/inet_hashtables.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
- #include <net/transp_v6.h>
- #include <net/ipv6.h>
- #include <net/inet_common.h>
-@@ -99,7 +101,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
- struct inet_hashinfo tcp_hashinfo;
- EXPORT_SYMBOL(tcp_hashinfo);
- 
--static inline __u32 tcp_v4_init_sequence(const struct sk_buff *skb)
-+__u32 tcp_v4_init_sequence(const struct sk_buff *skb)
- {
- 	return secure_tcp_sequence_number(ip_hdr(skb)->daddr,
- 					  ip_hdr(skb)->saddr,
-@@ -334,7 +336,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	struct inet_sock *inet;
- 	const int type = icmp_hdr(icmp_skb)->type;
- 	const int code = icmp_hdr(icmp_skb)->code;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk;
- 	struct sk_buff *skb;
- 	struct request_sock *fastopen;
- 	__u32 seq, snd_una;
-@@ -358,13 +360,19 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 		return;
- 	}
- 
--	bh_lock_sock(sk);
-+	tp = tcp_sk(sk);
-+	if (mptcp(tp))
-+		meta_sk = mptcp_meta_sk(sk);
-+	else
-+		meta_sk = sk;
-+
-+	bh_lock_sock(meta_sk);
- 	/* If too many ICMPs get dropped on busy
- 	 * servers this needs to be solved differently.
- 	 * We do take care of PMTU discovery (RFC1191) special case :
- 	 * we can receive locally generated ICMP messages while socket is held.
- 	 */
--	if (sock_owned_by_user(sk)) {
-+	if (sock_owned_by_user(meta_sk)) {
- 		if (!(type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED))
- 			NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
- 	}
-@@ -377,7 +385,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	}
- 
- 	icsk = inet_csk(sk);
--	tp = tcp_sk(sk);
- 	seq = ntohl(th->seq);
- 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
- 	fastopen = tp->fastopen_rsk;
-@@ -411,11 +418,13 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 				goto out;
- 
- 			tp->mtu_info = info;
--			if (!sock_owned_by_user(sk)) {
-+			if (!sock_owned_by_user(meta_sk)) {
- 				tcp_v4_mtu_reduced(sk);
- 			} else {
- 				if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags))
- 					sock_hold(sk);
-+				if (mptcp(tp))
-+					mptcp_tsq_flags(sk);
- 			}
- 			goto out;
- 		}
-@@ -429,7 +438,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 		    !icsk->icsk_backoff || fastopen)
- 			break;
- 
--		if (sock_owned_by_user(sk))
-+		if (sock_owned_by_user(meta_sk))
- 			break;
- 
- 		icsk->icsk_backoff--;
-@@ -463,7 +472,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	switch (sk->sk_state) {
- 		struct request_sock *req, **prev;
- 	case TCP_LISTEN:
--		if (sock_owned_by_user(sk))
-+		if (sock_owned_by_user(meta_sk))
- 			goto out;
- 
- 		req = inet_csk_search_req(sk, &prev, th->dest,
-@@ -499,7 +508,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 		if (fastopen && fastopen->sk == NULL)
- 			break;
- 
--		if (!sock_owned_by_user(sk)) {
-+		if (!sock_owned_by_user(meta_sk)) {
- 			sk->sk_err = err;
- 
- 			sk->sk_error_report(sk);
-@@ -528,7 +537,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	 */
- 
- 	inet = inet_sk(sk);
--	if (!sock_owned_by_user(sk) && inet->recverr) {
-+	if (!sock_owned_by_user(meta_sk) && inet->recverr) {
- 		sk->sk_err = err;
- 		sk->sk_error_report(sk);
- 	} else	{ /* Only an error on timeout */
-@@ -536,7 +545,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
- 	}
- 
- out:
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
-@@ -578,7 +587,7 @@ EXPORT_SYMBOL(tcp_v4_send_check);
-  *	Exception: precedence violation. We do not implement it in any case.
-  */
- 
--static void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
-+void tcp_v4_send_reset(struct sock *sk, struct sk_buff *skb)
- {
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	struct {
-@@ -702,10 +711,10 @@ release_sk1:
-    outside socket context is ugly, certainly. What can I do?
-  */
- 
--static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
-+static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
- 			    u32 win, u32 tsval, u32 tsecr, int oif,
- 			    struct tcp_md5sig_key *key,
--			    int reply_flags, u8 tos)
-+			    int reply_flags, u8 tos, int mptcp)
- {
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	struct {
-@@ -714,6 +723,10 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
- #ifdef CONFIG_TCP_MD5SIG
- 			   + (TCPOLEN_MD5SIG_ALIGNED >> 2)
- #endif
-+#ifdef CONFIG_MPTCP
-+			   + ((MPTCP_SUB_LEN_DSS >> 2) +
-+			      (MPTCP_SUB_LEN_ACK >> 2))
-+#endif
- 			];
- 	} rep;
- 	struct ip_reply_arg arg;
-@@ -758,6 +771,21 @@ static void tcp_v4_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
- 				    ip_hdr(skb)->daddr, &rep.th);
- 	}
- #endif
-+#ifdef CONFIG_MPTCP
-+	if (mptcp) {
-+		int offset = (tsecr) ? 3 : 0;
-+		/* Construction of 32-bit data_ack */
-+		rep.opt[offset++] = htonl((TCPOPT_MPTCP << 24) |
-+					  ((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
-+					  (0x20 << 8) |
-+					  (0x01));
-+		rep.opt[offset] = htonl(data_ack);
-+
-+		arg.iov[0].iov_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
-+		rep.th.doff = arg.iov[0].iov_len / 4;
-+	}
-+#endif /* CONFIG_MPTCP */
-+
- 	arg.flags = reply_flags;
- 	arg.csum = csum_tcpudp_nofold(ip_hdr(skb)->daddr,
- 				      ip_hdr(skb)->saddr, /* XXX */
-@@ -776,36 +804,44 @@ static void tcp_v4_timewait_ack(struct sock *sk, struct sk_buff *skb)
- {
- 	struct inet_timewait_sock *tw = inet_twsk(sk);
- 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
-+	u32 data_ack = 0;
-+	int mptcp = 0;
-+
-+	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
-+		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
-+		mptcp = 1;
-+	}
- 
- 	tcp_v4_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
-+			data_ack,
- 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
- 			tcp_time_stamp + tcptw->tw_ts_offset,
- 			tcptw->tw_ts_recent,
- 			tw->tw_bound_dev_if,
- 			tcp_twsk_md5_key(tcptw),
- 			tw->tw_transparent ? IP_REPLY_ARG_NOSRCCHECK : 0,
--			tw->tw_tos
-+			tw->tw_tos, mptcp
- 			);
- 
- 	inet_twsk_put(tw);
- }
- 
--static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
--				  struct request_sock *req)
-+void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+			   struct request_sock *req)
- {
- 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
- 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
- 	 */
- 	tcp_v4_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
- 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
--			tcp_rsk(req)->rcv_nxt, req->rcv_wnd,
-+			tcp_rsk(req)->rcv_nxt, 0, req->rcv_wnd,
- 			tcp_time_stamp,
- 			req->ts_recent,
- 			0,
- 			tcp_md5_do_lookup(sk, (union tcp_md5_addr *)&ip_hdr(skb)->daddr,
- 					  AF_INET),
- 			inet_rsk(req)->no_srccheck ? IP_REPLY_ARG_NOSRCCHECK : 0,
--			ip_hdr(skb)->tos);
-+			ip_hdr(skb)->tos, 0);
- }
- 
- /*
-@@ -813,10 +849,11 @@ static void tcp_v4_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-  *	This still operates on a request_sock only, not on a big
-  *	socket.
-  */
--static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
--			      struct request_sock *req,
--			      u16 queue_mapping,
--			      struct tcp_fastopen_cookie *foc)
-+int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
-+		       struct flowi *fl,
-+		       struct request_sock *req,
-+		       u16 queue_mapping,
-+		       struct tcp_fastopen_cookie *foc)
- {
- 	const struct inet_request_sock *ireq = inet_rsk(req);
- 	struct flowi4 fl4;
-@@ -844,21 +881,10 @@ static int tcp_v4_send_synack(struct sock *sk, struct dst_entry *dst,
- 	return err;
- }
- 
--static int tcp_v4_rtx_synack(struct sock *sk, struct request_sock *req)
--{
--	int res = tcp_v4_send_synack(sk, NULL, req, 0, NULL);
--
--	if (!res) {
--		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
--		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
--	}
--	return res;
--}
--
- /*
-  *	IPv4 request_sock destructor.
-  */
--static void tcp_v4_reqsk_destructor(struct request_sock *req)
-+void tcp_v4_reqsk_destructor(struct request_sock *req)
- {
- 	kfree(inet_rsk(req)->opt);
- }
-@@ -896,7 +922,7 @@ EXPORT_SYMBOL(tcp_syn_flood_action);
- /*
-  * Save and compile IPv4 options into the request_sock if needed.
-  */
--static struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
-+struct ip_options_rcu *tcp_v4_save_options(struct sk_buff *skb)
- {
- 	const struct ip_options *opt = &(IPCB(skb)->opt);
- 	struct ip_options_rcu *dopt = NULL;
-@@ -1237,161 +1263,71 @@ static bool tcp_v4_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
- 
- #endif
- 
-+static int tcp_v4_init_req(struct request_sock *req, struct sock *sk,
-+			   struct sk_buff *skb)
-+{
-+	struct inet_request_sock *ireq = inet_rsk(req);
-+
-+	ireq->ir_loc_addr = ip_hdr(skb)->daddr;
-+	ireq->ir_rmt_addr = ip_hdr(skb)->saddr;
-+	ireq->no_srccheck = inet_sk(sk)->transparent;
-+	ireq->opt = tcp_v4_save_options(skb);
-+	ireq->ir_mark = inet_request_mark(sk, skb);
-+
-+	return 0;
-+}
-+
-+static struct dst_entry *tcp_v4_route_req(struct sock *sk, struct flowi *fl,
-+					  const struct request_sock *req,
-+					  bool *strict)
-+{
-+	struct dst_entry *dst = inet_csk_route_req(sk, &fl->u.ip4, req);
-+
-+	if (strict) {
-+		if (fl->u.ip4.daddr == inet_rsk(req)->ir_rmt_addr)
-+			*strict = true;
-+		else
-+			*strict = false;
-+	}
-+
-+	return dst;
-+}
-+
- struct request_sock_ops tcp_request_sock_ops __read_mostly = {
- 	.family		=	PF_INET,
- 	.obj_size	=	sizeof(struct tcp_request_sock),
--	.rtx_syn_ack	=	tcp_v4_rtx_synack,
-+	.rtx_syn_ack	=	tcp_rtx_synack,
- 	.send_ack	=	tcp_v4_reqsk_send_ack,
- 	.destructor	=	tcp_v4_reqsk_destructor,
- 	.send_reset	=	tcp_v4_send_reset,
- 	.syn_ack_timeout = 	tcp_syn_ack_timeout,
- };
- 
-+const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
-+	.mss_clamp	=	TCP_MSS_DEFAULT,
- #ifdef CONFIG_TCP_MD5SIG
--static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
- 	.md5_lookup	=	tcp_v4_reqsk_md5_lookup,
- 	.calc_md5_hash	=	tcp_v4_md5_hash_skb,
--};
- #endif
-+	.init_req	=	tcp_v4_init_req,
-+#ifdef CONFIG_SYN_COOKIES
-+	.cookie_init_seq =	cookie_v4_init_sequence,
-+#endif
-+	.route_req	=	tcp_v4_route_req,
-+	.init_seq	=	tcp_v4_init_sequence,
-+	.send_synack	=	tcp_v4_send_synack,
-+	.queue_hash_add =	inet_csk_reqsk_queue_hash_add,
-+};
- 
- int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
- {
--	struct tcp_options_received tmp_opt;
--	struct request_sock *req;
--	struct inet_request_sock *ireq;
--	struct tcp_sock *tp = tcp_sk(sk);
--	struct dst_entry *dst = NULL;
--	__be32 saddr = ip_hdr(skb)->saddr;
--	__be32 daddr = ip_hdr(skb)->daddr;
--	__u32 isn = TCP_SKB_CB(skb)->when;
--	bool want_cookie = false, fastopen;
--	struct flowi4 fl4;
--	struct tcp_fastopen_cookie foc = { .len = -1 };
--	int err;
--
- 	/* Never answer to SYNs send to broadcast or multicast */
- 	if (skb_rtable(skb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
- 		goto drop;
- 
--	/* TW buckets are converted to open requests without
--	 * limitations, they conserve resources and peer is
--	 * evidently real one.
--	 */
--	if ((sysctl_tcp_syncookies == 2 ||
--	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
--		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
--		if (!want_cookie)
--			goto drop;
--	}
--
--	/* Accept backlog is full. If we have already queued enough
--	 * of warm entries in syn queue, drop request. It is better than
--	 * clogging syn queue with openreqs with exponentially increasing
--	 * timeout.
--	 */
--	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
--		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
--		goto drop;
--	}
--
--	req = inet_reqsk_alloc(&tcp_request_sock_ops);
--	if (!req)
--		goto drop;
--
--#ifdef CONFIG_TCP_MD5SIG
--	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
--#endif
--
--	tcp_clear_options(&tmp_opt);
--	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
--	tmp_opt.user_mss  = tp->rx_opt.user_mss;
--	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
--
--	if (want_cookie && !tmp_opt.saw_tstamp)
--		tcp_clear_options(&tmp_opt);
--
--	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
--	tcp_openreq_init(req, &tmp_opt, skb);
-+	return tcp_conn_request(&tcp_request_sock_ops,
-+				&tcp_request_sock_ipv4_ops, sk, skb);
- 
--	ireq = inet_rsk(req);
--	ireq->ir_loc_addr = daddr;
--	ireq->ir_rmt_addr = saddr;
--	ireq->no_srccheck = inet_sk(sk)->transparent;
--	ireq->opt = tcp_v4_save_options(skb);
--	ireq->ir_mark = inet_request_mark(sk, skb);
--
--	if (security_inet_conn_request(sk, skb, req))
--		goto drop_and_free;
--
--	if (!want_cookie || tmp_opt.tstamp_ok)
--		TCP_ECN_create_request(req, skb, sock_net(sk));
--
--	if (want_cookie) {
--		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
--		req->cookie_ts = tmp_opt.tstamp_ok;
--	} else if (!isn) {
--		/* VJ's idea. We save last timestamp seen
--		 * from the destination in peer table, when entering
--		 * state TIME-WAIT, and check against it before
--		 * accepting new connection request.
--		 *
--		 * If "isn" is not zero, this request hit alive
--		 * timewait bucket, so that all the necessary checks
--		 * are made in the function processing timewait state.
--		 */
--		if (tmp_opt.saw_tstamp &&
--		    tcp_death_row.sysctl_tw_recycle &&
--		    (dst = inet_csk_route_req(sk, &fl4, req)) != NULL &&
--		    fl4.daddr == saddr) {
--			if (!tcp_peer_is_proven(req, dst, true)) {
--				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
--				goto drop_and_release;
--			}
--		}
--		/* Kill the following clause, if you dislike this way. */
--		else if (!sysctl_tcp_syncookies &&
--			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
--			  (sysctl_max_syn_backlog >> 2)) &&
--			 !tcp_peer_is_proven(req, dst, false)) {
--			/* Without syncookies last quarter of
--			 * backlog is filled with destinations,
--			 * proven to be alive.
--			 * It means that we continue to communicate
--			 * to destinations, already remembered
--			 * to the moment of synflood.
--			 */
--			LIMIT_NETDEBUG(KERN_DEBUG pr_fmt("drop open request from %pI4/%u\n"),
--				       &saddr, ntohs(tcp_hdr(skb)->source));
--			goto drop_and_release;
--		}
--
--		isn = tcp_v4_init_sequence(skb);
--	}
--	if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL)
--		goto drop_and_free;
--
--	tcp_rsk(req)->snt_isn = isn;
--	tcp_rsk(req)->snt_synack = tcp_time_stamp;
--	tcp_openreq_init_rwin(req, sk, dst);
--	fastopen = !want_cookie &&
--		   tcp_try_fastopen(sk, skb, req, &foc, dst);
--	err = tcp_v4_send_synack(sk, dst, req,
--				 skb_get_queue_mapping(skb), &foc);
--	if (!fastopen) {
--		if (err || want_cookie)
--			goto drop_and_free;
--
--		tcp_rsk(req)->snt_synack = tcp_time_stamp;
--		tcp_rsk(req)->listener = NULL;
--		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
--	}
--
--	return 0;
--
--drop_and_release:
--	dst_release(dst);
--drop_and_free:
--	reqsk_free(req);
- drop:
- 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
- 	return 0;
-@@ -1497,7 +1433,7 @@ put_and_exit:
- }
- EXPORT_SYMBOL(tcp_v4_syn_recv_sock);
- 
--static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
-+struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- 	struct tcphdr *th = tcp_hdr(skb);
- 	const struct iphdr *iph = ip_hdr(skb);
-@@ -1514,8 +1450,15 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
- 
- 	if (nsk) {
- 		if (nsk->sk_state != TCP_TIME_WAIT) {
-+			/* Don't lock again the meta-sk. It has been locked
-+			 * before mptcp_v4_do_rcv.
-+			 */
-+			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
-+				bh_lock_sock(mptcp_meta_sk(nsk));
- 			bh_lock_sock(nsk);
-+
- 			return nsk;
-+
- 		}
- 		inet_twsk_put(inet_twsk(nsk));
- 		return NULL;
-@@ -1550,6 +1493,9 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
- 		goto discard;
- #endif
- 
-+	if (is_meta_sk(sk))
-+		return mptcp_v4_do_rcv(sk, skb);
-+
- 	if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
- 		struct dst_entry *dst = sk->sk_rx_dst;
- 
-@@ -1681,7 +1627,7 @@ bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
- 	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
- 		wake_up_interruptible_sync_poll(sk_sleep(sk),
- 					   POLLIN | POLLRDNORM | POLLRDBAND);
--		if (!inet_csk_ack_scheduled(sk))
-+		if (!inet_csk_ack_scheduled(sk) && !mptcp(tp))
- 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
- 						  (3 * tcp_rto_min(sk)) / 4,
- 						  TCP_RTO_MAX);
-@@ -1698,7 +1644,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
- {
- 	const struct iphdr *iph;
- 	const struct tcphdr *th;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk = NULL;
- 	int ret;
- 	struct net *net = dev_net(skb->dev);
- 
-@@ -1732,18 +1678,42 @@ int tcp_v4_rcv(struct sk_buff *skb)
- 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
- 				    skb->len - th->doff * 4);
- 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-+#ifdef CONFIG_MPTCP
-+	TCP_SKB_CB(skb)->mptcp_flags = 0;
-+	TCP_SKB_CB(skb)->dss_off = 0;
-+#endif
- 	TCP_SKB_CB(skb)->when	 = 0;
- 	TCP_SKB_CB(skb)->ip_dsfield = ipv4_get_dsfield(iph);
- 	TCP_SKB_CB(skb)->sacked	 = 0;
- 
- 	sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
--	if (!sk)
--		goto no_tcp_socket;
- 
- process:
--	if (sk->sk_state == TCP_TIME_WAIT)
-+	if (sk && sk->sk_state == TCP_TIME_WAIT)
- 		goto do_time_wait;
- 
-+#ifdef CONFIG_MPTCP
-+	if (!sk && th->syn && !th->ack) {
-+		int ret = mptcp_lookup_join(skb, NULL);
-+
-+		if (ret < 0) {
-+			tcp_v4_send_reset(NULL, skb);
-+			goto discard_it;
-+		} else if (ret > 0) {
-+			return 0;
-+		}
-+	}
-+
-+	/* Is there a pending request sock for this segment ? */
-+	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
-+		if (sk)
-+			sock_put(sk);
-+		return 0;
-+	}
-+#endif
-+	if (!sk)
-+		goto no_tcp_socket;
-+
- 	if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
- 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- 		goto discard_and_relse;
-@@ -1759,11 +1729,21 @@ process:
- 	sk_mark_napi_id(sk, skb);
- 	skb->dev = NULL;
- 
--	bh_lock_sock_nested(sk);
-+	if (mptcp(tcp_sk(sk))) {
-+		meta_sk = mptcp_meta_sk(sk);
-+
-+		bh_lock_sock_nested(meta_sk);
-+		if (sock_owned_by_user(meta_sk))
-+			skb->sk = sk;
-+	} else {
-+		meta_sk = sk;
-+		bh_lock_sock_nested(sk);
-+	}
-+
- 	ret = 0;
--	if (!sock_owned_by_user(sk)) {
-+	if (!sock_owned_by_user(meta_sk)) {
- #ifdef CONFIG_NET_DMA
--		struct tcp_sock *tp = tcp_sk(sk);
-+		struct tcp_sock *tp = tcp_sk(meta_sk);
- 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- 			tp->ucopy.dma_chan = net_dma_find_channel();
- 		if (tp->ucopy.dma_chan)
-@@ -1771,16 +1751,16 @@ process:
- 		else
- #endif
- 		{
--			if (!tcp_prequeue(sk, skb))
-+			if (!tcp_prequeue(meta_sk, skb))
- 				ret = tcp_v4_do_rcv(sk, skb);
- 		}
--	} else if (unlikely(sk_add_backlog(sk, skb,
--					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
--		bh_unlock_sock(sk);
-+	} else if (unlikely(sk_add_backlog(meta_sk, skb,
-+					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+		bh_unlock_sock(meta_sk);
- 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
- 		goto discard_and_relse;
- 	}
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 
- 	sock_put(sk);
- 
-@@ -1835,6 +1815,18 @@ do_time_wait:
- 			sk = sk2;
- 			goto process;
- 		}
-+#ifdef CONFIG_MPTCP
-+		if (th->syn && !th->ack) {
-+			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
-+
-+			if (ret < 0) {
-+				tcp_v4_send_reset(NULL, skb);
-+				goto discard_it;
-+			} else if (ret > 0) {
-+				return 0;
-+			}
-+		}
-+#endif
- 		/* Fall through to ACK */
- 	}
- 	case TCP_TW_ACK:
-@@ -1900,7 +1892,12 @@ static int tcp_v4_init_sock(struct sock *sk)
- 
- 	tcp_init_sock(sk);
- 
--	icsk->icsk_af_ops = &ipv4_specific;
-+#ifdef CONFIG_MPTCP
-+	if (is_mptcp_enabled(sk))
-+		icsk->icsk_af_ops = &mptcp_v4_specific;
-+	else
-+#endif
-+		icsk->icsk_af_ops = &ipv4_specific;
- 
- #ifdef CONFIG_TCP_MD5SIG
- 	tcp_sk(sk)->af_specific = &tcp_sock_ipv4_specific;
-@@ -1917,6 +1914,11 @@ void tcp_v4_destroy_sock(struct sock *sk)
- 
- 	tcp_cleanup_congestion_control(sk);
- 
-+	if (mptcp(tp))
-+		mptcp_destroy_sock(sk);
-+	if (tp->inside_tk_table)
-+		mptcp_hash_remove(tp);
-+
- 	/* Cleanup up the write buffer. */
- 	tcp_write_queue_purge(sk);
- 
-@@ -2481,6 +2483,19 @@ void tcp4_proc_exit(void)
- }
- #endif /* CONFIG_PROC_FS */
- 
-+#ifdef CONFIG_MPTCP
-+static void tcp_v4_clear_sk(struct sock *sk, int size)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	/* we do not want to clear tk_table field, because of RCU lookups */
-+	sk_prot_clear_nulls(sk, offsetof(struct tcp_sock, tk_table));
-+
-+	size -= offsetof(struct tcp_sock, tk_table) + sizeof(tp->tk_table);
-+	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size);
-+}
-+#endif
-+
- struct proto tcp_prot = {
- 	.name			= "TCP",
- 	.owner			= THIS_MODULE,
-@@ -2528,6 +2543,9 @@ struct proto tcp_prot = {
- 	.destroy_cgroup		= tcp_destroy_cgroup,
- 	.proto_cgroup		= tcp_proto_cgroup,
- #endif
-+#ifdef CONFIG_MPTCP
-+	.clear_sk		= tcp_v4_clear_sk,
-+#endif
- };
- EXPORT_SYMBOL(tcp_prot);
- 
-diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
-index e68e0d4af6c9..ae6946857dff 100644
---- a/net/ipv4/tcp_minisocks.c
-+++ b/net/ipv4/tcp_minisocks.c
-@@ -18,11 +18,13 @@
-  *		Jorge Cwik, <jorge@laser.satlink.net>
-  */
- 
-+#include <linux/kconfig.h>
- #include <linux/mm.h>
- #include <linux/module.h>
- #include <linux/slab.h>
- #include <linux/sysctl.h>
- #include <linux/workqueue.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- #include <net/inet_common.h>
- #include <net/xfrm.h>
-@@ -95,10 +97,13 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- 	struct tcp_options_received tmp_opt;
- 	struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
- 	bool paws_reject = false;
-+	struct mptcp_options_received mopt;
- 
- 	tmp_opt.saw_tstamp = 0;
- 	if (th->doff > (sizeof(*th) >> 2) && tcptw->tw_ts_recent_stamp) {
--		tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+		mptcp_init_mp_opt(&mopt);
-+
-+		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
- 
- 		if (tmp_opt.saw_tstamp) {
- 			tmp_opt.rcv_tsecr	-= tcptw->tw_ts_offset;
-@@ -106,6 +111,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- 			tmp_opt.ts_recent_stamp	= tcptw->tw_ts_recent_stamp;
- 			paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
- 		}
-+
-+		if (unlikely(mopt.mp_fclose) && tcptw->mptcp_tw) {
-+			if (mopt.mptcp_key == tcptw->mptcp_tw->loc_key)
-+				goto kill_with_rst;
-+		}
- 	}
- 
- 	if (tw->tw_substate == TCP_FIN_WAIT2) {
-@@ -128,6 +138,16 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
- 		if (!th->ack ||
- 		    !after(TCP_SKB_CB(skb)->end_seq, tcptw->tw_rcv_nxt) ||
- 		    TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
-+			/* If mptcp_is_data_fin() returns true, we are sure that
-+			 * mopt has been initialized - otherwise it would not
-+			 * be a DATA_FIN.
-+			 */
-+			if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw &&
-+			    mptcp_is_data_fin(skb) &&
-+			    TCP_SKB_CB(skb)->seq == tcptw->tw_rcv_nxt &&
-+			    mopt.data_seq + 1 == (u32)tcptw->mptcp_tw->rcv_nxt)
-+				return TCP_TW_ACK;
-+
- 			inet_twsk_put(tw);
- 			return TCP_TW_SUCCESS;
- 		}
-@@ -290,6 +310,15 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
- 		tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
- 		tcptw->tw_ts_offset	= tp->tsoffset;
- 
-+		if (mptcp(tp)) {
-+			if (mptcp_init_tw_sock(sk, tcptw)) {
-+				inet_twsk_free(tw);
-+				goto exit;
-+			}
-+		} else {
-+			tcptw->mptcp_tw = NULL;
-+		}
-+
- #if IS_ENABLED(CONFIG_IPV6)
- 		if (tw->tw_family == PF_INET6) {
- 			struct ipv6_pinfo *np = inet6_sk(sk);
-@@ -347,15 +376,18 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
- 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPTIMEWAITOVERFLOW);
- 	}
- 
-+exit:
- 	tcp_update_metrics(sk);
- 	tcp_done(sk);
- }
- 
- void tcp_twsk_destructor(struct sock *sk)
- {
--#ifdef CONFIG_TCP_MD5SIG
- 	struct tcp_timewait_sock *twsk = tcp_twsk(sk);
- 
-+	if (twsk->mptcp_tw)
-+		mptcp_twsk_destructor(twsk);
-+#ifdef CONFIG_TCP_MD5SIG
- 	if (twsk->tw_md5_key)
- 		kfree_rcu(twsk->tw_md5_key, rcu);
- #endif
-@@ -382,13 +414,14 @@ void tcp_openreq_init_rwin(struct request_sock *req,
- 		req->window_clamp = tcp_full_space(sk);
- 
- 	/* tcp_full_space because it is guaranteed to be the first packet */
--	tcp_select_initial_window(tcp_full_space(sk),
--		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
-+	tp->ops->select_initial_window(tcp_full_space(sk),
-+		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) -
-+		(ireq->saw_mpc ? MPTCP_SUB_LEN_DSM_ALIGN : 0),
- 		&req->rcv_wnd,
- 		&req->window_clamp,
- 		ireq->wscale_ok,
- 		&rcv_wscale,
--		dst_metric(dst, RTAX_INITRWND));
-+		dst_metric(dst, RTAX_INITRWND), sk);
- 	ireq->rcv_wscale = rcv_wscale;
- }
- EXPORT_SYMBOL(tcp_openreq_init_rwin);
-@@ -499,6 +532,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
- 			newtp->rx_opt.ts_recent_stamp = 0;
- 			newtp->tcp_header_len = sizeof(struct tcphdr);
- 		}
-+		if (ireq->saw_mpc)
-+			newtp->tcp_header_len += MPTCP_SUB_LEN_DSM_ALIGN;
- 		newtp->tsoffset = 0;
- #ifdef CONFIG_TCP_MD5SIG
- 		newtp->md5sig_info = NULL;	/*XXX*/
-@@ -535,16 +570,20 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- 			   bool fastopen)
- {
- 	struct tcp_options_received tmp_opt;
-+	struct mptcp_options_received mopt;
- 	struct sock *child;
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
- 	bool paws_reject = false;
- 
--	BUG_ON(fastopen == (sk->sk_state == TCP_LISTEN));
-+	BUG_ON(!mptcp(tcp_sk(sk)) && fastopen == (sk->sk_state == TCP_LISTEN));
- 
- 	tmp_opt.saw_tstamp = 0;
-+
-+	mptcp_init_mp_opt(&mopt);
-+
- 	if (th->doff > (sizeof(struct tcphdr)>>2)) {
--		tcp_parse_options(skb, &tmp_opt, 0, NULL);
-+		tcp_parse_options(skb, &tmp_opt, &mopt, 0, NULL);
- 
- 		if (tmp_opt.saw_tstamp) {
- 			tmp_opt.ts_recent = req->ts_recent;
-@@ -583,7 +622,14 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- 		 *
- 		 * Reset timer after retransmitting SYNACK, similar to
- 		 * the idea of fast retransmit in recovery.
-+		 *
-+		 * Fall back to TCP if MP_CAPABLE is not set.
- 		 */
-+
-+		if (inet_rsk(req)->saw_mpc && !mopt.saw_mpc)
-+			inet_rsk(req)->saw_mpc = false;
-+
-+
- 		if (!inet_rtx_syn_ack(sk, req))
- 			req->expires = min(TCP_TIMEOUT_INIT << req->num_timeout,
- 					   TCP_RTO_MAX) + jiffies;
-@@ -718,9 +764,21 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
- 	 * socket is created, wait for troubles.
- 	 */
- 	child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
-+
- 	if (child == NULL)
- 		goto listen_overflow;
- 
-+	if (!is_meta_sk(sk)) {
-+		int ret = mptcp_check_req_master(sk, child, req, prev);
-+		if (ret < 0)
-+			goto listen_overflow;
-+
-+		/* MPTCP-supported */
-+		if (!ret)
-+			return tcp_sk(child)->mpcb->master_sk;
-+	} else {
-+		return mptcp_check_req_child(sk, child, req, prev, &mopt);
-+	}
- 	inet_csk_reqsk_queue_unlink(sk, req, prev);
- 	inet_csk_reqsk_queue_removed(sk, req);
- 
-@@ -746,7 +804,17 @@ embryonic_reset:
- 		tcp_reset(sk);
- 	}
- 	if (!fastopen) {
--		inet_csk_reqsk_queue_drop(sk, req, prev);
-+		if (is_meta_sk(sk)) {
-+			/* We want to avoid stoping the keepalive-timer and so
-+			 * avoid ending up in inet_csk_reqsk_queue_removed ...
-+			 */
-+			inet_csk_reqsk_queue_unlink(sk, req, prev);
-+			if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
-+				mptcp_delete_synack_timer(sk);
-+			reqsk_free(req);
-+		} else {
-+			inet_csk_reqsk_queue_drop(sk, req, prev);
-+		}
- 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
- 	}
- 	return NULL;
-@@ -770,8 +838,9 @@ int tcp_child_process(struct sock *parent, struct sock *child,
- {
- 	int ret = 0;
- 	int state = child->sk_state;
-+	struct sock *meta_sk = mptcp(tcp_sk(child)) ? mptcp_meta_sk(child) : child;
- 
--	if (!sock_owned_by_user(child)) {
-+	if (!sock_owned_by_user(meta_sk)) {
- 		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
- 					    skb->len);
- 		/* Wakeup parent, send SIGIO */
-@@ -782,10 +851,14 @@ int tcp_child_process(struct sock *parent, struct sock *child,
- 		 * in main socket hash table and lock on listening
- 		 * socket does not protect us more.
- 		 */
--		__sk_add_backlog(child, skb);
-+		if (mptcp(tcp_sk(child)))
-+			skb->sk = child;
-+		__sk_add_backlog(meta_sk, skb);
- 	}
- 
--	bh_unlock_sock(child);
-+	if (mptcp(tcp_sk(child)))
-+		bh_unlock_sock(child);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(child);
- 	return ret;
- }
-diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
-index 179b51e6bda3..efd31b6c5784 100644
---- a/net/ipv4/tcp_output.c
-+++ b/net/ipv4/tcp_output.c
-@@ -36,6 +36,12 @@
- 
- #define pr_fmt(fmt) "TCP: " fmt
- 
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#endif
-+#include <net/ipv6.h>
- #include <net/tcp.h>
- 
- #include <linux/compiler.h>
-@@ -68,11 +74,8 @@ int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
- unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
- EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
- 
--static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
--			   int push_one, gfp_t gfp);
--
- /* Account for new data that has been sent to the network. */
--static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
-+void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
- {
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct tcp_sock *tp = tcp_sk(sk);
-@@ -214,7 +217,7 @@ u32 tcp_default_init_rwnd(u32 mss)
- void tcp_select_initial_window(int __space, __u32 mss,
- 			       __u32 *rcv_wnd, __u32 *window_clamp,
- 			       int wscale_ok, __u8 *rcv_wscale,
--			       __u32 init_rcv_wnd)
-+			       __u32 init_rcv_wnd, const struct sock *sk)
- {
- 	unsigned int space = (__space < 0 ? 0 : __space);
- 
-@@ -269,12 +272,16 @@ EXPORT_SYMBOL(tcp_select_initial_window);
-  * value can be stuffed directly into th->window for an outgoing
-  * frame.
-  */
--static u16 tcp_select_window(struct sock *sk)
-+u16 tcp_select_window(struct sock *sk)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	u32 old_win = tp->rcv_wnd;
--	u32 cur_win = tcp_receive_window(tp);
--	u32 new_win = __tcp_select_window(sk);
-+	/* The window must never shrink at the meta-level. At the subflow we
-+	 * have to allow this. Otherwise we may announce a window too large
-+	 * for the current meta-level sk_rcvbuf.
-+	 */
-+	u32 cur_win = tcp_receive_window(mptcp(tp) ? tcp_sk(mptcp_meta_sk(sk)) : tp);
-+	u32 new_win = tp->ops->__select_window(sk);
- 
- 	/* Never shrink the offered window */
- 	if (new_win < cur_win) {
-@@ -290,6 +297,7 @@ static u16 tcp_select_window(struct sock *sk)
- 				      LINUX_MIB_TCPWANTZEROWINDOWADV);
- 		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
- 	}
-+
- 	tp->rcv_wnd = new_win;
- 	tp->rcv_wup = tp->rcv_nxt;
- 
-@@ -374,7 +382,7 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb,
- /* Constructs common control bits of non-data skb. If SYN/FIN is present,
-  * auto increment end seqno.
-  */
--static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
-+void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
- {
- 	struct skb_shared_info *shinfo = skb_shinfo(skb);
- 
-@@ -394,7 +402,7 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
- 	TCP_SKB_CB(skb)->end_seq = seq;
- }
- 
--static inline bool tcp_urg_mode(const struct tcp_sock *tp)
-+bool tcp_urg_mode(const struct tcp_sock *tp)
- {
- 	return tp->snd_una != tp->snd_up;
- }
-@@ -404,17 +412,7 @@ static inline bool tcp_urg_mode(const struct tcp_sock *tp)
- #define OPTION_MD5		(1 << 2)
- #define OPTION_WSCALE		(1 << 3)
- #define OPTION_FAST_OPEN_COOKIE	(1 << 8)
--
--struct tcp_out_options {
--	u16 options;		/* bit field of OPTION_* */
--	u16 mss;		/* 0 to disable */
--	u8 ws;			/* window scale, 0 to disable */
--	u8 num_sack_blocks;	/* number of SACK blocks to include */
--	u8 hash_size;		/* bytes in hash_location */
--	__u8 *hash_location;	/* temporary pointer, overloaded */
--	__u32 tsval, tsecr;	/* need to include OPTION_TS */
--	struct tcp_fastopen_cookie *fastopen_cookie;	/* Fast open cookie */
--};
-+/* Before adding here - take a look at OPTION_MPTCP in include/net/mptcp.h */
- 
- /* Write previously computed TCP options to the packet.
-  *
-@@ -430,7 +428,7 @@ struct tcp_out_options {
-  * (but it may well be that other scenarios fail similarly).
-  */
- static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
--			      struct tcp_out_options *opts)
-+			      struct tcp_out_options *opts, struct sk_buff *skb)
- {
- 	u16 options = opts->options;	/* mungable copy */
- 
-@@ -513,6 +511,9 @@ static void tcp_options_write(__be32 *ptr, struct tcp_sock *tp,
- 		}
- 		ptr += (foc->len + 3) >> 2;
- 	}
-+
-+	if (unlikely(OPTION_MPTCP & opts->options))
-+		mptcp_options_write(ptr, tp, opts, skb);
- }
- 
- /* Compute TCP options for SYN packets. This is not the final
-@@ -564,6 +565,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb,
- 		if (unlikely(!(OPTION_TS & opts->options)))
- 			remaining -= TCPOLEN_SACKPERM_ALIGNED;
- 	}
-+	if (tp->request_mptcp || mptcp(tp))
-+		mptcp_syn_options(sk, opts, &remaining);
- 
- 	if (fastopen && fastopen->cookie.len >= 0) {
- 		u32 need = TCPOLEN_EXP_FASTOPEN_BASE + fastopen->cookie.len;
-@@ -637,6 +640,9 @@ static unsigned int tcp_synack_options(struct sock *sk,
- 		}
- 	}
- 
-+	if (ireq->saw_mpc)
-+		mptcp_synack_options(req, opts, &remaining);
-+
- 	return MAX_TCP_OPTION_SPACE - remaining;
- }
- 
-@@ -670,16 +676,22 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
- 		opts->tsecr = tp->rx_opt.ts_recent;
- 		size += TCPOLEN_TSTAMP_ALIGNED;
- 	}
-+	if (mptcp(tp))
-+		mptcp_established_options(sk, skb, opts, &size);
- 
- 	eff_sacks = tp->rx_opt.num_sacks + tp->rx_opt.dsack;
- 	if (unlikely(eff_sacks)) {
--		const unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
--		opts->num_sack_blocks =
--			min_t(unsigned int, eff_sacks,
--			      (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
--			      TCPOLEN_SACK_PERBLOCK);
--		size += TCPOLEN_SACK_BASE_ALIGNED +
--			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
-+		const unsigned remaining = MAX_TCP_OPTION_SPACE - size;
-+		if (remaining < TCPOLEN_SACK_BASE_ALIGNED)
-+			opts->num_sack_blocks = 0;
-+		else
-+			opts->num_sack_blocks =
-+			    min_t(unsigned int, eff_sacks,
-+				  (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
-+				  TCPOLEN_SACK_PERBLOCK);
-+		if (opts->num_sack_blocks)
-+			size += TCPOLEN_SACK_BASE_ALIGNED +
-+			    opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
- 	}
- 
- 	return size;
-@@ -711,8 +723,8 @@ static void tcp_tsq_handler(struct sock *sk)
- 	if ((1 << sk->sk_state) &
- 	    (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_CLOSING |
- 	     TCPF_CLOSE_WAIT  | TCPF_LAST_ACK))
--		tcp_write_xmit(sk, tcp_current_mss(sk), tcp_sk(sk)->nonagle,
--			       0, GFP_ATOMIC);
-+		tcp_sk(sk)->ops->write_xmit(sk, tcp_current_mss(sk),
-+					    tcp_sk(sk)->nonagle, 0, GFP_ATOMIC);
- }
- /*
-  * One tasklet per cpu tries to send more skbs.
-@@ -727,7 +739,7 @@ static void tcp_tasklet_func(unsigned long data)
- 	unsigned long flags;
- 	struct list_head *q, *n;
- 	struct tcp_sock *tp;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk;
- 
- 	local_irq_save(flags);
- 	list_splice_init(&tsq->head, &list);
-@@ -738,15 +750,25 @@ static void tcp_tasklet_func(unsigned long data)
- 		list_del(&tp->tsq_node);
- 
- 		sk = (struct sock *)tp;
--		bh_lock_sock(sk);
-+		meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
-+		bh_lock_sock(meta_sk);
- 
--		if (!sock_owned_by_user(sk)) {
-+		if (!sock_owned_by_user(meta_sk)) {
- 			tcp_tsq_handler(sk);
-+			if (mptcp(tp))
-+				tcp_tsq_handler(meta_sk);
- 		} else {
-+			if (mptcp(tp) && sk->sk_state == TCP_CLOSE)
-+				goto exit;
-+
- 			/* defer the work to tcp_release_cb() */
- 			set_bit(TCP_TSQ_DEFERRED, &tp->tsq_flags);
-+
-+			if (mptcp(tp))
-+				mptcp_tsq_flags(sk);
- 		}
--		bh_unlock_sock(sk);
-+exit:
-+		bh_unlock_sock(meta_sk);
- 
- 		clear_bit(TSQ_QUEUED, &tp->tsq_flags);
- 		sk_free(sk);
-@@ -756,7 +778,10 @@ static void tcp_tasklet_func(unsigned long data)
- #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) |		\
- 			  (1UL << TCP_WRITE_TIMER_DEFERRED) |	\
- 			  (1UL << TCP_DELACK_TIMER_DEFERRED) |	\
--			  (1UL << TCP_MTU_REDUCED_DEFERRED))
-+			  (1UL << TCP_MTU_REDUCED_DEFERRED) |   \
-+			  (1UL << MPTCP_PATH_MANAGER) |		\
-+			  (1UL << MPTCP_SUB_DEFERRED))
-+
- /**
-  * tcp_release_cb - tcp release_sock() callback
-  * @sk: socket
-@@ -803,6 +828,13 @@ void tcp_release_cb(struct sock *sk)
- 		sk->sk_prot->mtu_reduced(sk);
- 		__sock_put(sk);
- 	}
-+	if (flags & (1UL << MPTCP_PATH_MANAGER)) {
-+		if (tcp_sk(sk)->mpcb->pm_ops->release_sock)
-+			tcp_sk(sk)->mpcb->pm_ops->release_sock(sk);
-+		__sock_put(sk);
-+	}
-+	if (flags & (1UL << MPTCP_SUB_DEFERRED))
-+		mptcp_tsq_sub_deferred(sk);
- }
- EXPORT_SYMBOL(tcp_release_cb);
- 
-@@ -862,8 +894,8 @@ void tcp_wfree(struct sk_buff *skb)
-  * We are working here with either a clone of the original
-  * SKB, or a fresh unique copy made by the retransmit engine.
-  */
--static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
--			    gfp_t gfp_mask)
-+int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-+		        gfp_t gfp_mask)
- {
- 	const struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct inet_sock *inet;
-@@ -933,7 +965,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- 		 */
- 		th->window	= htons(min(tp->rcv_wnd, 65535U));
- 	} else {
--		th->window	= htons(tcp_select_window(sk));
-+		th->window	= htons(tp->ops->select_window(sk));
- 	}
- 	th->check		= 0;
- 	th->urg_ptr		= 0;
-@@ -949,7 +981,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
- 		}
- 	}
- 
--	tcp_options_write((__be32 *)(th + 1), tp, &opts);
-+	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
- 	if (likely((tcb->tcp_flags & TCPHDR_SYN) == 0))
- 		TCP_ECN_send(sk, skb, tcp_header_size);
- 
-@@ -988,7 +1020,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
-  * NOTE: probe0 timer is not checked, do not forget tcp_push_pending_frames,
-  * otherwise socket can stall.
-  */
--static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
-+void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 
-@@ -1001,15 +1033,16 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
- }
- 
- /* Initialize TSO segments for a packet. */
--static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
--				 unsigned int mss_now)
-+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+			  unsigned int mss_now)
- {
- 	struct skb_shared_info *shinfo = skb_shinfo(skb);
- 
- 	/* Make sure we own this skb before messing gso_size/gso_segs */
- 	WARN_ON_ONCE(skb_cloned(skb));
- 
--	if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
-+	if (skb->len <= mss_now || (is_meta_sk(sk) && !mptcp_sk_can_gso(sk)) ||
-+	    (!is_meta_sk(sk) && !sk_can_gso(sk)) || skb->ip_summed == CHECKSUM_NONE) {
- 		/* Avoid the costly divide in the normal
- 		 * non-TSO case.
- 		 */
-@@ -1041,7 +1074,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
- /* Pcount in the middle of the write queue got changed, we need to do various
-  * tweaks to fix counters
-  */
--static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
-+void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int decr)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 
-@@ -1164,7 +1197,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len,
-  * eventually). The difference is that pulled data not copied, but
-  * immediately discarded.
-  */
--static void __pskb_trim_head(struct sk_buff *skb, int len)
-+void __pskb_trim_head(struct sk_buff *skb, int len)
- {
- 	struct skb_shared_info *shinfo;
- 	int i, k, eat;
-@@ -1205,6 +1238,9 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
- /* Remove acked data from a packet in the transmit queue. */
- int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
- {
-+	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk) && mptcp_is_data_seq(skb))
-+		return mptcp_trim_head(sk, skb, len);
-+
- 	if (skb_unclone(skb, GFP_ATOMIC))
- 		return -ENOMEM;
- 
-@@ -1222,6 +1258,15 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
- 	if (tcp_skb_pcount(skb) > 1)
- 		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
- 
-+#ifdef CONFIG_MPTCP
-+	/* Some data got acked - we assume that the seq-number reached the dest.
-+	 * Anyway, our MPTCP-option has been trimmed above - we lost it here.
-+	 * Only remove the SEQ if the call does not come from a meta retransmit.
-+	 */
-+	if (mptcp(tcp_sk(sk)) && !is_meta_sk(sk))
-+		TCP_SKB_CB(skb)->mptcp_flags &= ~MPTCPHDR_SEQ;
-+#endif
-+
- 	return 0;
- }
- 
-@@ -1379,6 +1424,7 @@ unsigned int tcp_current_mss(struct sock *sk)
- 
- 	return mss_now;
- }
-+EXPORT_SYMBOL(tcp_current_mss);
- 
- /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
-  * As additional protections, we do not touch cwnd in retransmission phases,
-@@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
-  * But we can avoid doing the divide again given we already have
-  *  skb_pcount = skb->len / mss_now
-  */
--static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
--				const struct sk_buff *skb)
-+void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
-+			 const struct sk_buff *skb)
- {
- 	if (skb->len < tcp_skb_pcount(skb) * mss_now)
- 		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
-@@ -1468,11 +1514,11 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
- 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
- }
- /* Returns the portion of skb which can be sent right away */
--static unsigned int tcp_mss_split_point(const struct sock *sk,
--					const struct sk_buff *skb,
--					unsigned int mss_now,
--					unsigned int max_segs,
--					int nonagle)
-+unsigned int tcp_mss_split_point(const struct sock *sk,
-+				 const struct sk_buff *skb,
-+				 unsigned int mss_now,
-+				 unsigned int max_segs,
-+				 int nonagle)
- {
- 	const struct tcp_sock *tp = tcp_sk(sk);
- 	u32 partial, needed, window, max_len;
-@@ -1502,13 +1548,14 @@ static unsigned int tcp_mss_split_point(const struct sock *sk,
- /* Can at least one segment of SKB be sent right now, according to the
-  * congestion window rules?  If so, return how many segments are allowed.
-  */
--static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
--					 const struct sk_buff *skb)
-+unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-+			   const struct sk_buff *skb)
- {
- 	u32 in_flight, cwnd;
- 
- 	/* Don't be strict about the congestion window for the final FIN.  */
--	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
-+	if (skb &&
-+	    (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) &&
- 	    tcp_skb_pcount(skb) == 1)
- 		return 1;
- 
-@@ -1524,8 +1571,8 @@ static inline unsigned int tcp_cwnd_test(const struct tcp_sock *tp,
-  * This must be invoked the first time we consider transmitting
-  * SKB onto the wire.
-  */
--static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
--			     unsigned int mss_now)
-+int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
-+		      unsigned int mss_now)
- {
- 	int tso_segs = tcp_skb_pcount(skb);
- 
-@@ -1540,8 +1587,8 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
- /* Return true if the Nagle test allows this packet to be
-  * sent now.
-  */
--static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
--				  unsigned int cur_mss, int nonagle)
-+bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+		    unsigned int cur_mss, int nonagle)
- {
- 	/* Nagle rule does not apply to frames, which sit in the middle of the
- 	 * write_queue (they have no chances to get new data).
-@@ -1553,7 +1600,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
- 		return true;
- 
- 	/* Don't use the nagle rule for urgent data (or for the final FIN). */
--	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
-+	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) ||
-+	    mptcp_is_data_fin(skb))
- 		return true;
- 
- 	if (!tcp_nagle_check(skb->len < cur_mss, tp, nonagle))
-@@ -1563,9 +1611,8 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
- }
- 
- /* Does at least the first segment of SKB fit into the send window? */
--static bool tcp_snd_wnd_test(const struct tcp_sock *tp,
--			     const struct sk_buff *skb,
--			     unsigned int cur_mss)
-+bool tcp_snd_wnd_test(const struct tcp_sock *tp, const struct sk_buff *skb,
-+		      unsigned int cur_mss)
- {
- 	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
- 
-@@ -1676,7 +1723,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
- 	u32 send_win, cong_win, limit, in_flight;
- 	int win_divisor;
- 
--	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
-+	if ((TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) || mptcp_is_data_fin(skb))
- 		goto send_now;
- 
- 	if (icsk->icsk_ca_state != TCP_CA_Open)
-@@ -1888,7 +1935,7 @@ static int tcp_mtu_probe(struct sock *sk)
-  * Returns true, if no segments are in flight and we have queued segments,
-  * but cannot send anything now because of SWS or another problem.
-  */
--static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
-+bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
- 			   int push_one, gfp_t gfp)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
-@@ -1900,7 +1947,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
- 
- 	sent_pkts = 0;
- 
--	if (!push_one) {
-+	/* pmtu not yet supported with MPTCP. Should be possible, by early
-+	 * exiting the loop inside tcp_mtu_probe, making sure that only one
-+	 * single DSS-mapping gets probed.
-+	 */
-+	if (!push_one && !mptcp(tp)) {
- 		/* Do MTU probing. */
- 		result = tcp_mtu_probe(sk);
- 		if (!result) {
-@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
- 	int err = -1;
- 
- 	if (tcp_send_head(sk) != NULL) {
--		err = tcp_write_xmit(sk, mss, TCP_NAGLE_OFF, 2, GFP_ATOMIC);
-+		err = tp->ops->write_xmit(sk, mss, TCP_NAGLE_OFF, 2,
-+					  GFP_ATOMIC);
- 		goto rearm_timer;
- 	}
- 
-@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
- 	if (unlikely(sk->sk_state == TCP_CLOSE))
- 		return;
- 
--	if (tcp_write_xmit(sk, cur_mss, nonagle, 0,
--			   sk_gfp_atomic(sk, GFP_ATOMIC)))
-+	if (tcp_sk(sk)->ops->write_xmit(sk, cur_mss, nonagle, 0,
-+					sk_gfp_atomic(sk, GFP_ATOMIC)))
- 		tcp_check_probe_timer(sk);
- }
- 
-@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
- 
- 	BUG_ON(!skb || skb->len < mss_now);
- 
--	tcp_write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1, sk->sk_allocation);
-+	tcp_sk(sk)->ops->write_xmit(sk, mss_now, TCP_NAGLE_PUSH, 1,
-+				    sk->sk_allocation);
- }
- 
- /* This function returns the amount that we can raise the
-@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
- 	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
- 		return;
- 
-+	/* Currently not supported for MPTCP - but it should be possible */
-+	if (mptcp(tp))
-+		return;
-+
- 	tcp_for_write_queue_from_safe(skb, tmp, sk) {
- 		if (!tcp_can_collapse(sk, skb))
- 			break;
-@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
- 
- 	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
- 	th->window = htons(min(req->rcv_wnd, 65535U));
--	tcp_options_write((__be32 *)(th + 1), tp, &opts);
-+	tcp_options_write((__be32 *)(th + 1), tp, &opts, skb);
- 	th->doff = (tcp_header_size >> 2);
- 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
- 
-@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
- 	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
- 		tp->window_clamp = tcp_full_space(sk);
- 
--	tcp_select_initial_window(tcp_full_space(sk),
--				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
--				  &tp->rcv_wnd,
--				  &tp->window_clamp,
--				  sysctl_tcp_window_scaling,
--				  &rcv_wscale,
--				  dst_metric(dst, RTAX_INITRWND));
-+	tp->ops->select_initial_window(tcp_full_space(sk),
-+				       tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
-+				       &tp->rcv_wnd,
-+				       &tp->window_clamp,
-+				       sysctl_tcp_window_scaling,
-+				       &rcv_wscale,
-+				       dst_metric(dst, RTAX_INITRWND), sk);
- 
- 	tp->rx_opt.rcv_wscale = rcv_wscale;
- 	tp->rcv_ssthresh = tp->rcv_wnd;
-@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
- 	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
- 	inet_csk(sk)->icsk_retransmits = 0;
- 	tcp_clear_retrans(tp);
-+
-+#ifdef CONFIG_MPTCP
-+	if (sysctl_mptcp_enabled && mptcp_doit(sk)) {
-+		if (is_master_tp(tp)) {
-+			tp->request_mptcp = 1;
-+			mptcp_connect_init(sk);
-+		} else if (tp->mptcp) {
-+			struct inet_sock *inet	= inet_sk(sk);
-+
-+			tp->mptcp->snt_isn	= tp->write_seq;
-+			tp->mptcp->init_rcv_wnd	= tp->rcv_wnd;
-+
-+			/* Set nonce for new subflows */
-+			if (sk->sk_family == AF_INET)
-+				tp->mptcp->mptcp_loc_nonce = mptcp_v4_get_nonce(
-+							inet->inet_saddr,
-+							inet->inet_daddr,
-+							inet->inet_sport,
-+							inet->inet_dport);
-+#if IS_ENABLED(CONFIG_IPV6)
-+			else
-+				tp->mptcp->mptcp_loc_nonce = mptcp_v6_get_nonce(
-+						inet6_sk(sk)->saddr.s6_addr32,
-+						sk->sk_v6_daddr.s6_addr32,
-+						inet->inet_sport,
-+						inet->inet_dport);
-+#endif
-+		}
-+	}
-+#endif
- }
- 
- static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
-@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
- 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
- 	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
- }
-+EXPORT_SYMBOL(tcp_send_ack);
- 
- /* This routine sends a packet with an out of date sequence
-  * number. It assumes the other end will try to ack it.
-@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
-  * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
-  * out-of-date with SND.UNA-1 to probe window.
-  */
--static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
-+int tcp_xmit_probe_skb(struct sock *sk, int urgent)
- {
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	struct sk_buff *skb;
-@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
- 	struct tcp_sock *tp = tcp_sk(sk);
- 	int err;
- 
--	err = tcp_write_wakeup(sk);
-+	err = tp->ops->write_wakeup(sk);
- 
- 	if (tp->packets_out || !tcp_send_head(sk)) {
- 		/* Cancel probe timer, if it is not required. */
-@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
- 					  TCP_RTO_MAX);
- 	}
- }
-+
-+int tcp_rtx_synack(struct sock *sk, struct request_sock *req)
-+{
-+	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
-+	struct flowi fl;
-+	int res;
-+
-+	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
-+	if (!res) {
-+		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
-+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-+	}
-+	return res;
-+}
-+EXPORT_SYMBOL(tcp_rtx_synack);
-diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
-index 286227abed10..966b873cbf3e 100644
---- a/net/ipv4/tcp_timer.c
-+++ b/net/ipv4/tcp_timer.c
-@@ -20,6 +20,7 @@
- 
- #include <linux/module.h>
- #include <linux/gfp.h>
-+#include <net/mptcp.h>
- #include <net/tcp.h>
- 
- int sysctl_tcp_syn_retries __read_mostly = TCP_SYN_RETRIES;
-@@ -32,7 +33,7 @@ int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
- int sysctl_tcp_orphan_retries __read_mostly;
- int sysctl_tcp_thin_linear_timeouts __read_mostly;
- 
--static void tcp_write_err(struct sock *sk)
-+void tcp_write_err(struct sock *sk)
- {
- 	sk->sk_err = sk->sk_err_soft ? : ETIMEDOUT;
- 	sk->sk_error_report(sk);
-@@ -74,7 +75,7 @@ static int tcp_out_of_resources(struct sock *sk, int do_reset)
- 		    (!tp->snd_wnd && !tp->packets_out))
- 			do_reset = 1;
- 		if (do_reset)
--			tcp_send_active_reset(sk, GFP_ATOMIC);
-+			tp->ops->send_active_reset(sk, GFP_ATOMIC);
- 		tcp_done(sk);
- 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY);
- 		return 1;
-@@ -124,10 +125,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
-  * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
-  * syn_set flag is set.
-  */
--static bool retransmits_timed_out(struct sock *sk,
--				  unsigned int boundary,
--				  unsigned int timeout,
--				  bool syn_set)
-+bool retransmits_timed_out(struct sock *sk, unsigned int boundary,
-+			   unsigned int timeout, bool syn_set)
- {
- 	unsigned int linear_backoff_thresh, start_ts;
- 	unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
-@@ -153,7 +152,7 @@ static bool retransmits_timed_out(struct sock *sk,
- }
- 
- /* A write timeout has occurred. Process the after effects. */
--static int tcp_write_timeout(struct sock *sk)
-+int tcp_write_timeout(struct sock *sk)
- {
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct tcp_sock *tp = tcp_sk(sk);
-@@ -171,6 +170,10 @@ static int tcp_write_timeout(struct sock *sk)
- 		}
- 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
- 		syn_set = true;
-+		/* Stop retransmitting MP_CAPABLE options in SYN if timed out. */
-+		if (tcp_sk(sk)->request_mptcp &&
-+		    icsk->icsk_retransmits >= mptcp_sysctl_syn_retries())
-+			tcp_sk(sk)->request_mptcp = 0;
- 	} else {
- 		if (retransmits_timed_out(sk, sysctl_tcp_retries1, 0, 0)) {
- 			/* Black hole detection */
-@@ -251,18 +254,22 @@ out:
- static void tcp_delack_timer(unsigned long data)
- {
- 	struct sock *sk = (struct sock *)data;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
- 
--	bh_lock_sock(sk);
--	if (!sock_owned_by_user(sk)) {
-+	bh_lock_sock(meta_sk);
-+	if (!sock_owned_by_user(meta_sk)) {
- 		tcp_delack_timer_handler(sk);
- 	} else {
- 		inet_csk(sk)->icsk_ack.blocked = 1;
--		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOCKED);
-+		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_DELAYEDACKLOCKED);
- 		/* deleguate our work to tcp_release_cb() */
- 		if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
- 			sock_hold(sk);
-+		if (mptcp(tp))
-+			mptcp_tsq_flags(sk);
- 	}
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
-@@ -479,6 +486,10 @@ out_reset_timer:
- 		__sk_dst_reset(sk);
- 
- out:;
-+	if (mptcp(tp)) {
-+		mptcp_reinject_data(sk, 1);
-+		mptcp_set_rto(sk);
-+	}
- }
- 
- void tcp_write_timer_handler(struct sock *sk)
-@@ -505,7 +516,7 @@ void tcp_write_timer_handler(struct sock *sk)
- 		break;
- 	case ICSK_TIME_RETRANS:
- 		icsk->icsk_pending = 0;
--		tcp_retransmit_timer(sk);
-+		tcp_sk(sk)->ops->retransmit_timer(sk);
- 		break;
- 	case ICSK_TIME_PROBE0:
- 		icsk->icsk_pending = 0;
-@@ -520,16 +531,19 @@ out:
- static void tcp_write_timer(unsigned long data)
- {
- 	struct sock *sk = (struct sock *)data;
-+	struct sock *meta_sk = mptcp(tcp_sk(sk)) ? mptcp_meta_sk(sk) : sk;
- 
--	bh_lock_sock(sk);
--	if (!sock_owned_by_user(sk)) {
-+	bh_lock_sock(meta_sk);
-+	if (!sock_owned_by_user(meta_sk)) {
- 		tcp_write_timer_handler(sk);
- 	} else {
- 		/* deleguate our work to tcp_release_cb() */
- 		if (!test_and_set_bit(TCP_WRITE_TIMER_DEFERRED, &tcp_sk(sk)->tsq_flags))
- 			sock_hold(sk);
-+		if (mptcp(tcp_sk(sk)))
-+			mptcp_tsq_flags(sk);
- 	}
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
-@@ -566,11 +580,12 @@ static void tcp_keepalive_timer (unsigned long data)
- 	struct sock *sk = (struct sock *) data;
- 	struct inet_connection_sock *icsk = inet_csk(sk);
- 	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp(tp) ? mptcp_meta_sk(sk) : sk;
- 	u32 elapsed;
- 
- 	/* Only process if socket is not in use. */
--	bh_lock_sock(sk);
--	if (sock_owned_by_user(sk)) {
-+	bh_lock_sock(meta_sk);
-+	if (sock_owned_by_user(meta_sk)) {
- 		/* Try again later. */
- 		inet_csk_reset_keepalive_timer (sk, HZ/20);
- 		goto out;
-@@ -581,16 +596,38 @@ static void tcp_keepalive_timer (unsigned long data)
- 		goto out;
- 	}
- 
-+	if (tp->send_mp_fclose) {
-+		/* MUST do this before tcp_write_timeout, because retrans_stamp
-+		 * may have been set to 0 in another part while we are
-+		 * retransmitting MP_FASTCLOSE. Then, we would crash, because
-+		 * retransmits_timed_out accesses the meta-write-queue.
-+		 *
-+		 * We make sure that the timestamp is != 0.
-+		 */
-+		if (!tp->retrans_stamp)
-+			tp->retrans_stamp = tcp_time_stamp ? : 1;
-+
-+		if (tcp_write_timeout(sk))
-+			goto out;
-+
-+		tcp_send_ack(sk);
-+		icsk->icsk_retransmits++;
-+
-+		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-+		elapsed = icsk->icsk_rto;
-+		goto resched;
-+	}
-+
- 	if (sk->sk_state == TCP_FIN_WAIT2 && sock_flag(sk, SOCK_DEAD)) {
- 		if (tp->linger2 >= 0) {
- 			const int tmo = tcp_fin_time(sk) - TCP_TIMEWAIT_LEN;
- 
- 			if (tmo > 0) {
--				tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
-+				tp->ops->time_wait(sk, TCP_FIN_WAIT2, tmo);
- 				goto out;
- 			}
- 		}
--		tcp_send_active_reset(sk, GFP_ATOMIC);
-+		tp->ops->send_active_reset(sk, GFP_ATOMIC);
- 		goto death;
- 	}
- 
-@@ -614,11 +651,11 @@ static void tcp_keepalive_timer (unsigned long data)
- 		    icsk->icsk_probes_out > 0) ||
- 		    (icsk->icsk_user_timeout == 0 &&
- 		    icsk->icsk_probes_out >= keepalive_probes(tp))) {
--			tcp_send_active_reset(sk, GFP_ATOMIC);
-+			tp->ops->send_active_reset(sk, GFP_ATOMIC);
- 			tcp_write_err(sk);
- 			goto out;
- 		}
--		if (tcp_write_wakeup(sk) <= 0) {
-+		if (tp->ops->write_wakeup(sk) <= 0) {
- 			icsk->icsk_probes_out++;
- 			elapsed = keepalive_intvl_when(tp);
- 		} else {
-@@ -642,7 +679,7 @@ death:
- 	tcp_done(sk);
- 
- out:
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
-diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
-index 5667b3003af9..7139c2973fd2 100644
---- a/net/ipv6/addrconf.c
-+++ b/net/ipv6/addrconf.c
-@@ -760,6 +760,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
- 
- 	kfree_rcu(ifp, rcu);
- }
-+EXPORT_SYMBOL(inet6_ifa_finish_destroy);
- 
- static void
- ipv6_link_dev_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp)
-diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
-index 7cb4392690dd..7057afbca4df 100644
---- a/net/ipv6/af_inet6.c
-+++ b/net/ipv6/af_inet6.c
-@@ -97,8 +97,7 @@ static __inline__ struct ipv6_pinfo *inet6_sk_generic(struct sock *sk)
- 	return (struct ipv6_pinfo *)(((u8 *)sk) + offset);
- }
- 
--static int inet6_create(struct net *net, struct socket *sock, int protocol,
--			int kern)
-+int inet6_create(struct net *net, struct socket *sock, int protocol, int kern)
- {
- 	struct inet_sock *inet;
- 	struct ipv6_pinfo *np;
-diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
-index a245e5ddffbd..99c892b8992d 100644
---- a/net/ipv6/inet6_connection_sock.c
-+++ b/net/ipv6/inet6_connection_sock.c
-@@ -96,8 +96,8 @@ struct dst_entry *inet6_csk_route_req(struct sock *sk,
- /*
-  * request_sock (formerly open request) hash tables.
-  */
--static u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
--			   const u32 rnd, const u32 synq_hsize)
-+u32 inet6_synq_hash(const struct in6_addr *raddr, const __be16 rport,
-+		    const u32 rnd, const u32 synq_hsize)
- {
- 	u32 c;
- 
-diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
-index edb58aff4ae7..ea4d9fda0927 100644
---- a/net/ipv6/ipv6_sockglue.c
-+++ b/net/ipv6/ipv6_sockglue.c
-@@ -48,6 +48,8 @@
- #include <net/addrconf.h>
- #include <net/inet_common.h>
- #include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
- #include <net/udp.h>
- #include <net/udplite.h>
- #include <net/xfrm.h>
-@@ -196,7 +198,12 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
- 				sock_prot_inuse_add(net, &tcp_prot, 1);
- 				local_bh_enable();
- 				sk->sk_prot = &tcp_prot;
--				icsk->icsk_af_ops = &ipv4_specific;
-+#ifdef CONFIG_MPTCP
-+				if (is_mptcp_enabled(sk))
-+					icsk->icsk_af_ops = &mptcp_v4_specific;
-+				else
-+#endif
-+					icsk->icsk_af_ops = &ipv4_specific;
- 				sk->sk_socket->ops = &inet_stream_ops;
- 				sk->sk_family = PF_INET;
- 				tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
-diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
-index a822b880689b..b2b38869d795 100644
---- a/net/ipv6/syncookies.c
-+++ b/net/ipv6/syncookies.c
-@@ -181,13 +181,13 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
- 
- 	/* check for timestamp cookie support */
- 	memset(&tcp_opt, 0, sizeof(tcp_opt));
--	tcp_parse_options(skb, &tcp_opt, 0, NULL);
-+	tcp_parse_options(skb, &tcp_opt, NULL, 0, NULL);
- 
- 	if (!cookie_check_timestamp(&tcp_opt, sock_net(sk), &ecn_ok))
- 		goto out;
- 
- 	ret = NULL;
--	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
-+	req = inet_reqsk_alloc(&tcp6_request_sock_ops);
- 	if (!req)
- 		goto out;
- 
-@@ -255,10 +255,10 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
- 	}
- 
- 	req->window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
--	tcp_select_initial_window(tcp_full_space(sk), req->mss,
--				  &req->rcv_wnd, &req->window_clamp,
--				  ireq->wscale_ok, &rcv_wscale,
--				  dst_metric(dst, RTAX_INITRWND));
-+	tp->ops->select_initial_window(tcp_full_space(sk), req->mss,
-+				       &req->rcv_wnd, &req->window_clamp,
-+				       ireq->wscale_ok, &rcv_wscale,
-+				       dst_metric(dst, RTAX_INITRWND), sk);
- 
- 	ireq->rcv_wscale = rcv_wscale;
- 
-diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
-index 229239ad96b1..fda94d71666e 100644
---- a/net/ipv6/tcp_ipv6.c
-+++ b/net/ipv6/tcp_ipv6.c
-@@ -63,6 +63,8 @@
- #include <net/inet_common.h>
- #include <net/secure_seq.h>
- #include <net/tcp_memcontrol.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v6.h>
- #include <net/busy_poll.h>
- 
- #include <linux/proc_fs.h>
-@@ -71,12 +73,6 @@
- #include <linux/crypto.h>
- #include <linux/scatterlist.h>
- 
--static void	tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb);
--static void	tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
--				      struct request_sock *req);
--
--static int	tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb);
--
- static const struct inet_connection_sock_af_ops ipv6_mapped;
- static const struct inet_connection_sock_af_ops ipv6_specific;
- #ifdef CONFIG_TCP_MD5SIG
-@@ -90,7 +86,7 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(struct sock *sk,
- }
- #endif
- 
--static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
-+void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
- {
- 	struct dst_entry *dst = skb_dst(skb);
- 	const struct rt6_info *rt = (const struct rt6_info *)dst;
-@@ -102,10 +98,11 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
- 		inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
- }
- 
--static void tcp_v6_hash(struct sock *sk)
-+void tcp_v6_hash(struct sock *sk)
- {
- 	if (sk->sk_state != TCP_CLOSE) {
--		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped) {
-+		if (inet_csk(sk)->icsk_af_ops == &ipv6_mapped ||
-+		    inet_csk(sk)->icsk_af_ops == &mptcp_v6_mapped) {
- 			tcp_prot.hash(sk);
- 			return;
- 		}
-@@ -115,7 +112,7 @@ static void tcp_v6_hash(struct sock *sk)
- 	}
- }
- 
--static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
-+__u32 tcp_v6_init_sequence(const struct sk_buff *skb)
- {
- 	return secure_tcpv6_sequence_number(ipv6_hdr(skb)->daddr.s6_addr32,
- 					    ipv6_hdr(skb)->saddr.s6_addr32,
-@@ -123,7 +120,7 @@ static __u32 tcp_v6_init_sequence(const struct sk_buff *skb)
- 					    tcp_hdr(skb)->source);
- }
- 
--static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
-+int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- 			  int addr_len)
- {
- 	struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
-@@ -215,7 +212,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- 		sin.sin_port = usin->sin6_port;
- 		sin.sin_addr.s_addr = usin->sin6_addr.s6_addr32[3];
- 
--		icsk->icsk_af_ops = &ipv6_mapped;
-+#ifdef CONFIG_MPTCP
-+		if (is_mptcp_enabled(sk))
-+			icsk->icsk_af_ops = &mptcp_v6_mapped;
-+		else
-+#endif
-+			icsk->icsk_af_ops = &ipv6_mapped;
- 		sk->sk_backlog_rcv = tcp_v4_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- 		tp->af_specific = &tcp_sock_ipv6_mapped_specific;
-@@ -225,7 +227,12 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
- 
- 		if (err) {
- 			icsk->icsk_ext_hdr_len = exthdrlen;
--			icsk->icsk_af_ops = &ipv6_specific;
-+#ifdef CONFIG_MPTCP
-+			if (is_mptcp_enabled(sk))
-+				icsk->icsk_af_ops = &mptcp_v6_specific;
-+			else
-+#endif
-+				icsk->icsk_af_ops = &ipv6_specific;
- 			sk->sk_backlog_rcv = tcp_v6_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- 			tp->af_specific = &tcp_sock_ipv6_specific;
-@@ -337,7 +344,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 	const struct ipv6hdr *hdr = (const struct ipv6hdr *)skb->data;
- 	const struct tcphdr *th = (struct tcphdr *)(skb->data+offset);
- 	struct ipv6_pinfo *np;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk;
- 	int err;
- 	struct tcp_sock *tp;
- 	struct request_sock *fastopen;
-@@ -358,8 +365,14 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 		return;
- 	}
- 
--	bh_lock_sock(sk);
--	if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
-+	tp = tcp_sk(sk);
-+	if (mptcp(tp))
-+		meta_sk = mptcp_meta_sk(sk);
-+	else
-+		meta_sk = sk;
-+
-+	bh_lock_sock(meta_sk);
-+	if (sock_owned_by_user(meta_sk) && type != ICMPV6_PKT_TOOBIG)
- 		NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
- 
- 	if (sk->sk_state == TCP_CLOSE)
-@@ -370,7 +383,6 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 		goto out;
- 	}
- 
--	tp = tcp_sk(sk);
- 	seq = ntohl(th->seq);
- 	/* XXX (TFO) - tp->snd_una should be ISN (tcp_create_openreq_child() */
- 	fastopen = tp->fastopen_rsk;
-@@ -403,11 +415,15 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 			goto out;
- 
- 		tp->mtu_info = ntohl(info);
--		if (!sock_owned_by_user(sk))
-+		if (!sock_owned_by_user(meta_sk))
- 			tcp_v6_mtu_reduced(sk);
--		else if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
-+		else {
-+			if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED,
- 					   &tp->tsq_flags))
--			sock_hold(sk);
-+				sock_hold(sk);
-+			if (mptcp(tp))
-+				mptcp_tsq_flags(sk);
-+		}
- 		goto out;
- 	}
- 
-@@ -417,7 +433,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 	switch (sk->sk_state) {
- 		struct request_sock *req, **prev;
- 	case TCP_LISTEN:
--		if (sock_owned_by_user(sk))
-+		if (sock_owned_by_user(meta_sk))
- 			goto out;
- 
- 		req = inet6_csk_search_req(sk, &prev, th->dest, &hdr->daddr,
-@@ -447,7 +463,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 		if (fastopen && fastopen->sk == NULL)
- 			break;
- 
--		if (!sock_owned_by_user(sk)) {
-+		if (!sock_owned_by_user(meta_sk)) {
- 			sk->sk_err = err;
- 			sk->sk_error_report(sk);		/* Wake people up to see the error (see connect in sock.c) */
- 
-@@ -457,26 +473,27 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
- 		goto out;
- 	}
- 
--	if (!sock_owned_by_user(sk) && np->recverr) {
-+	if (!sock_owned_by_user(meta_sk) && np->recverr) {
- 		sk->sk_err = err;
- 		sk->sk_error_report(sk);
- 	} else
- 		sk->sk_err_soft = err;
- 
- out:
--	bh_unlock_sock(sk);
-+	bh_unlock_sock(meta_sk);
- 	sock_put(sk);
- }
- 
- 
--static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
--			      struct flowi6 *fl6,
--			      struct request_sock *req,
--			      u16 queue_mapping,
--			      struct tcp_fastopen_cookie *foc)
-+int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
-+		       struct flowi *fl,
-+		       struct request_sock *req,
-+		       u16 queue_mapping,
-+		       struct tcp_fastopen_cookie *foc)
- {
- 	struct inet_request_sock *ireq = inet_rsk(req);
- 	struct ipv6_pinfo *np = inet6_sk(sk);
-+	struct flowi6 *fl6 = &fl->u.ip6;
- 	struct sk_buff *skb;
- 	int err = -ENOMEM;
- 
-@@ -497,18 +514,21 @@ static int tcp_v6_send_synack(struct sock *sk, struct dst_entry *dst,
- 		skb_set_queue_mapping(skb, queue_mapping);
- 		err = ip6_xmit(sk, skb, fl6, np->opt, np->tclass);
- 		err = net_xmit_eval(err);
-+		if (!tcp_rsk(req)->snt_synack && !err)
-+			tcp_rsk(req)->snt_synack = tcp_time_stamp;
- 	}
- 
- done:
- 	return err;
- }
- 
--static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
-+int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
- {
--	struct flowi6 fl6;
-+	const struct tcp_request_sock_ops *af_ops = tcp_rsk(req)->af_specific;
-+	struct flowi fl;
- 	int res;
- 
--	res = tcp_v6_send_synack(sk, NULL, &fl6, req, 0, NULL);
-+	res = af_ops->send_synack(sk, NULL, &fl, req, 0, NULL);
- 	if (!res) {
- 		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_RETRANSSEGS);
- 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPSYNRETRANS);
-@@ -516,7 +536,7 @@ static int tcp_v6_rtx_synack(struct sock *sk, struct request_sock *req)
- 	return res;
- }
- 
--static void tcp_v6_reqsk_destructor(struct request_sock *req)
-+void tcp_v6_reqsk_destructor(struct request_sock *req)
- {
- 	kfree_skb(inet_rsk(req)->pktopts);
- }
-@@ -718,27 +738,74 @@ static int tcp_v6_inbound_md5_hash(struct sock *sk, const struct sk_buff *skb)
- }
- #endif
- 
-+static int tcp_v6_init_req(struct request_sock *req, struct sock *sk,
-+			   struct sk_buff *skb)
-+{
-+	struct inet_request_sock *ireq = inet_rsk(req);
-+	struct ipv6_pinfo *np = inet6_sk(sk);
-+
-+	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
-+	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
-+
-+	ireq->ir_iif = sk->sk_bound_dev_if;
-+	ireq->ir_mark = inet_request_mark(sk, skb);
-+
-+	/* So that link locals have meaning */
-+	if (!sk->sk_bound_dev_if &&
-+	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
-+		ireq->ir_iif = inet6_iif(skb);
-+
-+	if (!TCP_SKB_CB(skb)->when &&
-+	    (ipv6_opt_accepted(sk, skb) || np->rxopt.bits.rxinfo ||
-+	     np->rxopt.bits.rxoinfo || np->rxopt.bits.rxhlim ||
-+	     np->rxopt.bits.rxohlim || np->repflow)) {
-+		atomic_inc(&skb->users);
-+		ireq->pktopts = skb;
-+	}
-+
-+	return 0;
-+}
-+
-+static struct dst_entry *tcp_v6_route_req(struct sock *sk, struct flowi *fl,
-+					  const struct request_sock *req,
-+					  bool *strict)
-+{
-+	if (strict)
-+		*strict = true;
-+	return inet6_csk_route_req(sk, &fl->u.ip6, req);
-+}
-+
- struct request_sock_ops tcp6_request_sock_ops __read_mostly = {
- 	.family		=	AF_INET6,
- 	.obj_size	=	sizeof(struct tcp6_request_sock),
--	.rtx_syn_ack	=	tcp_v6_rtx_synack,
-+	.rtx_syn_ack	=	tcp_rtx_synack,
- 	.send_ack	=	tcp_v6_reqsk_send_ack,
- 	.destructor	=	tcp_v6_reqsk_destructor,
- 	.send_reset	=	tcp_v6_send_reset,
- 	.syn_ack_timeout =	tcp_syn_ack_timeout,
- };
- 
-+const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
-+	.mss_clamp	=	IPV6_MIN_MTU - sizeof(struct tcphdr) -
-+				sizeof(struct ipv6hdr),
- #ifdef CONFIG_TCP_MD5SIG
--static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
- 	.md5_lookup	=	tcp_v6_reqsk_md5_lookup,
- 	.calc_md5_hash	=	tcp_v6_md5_hash_skb,
--};
- #endif
-+	.init_req	=	tcp_v6_init_req,
-+#ifdef CONFIG_SYN_COOKIES
-+	.cookie_init_seq =	cookie_v6_init_sequence,
-+#endif
-+	.route_req	=	tcp_v6_route_req,
-+	.init_seq	=	tcp_v6_init_sequence,
-+	.send_synack	=	tcp_v6_send_synack,
-+	.queue_hash_add =	inet6_csk_reqsk_queue_hash_add,
-+};
- 
--static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
--				 u32 tsval, u32 tsecr, int oif,
--				 struct tcp_md5sig_key *key, int rst, u8 tclass,
--				 u32 label)
-+static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack,
-+				 u32 data_ack, u32 win, u32 tsval, u32 tsecr,
-+				 int oif, struct tcp_md5sig_key *key, int rst,
-+				 u8 tclass, u32 label, int mptcp)
- {
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	struct tcphdr *t1;
-@@ -756,7 +823,10 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- 	if (key)
- 		tot_len += TCPOLEN_MD5SIG_ALIGNED;
- #endif
--
-+#ifdef CONFIG_MPTCP
-+	if (mptcp)
-+		tot_len += MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK;
-+#endif
- 	buff = alloc_skb(MAX_HEADER + sizeof(struct ipv6hdr) + tot_len,
- 			 GFP_ATOMIC);
- 	if (buff == NULL)
-@@ -794,6 +864,17 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- 		tcp_v6_md5_hash_hdr((__u8 *)topt, key,
- 				    &ipv6_hdr(skb)->saddr,
- 				    &ipv6_hdr(skb)->daddr, t1);
-+		topt += 4;
-+	}
-+#endif
-+#ifdef CONFIG_MPTCP
-+	if (mptcp) {
-+		/* Construction of 32-bit data_ack */
-+		*topt++ = htonl((TCPOPT_MPTCP << 24) |
-+				((MPTCP_SUB_LEN_DSS + MPTCP_SUB_LEN_ACK) << 16) |
-+				(0x20 << 8) |
-+				(0x01));
-+		*topt++ = htonl(data_ack);
- 	}
- #endif
- 
-@@ -834,7 +915,7 @@ static void tcp_v6_send_response(struct sk_buff *skb, u32 seq, u32 ack, u32 win,
- 	kfree_skb(buff);
- }
- 
--static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
-+void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
- {
- 	const struct tcphdr *th = tcp_hdr(skb);
- 	u32 seq = 0, ack_seq = 0;
-@@ -891,7 +972,7 @@ static void tcp_v6_send_reset(struct sock *sk, struct sk_buff *skb)
- 			  (th->doff << 2);
- 
- 	oif = sk ? sk->sk_bound_dev_if : 0;
--	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, oif, key, 1, 0, 0);
-+	tcp_v6_send_response(skb, seq, ack_seq, 0, 0, 0, 0, oif, key, 1, 0, 0, 0);
- 
- #ifdef CONFIG_TCP_MD5SIG
- release_sk1:
-@@ -902,45 +983,52 @@ release_sk1:
- #endif
- }
- 
--static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack,
-+static void tcp_v6_send_ack(struct sk_buff *skb, u32 seq, u32 ack, u32 data_ack,
- 			    u32 win, u32 tsval, u32 tsecr, int oif,
- 			    struct tcp_md5sig_key *key, u8 tclass,
--			    u32 label)
-+			    u32 label, int mptcp)
- {
--	tcp_v6_send_response(skb, seq, ack, win, tsval, tsecr, oif, key, 0, tclass,
--			     label);
-+	tcp_v6_send_response(skb, seq, ack, data_ack, win, tsval, tsecr, oif,
-+			     key, 0, tclass, label, mptcp);
- }
- 
- static void tcp_v6_timewait_ack(struct sock *sk, struct sk_buff *skb)
- {
- 	struct inet_timewait_sock *tw = inet_twsk(sk);
- 	struct tcp_timewait_sock *tcptw = tcp_twsk(sk);
-+	u32 data_ack = 0;
-+	int mptcp = 0;
- 
-+	if (tcptw->mptcp_tw && tcptw->mptcp_tw->meta_tw) {
-+		data_ack = (u32)tcptw->mptcp_tw->rcv_nxt;
-+		mptcp = 1;
-+	}
- 	tcp_v6_send_ack(skb, tcptw->tw_snd_nxt, tcptw->tw_rcv_nxt,
-+			data_ack,
- 			tcptw->tw_rcv_wnd >> tw->tw_rcv_wscale,
- 			tcp_time_stamp + tcptw->tw_ts_offset,
- 			tcptw->tw_ts_recent, tw->tw_bound_dev_if, tcp_twsk_md5_key(tcptw),
--			tw->tw_tclass, (tw->tw_flowlabel << 12));
-+			tw->tw_tclass, (tw->tw_flowlabel << 12), mptcp);
- 
- 	inet_twsk_put(tw);
- }
- 
--static void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
--				  struct request_sock *req)
-+void tcp_v6_reqsk_send_ack(struct sock *sk, struct sk_buff *skb,
-+			   struct request_sock *req)
- {
- 	/* sk->sk_state == TCP_LISTEN -> for regular TCP_SYN_RECV
- 	 * sk->sk_state == TCP_SYN_RECV -> for Fast Open.
- 	 */
- 	tcp_v6_send_ack(skb, (sk->sk_state == TCP_LISTEN) ?
- 			tcp_rsk(req)->snt_isn + 1 : tcp_sk(sk)->snd_nxt,
--			tcp_rsk(req)->rcv_nxt,
-+			tcp_rsk(req)->rcv_nxt, 0,
- 			req->rcv_wnd, tcp_time_stamp, req->ts_recent, sk->sk_bound_dev_if,
- 			tcp_v6_md5_do_lookup(sk, &ipv6_hdr(skb)->daddr),
--			0, 0);
-+			0, 0, 0);
- }
- 
- 
--static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
-+struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- 	struct request_sock *req, **prev;
- 	const struct tcphdr *th = tcp_hdr(skb);
-@@ -959,7 +1047,13 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- 
- 	if (nsk) {
- 		if (nsk->sk_state != TCP_TIME_WAIT) {
-+			/* Don't lock again the meta-sk. It has been locked
-+			 * before mptcp_v6_do_rcv.
-+			 */
-+			if (mptcp(tcp_sk(nsk)) && !is_meta_sk(sk))
-+				bh_lock_sock(mptcp_meta_sk(nsk));
- 			bh_lock_sock(nsk);
-+
- 			return nsk;
- 		}
- 		inet_twsk_put(inet_twsk(nsk));
-@@ -973,161 +1067,25 @@ static struct sock *tcp_v6_hnd_req(struct sock *sk, struct sk_buff *skb)
- 	return sk;
- }
- 
--/* FIXME: this is substantially similar to the ipv4 code.
-- * Can some kind of merge be done? -- erics
-- */
--static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
-+int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
- {
--	struct tcp_options_received tmp_opt;
--	struct request_sock *req;
--	struct inet_request_sock *ireq;
--	struct ipv6_pinfo *np = inet6_sk(sk);
--	struct tcp_sock *tp = tcp_sk(sk);
--	__u32 isn = TCP_SKB_CB(skb)->when;
--	struct dst_entry *dst = NULL;
--	struct tcp_fastopen_cookie foc = { .len = -1 };
--	bool want_cookie = false, fastopen;
--	struct flowi6 fl6;
--	int err;
--
- 	if (skb->protocol == htons(ETH_P_IP))
- 		return tcp_v4_conn_request(sk, skb);
- 
- 	if (!ipv6_unicast_destination(skb))
- 		goto drop;
- 
--	if ((sysctl_tcp_syncookies == 2 ||
--	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
--		want_cookie = tcp_syn_flood_action(sk, skb, "TCPv6");
--		if (!want_cookie)
--			goto drop;
--	}
--
--	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
--		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
--		goto drop;
--	}
--
--	req = inet6_reqsk_alloc(&tcp6_request_sock_ops);
--	if (req == NULL)
--		goto drop;
--
--#ifdef CONFIG_TCP_MD5SIG
--	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
--#endif
--
--	tcp_clear_options(&tmp_opt);
--	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
--	tmp_opt.user_mss = tp->rx_opt.user_mss;
--	tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
--
--	if (want_cookie && !tmp_opt.saw_tstamp)
--		tcp_clear_options(&tmp_opt);
-+	return tcp_conn_request(&tcp6_request_sock_ops,
-+				&tcp_request_sock_ipv6_ops, sk, skb);
- 
--	tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
--	tcp_openreq_init(req, &tmp_opt, skb);
--
--	ireq = inet_rsk(req);
--	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
--	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
--	if (!want_cookie || tmp_opt.tstamp_ok)
--		TCP_ECN_create_request(req, skb, sock_net(sk));
--
--	ireq->ir_iif = sk->sk_bound_dev_if;
--	ireq->ir_mark = inet_request_mark(sk, skb);
--
--	/* So that link locals have meaning */
--	if (!sk->sk_bound_dev_if &&
--	    ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
--		ireq->ir_iif = inet6_iif(skb);
--
--	if (!isn) {
--		if (ipv6_opt_accepted(sk, skb) ||
--		    np->rxopt.bits.rxinfo || np->rxopt.bits.rxoinfo ||
--		    np->rxopt.bits.rxhlim || np->rxopt.bits.rxohlim ||
--		    np->repflow) {
--			atomic_inc(&skb->users);
--			ireq->pktopts = skb;
--		}
--
--		if (want_cookie) {
--			isn = cookie_v6_init_sequence(sk, skb, &req->mss);
--			req->cookie_ts = tmp_opt.tstamp_ok;
--			goto have_isn;
--		}
--
--		/* VJ's idea. We save last timestamp seen
--		 * from the destination in peer table, when entering
--		 * state TIME-WAIT, and check against it before
--		 * accepting new connection request.
--		 *
--		 * If "isn" is not zero, this request hit alive
--		 * timewait bucket, so that all the necessary checks
--		 * are made in the function processing timewait state.
--		 */
--		if (tmp_opt.saw_tstamp &&
--		    tcp_death_row.sysctl_tw_recycle &&
--		    (dst = inet6_csk_route_req(sk, &fl6, req)) != NULL) {
--			if (!tcp_peer_is_proven(req, dst, true)) {
--				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
--				goto drop_and_release;
--			}
--		}
--		/* Kill the following clause, if you dislike this way. */
--		else if (!sysctl_tcp_syncookies &&
--			 (sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) <
--			  (sysctl_max_syn_backlog >> 2)) &&
--			 !tcp_peer_is_proven(req, dst, false)) {
--			/* Without syncookies last quarter of
--			 * backlog is filled with destinations,
--			 * proven to be alive.
--			 * It means that we continue to communicate
--			 * to destinations, already remembered
--			 * to the moment of synflood.
--			 */
--			LIMIT_NETDEBUG(KERN_DEBUG "TCP: drop open request from %pI6/%u\n",
--				       &ireq->ir_v6_rmt_addr, ntohs(tcp_hdr(skb)->source));
--			goto drop_and_release;
--		}
--
--		isn = tcp_v6_init_sequence(skb);
--	}
--have_isn:
--
--	if (security_inet_conn_request(sk, skb, req))
--		goto drop_and_release;
--
--	if (!dst && (dst = inet6_csk_route_req(sk, &fl6, req)) == NULL)
--		goto drop_and_free;
--
--	tcp_rsk(req)->snt_isn = isn;
--	tcp_rsk(req)->snt_synack = tcp_time_stamp;
--	tcp_openreq_init_rwin(req, sk, dst);
--	fastopen = !want_cookie &&
--		   tcp_try_fastopen(sk, skb, req, &foc, dst);
--	err = tcp_v6_send_synack(sk, dst, &fl6, req,
--				 skb_get_queue_mapping(skb), &foc);
--	if (!fastopen) {
--		if (err || want_cookie)
--			goto drop_and_free;
--
--		tcp_rsk(req)->listener = NULL;
--		inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
--	}
--	return 0;
--
--drop_and_release:
--	dst_release(dst);
--drop_and_free:
--	reqsk_free(req);
- drop:
- 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
- 	return 0; /* don't send reset */
- }
- 
--static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
--					 struct request_sock *req,
--					 struct dst_entry *dst)
-+struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
-+				  struct request_sock *req,
-+				  struct dst_entry *dst)
- {
- 	struct inet_request_sock *ireq;
- 	struct ipv6_pinfo *newnp, *np = inet6_sk(sk);
-@@ -1165,7 +1123,12 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
- 
- 		newsk->sk_v6_rcv_saddr = newnp->saddr;
- 
--		inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
-+#ifdef CONFIG_MPTCP
-+		if (is_mptcp_enabled(newsk))
-+			inet_csk(newsk)->icsk_af_ops = &mptcp_v6_mapped;
-+		else
-+#endif
-+			inet_csk(newsk)->icsk_af_ops = &ipv6_mapped;
- 		newsk->sk_backlog_rcv = tcp_v4_do_rcv;
- #ifdef CONFIG_TCP_MD5SIG
- 		newtp->af_specific = &tcp_sock_ipv6_mapped_specific;
-@@ -1329,7 +1292,7 @@ out:
-  * This is because we cannot sleep with the original spinlock
-  * held.
-  */
--static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
-+int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
- {
- 	struct ipv6_pinfo *np = inet6_sk(sk);
- 	struct tcp_sock *tp;
-@@ -1351,6 +1314,9 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
- 		goto discard;
- #endif
- 
-+	if (is_meta_sk(sk))
-+		return mptcp_v6_do_rcv(sk, skb);
-+
- 	if (sk_filter(sk, skb))
- 		goto discard;
- 
-@@ -1472,7 +1438,7 @@ static int tcp_v6_rcv(struct sk_buff *skb)
- {
- 	const struct tcphdr *th;
- 	const struct ipv6hdr *hdr;
--	struct sock *sk;
-+	struct sock *sk, *meta_sk = NULL;
- 	int ret;
- 	struct net *net = dev_net(skb->dev);
- 
-@@ -1503,18 +1469,43 @@ static int tcp_v6_rcv(struct sk_buff *skb)
- 	TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
- 				    skb->len - th->doff*4);
- 	TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
-+#ifdef CONFIG_MPTCP
-+	TCP_SKB_CB(skb)->mptcp_flags = 0;
-+	TCP_SKB_CB(skb)->dss_off = 0;
-+#endif
- 	TCP_SKB_CB(skb)->when = 0;
- 	TCP_SKB_CB(skb)->ip_dsfield = ipv6_get_dsfield(hdr);
- 	TCP_SKB_CB(skb)->sacked = 0;
- 
- 	sk = __inet6_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
--	if (!sk)
--		goto no_tcp_socket;
- 
- process:
--	if (sk->sk_state == TCP_TIME_WAIT)
-+	if (sk && sk->sk_state == TCP_TIME_WAIT)
- 		goto do_time_wait;
- 
-+#ifdef CONFIG_MPTCP
-+	if (!sk && th->syn && !th->ack) {
-+		int ret = mptcp_lookup_join(skb, NULL);
-+
-+		if (ret < 0) {
-+			tcp_v6_send_reset(NULL, skb);
-+			goto discard_it;
-+		} else if (ret > 0) {
-+			return 0;
-+		}
-+	}
-+
-+	/* Is there a pending request sock for this segment ? */
-+	if ((!sk || sk->sk_state == TCP_LISTEN) && mptcp_check_req(skb, net)) {
-+		if (sk)
-+			sock_put(sk);
-+		return 0;
-+	}
-+#endif
-+
-+	if (!sk)
-+		goto no_tcp_socket;
-+
- 	if (hdr->hop_limit < inet6_sk(sk)->min_hopcount) {
- 		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
- 		goto discard_and_relse;
-@@ -1529,11 +1520,21 @@ process:
- 	sk_mark_napi_id(sk, skb);
- 	skb->dev = NULL;
- 
--	bh_lock_sock_nested(sk);
-+	if (mptcp(tcp_sk(sk))) {
-+		meta_sk = mptcp_meta_sk(sk);
-+
-+		bh_lock_sock_nested(meta_sk);
-+		if (sock_owned_by_user(meta_sk))
-+			skb->sk = sk;
-+	} else {
-+		meta_sk = sk;
-+		bh_lock_sock_nested(sk);
-+	}
-+
- 	ret = 0;
--	if (!sock_owned_by_user(sk)) {
-+	if (!sock_owned_by_user(meta_sk)) {
- #ifdef CONFIG_NET_DMA
--		struct tcp_sock *tp = tcp_sk(sk);
-+		struct tcp_sock *tp = tcp_sk(meta_sk);
- 		if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- 			tp->ucopy.dma_chan = net_dma_find_channel();
- 		if (tp->ucopy.dma_chan)
-@@ -1541,16 +1542,17 @@ process:
- 		else
- #endif
- 		{
--			if (!tcp_prequeue(sk, skb))
-+			if (!tcp_prequeue(meta_sk, skb))
- 				ret = tcp_v6_do_rcv(sk, skb);
- 		}
--	} else if (unlikely(sk_add_backlog(sk, skb,
--					   sk->sk_rcvbuf + sk->sk_sndbuf))) {
--		bh_unlock_sock(sk);
-+	} else if (unlikely(sk_add_backlog(meta_sk, skb,
-+					   meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+		bh_unlock_sock(meta_sk);
- 		NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
- 		goto discard_and_relse;
- 	}
--	bh_unlock_sock(sk);
-+
-+	bh_unlock_sock(meta_sk);
- 
- 	sock_put(sk);
- 	return ret ? -1 : 0;
-@@ -1607,6 +1609,18 @@ do_time_wait:
- 			sk = sk2;
- 			goto process;
- 		}
-+#ifdef CONFIG_MPTCP
-+		if (th->syn && !th->ack) {
-+			int ret = mptcp_lookup_join(skb, inet_twsk(sk));
-+
-+			if (ret < 0) {
-+				tcp_v6_send_reset(NULL, skb);
-+				goto discard_it;
-+			} else if (ret > 0) {
-+				return 0;
-+			}
-+		}
-+#endif
- 		/* Fall through to ACK */
- 	}
- 	case TCP_TW_ACK:
-@@ -1657,7 +1671,7 @@ static void tcp_v6_early_demux(struct sk_buff *skb)
- 	}
- }
- 
--static struct timewait_sock_ops tcp6_timewait_sock_ops = {
-+struct timewait_sock_ops tcp6_timewait_sock_ops = {
- 	.twsk_obj_size	= sizeof(struct tcp6_timewait_sock),
- 	.twsk_unique	= tcp_twsk_unique,
- 	.twsk_destructor = tcp_twsk_destructor,
-@@ -1730,7 +1744,12 @@ static int tcp_v6_init_sock(struct sock *sk)
- 
- 	tcp_init_sock(sk);
- 
--	icsk->icsk_af_ops = &ipv6_specific;
-+#ifdef CONFIG_MPTCP
-+	if (is_mptcp_enabled(sk))
-+		icsk->icsk_af_ops = &mptcp_v6_specific;
-+	else
-+#endif
-+		icsk->icsk_af_ops = &ipv6_specific;
- 
- #ifdef CONFIG_TCP_MD5SIG
- 	tcp_sk(sk)->af_specific = &tcp_sock_ipv6_specific;
-@@ -1739,7 +1758,7 @@ static int tcp_v6_init_sock(struct sock *sk)
- 	return 0;
- }
- 
--static void tcp_v6_destroy_sock(struct sock *sk)
-+void tcp_v6_destroy_sock(struct sock *sk)
- {
- 	tcp_v4_destroy_sock(sk);
- 	inet6_destroy_sock(sk);
-@@ -1924,12 +1943,28 @@ void tcp6_proc_exit(struct net *net)
- static void tcp_v6_clear_sk(struct sock *sk, int size)
- {
- 	struct inet_sock *inet = inet_sk(sk);
-+#ifdef CONFIG_MPTCP
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	/* size_tk_table goes from the end of tk_table to the end of sk */
-+	int size_tk_table = size - offsetof(struct tcp_sock, tk_table) -
-+			    sizeof(tp->tk_table);
-+#endif
- 
- 	/* we do not want to clear pinet6 field, because of RCU lookups */
- 	sk_prot_clear_nulls(sk, offsetof(struct inet_sock, pinet6));
- 
- 	size -= offsetof(struct inet_sock, pinet6) + sizeof(inet->pinet6);
-+
-+#ifdef CONFIG_MPTCP
-+	/* We zero out only from pinet6 to tk_table */
-+	size -= size_tk_table + sizeof(tp->tk_table);
-+#endif
- 	memset(&inet->pinet6 + 1, 0, size);
-+
-+#ifdef CONFIG_MPTCP
-+	memset((char *)&tp->tk_table + sizeof(tp->tk_table), 0, size_tk_table);
-+#endif
-+
- }
- 
- struct proto tcpv6_prot = {
-diff --git a/net/mptcp/Kconfig b/net/mptcp/Kconfig
-new file mode 100644
-index 000000000000..cdfc03adabf8
---- /dev/null
-+++ b/net/mptcp/Kconfig
-@@ -0,0 +1,115 @@
-+#
-+# MPTCP configuration
-+#
-+config MPTCP
-+        bool "MPTCP protocol"
-+        depends on (IPV6=y || IPV6=n)
-+        ---help---
-+          This replaces the normal TCP stack with a Multipath TCP stack,
-+          able to use several paths at once.
-+
-+menuconfig MPTCP_PM_ADVANCED
-+	bool "MPTCP: advanced path-manager control"
-+	depends on MPTCP=y
-+	---help---
-+	  Support for selection of different path-managers. You should choose 'Y' here,
-+	  because otherwise you will not actively create new MPTCP-subflows.
-+
-+if MPTCP_PM_ADVANCED
-+
-+config MPTCP_FULLMESH
-+	tristate "MPTCP Full-Mesh Path-Manager"
-+	depends on MPTCP=y
-+	---help---
-+	  This path-management module will create a full-mesh among all IP-addresses.
-+
-+config MPTCP_NDIFFPORTS
-+	tristate "MPTCP ndiff-ports"
-+	depends on MPTCP=y
-+	---help---
-+	  This path-management module will create multiple subflows between the same
-+	  pair of IP-addresses, modifying the source-port. You can set the number
-+	  of subflows via the mptcp_ndiffports-sysctl.
-+
-+config MPTCP_BINDER
-+	tristate "MPTCP Binder"
-+	depends on (MPTCP=y)
-+	---help---
-+	  This path-management module works like ndiffports, and adds the sysctl
-+	  option to set the gateway (and/or path to) per each additional subflow
-+	  via Loose Source Routing (IPv4 only).
-+
-+choice
-+	prompt "Default MPTCP Path-Manager"
-+	default DEFAULT
-+	help
-+	  Select the Path-Manager of your choice
-+
-+	config DEFAULT_FULLMESH
-+		bool "Full mesh" if MPTCP_FULLMESH=y
-+
-+	config DEFAULT_NDIFFPORTS
-+		bool "ndiff-ports" if MPTCP_NDIFFPORTS=y
-+
-+	config DEFAULT_BINDER
-+		bool "binder" if MPTCP_BINDER=y
-+
-+	config DEFAULT_DUMMY
-+		bool "Default"
-+
-+endchoice
-+
-+endif
-+
-+config DEFAULT_MPTCP_PM
-+	string
-+	default "default" if DEFAULT_DUMMY
-+	default "fullmesh" if DEFAULT_FULLMESH 
-+	default "ndiffports" if DEFAULT_NDIFFPORTS
-+	default "binder" if DEFAULT_BINDER
-+	default "default"
-+
-+menuconfig MPTCP_SCHED_ADVANCED
-+	bool "MPTCP: advanced scheduler control"
-+	depends on MPTCP=y
-+	---help---
-+	  Support for selection of different schedulers. You should choose 'Y' here,
-+	  if you want to choose a different scheduler than the default one.
-+
-+if MPTCP_SCHED_ADVANCED
-+
-+config MPTCP_ROUNDROBIN
-+	tristate "MPTCP Round-Robin"
-+	depends on (MPTCP=y)
-+	---help---
-+	  This is a very simple round-robin scheduler. Probably has bad performance
-+	  but might be interesting for researchers.
-+
-+choice
-+	prompt "Default MPTCP Scheduler"
-+	default DEFAULT
-+	help
-+	  Select the Scheduler of your choice
-+
-+	config DEFAULT_SCHEDULER
-+		bool "Default"
-+		---help---
-+		  This is the default scheduler, sending first on the subflow
-+		  with the lowest RTT.
-+
-+	config DEFAULT_ROUNDROBIN
-+		bool "Round-Robin" if MPTCP_ROUNDROBIN=y
-+		---help---
-+		  This is the round-rob scheduler, sending in a round-robin
-+		  fashion..
-+
-+endchoice
-+endif
-+
-+config DEFAULT_MPTCP_SCHED
-+	string
-+	depends on (MPTCP=y)
-+	default "default" if DEFAULT_SCHEDULER
-+	default "roundrobin" if DEFAULT_ROUNDROBIN
-+	default "default"
-+
-diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
-new file mode 100644
-index 000000000000..35561a7012e3
---- /dev/null
-+++ b/net/mptcp/Makefile
-@@ -0,0 +1,20 @@
-+#
-+## Makefile for MultiPath TCP support code.
-+#
-+#
-+
-+obj-$(CONFIG_MPTCP) += mptcp.o
-+
-+mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
-+	   mptcp_output.o mptcp_input.o mptcp_sched.o
-+
-+obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
-+obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
-+obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
-+obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
-+obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
-+obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
-+obj-$(CONFIG_MPTCP_ROUNDROBIN) += mptcp_rr.o
-+
-+mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
-+
-diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
-new file mode 100644
-index 000000000000..95d8da560715
---- /dev/null
-+++ b/net/mptcp/mptcp_binder.c
-@@ -0,0 +1,487 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#include <linux/route.h>
-+#include <linux/inet.h>
-+#include <linux/mroute.h>
-+#include <linux/spinlock_types.h>
-+#include <net/inet_ecn.h>
-+#include <net/route.h>
-+#include <net/xfrm.h>
-+#include <net/compat.h>
-+#include <linux/slab.h>
-+
-+#define MPTCP_GW_MAX_LISTS	10
-+#define MPTCP_GW_LIST_MAX_LEN	6
-+#define MPTCP_GW_SYSCTL_MAX_LEN	(15 * MPTCP_GW_LIST_MAX_LEN *	\
-+							MPTCP_GW_MAX_LISTS)
-+
-+struct mptcp_gw_list {
-+	struct in_addr list[MPTCP_GW_MAX_LISTS][MPTCP_GW_LIST_MAX_LEN];
-+	u8 len[MPTCP_GW_MAX_LISTS];
-+};
-+
-+struct binder_priv {
-+	/* Worker struct for subflow establishment */
-+	struct work_struct subflow_work;
-+
-+	struct mptcp_cb *mpcb;
-+
-+	/* Prevent multiple sub-sockets concurrently iterating over sockets */
-+	spinlock_t *flow_lock;
-+};
-+
-+static struct mptcp_gw_list *mptcp_gws;
-+static rwlock_t mptcp_gws_lock;
-+
-+static int mptcp_binder_ndiffports __read_mostly = 1;
-+
-+static char sysctl_mptcp_binder_gateways[MPTCP_GW_SYSCTL_MAX_LEN] __read_mostly;
-+
-+static int mptcp_get_avail_list_ipv4(struct sock *sk)
-+{
-+	int i, j, list_taken, opt_ret, opt_len;
-+	unsigned char *opt_ptr, *opt_end_ptr, opt[MAX_IPOPTLEN];
-+
-+	for (i = 0; i < MPTCP_GW_MAX_LISTS; ++i) {
-+		if (mptcp_gws->len[i] == 0)
-+			goto error;
-+
-+		mptcp_debug("mptcp_get_avail_list_ipv4: List %i\n", i);
-+		list_taken = 0;
-+
-+		/* Loop through all sub-sockets in this connection */
-+		mptcp_for_each_sk(tcp_sk(sk)->mpcb, sk) {
-+			mptcp_debug("mptcp_get_avail_list_ipv4: Next sock\n");
-+
-+			/* Reset length and options buffer, then retrieve
-+			 * from socket
-+			 */
-+			opt_len = MAX_IPOPTLEN;
-+			memset(opt, 0, MAX_IPOPTLEN);
-+			opt_ret = ip_getsockopt(sk, IPPROTO_IP,
-+				IP_OPTIONS, opt, &opt_len);
-+			if (opt_ret < 0) {
-+				mptcp_debug(KERN_ERR "%s: MPTCP subsocket getsockopt() IP_OPTIONS failed, error %d\n",
-+					    __func__, opt_ret);
-+				goto error;
-+			}
-+
-+			/* If socket has no options, it has no stake in this list */
-+			if (opt_len <= 0)
-+				continue;
-+
-+			/* Iterate options buffer */
-+			for (opt_ptr = &opt[0]; opt_ptr < &opt[opt_len]; opt_ptr++) {
-+				if (*opt_ptr == IPOPT_LSRR) {
-+					mptcp_debug("mptcp_get_avail_list_ipv4: LSRR options found\n");
-+					goto sock_lsrr;
-+				}
-+			}
-+			continue;
-+
-+sock_lsrr:
-+			/* Pointer to the 2nd to last address */
-+			opt_end_ptr = opt_ptr+(*(opt_ptr+1))-4;
-+
-+			/* Addresses start 3 bytes after type offset */
-+			opt_ptr += 3;
-+			j = 0;
-+
-+			/* Different length lists cannot be the same */
-+			if ((opt_end_ptr-opt_ptr)/4 != mptcp_gws->len[i])
-+				continue;
-+
-+			/* Iterate if we are still inside options list
-+			 * and sysctl list
-+			 */
-+			while (opt_ptr < opt_end_ptr && j < mptcp_gws->len[i]) {
-+				/* If there is a different address, this list must
-+				 * not be set on this socket
-+				 */
-+				if (memcmp(&mptcp_gws->list[i][j], opt_ptr, 4))
-+					break;
-+
-+				/* Jump 4 bytes to next address */
-+				opt_ptr += 4;
-+				j++;
-+			}
-+
-+			/* Reached the end without a differing address, lists
-+			 * are therefore identical.
-+			 */
-+			if (j == mptcp_gws->len[i]) {
-+				mptcp_debug("mptcp_get_avail_list_ipv4: List already used\n");
-+				list_taken = 1;
-+				break;
-+			}
-+		}
-+
-+		/* Free list found if not taken by a socket */
-+		if (!list_taken) {
-+			mptcp_debug("mptcp_get_avail_list_ipv4: List free\n");
-+			break;
-+		}
-+	}
-+
-+	if (i >= MPTCP_GW_MAX_LISTS)
-+		goto error;
-+
-+	return i;
-+error:
-+	return -1;
-+}
-+
-+/* The list of addresses is parsed each time a new connection is opened,
-+ *  to make sure it's up to date. In case of error, all the lists are
-+ *  marked as unavailable and the subflow's fingerprint is set to 0.
-+ */
-+static void mptcp_v4_add_lsrr(struct sock *sk, struct in_addr addr)
-+{
-+	int i, j, ret;
-+	unsigned char opt[MAX_IPOPTLEN] = {0};
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct binder_priv *fmp = (struct binder_priv *)&tp->mpcb->mptcp_pm[0];
-+
-+	/* Read lock: multiple sockets can read LSRR addresses at the same
-+	 * time, but writes are done in mutual exclusion.
-+	 * Spin lock: must search for free list for one socket at a time, or
-+	 * multiple sockets could take the same list.
-+	 */
-+	read_lock(&mptcp_gws_lock);
-+	spin_lock(fmp->flow_lock);
-+
-+	i = mptcp_get_avail_list_ipv4(sk);
-+
-+	/* Execution enters here only if a free path is found.
-+	 */
-+	if (i >= 0) {
-+		opt[0] = IPOPT_NOP;
-+		opt[1] = IPOPT_LSRR;
-+		opt[2] = sizeof(mptcp_gws->list[i][0].s_addr) *
-+				(mptcp_gws->len[i] + 1) + 3;
-+		opt[3] = IPOPT_MINOFF;
-+		for (j = 0; j < mptcp_gws->len[i]; ++j)
-+			memcpy(opt + 4 +
-+				(j * sizeof(mptcp_gws->list[i][0].s_addr)),
-+				&mptcp_gws->list[i][j].s_addr,
-+				sizeof(mptcp_gws->list[i][0].s_addr));
-+		/* Final destination must be part of IP_OPTIONS parameter. */
-+		memcpy(opt + 4 + (j * sizeof(addr.s_addr)), &addr.s_addr,
-+		       sizeof(addr.s_addr));
-+
-+		/* setsockopt must be inside the lock, otherwise another
-+		 * subflow could fail to see that we have taken a list.
-+		 */
-+		ret = ip_setsockopt(sk, IPPROTO_IP, IP_OPTIONS, opt,
-+				4 + sizeof(mptcp_gws->list[i][0].s_addr)
-+				* (mptcp_gws->len[i] + 1));
-+
-+		if (ret < 0) {
-+			mptcp_debug(KERN_ERR "%s: MPTCP subsock setsockopt() IP_OPTIONS failed, error %d\n",
-+				    __func__, ret);
-+		}
-+	}
-+
-+	spin_unlock(fmp->flow_lock);
-+	read_unlock(&mptcp_gws_lock);
-+
-+	return;
-+}
-+
-+/* Parses gateways string for a list of paths to different
-+ * gateways, and stores them for use with the Loose Source Routing (LSRR)
-+ * socket option. Each list must have "," separated addresses, and the lists
-+ * themselves must be separated by "-". Returns -1 in case one or more of the
-+ * addresses is not a valid ipv4/6 address.
-+ */
-+static int mptcp_parse_gateway_ipv4(char *gateways)
-+{
-+	int i, j, k, ret;
-+	char *tmp_string = NULL;
-+	struct in_addr tmp_addr;
-+
-+	tmp_string = kzalloc(16, GFP_KERNEL);
-+	if (tmp_string == NULL)
-+		return -ENOMEM;
-+
-+	write_lock(&mptcp_gws_lock);
-+
-+	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
-+
-+	/* A TMP string is used since inet_pton needs a null terminated string
-+	 * but we do not want to modify the sysctl for obvious reasons.
-+	 * i will iterate over the SYSCTL string, j will iterate over the
-+	 * temporary string where each IP is copied into, k will iterate over
-+	 * the IPs in each list.
-+	 */
-+	for (i = j = k = 0;
-+			i < MPTCP_GW_SYSCTL_MAX_LEN && k < MPTCP_GW_MAX_LISTS;
-+			++i) {
-+		if (gateways[i] == '-' || gateways[i] == ',' || gateways[i] == '\0') {
-+			/* If the temp IP is empty and the current list is
-+			 *  empty, we are done.
-+			 */
-+			if (j == 0 && mptcp_gws->len[k] == 0)
-+				break;
-+
-+			/* Terminate the temp IP string, then if it is
-+			 * non-empty parse the IP and copy it.
-+			 */
-+			tmp_string[j] = '\0';
-+			if (j > 0) {
-+				mptcp_debug("mptcp_parse_gateway_list tmp: %s i: %d\n", tmp_string, i);
-+
-+				ret = in4_pton(tmp_string, strlen(tmp_string),
-+						(u8 *)&tmp_addr.s_addr, '\0',
-+						NULL);
-+
-+				if (ret) {
-+					mptcp_debug("mptcp_parse_gateway_list ret: %d s_addr: %pI4\n",
-+						    ret,
-+						    &tmp_addr.s_addr);
-+					memcpy(&mptcp_gws->list[k][mptcp_gws->len[k]].s_addr,
-+					       &tmp_addr.s_addr,
-+					       sizeof(tmp_addr.s_addr));
-+					mptcp_gws->len[k]++;
-+					j = 0;
-+					tmp_string[j] = '\0';
-+					/* Since we can't impose a limit to
-+					 * what the user can input, make sure
-+					 * there are not too many IPs in the
-+					 * SYSCTL string.
-+					 */
-+					if (mptcp_gws->len[k] > MPTCP_GW_LIST_MAX_LEN) {
-+						mptcp_debug("mptcp_parse_gateway_list too many members in list %i: max %i\n",
-+							    k,
-+							    MPTCP_GW_LIST_MAX_LEN);
-+						goto error;
-+					}
-+				} else {
-+					goto error;
-+				}
-+			}
-+
-+			if (gateways[i] == '-' || gateways[i] == '\0')
-+				++k;
-+		} else {
-+			tmp_string[j] = gateways[i];
-+			++j;
-+		}
-+	}
-+
-+	/* Number of flows is number of gateway lists plus master flow */
-+	mptcp_binder_ndiffports = k+1;
-+
-+	write_unlock(&mptcp_gws_lock);
-+	kfree(tmp_string);
-+
-+	return 0;
-+
-+error:
-+	memset(mptcp_gws, 0, sizeof(struct mptcp_gw_list));
-+	memset(gateways, 0, sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN);
-+	write_unlock(&mptcp_gws_lock);
-+	kfree(tmp_string);
-+	return -1;
-+}
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+	const struct binder_priv *pm_priv = container_of(work,
-+						     struct binder_priv,
-+						     subflow_work);
-+	struct mptcp_cb *mpcb = pm_priv->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	int iter = 0;
-+
-+next_subflow:
-+	if (iter) {
-+		release_sock(meta_sk);
-+		mutex_unlock(&mpcb->mpcb_mutex);
-+
-+		cond_resched();
-+	}
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	iter++;
-+
-+	if (sock_flag(meta_sk, SOCK_DEAD))
-+		goto exit;
-+
-+	if (mpcb->master_sk &&
-+	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+		goto exit;
-+
-+	if (mptcp_binder_ndiffports > iter &&
-+	    mptcp_binder_ndiffports > mpcb->cnt_subflows) {
-+		struct mptcp_loc4 loc;
-+		struct mptcp_rem4 rem;
-+
-+		loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
-+		loc.loc4_id = 0;
-+		loc.low_prio = 0;
-+
-+		rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
-+		rem.port = inet_sk(meta_sk)->inet_dport;
-+		rem.rem4_id = 0; /* Default 0 */
-+
-+		mptcp_init4_subsockets(meta_sk, &loc, &rem);
-+
-+		goto next_subflow;
-+	}
-+
-+exit:
-+	release_sock(meta_sk);
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk);
-+}
-+
-+static void binder_new_session(const struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct binder_priv *fmp = (struct binder_priv *)&mpcb->mptcp_pm[0];
-+	static DEFINE_SPINLOCK(flow_lock);
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	if (meta_sk->sk_family == AF_INET6 &&
-+	    !mptcp_v6_is_v4_mapped(meta_sk)) {
-+			mptcp_fallback_default(mpcb);
-+			return;
-+	}
-+#endif
-+
-+	/* Initialize workqueue-struct */
-+	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+	fmp->mpcb = mpcb;
-+
-+	fmp->flow_lock = &flow_lock;
-+}
-+
-+static void binder_create_subflows(struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct binder_priv *pm_priv = (struct binder_priv *)&mpcb->mptcp_pm[0];
-+
-+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+	    mpcb->send_infinite_mapping ||
-+	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+		return;
-+
-+	if (!work_pending(&pm_priv->subflow_work)) {
-+		sock_hold(meta_sk);
-+		queue_work(mptcp_wq, &pm_priv->subflow_work);
-+	}
-+}
-+
-+static int binder_get_local_id(sa_family_t family, union inet_addr *addr,
-+				  struct net *net, bool *low_prio)
-+{
-+	return 0;
-+}
-+
-+/* Callback functions, executed when syctl mptcp.mptcp_gateways is updated.
-+ * Inspired from proc_tcp_congestion_control().
-+ */
-+static int proc_mptcp_gateways(ctl_table *ctl, int write,
-+				       void __user *buffer, size_t *lenp,
-+				       loff_t *ppos)
-+{
-+	int ret;
-+	ctl_table tbl = {
-+		.maxlen = MPTCP_GW_SYSCTL_MAX_LEN,
-+	};
-+
-+	if (write) {
-+		tbl.data = kzalloc(MPTCP_GW_SYSCTL_MAX_LEN, GFP_KERNEL);
-+		if (tbl.data == NULL)
-+			return -1;
-+		ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+		if (ret == 0) {
-+			ret = mptcp_parse_gateway_ipv4(tbl.data);
-+			memcpy(ctl->data, tbl.data, MPTCP_GW_SYSCTL_MAX_LEN);
-+		}
-+		kfree(tbl.data);
-+	} else {
-+		ret = proc_dostring(ctl, write, buffer, lenp, ppos);
-+	}
-+
-+
-+	return ret;
-+}
-+
-+static struct mptcp_pm_ops binder __read_mostly = {
-+	.new_session = binder_new_session,
-+	.fully_established = binder_create_subflows,
-+	.get_local_id = binder_get_local_id,
-+	.init_subsocket_v4 = mptcp_v4_add_lsrr,
-+	.name = "binder",
-+	.owner = THIS_MODULE,
-+};
-+
-+static struct ctl_table binder_table[] = {
-+	{
-+		.procname = "mptcp_binder_gateways",
-+		.data = &sysctl_mptcp_binder_gateways,
-+		.maxlen = sizeof(char) * MPTCP_GW_SYSCTL_MAX_LEN,
-+		.mode = 0644,
-+		.proc_handler = &proc_mptcp_gateways
-+	},
-+	{ }
-+};
-+
-+struct ctl_table_header *mptcp_sysctl_binder;
-+
-+/* General initialization of MPTCP_PM */
-+static int __init binder_register(void)
-+{
-+	mptcp_gws = kzalloc(sizeof(*mptcp_gws), GFP_KERNEL);
-+	if (!mptcp_gws)
-+		return -ENOMEM;
-+
-+	rwlock_init(&mptcp_gws_lock);
-+
-+	BUILD_BUG_ON(sizeof(struct binder_priv) > MPTCP_PM_SIZE);
-+
-+	mptcp_sysctl_binder = register_net_sysctl(&init_net, "net/mptcp",
-+			binder_table);
-+	if (!mptcp_sysctl_binder)
-+		goto sysctl_fail;
-+
-+	if (mptcp_register_path_manager(&binder))
-+		goto pm_failed;
-+
-+	return 0;
-+
-+pm_failed:
-+	unregister_net_sysctl_table(mptcp_sysctl_binder);
-+sysctl_fail:
-+	kfree(mptcp_gws);
-+
-+	return -1;
-+}
-+
-+static void binder_unregister(void)
-+{
-+	mptcp_unregister_path_manager(&binder);
-+	unregister_net_sysctl_table(mptcp_sysctl_binder);
-+	kfree(mptcp_gws);
-+}
-+
-+module_init(binder_register);
-+module_exit(binder_unregister);
-+
-+MODULE_AUTHOR("Luca Boccassi, Duncan Eastoe, Christoph Paasch (ndiffports)");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("BINDER MPTCP");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_coupled.c b/net/mptcp/mptcp_coupled.c
-new file mode 100644
-index 000000000000..5d761164eb85
---- /dev/null
-+++ b/net/mptcp/mptcp_coupled.c
-@@ -0,0 +1,270 @@
-+/*
-+ *	MPTCP implementation - Linked Increase congestion control Algorithm (LIA)
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+#include <linux/module.h>
-+
-+/* Scaling is done in the numerator with alpha_scale_num and in the denominator
-+ * with alpha_scale_den.
-+ *
-+ * To downscale, we just need to use alpha_scale.
-+ *
-+ * We have: alpha_scale = alpha_scale_num / (alpha_scale_den ^ 2)
-+ */
-+static int alpha_scale_den = 10;
-+static int alpha_scale_num = 32;
-+static int alpha_scale = 12;
-+
-+struct mptcp_ccc {
-+	u64	alpha;
-+	bool	forced_update;
-+};
-+
-+static inline int mptcp_ccc_sk_can_send(const struct sock *sk)
-+{
-+	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
-+}
-+
-+static inline u64 mptcp_get_alpha(const struct sock *meta_sk)
-+{
-+	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha;
-+}
-+
-+static inline void mptcp_set_alpha(const struct sock *meta_sk, u64 alpha)
-+{
-+	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->alpha = alpha;
-+}
-+
-+static inline u64 mptcp_ccc_scale(u32 val, int scale)
-+{
-+	return (u64) val << scale;
-+}
-+
-+static inline bool mptcp_get_forced(const struct sock *meta_sk)
-+{
-+	return ((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update;
-+}
-+
-+static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
-+{
-+	((struct mptcp_ccc *)inet_csk_ca(meta_sk))->forced_update = force;
-+}
-+
-+static void mptcp_ccc_recalc_alpha(const struct sock *sk)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+	const struct sock *sub_sk;
-+	int best_cwnd = 0, best_rtt = 0, can_send = 0;
-+	u64 max_numerator = 0, sum_denominator = 0, alpha = 1;
-+
-+	if (!mpcb)
-+		return;
-+
-+	/* Only one subflow left - fall back to normal reno-behavior
-+	 * (set alpha to 1)
-+	 */
-+	if (mpcb->cnt_established <= 1)
-+		goto exit;
-+
-+	/* Do regular alpha-calculation for multiple subflows */
-+
-+	/* Find the max numerator of the alpha-calculation */
-+	mptcp_for_each_sk(mpcb, sub_sk) {
-+		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+		u64 tmp;
-+
-+		if (!mptcp_ccc_sk_can_send(sub_sk))
-+			continue;
-+
-+		can_send++;
-+
-+		/* We need to look for the path, that provides the max-value.
-+		 * Integer-overflow is not possible here, because
-+		 * tmp will be in u64.
-+		 */
-+		tmp = div64_u64(mptcp_ccc_scale(sub_tp->snd_cwnd,
-+				alpha_scale_num), (u64)sub_tp->srtt_us * sub_tp->srtt_us);
-+
-+		if (tmp >= max_numerator) {
-+			max_numerator = tmp;
-+			best_cwnd = sub_tp->snd_cwnd;
-+			best_rtt = sub_tp->srtt_us;
-+		}
-+	}
-+
-+	/* No subflow is able to send - we don't care anymore */
-+	if (unlikely(!can_send))
-+		goto exit;
-+
-+	/* Calculate the denominator */
-+	mptcp_for_each_sk(mpcb, sub_sk) {
-+		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+
-+		if (!mptcp_ccc_sk_can_send(sub_sk))
-+			continue;
-+
-+		sum_denominator += div_u64(
-+				mptcp_ccc_scale(sub_tp->snd_cwnd,
-+						alpha_scale_den) * best_rtt,
-+						sub_tp->srtt_us);
-+	}
-+	sum_denominator *= sum_denominator;
-+	if (unlikely(!sum_denominator)) {
-+		pr_err("%s: sum_denominator == 0, cnt_established:%d\n",
-+		       __func__, mpcb->cnt_established);
-+		mptcp_for_each_sk(mpcb, sub_sk) {
-+			struct tcp_sock *sub_tp = tcp_sk(sub_sk);
-+			pr_err("%s: pi:%d, state:%d\n, rtt:%u, cwnd: %u",
-+			       __func__, sub_tp->mptcp->path_index,
-+			       sub_sk->sk_state, sub_tp->srtt_us,
-+			       sub_tp->snd_cwnd);
-+		}
-+	}
-+
-+	alpha = div64_u64(mptcp_ccc_scale(best_cwnd, alpha_scale_num), sum_denominator);
-+
-+	if (unlikely(!alpha))
-+		alpha = 1;
-+
-+exit:
-+	mptcp_set_alpha(mptcp_meta_sk(sk), alpha);
-+}
-+
-+static void mptcp_ccc_init(struct sock *sk)
-+{
-+	if (mptcp(tcp_sk(sk))) {
-+		mptcp_set_forced(mptcp_meta_sk(sk), 0);
-+		mptcp_set_alpha(mptcp_meta_sk(sk), 1);
-+	}
-+	/* If we do not mptcp, behave like reno: return */
-+}
-+
-+static void mptcp_ccc_cwnd_event(struct sock *sk, enum tcp_ca_event event)
-+{
-+	if (event == CA_EVENT_LOSS)
-+		mptcp_ccc_recalc_alpha(sk);
-+}
-+
-+static void mptcp_ccc_set_state(struct sock *sk, u8 ca_state)
-+{
-+	if (!mptcp(tcp_sk(sk)))
-+		return;
-+
-+	mptcp_set_forced(mptcp_meta_sk(sk), 1);
-+}
-+
-+static void mptcp_ccc_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	const struct mptcp_cb *mpcb = tp->mpcb;
-+	int snd_cwnd;
-+
-+	if (!mptcp(tp)) {
-+		tcp_reno_cong_avoid(sk, ack, acked);
-+		return;
-+	}
-+
-+	if (!tcp_is_cwnd_limited(sk))
-+		return;
-+
-+	if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+		/* In "safe" area, increase. */
-+		tcp_slow_start(tp, acked);
-+		mptcp_ccc_recalc_alpha(sk);
-+		return;
-+	}
-+
-+	if (mptcp_get_forced(mptcp_meta_sk(sk))) {
-+		mptcp_ccc_recalc_alpha(sk);
-+		mptcp_set_forced(mptcp_meta_sk(sk), 0);
-+	}
-+
-+	if (mpcb->cnt_established > 1) {
-+		u64 alpha = mptcp_get_alpha(mptcp_meta_sk(sk));
-+
-+		/* This may happen, if at the initialization, the mpcb
-+		 * was not yet attached to the sock, and thus
-+		 * initializing alpha failed.
-+		 */
-+		if (unlikely(!alpha))
-+			alpha = 1;
-+
-+		snd_cwnd = (int) div_u64 ((u64) mptcp_ccc_scale(1, alpha_scale),
-+						alpha);
-+
-+		/* snd_cwnd_cnt >= max (scale * tot_cwnd / alpha, cwnd)
-+		 * Thus, we select here the max value.
-+		 */
-+		if (snd_cwnd < tp->snd_cwnd)
-+			snd_cwnd = tp->snd_cwnd;
-+	} else {
-+		snd_cwnd = tp->snd_cwnd;
-+	}
-+
-+	if (tp->snd_cwnd_cnt >= snd_cwnd) {
-+		if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
-+			tp->snd_cwnd++;
-+			mptcp_ccc_recalc_alpha(sk);
-+		}
-+
-+		tp->snd_cwnd_cnt = 0;
-+	} else {
-+		tp->snd_cwnd_cnt++;
-+	}
-+}
-+
-+static struct tcp_congestion_ops mptcp_ccc = {
-+	.init		= mptcp_ccc_init,
-+	.ssthresh	= tcp_reno_ssthresh,
-+	.cong_avoid	= mptcp_ccc_cong_avoid,
-+	.cwnd_event	= mptcp_ccc_cwnd_event,
-+	.set_state	= mptcp_ccc_set_state,
-+	.owner		= THIS_MODULE,
-+	.name		= "lia",
-+};
-+
-+static int __init mptcp_ccc_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct mptcp_ccc) > ICSK_CA_PRIV_SIZE);
-+	return tcp_register_congestion_control(&mptcp_ccc);
-+}
-+
-+static void __exit mptcp_ccc_unregister(void)
-+{
-+	tcp_unregister_congestion_control(&mptcp_ccc);
-+}
-+
-+module_init(mptcp_ccc_register);
-+module_exit(mptcp_ccc_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch, Sébastien Barré");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP LINKED INCREASE CONGESTION CONTROL ALGORITHM");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_ctrl.c b/net/mptcp/mptcp_ctrl.c
-new file mode 100644
-index 000000000000..28dfa0479f5e
---- /dev/null
-+++ b/net/mptcp/mptcp_ctrl.c
-@@ -0,0 +1,2401 @@
-+/*
-+ *	MPTCP implementation - MPTCP-control
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <net/inet_common.h>
-+#include <net/inet6_hashtables.h>
-+#include <net/ipv6.h>
-+#include <net/ip6_checksum.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/ip6_route.h>
-+#include <net/mptcp_v6.h>
-+#endif
-+#include <net/sock.h>
-+#include <net/tcp.h>
-+#include <net/tcp_states.h>
-+#include <net/transp_v6.h>
-+#include <net/xfrm.h>
-+
-+#include <linux/cryptohash.h>
-+#include <linux/kconfig.h>
-+#include <linux/module.h>
-+#include <linux/netpoll.h>
-+#include <linux/list.h>
-+#include <linux/jhash.h>
-+#include <linux/tcp.h>
-+#include <linux/net.h>
-+#include <linux/in.h>
-+#include <linux/random.h>
-+#include <linux/inetdevice.h>
-+#include <linux/workqueue.h>
-+#include <linux/atomic.h>
-+#include <linux/sysctl.h>
-+
-+static struct kmem_cache *mptcp_sock_cache __read_mostly;
-+static struct kmem_cache *mptcp_cb_cache __read_mostly;
-+static struct kmem_cache *mptcp_tw_cache __read_mostly;
-+
-+int sysctl_mptcp_enabled __read_mostly = 1;
-+int sysctl_mptcp_checksum __read_mostly = 1;
-+int sysctl_mptcp_debug __read_mostly;
-+EXPORT_SYMBOL(sysctl_mptcp_debug);
-+int sysctl_mptcp_syn_retries __read_mostly = 3;
-+
-+bool mptcp_init_failed __read_mostly;
-+
-+struct static_key mptcp_static_key = STATIC_KEY_INIT_FALSE;
-+EXPORT_SYMBOL(mptcp_static_key);
-+
-+static int proc_mptcp_path_manager(ctl_table *ctl, int write,
-+				   void __user *buffer, size_t *lenp,
-+				   loff_t *ppos)
-+{
-+	char val[MPTCP_PM_NAME_MAX];
-+	ctl_table tbl = {
-+		.data = val,
-+		.maxlen = MPTCP_PM_NAME_MAX,
-+	};
-+	int ret;
-+
-+	mptcp_get_default_path_manager(val);
-+
-+	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+	if (write && ret == 0)
-+		ret = mptcp_set_default_path_manager(val);
-+	return ret;
-+}
-+
-+static int proc_mptcp_scheduler(ctl_table *ctl, int write,
-+				void __user *buffer, size_t *lenp,
-+				loff_t *ppos)
-+{
-+	char val[MPTCP_SCHED_NAME_MAX];
-+	ctl_table tbl = {
-+		.data = val,
-+		.maxlen = MPTCP_SCHED_NAME_MAX,
-+	};
-+	int ret;
-+
-+	mptcp_get_default_scheduler(val);
-+
-+	ret = proc_dostring(&tbl, write, buffer, lenp, ppos);
-+	if (write && ret == 0)
-+		ret = mptcp_set_default_scheduler(val);
-+	return ret;
-+}
-+
-+static struct ctl_table mptcp_table[] = {
-+	{
-+		.procname = "mptcp_enabled",
-+		.data = &sysctl_mptcp_enabled,
-+		.maxlen = sizeof(int),
-+		.mode = 0644,
-+		.proc_handler = &proc_dointvec
-+	},
-+	{
-+		.procname = "mptcp_checksum",
-+		.data = &sysctl_mptcp_checksum,
-+		.maxlen = sizeof(int),
-+		.mode = 0644,
-+		.proc_handler = &proc_dointvec
-+	},
-+	{
-+		.procname = "mptcp_debug",
-+		.data = &sysctl_mptcp_debug,
-+		.maxlen = sizeof(int),
-+		.mode = 0644,
-+		.proc_handler = &proc_dointvec
-+	},
-+	{
-+		.procname = "mptcp_syn_retries",
-+		.data = &sysctl_mptcp_syn_retries,
-+		.maxlen = sizeof(int),
-+		.mode = 0644,
-+		.proc_handler = &proc_dointvec
-+	},
-+	{
-+		.procname	= "mptcp_path_manager",
-+		.mode		= 0644,
-+		.maxlen		= MPTCP_PM_NAME_MAX,
-+		.proc_handler	= proc_mptcp_path_manager,
-+	},
-+	{
-+		.procname	= "mptcp_scheduler",
-+		.mode		= 0644,
-+		.maxlen		= MPTCP_SCHED_NAME_MAX,
-+		.proc_handler	= proc_mptcp_scheduler,
-+	},
-+	{ }
-+};
-+
-+static inline u32 mptcp_hash_tk(u32 token)
-+{
-+	return token % MPTCP_HASH_SIZE;
-+}
-+
-+struct hlist_nulls_head tk_hashtable[MPTCP_HASH_SIZE];
-+EXPORT_SYMBOL(tk_hashtable);
-+
-+/* This second hashtable is needed to retrieve request socks
-+ * created as a result of a join request. While the SYN contains
-+ * the token, the final ack does not, so we need a separate hashtable
-+ * to retrieve the mpcb.
-+ */
-+struct hlist_nulls_head mptcp_reqsk_htb[MPTCP_HASH_SIZE];
-+spinlock_t mptcp_reqsk_hlock;	/* hashtable protection */
-+
-+/* The following hash table is used to avoid collision of token */
-+static struct hlist_nulls_head mptcp_reqsk_tk_htb[MPTCP_HASH_SIZE];
-+spinlock_t mptcp_tk_hashlock;	/* hashtable protection */
-+
-+static bool mptcp_reqsk_find_tk(const u32 token)
-+{
-+	const u32 hash = mptcp_hash_tk(token);
-+	const struct mptcp_request_sock *mtreqsk;
-+	const struct hlist_nulls_node *node;
-+
-+begin:
-+	hlist_nulls_for_each_entry_rcu(mtreqsk, node,
-+				       &mptcp_reqsk_tk_htb[hash], hash_entry) {
-+		if (token == mtreqsk->mptcp_loc_token)
-+			return true;
-+	}
-+	/* A request-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash)
-+		goto begin;
-+	return false;
-+}
-+
-+static void mptcp_reqsk_insert_tk(struct request_sock *reqsk, const u32 token)
-+{
-+	u32 hash = mptcp_hash_tk(token);
-+
-+	hlist_nulls_add_head_rcu(&mptcp_rsk(reqsk)->hash_entry,
-+				 &mptcp_reqsk_tk_htb[hash]);
-+}
-+
-+static void mptcp_reqsk_remove_tk(const struct request_sock *reqsk)
-+{
-+	rcu_read_lock();
-+	spin_lock(&mptcp_tk_hashlock);
-+	hlist_nulls_del_init_rcu(&mptcp_rsk(reqsk)->hash_entry);
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock();
-+}
-+
-+void mptcp_reqsk_destructor(struct request_sock *req)
-+{
-+	if (!mptcp_rsk(req)->is_sub) {
-+		if (in_softirq()) {
-+			mptcp_reqsk_remove_tk(req);
-+		} else {
-+			rcu_read_lock_bh();
-+			spin_lock(&mptcp_tk_hashlock);
-+			hlist_nulls_del_init_rcu(&mptcp_rsk(req)->hash_entry);
-+			spin_unlock(&mptcp_tk_hashlock);
-+			rcu_read_unlock_bh();
-+		}
-+	} else {
-+		mptcp_hash_request_remove(req);
-+	}
-+}
-+
-+static void __mptcp_hash_insert(struct tcp_sock *meta_tp, const u32 token)
-+{
-+	u32 hash = mptcp_hash_tk(token);
-+	hlist_nulls_add_head_rcu(&meta_tp->tk_table, &tk_hashtable[hash]);
-+	meta_tp->inside_tk_table = 1;
-+}
-+
-+static bool mptcp_find_token(u32 token)
-+{
-+	const u32 hash = mptcp_hash_tk(token);
-+	const struct tcp_sock *meta_tp;
-+	const struct hlist_nulls_node *node;
-+
-+begin:
-+	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash], tk_table) {
-+		if (token == meta_tp->mptcp_loc_token)
-+			return true;
-+	}
-+	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash)
-+		goto begin;
-+	return false;
-+}
-+
-+static void mptcp_set_key_reqsk(struct request_sock *req,
-+				const struct sk_buff *skb)
-+{
-+	const struct inet_request_sock *ireq = inet_rsk(req);
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+
-+	if (skb->protocol == htons(ETH_P_IP)) {
-+		mtreq->mptcp_loc_key = mptcp_v4_get_key(ip_hdr(skb)->saddr,
-+							ip_hdr(skb)->daddr,
-+							htons(ireq->ir_num),
-+							ireq->ir_rmt_port);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else {
-+		mtreq->mptcp_loc_key = mptcp_v6_get_key(ipv6_hdr(skb)->saddr.s6_addr32,
-+							ipv6_hdr(skb)->daddr.s6_addr32,
-+							htons(ireq->ir_num),
-+							ireq->ir_rmt_port);
-+#endif
-+	}
-+
-+	mptcp_key_sha1(mtreq->mptcp_loc_key, &mtreq->mptcp_loc_token, NULL);
-+}
-+
-+/* New MPTCP-connection request, prepare a new token for the meta-socket that
-+ * will be created in mptcp_check_req_master(), and store the received token.
-+ */
-+void mptcp_reqsk_new_mptcp(struct request_sock *req,
-+			   const struct mptcp_options_received *mopt,
-+			   const struct sk_buff *skb)
-+{
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+
-+	inet_rsk(req)->saw_mpc = 1;
-+
-+	rcu_read_lock();
-+	spin_lock(&mptcp_tk_hashlock);
-+	do {
-+		mptcp_set_key_reqsk(req, skb);
-+	} while (mptcp_reqsk_find_tk(mtreq->mptcp_loc_token) ||
-+		 mptcp_find_token(mtreq->mptcp_loc_token));
-+
-+	mptcp_reqsk_insert_tk(req, mtreq->mptcp_loc_token);
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock();
-+	mtreq->mptcp_rem_key = mopt->mptcp_key;
-+}
-+
-+static void mptcp_set_key_sk(const struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	const struct inet_sock *isk = inet_sk(sk);
-+
-+	if (sk->sk_family == AF_INET)
-+		tp->mptcp_loc_key = mptcp_v4_get_key(isk->inet_saddr,
-+						     isk->inet_daddr,
-+						     isk->inet_sport,
-+						     isk->inet_dport);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else
-+		tp->mptcp_loc_key = mptcp_v6_get_key(inet6_sk(sk)->saddr.s6_addr32,
-+						     sk->sk_v6_daddr.s6_addr32,
-+						     isk->inet_sport,
-+						     isk->inet_dport);
-+#endif
-+
-+	mptcp_key_sha1(tp->mptcp_loc_key,
-+		       &tp->mptcp_loc_token, NULL);
-+}
-+
-+void mptcp_connect_init(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	rcu_read_lock_bh();
-+	spin_lock(&mptcp_tk_hashlock);
-+	do {
-+		mptcp_set_key_sk(sk);
-+	} while (mptcp_reqsk_find_tk(tp->mptcp_loc_token) ||
-+		 mptcp_find_token(tp->mptcp_loc_token));
-+
-+	__mptcp_hash_insert(tp, tp->mptcp_loc_token);
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock_bh();
-+}
-+
-+/**
-+ * This function increments the refcount of the mpcb struct.
-+ * It is the responsibility of the caller to decrement when releasing
-+ * the structure.
-+ */
-+struct sock *mptcp_hash_find(const struct net *net, const u32 token)
-+{
-+	const u32 hash = mptcp_hash_tk(token);
-+	const struct tcp_sock *meta_tp;
-+	struct sock *meta_sk = NULL;
-+	const struct hlist_nulls_node *node;
-+
-+	rcu_read_lock();
-+begin:
-+	hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[hash],
-+				       tk_table) {
-+		meta_sk = (struct sock *)meta_tp;
-+		if (token == meta_tp->mptcp_loc_token &&
-+		    net_eq(net, sock_net(meta_sk))) {
-+			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+				goto out;
-+			if (unlikely(token != meta_tp->mptcp_loc_token ||
-+				     !net_eq(net, sock_net(meta_sk)))) {
-+				sock_gen_put(meta_sk);
-+				goto begin;
-+			}
-+			goto found;
-+		}
-+	}
-+	/* A TCP-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash)
-+		goto begin;
-+out:
-+	meta_sk = NULL;
-+found:
-+	rcu_read_unlock();
-+	return meta_sk;
-+}
-+
-+void mptcp_hash_remove_bh(struct tcp_sock *meta_tp)
-+{
-+	/* remove from the token hashtable */
-+	rcu_read_lock_bh();
-+	spin_lock(&mptcp_tk_hashlock);
-+	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
-+	meta_tp->inside_tk_table = 0;
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock_bh();
-+}
-+
-+void mptcp_hash_remove(struct tcp_sock *meta_tp)
-+{
-+	rcu_read_lock();
-+	spin_lock(&mptcp_tk_hashlock);
-+	hlist_nulls_del_init_rcu(&meta_tp->tk_table);
-+	meta_tp->inside_tk_table = 0;
-+	spin_unlock(&mptcp_tk_hashlock);
-+	rcu_read_unlock();
-+}
-+
-+struct sock *mptcp_select_ack_sock(const struct sock *meta_sk)
-+{
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sock *sk, *rttsk = NULL, *lastsk = NULL;
-+	u32 min_time = 0, last_active = 0;
-+
-+	mptcp_for_each_sk(meta_tp->mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+		u32 elapsed;
-+
-+		if (!mptcp_sk_can_send_ack(sk) || tp->pf)
-+			continue;
-+
-+		elapsed = keepalive_time_elapsed(tp);
-+
-+		/* We take the one with the lowest RTT within a reasonable
-+		 * (meta-RTO)-timeframe
-+		 */
-+		if (elapsed < inet_csk(meta_sk)->icsk_rto) {
-+			if (!min_time || tp->srtt_us < min_time) {
-+				min_time = tp->srtt_us;
-+				rttsk = sk;
-+			}
-+			continue;
-+		}
-+
-+		/* Otherwise, we just take the most recent active */
-+		if (!rttsk && (!last_active || elapsed < last_active)) {
-+			last_active = elapsed;
-+			lastsk = sk;
-+		}
-+	}
-+
-+	if (rttsk)
-+		return rttsk;
-+
-+	return lastsk;
-+}
-+EXPORT_SYMBOL(mptcp_select_ack_sock);
-+
-+static void mptcp_sock_def_error_report(struct sock *sk)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+	if (!sock_flag(sk, SOCK_DEAD))
-+		mptcp_sub_close(sk, 0);
-+
-+	if (mpcb->infinite_mapping_rcv || mpcb->infinite_mapping_snd ||
-+	    mpcb->send_infinite_mapping) {
-+		struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+		meta_sk->sk_err = sk->sk_err;
-+		meta_sk->sk_err_soft = sk->sk_err_soft;
-+
-+		if (!sock_flag(meta_sk, SOCK_DEAD))
-+			meta_sk->sk_error_report(meta_sk);
-+
-+		tcp_done(meta_sk);
-+	}
-+
-+	sk->sk_err = 0;
-+	return;
-+}
-+
-+static void mptcp_mpcb_put(struct mptcp_cb *mpcb)
-+{
-+	if (atomic_dec_and_test(&mpcb->mpcb_refcnt)) {
-+		mptcp_cleanup_path_manager(mpcb);
-+		mptcp_cleanup_scheduler(mpcb);
-+		kmem_cache_free(mptcp_cb_cache, mpcb);
-+	}
-+}
-+
-+static void mptcp_sock_destruct(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	inet_sock_destruct(sk);
-+
-+	if (!is_meta_sk(sk) && !tp->was_meta_sk) {
-+		BUG_ON(!hlist_unhashed(&tp->mptcp->cb_list));
-+
-+		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
-+		tp->mptcp = NULL;
-+
-+		/* Taken when mpcb pointer was set */
-+		sock_put(mptcp_meta_sk(sk));
-+		mptcp_mpcb_put(tp->mpcb);
-+	} else {
-+		struct mptcp_cb *mpcb = tp->mpcb;
-+		struct mptcp_tw *mptw;
-+
-+		/* The mpcb is disappearing - we can make the final
-+		 * update to the rcv_nxt of the time-wait-sock and remove
-+		 * its reference to the mpcb.
-+		 */
-+		spin_lock_bh(&mpcb->tw_lock);
-+		list_for_each_entry_rcu(mptw, &mpcb->tw_list, list) {
-+			list_del_rcu(&mptw->list);
-+			mptw->in_list = 0;
-+			mptcp_mpcb_put(mpcb);
-+			rcu_assign_pointer(mptw->mpcb, NULL);
-+		}
-+		spin_unlock_bh(&mpcb->tw_lock);
-+
-+		mptcp_mpcb_put(mpcb);
-+
-+		mptcp_debug("%s destroying meta-sk\n", __func__);
-+	}
-+
-+	WARN_ON(!static_key_false(&mptcp_static_key));
-+	/* Must be the last call, because is_meta_sk() above still needs the
-+	 * static key
-+	 */
-+	static_key_slow_dec(&mptcp_static_key);
-+}
-+
-+void mptcp_destroy_sock(struct sock *sk)
-+{
-+	if (is_meta_sk(sk)) {
-+		struct sock *sk_it, *tmpsk;
-+
-+		__skb_queue_purge(&tcp_sk(sk)->mpcb->reinject_queue);
-+		mptcp_purge_ofo_queue(tcp_sk(sk));
-+
-+		/* We have to close all remaining subflows. Normally, they
-+		 * should all be about to get closed. But, if the kernel is
-+		 * forcing a closure (e.g., tcp_write_err), the subflows might
-+		 * not have been closed properly (as we are waiting for the
-+		 * DATA_ACK of the DATA_FIN).
-+		 */
-+		mptcp_for_each_sk_safe(tcp_sk(sk)->mpcb, sk_it, tmpsk) {
-+			/* Already did call tcp_close - waiting for graceful
-+			 * closure, or if we are retransmitting fast-close on
-+			 * the subflow. The reset (or timeout) will kill the
-+			 * subflow..
-+			 */
-+			if (tcp_sk(sk_it)->closing ||
-+			    tcp_sk(sk_it)->send_mp_fclose)
-+				continue;
-+
-+			/* Allow the delayed work first to prevent time-wait state */
-+			if (delayed_work_pending(&tcp_sk(sk_it)->mptcp->work))
-+				continue;
-+
-+			mptcp_sub_close(sk_it, 0);
-+		}
-+
-+		mptcp_delete_synack_timer(sk);
-+	} else {
-+		mptcp_del_sock(sk);
-+	}
-+}
-+
-+static void mptcp_set_state(struct sock *sk)
-+{
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+	/* Meta is not yet established - wake up the application */
-+	if ((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV) &&
-+	    sk->sk_state == TCP_ESTABLISHED) {
-+		tcp_set_state(meta_sk, TCP_ESTABLISHED);
-+
-+		if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+			meta_sk->sk_state_change(meta_sk);
-+			sk_wake_async(meta_sk, SOCK_WAKE_IO, POLL_OUT);
-+		}
-+	}
-+
-+	if (sk->sk_state == TCP_ESTABLISHED) {
-+		tcp_sk(sk)->mptcp->establish_increased = 1;
-+		tcp_sk(sk)->mpcb->cnt_established++;
-+	}
-+}
-+
-+void mptcp_init_congestion_control(struct sock *sk)
-+{
-+	struct inet_connection_sock *icsk = inet_csk(sk);
-+	struct inet_connection_sock *meta_icsk = inet_csk(mptcp_meta_sk(sk));
-+	const struct tcp_congestion_ops *ca = meta_icsk->icsk_ca_ops;
-+
-+	/* The application didn't set the congestion control to use
-+	 * fallback to the default one.
-+	 */
-+	if (ca == &tcp_init_congestion_ops)
-+		goto use_default;
-+
-+	/* Use the same congestion control as set by the user. If the
-+	 * module is not available fallback to the default one.
-+	 */
-+	if (!try_module_get(ca->owner)) {
-+		pr_warn("%s: fallback to the system default CC\n", __func__);
-+		goto use_default;
-+	}
-+
-+	icsk->icsk_ca_ops = ca;
-+	if (icsk->icsk_ca_ops->init)
-+		icsk->icsk_ca_ops->init(sk);
-+
-+	return;
-+
-+use_default:
-+	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
-+	tcp_init_congestion_control(sk);
-+}
-+
-+u32 mptcp_secret[MD5_MESSAGE_BYTES / 4] ____cacheline_aligned;
-+u32 mptcp_seed = 0;
-+
-+void mptcp_key_sha1(u64 key, u32 *token, u64 *idsn)
-+{
-+	u32 workspace[SHA_WORKSPACE_WORDS];
-+	u32 mptcp_hashed_key[SHA_DIGEST_WORDS];
-+	u8 input[64];
-+	int i;
-+
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	/* Initialize input with appropriate padding */
-+	memset(&input[9], 0, sizeof(input) - 10); /* -10, because the last byte
-+						   * is explicitly set too
-+						   */
-+	memcpy(input, &key, sizeof(key)); /* Copy key to the msg beginning */
-+	input[8] = 0x80; /* Padding: First bit after message = 1 */
-+	input[63] = 0x40; /* Padding: Length of the message = 64 bits */
-+
-+	sha_init(mptcp_hashed_key);
-+	sha_transform(mptcp_hashed_key, input, workspace);
-+
-+	for (i = 0; i < 5; i++)
-+		mptcp_hashed_key[i] = cpu_to_be32(mptcp_hashed_key[i]);
-+
-+	if (token)
-+		*token = mptcp_hashed_key[0];
-+	if (idsn)
-+		*idsn = *((u64 *)&mptcp_hashed_key[3]);
-+}
-+
-+void mptcp_hmac_sha1(u8 *key_1, u8 *key_2, u8 *rand_1, u8 *rand_2,
-+		       u32 *hash_out)
-+{
-+	u32 workspace[SHA_WORKSPACE_WORDS];
-+	u8 input[128]; /* 2 512-bit blocks */
-+	int i;
-+
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	/* Generate key xored with ipad */
-+	memset(input, 0x36, 64);
-+	for (i = 0; i < 8; i++)
-+		input[i] ^= key_1[i];
-+	for (i = 0; i < 8; i++)
-+		input[i + 8] ^= key_2[i];
-+
-+	memcpy(&input[64], rand_1, 4);
-+	memcpy(&input[68], rand_2, 4);
-+	input[72] = 0x80; /* Padding: First bit after message = 1 */
-+	memset(&input[73], 0, 53);
-+
-+	/* Padding: Length of the message = 512 + 64 bits */
-+	input[126] = 0x02;
-+	input[127] = 0x40;
-+
-+	sha_init(hash_out);
-+	sha_transform(hash_out, input, workspace);
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	sha_transform(hash_out, &input[64], workspace);
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	for (i = 0; i < 5; i++)
-+		hash_out[i] = cpu_to_be32(hash_out[i]);
-+
-+	/* Prepare second part of hmac */
-+	memset(input, 0x5C, 64);
-+	for (i = 0; i < 8; i++)
-+		input[i] ^= key_1[i];
-+	for (i = 0; i < 8; i++)
-+		input[i + 8] ^= key_2[i];
-+
-+	memcpy(&input[64], hash_out, 20);
-+	input[84] = 0x80;
-+	memset(&input[85], 0, 41);
-+
-+	/* Padding: Length of the message = 512 + 160 bits */
-+	input[126] = 0x02;
-+	input[127] = 0xA0;
-+
-+	sha_init(hash_out);
-+	sha_transform(hash_out, input, workspace);
-+	memset(workspace, 0, sizeof(workspace));
-+
-+	sha_transform(hash_out, &input[64], workspace);
-+
-+	for (i = 0; i < 5; i++)
-+		hash_out[i] = cpu_to_be32(hash_out[i]);
-+}
-+
-+static void mptcp_mpcb_inherit_sockopts(struct sock *meta_sk, struct sock *master_sk)
-+{
-+	/* Socket-options handled by sk_clone_lock while creating the meta-sk.
-+	 * ======
-+	 * SO_SNDBUF, SO_SNDBUFFORCE, SO_RCVBUF, SO_RCVBUFFORCE, SO_RCVLOWAT,
-+	 * SO_RCVTIMEO, SO_SNDTIMEO, SO_ATTACH_FILTER, SO_DETACH_FILTER,
-+	 * TCP_NODELAY, TCP_CORK
-+	 *
-+	 * Socket-options handled in this function here
-+	 * ======
-+	 * TCP_DEFER_ACCEPT
-+	 * SO_KEEPALIVE
-+	 *
-+	 * Socket-options on the todo-list
-+	 * ======
-+	 * SO_BINDTODEVICE - should probably prevent creation of new subsocks
-+	 *		     across other devices. - what about the api-draft?
-+	 * SO_DEBUG
-+	 * SO_REUSEADDR - probably we don't care about this
-+	 * SO_DONTROUTE, SO_BROADCAST
-+	 * SO_OOBINLINE
-+	 * SO_LINGER
-+	 * SO_TIMESTAMP* - I don't think this is of concern for a SOCK_STREAM
-+	 * SO_PASSSEC - I don't think this is of concern for a SOCK_STREAM
-+	 * SO_RXQ_OVFL
-+	 * TCP_COOKIE_TRANSACTIONS
-+	 * TCP_MAXSEG
-+	 * TCP_THIN_* - Handled by sk_clone_lock, but we need to support this
-+	 *		in mptcp_retransmit_timer. AND we need to check what is
-+	 *		about the subsockets.
-+	 * TCP_LINGER2
-+	 * TCP_WINDOW_CLAMP
-+	 * TCP_USER_TIMEOUT
-+	 * TCP_MD5SIG
-+	 *
-+	 * Socket-options of no concern for the meta-socket (but for the subsocket)
-+	 * ======
-+	 * SO_PRIORITY
-+	 * SO_MARK
-+	 * TCP_CONGESTION
-+	 * TCP_SYNCNT
-+	 * TCP_QUICKACK
-+	 */
-+
-+	/* DEFER_ACCEPT should not be set on the meta, as we want to accept new subflows directly */
-+	inet_csk(meta_sk)->icsk_accept_queue.rskq_defer_accept = 0;
-+
-+	/* Keepalives are handled entirely at the MPTCP-layer */
-+	if (sock_flag(meta_sk, SOCK_KEEPOPEN)) {
-+		inet_csk_reset_keepalive_timer(meta_sk,
-+					       keepalive_time_when(tcp_sk(meta_sk)));
-+		sock_reset_flag(master_sk, SOCK_KEEPOPEN);
-+		inet_csk_delete_keepalive_timer(master_sk);
-+	}
-+
-+	/* Do not propagate subflow-errors up to the MPTCP-layer */
-+	inet_sk(master_sk)->recverr = 0;
-+}
-+
-+static void mptcp_sub_inherit_sockopts(const struct sock *meta_sk, struct sock *sub_sk)
-+{
-+	/* IP_TOS also goes to the subflow. */
-+	if (inet_sk(sub_sk)->tos != inet_sk(meta_sk)->tos) {
-+		inet_sk(sub_sk)->tos = inet_sk(meta_sk)->tos;
-+		sub_sk->sk_priority = meta_sk->sk_priority;
-+		sk_dst_reset(sub_sk);
-+	}
-+
-+	/* Inherit SO_REUSEADDR */
-+	sub_sk->sk_reuse = meta_sk->sk_reuse;
-+
-+	/* Inherit snd/rcv-buffer locks */
-+	sub_sk->sk_userlocks = meta_sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
-+
-+	/* Nagle/Cork is forced off on the subflows. It is handled at the meta-layer */
-+	tcp_sk(sub_sk)->nonagle = TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
-+
-+	/* Keepalives are handled entirely at the MPTCP-layer */
-+	if (sock_flag(sub_sk, SOCK_KEEPOPEN)) {
-+		sock_reset_flag(sub_sk, SOCK_KEEPOPEN);
-+		inet_csk_delete_keepalive_timer(sub_sk);
-+	}
-+
-+	/* Do not propagate subflow-errors up to the MPTCP-layer */
-+	inet_sk(sub_sk)->recverr = 0;
-+}
-+
-+int mptcp_backlog_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	/* skb-sk may be NULL if we receive a packet immediatly after the
-+	 * SYN/ACK + MP_CAPABLE.
-+	 */
-+	struct sock *sk = skb->sk ? skb->sk : meta_sk;
-+	int ret = 0;
-+
-+	skb->sk = NULL;
-+
-+	if (unlikely(!atomic_inc_not_zero(&sk->sk_refcnt))) {
-+		kfree_skb(skb);
-+		return 0;
-+	}
-+
-+	if (sk->sk_family == AF_INET)
-+		ret = tcp_v4_do_rcv(sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else
-+		ret = tcp_v6_do_rcv(sk, skb);
-+#endif
-+
-+	sock_put(sk);
-+	return ret;
-+}
-+
-+struct lock_class_key meta_key;
-+struct lock_class_key meta_slock_key;
-+
-+static void mptcp_synack_timer_handler(unsigned long data)
-+{
-+	struct sock *meta_sk = (struct sock *) data;
-+	struct listen_sock *lopt = inet_csk(meta_sk)->icsk_accept_queue.listen_opt;
-+
-+	/* Only process if socket is not in use. */
-+	bh_lock_sock(meta_sk);
-+
-+	if (sock_owned_by_user(meta_sk)) {
-+		/* Try again later. */
-+		mptcp_reset_synack_timer(meta_sk, HZ/20);
-+		goto out;
-+	}
-+
-+	/* May happen if the queue got destructed in mptcp_close */
-+	if (!lopt)
-+		goto out;
-+
-+	inet_csk_reqsk_queue_prune(meta_sk, TCP_SYNQ_INTERVAL,
-+				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
-+
-+	if (lopt->qlen)
-+		mptcp_reset_synack_timer(meta_sk, TCP_SYNQ_INTERVAL);
-+
-+out:
-+	bh_unlock_sock(meta_sk);
-+	sock_put(meta_sk);
-+}
-+
-+static const struct tcp_sock_ops mptcp_meta_specific = {
-+	.__select_window		= __mptcp_select_window,
-+	.select_window			= mptcp_select_window,
-+	.select_initial_window		= mptcp_select_initial_window,
-+	.init_buffer_space		= mptcp_init_buffer_space,
-+	.set_rto			= mptcp_tcp_set_rto,
-+	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
-+	.init_congestion_control	= mptcp_init_congestion_control,
-+	.send_fin			= mptcp_send_fin,
-+	.write_xmit			= mptcp_write_xmit,
-+	.send_active_reset		= mptcp_send_active_reset,
-+	.write_wakeup			= mptcp_write_wakeup,
-+	.prune_ofo_queue		= mptcp_prune_ofo_queue,
-+	.retransmit_timer		= mptcp_retransmit_timer,
-+	.time_wait			= mptcp_time_wait,
-+	.cleanup_rbuf			= mptcp_cleanup_rbuf,
-+};
-+
-+static const struct tcp_sock_ops mptcp_sub_specific = {
-+	.__select_window		= __mptcp_select_window,
-+	.select_window			= mptcp_select_window,
-+	.select_initial_window		= mptcp_select_initial_window,
-+	.init_buffer_space		= mptcp_init_buffer_space,
-+	.set_rto			= mptcp_tcp_set_rto,
-+	.should_expand_sndbuf		= mptcp_should_expand_sndbuf,
-+	.init_congestion_control	= mptcp_init_congestion_control,
-+	.send_fin			= tcp_send_fin,
-+	.write_xmit			= tcp_write_xmit,
-+	.send_active_reset		= tcp_send_active_reset,
-+	.write_wakeup			= tcp_write_wakeup,
-+	.prune_ofo_queue		= tcp_prune_ofo_queue,
-+	.retransmit_timer		= tcp_retransmit_timer,
-+	.time_wait			= tcp_time_wait,
-+	.cleanup_rbuf			= tcp_cleanup_rbuf,
-+};
-+
-+static int mptcp_alloc_mpcb(struct sock *meta_sk, __u64 remote_key, u32 window)
-+{
-+	struct mptcp_cb *mpcb;
-+	struct sock *master_sk;
-+	struct inet_connection_sock *master_icsk, *meta_icsk = inet_csk(meta_sk);
-+	struct tcp_sock *master_tp, *meta_tp = tcp_sk(meta_sk);
-+	u64 idsn;
-+
-+	dst_release(meta_sk->sk_rx_dst);
-+	meta_sk->sk_rx_dst = NULL;
-+	/* This flag is set to announce sock_lock_init to
-+	 * reclassify the lock-class of the master socket.
-+	 */
-+	meta_tp->is_master_sk = 1;
-+	master_sk = sk_clone_lock(meta_sk, GFP_ATOMIC | __GFP_ZERO);
-+	meta_tp->is_master_sk = 0;
-+	if (!master_sk)
-+		return -ENOBUFS;
-+
-+	master_tp = tcp_sk(master_sk);
-+	master_icsk = inet_csk(master_sk);
-+
-+	mpcb = kmem_cache_zalloc(mptcp_cb_cache, GFP_ATOMIC);
-+	if (!mpcb) {
-+		/* sk_free (and __sk_free) requirese wmem_alloc to be 1.
-+		 * All the rest is set to 0 thanks to __GFP_ZERO above.
-+		 */
-+		atomic_set(&master_sk->sk_wmem_alloc, 1);
-+		sk_free(master_sk);
-+		return -ENOBUFS;
-+	}
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	if (meta_icsk->icsk_af_ops == &mptcp_v6_mapped) {
-+		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
-+
-+		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
-+
-+		newnp = inet6_sk(master_sk);
-+		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
-+
-+		newnp->ipv6_mc_list = NULL;
-+		newnp->ipv6_ac_list = NULL;
-+		newnp->ipv6_fl_list = NULL;
-+		newnp->opt = NULL;
-+		newnp->pktoptions = NULL;
-+		(void)xchg(&newnp->rxpmtu, NULL);
-+	} else if (meta_sk->sk_family == AF_INET6) {
-+		struct ipv6_pinfo *newnp, *np = inet6_sk(meta_sk);
-+
-+		inet_sk(master_sk)->pinet6 = &((struct tcp6_sock *)master_sk)->inet6;
-+
-+		newnp = inet6_sk(master_sk);
-+		memcpy(newnp, np, sizeof(struct ipv6_pinfo));
-+
-+		newnp->hop_limit	= -1;
-+		newnp->mcast_hops	= IPV6_DEFAULT_MCASTHOPS;
-+		newnp->mc_loop	= 1;
-+		newnp->pmtudisc	= IPV6_PMTUDISC_WANT;
-+		newnp->ipv6only	= sock_net(master_sk)->ipv6.sysctl.bindv6only;
-+	}
-+#endif
-+
-+	meta_tp->mptcp = NULL;
-+
-+	/* Store the keys and generate the peer's token */
-+	mpcb->mptcp_loc_key = meta_tp->mptcp_loc_key;
-+	mpcb->mptcp_loc_token = meta_tp->mptcp_loc_token;
-+
-+	/* Generate Initial data-sequence-numbers */
-+	mptcp_key_sha1(mpcb->mptcp_loc_key, NULL, &idsn);
-+	idsn = ntohll(idsn) + 1;
-+	mpcb->snd_high_order[0] = idsn >> 32;
-+	mpcb->snd_high_order[1] = mpcb->snd_high_order[0] - 1;
-+
-+	meta_tp->write_seq = (u32)idsn;
-+	meta_tp->snd_sml = meta_tp->write_seq;
-+	meta_tp->snd_una = meta_tp->write_seq;
-+	meta_tp->snd_nxt = meta_tp->write_seq;
-+	meta_tp->pushed_seq = meta_tp->write_seq;
-+	meta_tp->snd_up = meta_tp->write_seq;
-+
-+	mpcb->mptcp_rem_key = remote_key;
-+	mptcp_key_sha1(mpcb->mptcp_rem_key, &mpcb->mptcp_rem_token, &idsn);
-+	idsn = ntohll(idsn) + 1;
-+	mpcb->rcv_high_order[0] = idsn >> 32;
-+	mpcb->rcv_high_order[1] = mpcb->rcv_high_order[0] + 1;
-+	meta_tp->copied_seq = (u32) idsn;
-+	meta_tp->rcv_nxt = (u32) idsn;
-+	meta_tp->rcv_wup = (u32) idsn;
-+
-+	meta_tp->snd_wl1 = meta_tp->rcv_nxt - 1;
-+	meta_tp->snd_wnd = window;
-+	meta_tp->retrans_stamp = 0; /* Set in tcp_connect() */
-+
-+	meta_tp->packets_out = 0;
-+	meta_icsk->icsk_probes_out = 0;
-+
-+	/* Set mptcp-pointers */
-+	master_tp->mpcb = mpcb;
-+	master_tp->meta_sk = meta_sk;
-+	meta_tp->mpcb = mpcb;
-+	meta_tp->meta_sk = meta_sk;
-+	mpcb->meta_sk = meta_sk;
-+	mpcb->master_sk = master_sk;
-+
-+	meta_tp->was_meta_sk = 0;
-+
-+	/* Initialize the queues */
-+	skb_queue_head_init(&mpcb->reinject_queue);
-+	skb_queue_head_init(&master_tp->out_of_order_queue);
-+	tcp_prequeue_init(master_tp);
-+	INIT_LIST_HEAD(&master_tp->tsq_node);
-+
-+	master_tp->tsq_flags = 0;
-+
-+	mutex_init(&mpcb->mpcb_mutex);
-+
-+	/* Init the accept_queue structure, we support a queue of 32 pending
-+	 * connections, it does not need to be huge, since we only store  here
-+	 * pending subflow creations.
-+	 */
-+	if (reqsk_queue_alloc(&meta_icsk->icsk_accept_queue, 32, GFP_ATOMIC)) {
-+		inet_put_port(master_sk);
-+		kmem_cache_free(mptcp_cb_cache, mpcb);
-+		sk_free(master_sk);
-+		return -ENOMEM;
-+	}
-+
-+	/* Redefine function-pointers as the meta-sk is now fully ready */
-+	static_key_slow_inc(&mptcp_static_key);
-+	meta_tp->mpc = 1;
-+	meta_tp->ops = &mptcp_meta_specific;
-+
-+	meta_sk->sk_backlog_rcv = mptcp_backlog_rcv;
-+	meta_sk->sk_destruct = mptcp_sock_destruct;
-+
-+	/* Meta-level retransmit timer */
-+	meta_icsk->icsk_rto *= 2; /* Double of initial - rto */
-+
-+	tcp_init_xmit_timers(master_sk);
-+	/* Has been set for sending out the SYN */
-+	inet_csk_clear_xmit_timer(meta_sk, ICSK_TIME_RETRANS);
-+
-+	if (!meta_tp->inside_tk_table) {
-+		/* Adding the meta_tp in the token hashtable - coming from server-side */
-+		rcu_read_lock();
-+		spin_lock(&mptcp_tk_hashlock);
-+
-+		__mptcp_hash_insert(meta_tp, mpcb->mptcp_loc_token);
-+
-+		spin_unlock(&mptcp_tk_hashlock);
-+		rcu_read_unlock();
-+	}
-+	master_tp->inside_tk_table = 0;
-+
-+	/* Init time-wait stuff */
-+	INIT_LIST_HEAD(&mpcb->tw_list);
-+	spin_lock_init(&mpcb->tw_lock);
-+
-+	INIT_HLIST_HEAD(&mpcb->callback_list);
-+
-+	mptcp_mpcb_inherit_sockopts(meta_sk, master_sk);
-+
-+	mpcb->orig_sk_rcvbuf = meta_sk->sk_rcvbuf;
-+	mpcb->orig_sk_sndbuf = meta_sk->sk_sndbuf;
-+	mpcb->orig_window_clamp = meta_tp->window_clamp;
-+
-+	/* The meta is directly linked - set refcnt to 1 */
-+	atomic_set(&mpcb->mpcb_refcnt, 1);
-+
-+	mptcp_init_path_manager(mpcb);
-+	mptcp_init_scheduler(mpcb);
-+
-+	setup_timer(&mpcb->synack_timer, mptcp_synack_timer_handler,
-+		    (unsigned long)meta_sk);
-+
-+	mptcp_debug("%s: created mpcb with token %#x\n",
-+		    __func__, mpcb->mptcp_loc_token);
-+
-+	return 0;
-+}
-+
-+void mptcp_fallback_meta_sk(struct sock *meta_sk)
-+{
-+	kfree(inet_csk(meta_sk)->icsk_accept_queue.listen_opt);
-+	kmem_cache_free(mptcp_cb_cache, tcp_sk(meta_sk)->mpcb);
-+}
-+
-+int mptcp_add_sock(struct sock *meta_sk, struct sock *sk, u8 loc_id, u8 rem_id,
-+		   gfp_t flags)
-+{
-+	struct mptcp_cb *mpcb	= tcp_sk(meta_sk)->mpcb;
-+	struct tcp_sock *tp	= tcp_sk(sk);
-+
-+	tp->mptcp = kmem_cache_zalloc(mptcp_sock_cache, flags);
-+	if (!tp->mptcp)
-+		return -ENOMEM;
-+
-+	tp->mptcp->path_index = mptcp_set_new_pathindex(mpcb);
-+	/* No more space for more subflows? */
-+	if (!tp->mptcp->path_index) {
-+		kmem_cache_free(mptcp_sock_cache, tp->mptcp);
-+		return -EPERM;
-+	}
-+
-+	INIT_HLIST_NODE(&tp->mptcp->cb_list);
-+
-+	tp->mptcp->tp = tp;
-+	tp->mpcb = mpcb;
-+	tp->meta_sk = meta_sk;
-+
-+	static_key_slow_inc(&mptcp_static_key);
-+	tp->mpc = 1;
-+	tp->ops = &mptcp_sub_specific;
-+
-+	tp->mptcp->loc_id = loc_id;
-+	tp->mptcp->rem_id = rem_id;
-+	if (mpcb->sched_ops->init)
-+		mpcb->sched_ops->init(sk);
-+
-+	/* The corresponding sock_put is in mptcp_sock_destruct(). It cannot be
-+	 * included in mptcp_del_sock(), because the mpcb must remain alive
-+	 * until the last subsocket is completely destroyed.
-+	 */
-+	sock_hold(meta_sk);
-+	atomic_inc(&mpcb->mpcb_refcnt);
-+
-+	tp->mptcp->next = mpcb->connection_list;
-+	mpcb->connection_list = tp;
-+	tp->mptcp->attached = 1;
-+
-+	mpcb->cnt_subflows++;
-+	atomic_add(atomic_read(&((struct sock *)tp)->sk_rmem_alloc),
-+		   &meta_sk->sk_rmem_alloc);
-+
-+	mptcp_sub_inherit_sockopts(meta_sk, sk);
-+	INIT_DELAYED_WORK(&tp->mptcp->work, mptcp_sub_close_wq);
-+
-+	/* As we successfully allocated the mptcp_tcp_sock, we have to
-+	 * change the function-pointers here (for sk_destruct to work correctly)
-+	 */
-+	sk->sk_error_report = mptcp_sock_def_error_report;
-+	sk->sk_data_ready = mptcp_data_ready;
-+	sk->sk_write_space = mptcp_write_space;
-+	sk->sk_state_change = mptcp_set_state;
-+	sk->sk_destruct = mptcp_sock_destruct;
-+
-+	if (sk->sk_family == AF_INET)
-+		mptcp_debug("%s: token %#x pi %d, src_addr:%pI4:%d dst_addr:%pI4:%d, cnt_subflows now %d\n",
-+			    __func__ , mpcb->mptcp_loc_token,
-+			    tp->mptcp->path_index,
-+			    &((struct inet_sock *)tp)->inet_saddr,
-+			    ntohs(((struct inet_sock *)tp)->inet_sport),
-+			    &((struct inet_sock *)tp)->inet_daddr,
-+			    ntohs(((struct inet_sock *)tp)->inet_dport),
-+			    mpcb->cnt_subflows);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else
-+		mptcp_debug("%s: token %#x pi %d, src_addr:%pI6:%d dst_addr:%pI6:%d, cnt_subflows now %d\n",
-+			    __func__ , mpcb->mptcp_loc_token,
-+			    tp->mptcp->path_index, &inet6_sk(sk)->saddr,
-+			    ntohs(((struct inet_sock *)tp)->inet_sport),
-+			    &sk->sk_v6_daddr,
-+			    ntohs(((struct inet_sock *)tp)->inet_dport),
-+			    mpcb->cnt_subflows);
-+#endif
-+
-+	return 0;
-+}
-+
-+void mptcp_del_sock(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk), *tp_prev;
-+	struct mptcp_cb *mpcb;
-+
-+	if (!tp->mptcp || !tp->mptcp->attached)
-+		return;
-+
-+	mpcb = tp->mpcb;
-+	tp_prev = mpcb->connection_list;
-+
-+	mptcp_debug("%s: Removing subsock tok %#x pi:%d state %d is_meta? %d\n",
-+		    __func__, mpcb->mptcp_loc_token, tp->mptcp->path_index,
-+		    sk->sk_state, is_meta_sk(sk));
-+
-+	if (tp_prev == tp) {
-+		mpcb->connection_list = tp->mptcp->next;
-+	} else {
-+		for (; tp_prev && tp_prev->mptcp->next; tp_prev = tp_prev->mptcp->next) {
-+			if (tp_prev->mptcp->next == tp) {
-+				tp_prev->mptcp->next = tp->mptcp->next;
-+				break;
-+			}
-+		}
-+	}
-+	mpcb->cnt_subflows--;
-+	if (tp->mptcp->establish_increased)
-+		mpcb->cnt_established--;
-+
-+	tp->mptcp->next = NULL;
-+	tp->mptcp->attached = 0;
-+	mpcb->path_index_bits &= ~(1 << tp->mptcp->path_index);
-+
-+	if (!skb_queue_empty(&sk->sk_write_queue))
-+		mptcp_reinject_data(sk, 0);
-+
-+	if (is_master_tp(tp))
-+		mpcb->master_sk = NULL;
-+	else if (tp->mptcp->pre_established)
-+		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+
-+	rcu_assign_pointer(inet_sk(sk)->inet_opt, NULL);
-+}
-+
-+/* Updates the metasocket ULID/port data, based on the given sock.
-+ * The argument sock must be the sock accessible to the application.
-+ * In this function, we update the meta socket info, based on the changes
-+ * in the application socket (bind, address allocation, ...)
-+ */
-+void mptcp_update_metasocket(struct sock *sk, const struct sock *meta_sk)
-+{
-+	if (tcp_sk(sk)->mpcb->pm_ops->new_session)
-+		tcp_sk(sk)->mpcb->pm_ops->new_session(meta_sk);
-+
-+	tcp_sk(sk)->mptcp->send_mp_prio = tcp_sk(sk)->mptcp->low_prio;
-+}
-+
-+/* Clean up the receive buffer for full frames taken by the user,
-+ * then send an ACK if necessary.  COPIED is the number of bytes
-+ * tcp_recvmsg has given to the user so far, it speeds up the
-+ * calculation of whether or not we must ACK for the sake of
-+ * a window update.
-+ */
-+void mptcp_cleanup_rbuf(struct sock *meta_sk, int copied)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sock *sk;
-+	__u32 rcv_window_now = 0;
-+
-+	if (copied > 0 && !(meta_sk->sk_shutdown & RCV_SHUTDOWN)) {
-+		rcv_window_now = tcp_receive_window(meta_tp);
-+
-+		if (2 * rcv_window_now > meta_tp->window_clamp)
-+			rcv_window_now = 0;
-+	}
-+
-+	mptcp_for_each_sk(meta_tp->mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+		const struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+		if (!mptcp_sk_can_send_ack(sk))
-+			continue;
-+
-+		if (!inet_csk_ack_scheduled(sk))
-+			goto second_part;
-+		/* Delayed ACKs frequently hit locked sockets during bulk
-+		 * receive.
-+		 */
-+		if (icsk->icsk_ack.blocked ||
-+		    /* Once-per-two-segments ACK was not sent by tcp_input.c */
-+		    tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
-+		    /* If this read emptied read buffer, we send ACK, if
-+		     * connection is not bidirectional, user drained
-+		     * receive buffer and there was a small segment
-+		     * in queue.
-+		     */
-+		    (copied > 0 &&
-+		     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
-+		      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
-+		       !icsk->icsk_ack.pingpong)) &&
-+		     !atomic_read(&meta_sk->sk_rmem_alloc))) {
-+			tcp_send_ack(sk);
-+			continue;
-+		}
-+
-+second_part:
-+		/* This here is the second part of tcp_cleanup_rbuf */
-+		if (rcv_window_now) {
-+			__u32 new_window = tp->ops->__select_window(sk);
-+
-+			/* Send ACK now, if this read freed lots of space
-+			 * in our buffer. Certainly, new_window is new window.
-+			 * We can advertise it now, if it is not less than
-+			 * current one.
-+			 * "Lots" means "at least twice" here.
-+			 */
-+			if (new_window && new_window >= 2 * rcv_window_now)
-+				tcp_send_ack(sk);
-+		}
-+	}
-+}
-+
-+static int mptcp_sub_send_fin(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sk_buff *skb = tcp_write_queue_tail(sk);
-+	int mss_now;
-+
-+	/* Optimization, tack on the FIN if we have a queue of
-+	 * unsent frames.  But be careful about outgoing SACKS
-+	 * and IP options.
-+	 */
-+	mss_now = tcp_current_mss(sk);
-+
-+	if (tcp_send_head(sk) != NULL) {
-+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+		TCP_SKB_CB(skb)->end_seq++;
-+		tp->write_seq++;
-+	} else {
-+		skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_ATOMIC);
-+		if (!skb)
-+			return 1;
-+
-+		/* Reserve space for headers and prepare control bits. */
-+		skb_reserve(skb, MAX_TCP_HEADER);
-+		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
-+		tcp_init_nondata_skb(skb, tp->write_seq,
-+				     TCPHDR_ACK | TCPHDR_FIN);
-+		tcp_queue_skb(sk, skb);
-+	}
-+	__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
-+
-+	return 0;
-+}
-+
-+void mptcp_sub_close_wq(struct work_struct *work)
-+{
-+	struct tcp_sock *tp = container_of(work, struct mptcp_tcp_sock, work.work)->tp;
-+	struct sock *sk = (struct sock *)tp;
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+	mutex_lock(&tp->mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	if (sock_flag(sk, SOCK_DEAD))
-+		goto exit;
-+
-+	/* We come from tcp_disconnect. We are sure that meta_sk is set */
-+	if (!mptcp(tp)) {
-+		tp->closing = 1;
-+		sock_rps_reset_flow(sk);
-+		tcp_close(sk, 0);
-+		goto exit;
-+	}
-+
-+	if (meta_sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) {
-+		tp->closing = 1;
-+		sock_rps_reset_flow(sk);
-+		tcp_close(sk, 0);
-+	} else if (tcp_close_state(sk)) {
-+		sk->sk_shutdown |= SEND_SHUTDOWN;
-+		tcp_send_fin(sk);
-+	}
-+
-+exit:
-+	release_sock(meta_sk);
-+	mutex_unlock(&tp->mpcb->mpcb_mutex);
-+	sock_put(sk);
-+}
-+
-+void mptcp_sub_close(struct sock *sk, unsigned long delay)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct delayed_work *work = &tcp_sk(sk)->mptcp->work;
-+
-+	/* We are already closing - e.g., call from sock_def_error_report upon
-+	 * tcp_disconnect in tcp_close.
-+	 */
-+	if (tp->closing)
-+		return;
-+
-+	/* Work already scheduled ? */
-+	if (work_pending(&work->work)) {
-+		/* Work present - who will be first ? */
-+		if (jiffies + delay > work->timer.expires)
-+			return;
-+
-+		/* Try canceling - if it fails, work will be executed soon */
-+		if (!cancel_delayed_work(work))
-+			return;
-+		sock_put(sk);
-+	}
-+
-+	if (!delay) {
-+		unsigned char old_state = sk->sk_state;
-+
-+		/* If we are in user-context we can directly do the closing
-+		 * procedure. No need to schedule a work-queue.
-+		 */
-+		if (!in_softirq()) {
-+			if (sock_flag(sk, SOCK_DEAD))
-+				return;
-+
-+			if (!mptcp(tp)) {
-+				tp->closing = 1;
-+				sock_rps_reset_flow(sk);
-+				tcp_close(sk, 0);
-+				return;
-+			}
-+
-+			if (mptcp_meta_sk(sk)->sk_shutdown == SHUTDOWN_MASK ||
-+			    sk->sk_state == TCP_CLOSE) {
-+				tp->closing = 1;
-+				sock_rps_reset_flow(sk);
-+				tcp_close(sk, 0);
-+			} else if (tcp_close_state(sk)) {
-+				sk->sk_shutdown |= SEND_SHUTDOWN;
-+				tcp_send_fin(sk);
-+			}
-+
-+			return;
-+		}
-+
-+		/* We directly send the FIN. Because it may take so a long time,
-+		 * untile the work-queue will get scheduled...
-+		 *
-+		 * If mptcp_sub_send_fin returns 1, it failed and thus we reset
-+		 * the old state so that tcp_close will finally send the fin
-+		 * in user-context.
-+		 */
-+		if (!sk->sk_err && old_state != TCP_CLOSE &&
-+		    tcp_close_state(sk) && mptcp_sub_send_fin(sk)) {
-+			if (old_state == TCP_ESTABLISHED)
-+				TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
-+			sk->sk_state = old_state;
-+		}
-+	}
-+
-+	sock_hold(sk);
-+	queue_delayed_work(mptcp_wq, work, delay);
-+}
-+
-+void mptcp_sub_force_close(struct sock *sk)
-+{
-+	/* The below tcp_done may have freed the socket, if he is already dead.
-+	 * Thus, we are not allowed to access it afterwards. That's why
-+	 * we have to store the dead-state in this local variable.
-+	 */
-+	int sock_is_dead = sock_flag(sk, SOCK_DEAD);
-+
-+	tcp_sk(sk)->mp_killed = 1;
-+
-+	if (sk->sk_state != TCP_CLOSE)
-+		tcp_done(sk);
-+
-+	if (!sock_is_dead)
-+		mptcp_sub_close(sk, 0);
-+}
-+EXPORT_SYMBOL(mptcp_sub_force_close);
-+
-+/* Update the mpcb send window, based on the contributions
-+ * of each subflow
-+ */
-+void mptcp_update_sndbuf(const struct tcp_sock *tp)
-+{
-+	struct sock *meta_sk = tp->meta_sk, *sk;
-+	int new_sndbuf = 0, old_sndbuf = meta_sk->sk_sndbuf;
-+
-+	mptcp_for_each_sk(tp->mpcb, sk) {
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+
-+		new_sndbuf += sk->sk_sndbuf;
-+
-+		if (new_sndbuf > sysctl_tcp_wmem[2] || new_sndbuf < 0) {
-+			new_sndbuf = sysctl_tcp_wmem[2];
-+			break;
-+		}
-+	}
-+	meta_sk->sk_sndbuf = max(min(new_sndbuf, sysctl_tcp_wmem[2]), meta_sk->sk_sndbuf);
-+
-+	/* The subflow's call to sk_write_space in tcp_new_space ends up in
-+	 * mptcp_write_space.
-+	 * It has nothing to do with waking up the application.
-+	 * So, we do it here.
-+	 */
-+	if (old_sndbuf != meta_sk->sk_sndbuf)
-+		meta_sk->sk_write_space(meta_sk);
-+}
-+
-+void mptcp_close(struct sock *meta_sk, long timeout)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sock *sk_it, *tmpsk;
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sk_buff *skb;
-+	int data_was_unread = 0;
-+	int state;
-+
-+	mptcp_debug("%s: Close of meta_sk with tok %#x\n",
-+		    __func__, mpcb->mptcp_loc_token);
-+
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock(meta_sk);
-+
-+	if (meta_tp->inside_tk_table) {
-+		/* Detach the mpcb from the token hashtable */
-+		mptcp_hash_remove_bh(meta_tp);
-+		reqsk_queue_destroy(&inet_csk(meta_sk)->icsk_accept_queue);
-+	}
-+
-+	meta_sk->sk_shutdown = SHUTDOWN_MASK;
-+	/* We need to flush the recv. buffs.  We do this only on the
-+	 * descriptor close, not protocol-sourced closes, because the
-+	 * reader process may not have drained the data yet!
-+	 */
-+	while ((skb = __skb_dequeue(&meta_sk->sk_receive_queue)) != NULL) {
-+		u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
-+			  tcp_hdr(skb)->fin;
-+		data_was_unread += len;
-+		__kfree_skb(skb);
-+	}
-+
-+	sk_mem_reclaim(meta_sk);
-+
-+	/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
-+	if (meta_sk->sk_state == TCP_CLOSE) {
-+		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+			if (tcp_sk(sk_it)->send_mp_fclose)
-+				continue;
-+			mptcp_sub_close(sk_it, 0);
-+		}
-+		goto adjudge_to_death;
-+	}
-+
-+	if (data_was_unread) {
-+		/* Unread data was tossed, zap the connection. */
-+		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONCLOSE);
-+		tcp_set_state(meta_sk, TCP_CLOSE);
-+		tcp_sk(meta_sk)->ops->send_active_reset(meta_sk,
-+							meta_sk->sk_allocation);
-+	} else if (sock_flag(meta_sk, SOCK_LINGER) && !meta_sk->sk_lingertime) {
-+		/* Check zero linger _after_ checking for unread data. */
-+		meta_sk->sk_prot->disconnect(meta_sk, 0);
-+		NET_INC_STATS_USER(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+	} else if (tcp_close_state(meta_sk)) {
-+		mptcp_send_fin(meta_sk);
-+	} else if (meta_tp->snd_una == meta_tp->write_seq) {
-+		/* The DATA_FIN has been sent and acknowledged
-+		 * (e.g., by sk_shutdown). Close all the other subflows
-+		 */
-+		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+			unsigned long delay = 0;
-+			/* If we are the passive closer, don't trigger
-+			 * subflow-fin until the subflow has been finned
-+			 * by the peer. - thus we add a delay
-+			 */
-+			if (mpcb->passive_close &&
-+			    sk_it->sk_state == TCP_ESTABLISHED)
-+				delay = inet_csk(sk_it)->icsk_rto << 3;
-+
-+			mptcp_sub_close(sk_it, delay);
-+		}
-+	}
-+
-+	sk_stream_wait_close(meta_sk, timeout);
-+
-+adjudge_to_death:
-+	state = meta_sk->sk_state;
-+	sock_hold(meta_sk);
-+	sock_orphan(meta_sk);
-+
-+	/* socket will be freed after mptcp_close - we have to prevent
-+	 * access from the subflows.
-+	 */
-+	mptcp_for_each_sk(mpcb, sk_it) {
-+		/* Similar to sock_orphan, but we don't set it DEAD, because
-+		 * the callbacks are still set and must be called.
-+		 */
-+		write_lock_bh(&sk_it->sk_callback_lock);
-+		sk_set_socket(sk_it, NULL);
-+		sk_it->sk_wq  = NULL;
-+		write_unlock_bh(&sk_it->sk_callback_lock);
-+	}
-+
-+	/* It is the last release_sock in its life. It will remove backlog. */
-+	release_sock(meta_sk);
-+
-+	/* Now socket is owned by kernel and we acquire BH lock
-+	 * to finish close. No need to check for user refs.
-+	 */
-+	local_bh_disable();
-+	bh_lock_sock(meta_sk);
-+	WARN_ON(sock_owned_by_user(meta_sk));
-+
-+	percpu_counter_inc(meta_sk->sk_prot->orphan_count);
-+
-+	/* Have we already been destroyed by a softirq or backlog? */
-+	if (state != TCP_CLOSE && meta_sk->sk_state == TCP_CLOSE)
-+		goto out;
-+
-+	/*	This is a (useful) BSD violating of the RFC. There is a
-+	 *	problem with TCP as specified in that the other end could
-+	 *	keep a socket open forever with no application left this end.
-+	 *	We use a 3 minute timeout (about the same as BSD) then kill
-+	 *	our end. If they send after that then tough - BUT: long enough
-+	 *	that we won't make the old 4*rto = almost no time - whoops
-+	 *	reset mistake.
-+	 *
-+	 *	Nope, it was not mistake. It is really desired behaviour
-+	 *	f.e. on http servers, when such sockets are useless, but
-+	 *	consume significant resources. Let's do it with special
-+	 *	linger2	option.					--ANK
-+	 */
-+
-+	if (meta_sk->sk_state == TCP_FIN_WAIT2) {
-+		if (meta_tp->linger2 < 0) {
-+			tcp_set_state(meta_sk, TCP_CLOSE);
-+			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
-+			NET_INC_STATS_BH(sock_net(meta_sk),
-+					 LINUX_MIB_TCPABORTONLINGER);
-+		} else {
-+			const int tmo = tcp_fin_time(meta_sk);
-+
-+			if (tmo > TCP_TIMEWAIT_LEN) {
-+				inet_csk_reset_keepalive_timer(meta_sk,
-+							       tmo - TCP_TIMEWAIT_LEN);
-+			} else {
-+				meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2,
-+							tmo);
-+				goto out;
-+			}
-+		}
-+	}
-+	if (meta_sk->sk_state != TCP_CLOSE) {
-+		sk_mem_reclaim(meta_sk);
-+		if (tcp_too_many_orphans(meta_sk, 0)) {
-+			if (net_ratelimit())
-+				pr_info("MPTCP: too many of orphaned sockets\n");
-+			tcp_set_state(meta_sk, TCP_CLOSE);
-+			meta_tp->ops->send_active_reset(meta_sk, GFP_ATOMIC);
-+			NET_INC_STATS_BH(sock_net(meta_sk),
-+					 LINUX_MIB_TCPABORTONMEMORY);
-+		}
-+	}
-+
-+
-+	if (meta_sk->sk_state == TCP_CLOSE)
-+		inet_csk_destroy_sock(meta_sk);
-+	/* Otherwise, socket is reprieved until protocol close. */
-+
-+out:
-+	bh_unlock_sock(meta_sk);
-+	local_bh_enable();
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk); /* Taken by sock_hold */
-+}
-+
-+void mptcp_disconnect(struct sock *sk)
-+{
-+	struct sock *subsk, *tmpsk;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	mptcp_delete_synack_timer(sk);
-+
-+	__skb_queue_purge(&tp->mpcb->reinject_queue);
-+
-+	if (tp->inside_tk_table) {
-+		mptcp_hash_remove_bh(tp);
-+		reqsk_queue_destroy(&inet_csk(tp->meta_sk)->icsk_accept_queue);
-+	}
-+
-+	local_bh_disable();
-+	mptcp_for_each_sk_safe(tp->mpcb, subsk, tmpsk) {
-+		/* The socket will get removed from the subsocket-list
-+		 * and made non-mptcp by setting mpc to 0.
-+		 *
-+		 * This is necessary, because tcp_disconnect assumes
-+		 * that the connection is completly dead afterwards.
-+		 * Thus we need to do a mptcp_del_sock. Due to this call
-+		 * we have to make it non-mptcp.
-+		 *
-+		 * We have to lock the socket, because we set mpc to 0.
-+		 * An incoming packet would take the subsocket's lock
-+		 * and go on into the receive-path.
-+		 * This would be a race.
-+		 */
-+
-+		bh_lock_sock(subsk);
-+		mptcp_del_sock(subsk);
-+		tcp_sk(subsk)->mpc = 0;
-+		tcp_sk(subsk)->ops = &tcp_specific;
-+		mptcp_sub_force_close(subsk);
-+		bh_unlock_sock(subsk);
-+	}
-+	local_bh_enable();
-+
-+	tp->was_meta_sk = 1;
-+	tp->mpc = 0;
-+	tp->ops = &tcp_specific;
-+}
-+
-+
-+/* Returns 1 if we should enable MPTCP for that socket. */
-+int mptcp_doit(struct sock *sk)
-+{
-+	/* Do not allow MPTCP enabling if the MPTCP initialization failed */
-+	if (mptcp_init_failed)
-+		return 0;
-+
-+	if (sysctl_mptcp_enabled == MPTCP_APP && !tcp_sk(sk)->mptcp_enabled)
-+		return 0;
-+
-+	/* Socket may already be established (e.g., called from tcp_recvmsg) */
-+	if (mptcp(tcp_sk(sk)) || tcp_sk(sk)->request_mptcp)
-+		return 1;
-+
-+	/* Don't do mptcp over loopback */
-+	if (sk->sk_family == AF_INET &&
-+	    (ipv4_is_loopback(inet_sk(sk)->inet_daddr) ||
-+	     ipv4_is_loopback(inet_sk(sk)->inet_saddr)))
-+		return 0;
-+#if IS_ENABLED(CONFIG_IPV6)
-+	if (sk->sk_family == AF_INET6 &&
-+	    (ipv6_addr_loopback(&sk->sk_v6_daddr) ||
-+	     ipv6_addr_loopback(&inet6_sk(sk)->saddr)))
-+		return 0;
-+#endif
-+	if (mptcp_v6_is_v4_mapped(sk) &&
-+	    ipv4_is_loopback(inet_sk(sk)->inet_saddr))
-+		return 0;
-+
-+#ifdef CONFIG_TCP_MD5SIG
-+	/* If TCP_MD5SIG is enabled, do not do MPTCP - there is no Option-Space */
-+	if (tcp_sk(sk)->af_specific->md5_lookup(sk, sk))
-+		return 0;
-+#endif
-+
-+	return 1;
-+}
-+
-+int mptcp_create_master_sk(struct sock *meta_sk, __u64 remote_key, u32 window)
-+{
-+	struct tcp_sock *master_tp;
-+	struct sock *master_sk;
-+
-+	if (mptcp_alloc_mpcb(meta_sk, remote_key, window))
-+		goto err_alloc_mpcb;
-+
-+	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
-+	master_tp = tcp_sk(master_sk);
-+
-+	if (mptcp_add_sock(meta_sk, master_sk, 0, 0, GFP_ATOMIC))
-+		goto err_add_sock;
-+
-+	if (__inet_inherit_port(meta_sk, master_sk) < 0)
-+		goto err_add_sock;
-+
-+	meta_sk->sk_prot->unhash(meta_sk);
-+
-+	if (master_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(master_sk))
-+		__inet_hash_nolisten(master_sk, NULL);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else
-+		__inet6_hash(master_sk, NULL);
-+#endif
-+
-+	master_tp->mptcp->init_rcv_wnd = master_tp->rcv_wnd;
-+
-+	return 0;
-+
-+err_add_sock:
-+	mptcp_fallback_meta_sk(meta_sk);
-+
-+	inet_csk_prepare_forced_close(master_sk);
-+	tcp_done(master_sk);
-+	inet_csk_prepare_forced_close(meta_sk);
-+	tcp_done(meta_sk);
-+
-+err_alloc_mpcb:
-+	return -ENOBUFS;
-+}
-+
-+static int __mptcp_check_req_master(struct sock *child,
-+				    struct request_sock *req)
-+{
-+	struct tcp_sock *child_tp = tcp_sk(child);
-+	struct sock *meta_sk = child;
-+	struct mptcp_cb *mpcb;
-+	struct mptcp_request_sock *mtreq;
-+
-+	/* Never contained an MP_CAPABLE */
-+	if (!inet_rsk(req)->mptcp_rqsk)
-+		return 1;
-+
-+	if (!inet_rsk(req)->saw_mpc) {
-+		/* Fallback to regular TCP, because we saw one SYN without
-+		 * MP_CAPABLE. In tcp_check_req we continue the regular path.
-+		 * But, the socket has been added to the reqsk_tk_htb, so we
-+		 * must still remove it.
-+		 */
-+		mptcp_reqsk_remove_tk(req);
-+		return 1;
-+	}
-+
-+	/* Just set this values to pass them to mptcp_alloc_mpcb */
-+	mtreq = mptcp_rsk(req);
-+	child_tp->mptcp_loc_key = mtreq->mptcp_loc_key;
-+	child_tp->mptcp_loc_token = mtreq->mptcp_loc_token;
-+
-+	if (mptcp_create_master_sk(meta_sk, mtreq->mptcp_rem_key,
-+				   child_tp->snd_wnd))
-+		return -ENOBUFS;
-+
-+	child = tcp_sk(child)->mpcb->master_sk;
-+	child_tp = tcp_sk(child);
-+	mpcb = child_tp->mpcb;
-+
-+	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
-+	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
-+
-+	mpcb->dss_csum = mtreq->dss_csum;
-+	mpcb->server_side = 1;
-+
-+	/* Will be moved to ESTABLISHED by  tcp_rcv_state_process() */
-+	mptcp_update_metasocket(child, meta_sk);
-+
-+	/* Needs to be done here additionally, because when accepting a
-+	 * new connection we pass by __reqsk_free and not reqsk_free.
-+	 */
-+	mptcp_reqsk_remove_tk(req);
-+
-+	/* Hold when creating the meta-sk in tcp_vX_syn_recv_sock. */
-+	sock_put(meta_sk);
-+
-+	return 0;
-+}
-+
-+int mptcp_check_req_fastopen(struct sock *child, struct request_sock *req)
-+{
-+	struct sock *meta_sk = child, *master_sk;
-+	struct sk_buff *skb;
-+	u32 new_mapping;
-+	int ret;
-+
-+	ret = __mptcp_check_req_master(child, req);
-+	if (ret)
-+		return ret;
-+
-+	master_sk = tcp_sk(meta_sk)->mpcb->master_sk;
-+
-+	/* We need to rewind copied_seq as it is set to IDSN + 1 and as we have
-+	 * pre-MPTCP data in the receive queue.
-+	 */
-+	tcp_sk(meta_sk)->copied_seq -= tcp_sk(master_sk)->rcv_nxt -
-+				       tcp_rsk(req)->rcv_isn - 1;
-+
-+	/* Map subflow sequence number to data sequence numbers. We need to map
-+	 * these data to [IDSN - len - 1, IDSN[.
-+	 */
-+	new_mapping = tcp_sk(meta_sk)->copied_seq - tcp_rsk(req)->rcv_isn - 1;
-+
-+	/* There should be only one skb: the SYN + data. */
-+	skb_queue_walk(&meta_sk->sk_receive_queue, skb) {
-+		TCP_SKB_CB(skb)->seq += new_mapping;
-+		TCP_SKB_CB(skb)->end_seq += new_mapping;
-+	}
-+
-+	/* With fastopen we change the semantics of the relative subflow
-+	 * sequence numbers to deal with middleboxes that could add/remove
-+	 * multiple bytes in the SYN. We chose to start counting at rcv_nxt - 1
-+	 * instead of the regular TCP ISN.
-+	 */
-+	tcp_sk(master_sk)->mptcp->rcv_isn = tcp_sk(master_sk)->rcv_nxt - 1;
-+
-+	/* We need to update copied_seq of the master_sk to account for the
-+	 * already moved data to the meta receive queue.
-+	 */
-+	tcp_sk(master_sk)->copied_seq = tcp_sk(master_sk)->rcv_nxt;
-+
-+	/* Handled by the master_sk */
-+	tcp_sk(meta_sk)->fastopen_rsk = NULL;
-+
-+	return 0;
-+}
-+
-+int mptcp_check_req_master(struct sock *sk, struct sock *child,
-+			   struct request_sock *req,
-+			   struct request_sock **prev)
-+{
-+	struct sock *meta_sk = child;
-+	int ret;
-+
-+	ret = __mptcp_check_req_master(child, req);
-+	if (ret)
-+		return ret;
-+
-+	inet_csk_reqsk_queue_unlink(sk, req, prev);
-+	inet_csk_reqsk_queue_removed(sk, req);
-+	inet_csk_reqsk_queue_add(sk, req, meta_sk);
-+
-+	return 0;
-+}
-+
-+struct sock *mptcp_check_req_child(struct sock *meta_sk, struct sock *child,
-+				   struct request_sock *req,
-+				   struct request_sock **prev,
-+				   const struct mptcp_options_received *mopt)
-+{
-+	struct tcp_sock *child_tp = tcp_sk(child);
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	u8 hash_mac_check[20];
-+
-+	child_tp->inside_tk_table = 0;
-+
-+	if (!mopt->join_ack)
-+		goto teardown;
-+
-+	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
-+			(u8 *)&mpcb->mptcp_loc_key,
-+			(u8 *)&mtreq->mptcp_rem_nonce,
-+			(u8 *)&mtreq->mptcp_loc_nonce,
-+			(u32 *)hash_mac_check);
-+
-+	if (memcmp(hash_mac_check, (char *)&mopt->mptcp_recv_mac, 20))
-+		goto teardown;
-+
-+	/* Point it to the same struct socket and wq as the meta_sk */
-+	sk_set_socket(child, meta_sk->sk_socket);
-+	child->sk_wq = meta_sk->sk_wq;
-+
-+	if (mptcp_add_sock(meta_sk, child, mtreq->loc_id, mtreq->rem_id, GFP_ATOMIC)) {
-+		/* Has been inherited, but now child_tp->mptcp is NULL */
-+		child_tp->mpc = 0;
-+		child_tp->ops = &tcp_specific;
-+
-+		/* TODO when we support acking the third ack for new subflows,
-+		 * we should silently discard this third ack, by returning NULL.
-+		 *
-+		 * Maybe, at the retransmission we will have enough memory to
-+		 * fully add the socket to the meta-sk.
-+		 */
-+		goto teardown;
-+	}
-+
-+	/* The child is a clone of the meta socket, we must now reset
-+	 * some of the fields
-+	 */
-+	child_tp->mptcp->rcv_low_prio = mtreq->rcv_low_prio;
-+
-+	/* We should allow proper increase of the snd/rcv-buffers. Thus, we
-+	 * use the original values instead of the bloated up ones from the
-+	 * clone.
-+	 */
-+	child->sk_sndbuf = mpcb->orig_sk_sndbuf;
-+	child->sk_rcvbuf = mpcb->orig_sk_rcvbuf;
-+
-+	child_tp->mptcp->slave_sk = 1;
-+	child_tp->mptcp->snt_isn = tcp_rsk(req)->snt_isn;
-+	child_tp->mptcp->rcv_isn = tcp_rsk(req)->rcv_isn;
-+	child_tp->mptcp->init_rcv_wnd = req->rcv_wnd;
-+
-+	child_tp->tsq_flags = 0;
-+
-+	/* Subflows do not use the accept queue, as they
-+	 * are attached immediately to the mpcb.
-+	 */
-+	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
-+	reqsk_free(req);
-+	return child;
-+
-+teardown:
-+	/* Drop this request - sock creation failed. */
-+	inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+	reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue, req);
-+	reqsk_free(req);
-+	inet_csk_prepare_forced_close(child);
-+	tcp_done(child);
-+	return meta_sk;
-+}
-+
-+int mptcp_init_tw_sock(struct sock *sk, struct tcp_timewait_sock *tw)
-+{
-+	struct mptcp_tw *mptw;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+
-+	/* A subsocket in tw can only receive data. So, if we are in
-+	 * infinite-receive, then we should not reply with a data-ack or act
-+	 * upon general MPTCP-signaling. We prevent this by simply not creating
-+	 * the mptcp_tw_sock.
-+	 */
-+	if (mpcb->infinite_mapping_rcv) {
-+		tw->mptcp_tw = NULL;
-+		return 0;
-+	}
-+
-+	/* Alloc MPTCP-tw-sock */
-+	mptw = kmem_cache_alloc(mptcp_tw_cache, GFP_ATOMIC);
-+	if (!mptw)
-+		return -ENOBUFS;
-+
-+	atomic_inc(&mpcb->mpcb_refcnt);
-+
-+	tw->mptcp_tw = mptw;
-+	mptw->loc_key = mpcb->mptcp_loc_key;
-+	mptw->meta_tw = mpcb->in_time_wait;
-+	if (mptw->meta_tw) {
-+		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(mptcp_meta_tp(tp));
-+		if (mpcb->mptw_state != TCP_TIME_WAIT)
-+			mptw->rcv_nxt++;
-+	}
-+	rcu_assign_pointer(mptw->mpcb, mpcb);
-+
-+	spin_lock(&mpcb->tw_lock);
-+	list_add_rcu(&mptw->list, &tp->mpcb->tw_list);
-+	mptw->in_list = 1;
-+	spin_unlock(&mpcb->tw_lock);
-+
-+	return 0;
-+}
-+
-+void mptcp_twsk_destructor(struct tcp_timewait_sock *tw)
-+{
-+	struct mptcp_cb *mpcb;
-+
-+	rcu_read_lock();
-+	mpcb = rcu_dereference(tw->mptcp_tw->mpcb);
-+
-+	/* If we are still holding a ref to the mpcb, we have to remove ourself
-+	 * from the list and drop the ref properly.
-+	 */
-+	if (mpcb && atomic_inc_not_zero(&mpcb->mpcb_refcnt)) {
-+		spin_lock(&mpcb->tw_lock);
-+		if (tw->mptcp_tw->in_list) {
-+			list_del_rcu(&tw->mptcp_tw->list);
-+			tw->mptcp_tw->in_list = 0;
-+		}
-+		spin_unlock(&mpcb->tw_lock);
-+
-+		/* Twice, because we increased it above */
-+		mptcp_mpcb_put(mpcb);
-+		mptcp_mpcb_put(mpcb);
-+	}
-+
-+	rcu_read_unlock();
-+
-+	kmem_cache_free(mptcp_tw_cache, tw->mptcp_tw);
-+}
-+
-+/* Updates the rcv_nxt of the time-wait-socks and allows them to ack a
-+ * data-fin.
-+ */
-+void mptcp_time_wait(struct sock *sk, int state, int timeo)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_tw *mptw;
-+
-+	/* Used for sockets that go into tw after the meta
-+	 * (see mptcp_init_tw_sock())
-+	 */
-+	tp->mpcb->in_time_wait = 1;
-+	tp->mpcb->mptw_state = state;
-+
-+	/* Update the time-wait-sock's information */
-+	rcu_read_lock_bh();
-+	list_for_each_entry_rcu(mptw, &tp->mpcb->tw_list, list) {
-+		mptw->meta_tw = 1;
-+		mptw->rcv_nxt = mptcp_get_rcv_nxt_64(tp);
-+
-+		/* We want to ack a DATA_FIN, but are yet in FIN_WAIT_2 -
-+		 * pretend as if the DATA_FIN has already reached us, that way
-+		 * the checks in tcp_timewait_state_process will be good as the
-+		 * DATA_FIN comes in.
-+		 */
-+		if (state != TCP_TIME_WAIT)
-+			mptw->rcv_nxt++;
-+	}
-+	rcu_read_unlock_bh();
-+
-+	tcp_done(sk);
-+}
-+
-+void mptcp_tsq_flags(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+	/* It will be handled as a regular deferred-call */
-+	if (is_meta_sk(sk))
-+		return;
-+
-+	if (hlist_unhashed(&tp->mptcp->cb_list)) {
-+		hlist_add_head(&tp->mptcp->cb_list, &tp->mpcb->callback_list);
-+		/* We need to hold it here, as the sock_hold is not assured
-+		 * by the release_sock as it is done in regular TCP.
-+		 *
-+		 * The subsocket may get inet_csk_destroy'd while it is inside
-+		 * the callback_list.
-+		 */
-+		sock_hold(sk);
-+	}
-+
-+	if (!test_and_set_bit(MPTCP_SUB_DEFERRED, &tcp_sk(meta_sk)->tsq_flags))
-+		sock_hold(meta_sk);
-+}
-+
-+void mptcp_tsq_sub_deferred(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_tcp_sock *mptcp;
-+	struct hlist_node *tmp;
-+
-+	BUG_ON(!is_meta_sk(meta_sk) && !meta_tp->was_meta_sk);
-+
-+	__sock_put(meta_sk);
-+	hlist_for_each_entry_safe(mptcp, tmp, &meta_tp->mpcb->callback_list, cb_list) {
-+		struct tcp_sock *tp = mptcp->tp;
-+		struct sock *sk = (struct sock *)tp;
-+
-+		hlist_del_init(&mptcp->cb_list);
-+		sk->sk_prot->release_cb(sk);
-+		/* Final sock_put (cfr. mptcp_tsq_flags */
-+		sock_put(sk);
-+	}
-+}
-+
-+void mptcp_join_reqsk_init(struct mptcp_cb *mpcb, const struct request_sock *req,
-+			   struct sk_buff *skb)
-+{
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+	struct mptcp_options_received mopt;
-+	u8 mptcp_hash_mac[20];
-+
-+	mptcp_init_mp_opt(&mopt);
-+	tcp_parse_mptcp_options(skb, &mopt);
-+
-+	mtreq = mptcp_rsk(req);
-+	mtreq->mptcp_mpcb = mpcb;
-+	mtreq->is_sub = 1;
-+	inet_rsk(req)->mptcp_rqsk = 1;
-+
-+	mtreq->mptcp_rem_nonce = mopt.mptcp_recv_nonce;
-+
-+	mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
-+			(u8 *)&mpcb->mptcp_rem_key,
-+			(u8 *)&mtreq->mptcp_loc_nonce,
-+			(u8 *)&mtreq->mptcp_rem_nonce, (u32 *)mptcp_hash_mac);
-+	mtreq->mptcp_hash_tmac = *(u64 *)mptcp_hash_mac;
-+
-+	mtreq->rem_id = mopt.rem_id;
-+	mtreq->rcv_low_prio = mopt.low_prio;
-+	inet_rsk(req)->saw_mpc = 1;
-+}
-+
-+void mptcp_reqsk_init(struct request_sock *req, const struct sk_buff *skb)
-+{
-+	struct mptcp_options_received mopt;
-+	struct mptcp_request_sock *mreq = mptcp_rsk(req);
-+
-+	mptcp_init_mp_opt(&mopt);
-+	tcp_parse_mptcp_options(skb, &mopt);
-+
-+	mreq->is_sub = 0;
-+	inet_rsk(req)->mptcp_rqsk = 1;
-+	mreq->dss_csum = mopt.dss_csum;
-+	mreq->hash_entry.pprev = NULL;
-+
-+	mptcp_reqsk_new_mptcp(req, &mopt, skb);
-+}
-+
-+int mptcp_conn_request(struct sock *sk, struct sk_buff *skb)
-+{
-+	struct mptcp_options_received mopt;
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	__u32 isn = TCP_SKB_CB(skb)->when;
-+	bool want_cookie = false;
-+
-+	if ((sysctl_tcp_syncookies == 2 ||
-+	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-+		want_cookie = tcp_syn_flood_action(sk, skb,
-+						   mptcp_request_sock_ops.slab_name);
-+		if (!want_cookie)
-+			goto drop;
-+	}
-+
-+	mptcp_init_mp_opt(&mopt);
-+	tcp_parse_mptcp_options(skb, &mopt);
-+
-+	if (mopt.is_mp_join)
-+		return mptcp_do_join_short(skb, &mopt, sock_net(sk));
-+	if (mopt.drop_me)
-+		goto drop;
-+
-+	if (sysctl_mptcp_enabled == MPTCP_APP && !tp->mptcp_enabled)
-+		mopt.saw_mpc = 0;
-+
-+	if (skb->protocol == htons(ETH_P_IP)) {
-+		if (mopt.saw_mpc && !want_cookie) {
-+			if (skb_rtable(skb)->rt_flags &
-+			    (RTCF_BROADCAST | RTCF_MULTICAST))
-+				goto drop;
-+
-+			return tcp_conn_request(&mptcp_request_sock_ops,
-+						&mptcp_request_sock_ipv4_ops,
-+						sk, skb);
-+		}
-+
-+		return tcp_v4_conn_request(sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else {
-+		if (mopt.saw_mpc && !want_cookie) {
-+			if (!ipv6_unicast_destination(skb))
-+				goto drop;
-+
-+			return tcp_conn_request(&mptcp6_request_sock_ops,
-+						&mptcp_request_sock_ipv6_ops,
-+						sk, skb);
-+		}
-+
-+		return tcp_v6_conn_request(sk, skb);
-+#endif
-+	}
-+drop:
-+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
-+	return 0;
-+}
-+
-+struct workqueue_struct *mptcp_wq;
-+EXPORT_SYMBOL(mptcp_wq);
-+
-+/* Output /proc/net/mptcp */
-+static int mptcp_pm_seq_show(struct seq_file *seq, void *v)
-+{
-+	struct tcp_sock *meta_tp;
-+	const struct net *net = seq->private;
-+	int i, n = 0;
-+
-+	seq_printf(seq, "  sl  loc_tok  rem_tok  v6 local_address                         remote_address                        st ns tx_queue rx_queue inode");
-+	seq_putc(seq, '\n');
-+
-+	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+		struct hlist_nulls_node *node;
-+		rcu_read_lock_bh();
-+		hlist_nulls_for_each_entry_rcu(meta_tp, node,
-+					       &tk_hashtable[i], tk_table) {
-+			struct mptcp_cb *mpcb = meta_tp->mpcb;
-+			struct sock *meta_sk = (struct sock *)meta_tp;
-+			struct inet_sock *isk = inet_sk(meta_sk);
-+
-+			if (!mptcp(meta_tp) || !net_eq(net, sock_net(meta_sk)))
-+				continue;
-+
-+			if (capable(CAP_NET_ADMIN)) {
-+				seq_printf(seq, "%4d: %04X %04X ", n++,
-+						mpcb->mptcp_loc_token,
-+						mpcb->mptcp_rem_token);
-+			} else {
-+				seq_printf(seq, "%4d: %04X %04X ", n++, -1, -1);
-+			}
-+			if (meta_sk->sk_family == AF_INET ||
-+			    mptcp_v6_is_v4_mapped(meta_sk)) {
-+				seq_printf(seq, " 0 %08X:%04X                         %08X:%04X                        ",
-+					   isk->inet_rcv_saddr,
-+					   ntohs(isk->inet_sport),
-+					   isk->inet_daddr,
-+					   ntohs(isk->inet_dport));
-+#if IS_ENABLED(CONFIG_IPV6)
-+			} else if (meta_sk->sk_family == AF_INET6) {
-+				struct in6_addr *src = &meta_sk->sk_v6_rcv_saddr;
-+				struct in6_addr *dst = &meta_sk->sk_v6_daddr;
-+				seq_printf(seq, " 1 %08X%08X%08X%08X:%04X %08X%08X%08X%08X:%04X",
-+					   src->s6_addr32[0], src->s6_addr32[1],
-+					   src->s6_addr32[2], src->s6_addr32[3],
-+					   ntohs(isk->inet_sport),
-+					   dst->s6_addr32[0], dst->s6_addr32[1],
-+					   dst->s6_addr32[2], dst->s6_addr32[3],
-+					   ntohs(isk->inet_dport));
-+#endif
-+			}
-+			seq_printf(seq, " %02X %02X %08X:%08X %lu",
-+				   meta_sk->sk_state, mpcb->cnt_subflows,
-+				   meta_tp->write_seq - meta_tp->snd_una,
-+				   max_t(int, meta_tp->rcv_nxt -
-+					 meta_tp->copied_seq, 0),
-+				   sock_i_ino(meta_sk));
-+			seq_putc(seq, '\n');
-+		}
-+
-+		rcu_read_unlock_bh();
-+	}
-+
-+	return 0;
-+}
-+
-+static int mptcp_pm_seq_open(struct inode *inode, struct file *file)
-+{
-+	return single_open_net(inode, file, mptcp_pm_seq_show);
-+}
-+
-+static const struct file_operations mptcp_pm_seq_fops = {
-+	.owner = THIS_MODULE,
-+	.open = mptcp_pm_seq_open,
-+	.read = seq_read,
-+	.llseek = seq_lseek,
-+	.release = single_release_net,
-+};
-+
-+static int mptcp_pm_init_net(struct net *net)
-+{
-+	if (!proc_create("mptcp", S_IRUGO, net->proc_net, &mptcp_pm_seq_fops))
-+		return -ENOMEM;
-+
-+	return 0;
-+}
-+
-+static void mptcp_pm_exit_net(struct net *net)
-+{
-+	remove_proc_entry("mptcp", net->proc_net);
-+}
-+
-+static struct pernet_operations mptcp_pm_proc_ops = {
-+	.init = mptcp_pm_init_net,
-+	.exit = mptcp_pm_exit_net,
-+};
-+
-+/* General initialization of mptcp */
-+void __init mptcp_init(void)
-+{
-+	int i;
-+	struct ctl_table_header *mptcp_sysctl;
-+
-+	mptcp_sock_cache = kmem_cache_create("mptcp_sock",
-+					     sizeof(struct mptcp_tcp_sock),
-+					     0, SLAB_HWCACHE_ALIGN,
-+					     NULL);
-+	if (!mptcp_sock_cache)
-+		goto mptcp_sock_cache_failed;
-+
-+	mptcp_cb_cache = kmem_cache_create("mptcp_cb", sizeof(struct mptcp_cb),
-+					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+					   NULL);
-+	if (!mptcp_cb_cache)
-+		goto mptcp_cb_cache_failed;
-+
-+	mptcp_tw_cache = kmem_cache_create("mptcp_tw", sizeof(struct mptcp_tw),
-+					   0, SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+					   NULL);
-+	if (!mptcp_tw_cache)
-+		goto mptcp_tw_cache_failed;
-+
-+	get_random_bytes(mptcp_secret, sizeof(mptcp_secret));
-+
-+	mptcp_wq = alloc_workqueue("mptcp_wq", WQ_UNBOUND | WQ_MEM_RECLAIM, 8);
-+	if (!mptcp_wq)
-+		goto alloc_workqueue_failed;
-+
-+	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+		INIT_HLIST_NULLS_HEAD(&tk_hashtable[i], i);
-+		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_htb[i],
-+				      i + MPTCP_REQSK_NULLS_BASE);
-+		INIT_HLIST_NULLS_HEAD(&mptcp_reqsk_tk_htb[i], i);
-+	}
-+
-+	spin_lock_init(&mptcp_reqsk_hlock);
-+	spin_lock_init(&mptcp_tk_hashlock);
-+
-+	if (register_pernet_subsys(&mptcp_pm_proc_ops))
-+		goto pernet_failed;
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	if (mptcp_pm_v6_init())
-+		goto mptcp_pm_v6_failed;
-+#endif
-+	if (mptcp_pm_v4_init())
-+		goto mptcp_pm_v4_failed;
-+
-+	mptcp_sysctl = register_net_sysctl(&init_net, "net/mptcp", mptcp_table);
-+	if (!mptcp_sysctl)
-+		goto register_sysctl_failed;
-+
-+	if (mptcp_register_path_manager(&mptcp_pm_default))
-+		goto register_pm_failed;
-+
-+	if (mptcp_register_scheduler(&mptcp_sched_default))
-+		goto register_sched_failed;
-+
-+	pr_info("MPTCP: Stable release v0.89.0-rc");
-+
-+	mptcp_init_failed = false;
-+
-+	return;
-+
-+register_sched_failed:
-+	mptcp_unregister_path_manager(&mptcp_pm_default);
-+register_pm_failed:
-+	unregister_net_sysctl_table(mptcp_sysctl);
-+register_sysctl_failed:
-+	mptcp_pm_v4_undo();
-+mptcp_pm_v4_failed:
-+#if IS_ENABLED(CONFIG_IPV6)
-+	mptcp_pm_v6_undo();
-+mptcp_pm_v6_failed:
-+#endif
-+	unregister_pernet_subsys(&mptcp_pm_proc_ops);
-+pernet_failed:
-+	destroy_workqueue(mptcp_wq);
-+alloc_workqueue_failed:
-+	kmem_cache_destroy(mptcp_tw_cache);
-+mptcp_tw_cache_failed:
-+	kmem_cache_destroy(mptcp_cb_cache);
-+mptcp_cb_cache_failed:
-+	kmem_cache_destroy(mptcp_sock_cache);
-+mptcp_sock_cache_failed:
-+	mptcp_init_failed = true;
-+}
-diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
-new file mode 100644
-index 000000000000..3a54413ce25b
---- /dev/null
-+++ b/net/mptcp/mptcp_fullmesh.c
-@@ -0,0 +1,1722 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#include <net/addrconf.h>
-+#endif
-+
-+enum {
-+	MPTCP_EVENT_ADD = 1,
-+	MPTCP_EVENT_DEL,
-+	MPTCP_EVENT_MOD,
-+};
-+
-+#define MPTCP_SUBFLOW_RETRY_DELAY	1000
-+
-+/* Max number of local or remote addresses we can store.
-+ * When changing, see the bitfield below in fullmesh_rem4/6.
-+ */
-+#define MPTCP_MAX_ADDR	8
-+
-+struct fullmesh_rem4 {
-+	u8		rem4_id;
-+	u8		bitfield;
-+	u8		retry_bitfield;
-+	__be16		port;
-+	struct in_addr	addr;
-+};
-+
-+struct fullmesh_rem6 {
-+	u8		rem6_id;
-+	u8		bitfield;
-+	u8		retry_bitfield;
-+	__be16		port;
-+	struct in6_addr	addr;
-+};
-+
-+struct mptcp_loc_addr {
-+	struct mptcp_loc4 locaddr4[MPTCP_MAX_ADDR];
-+	u8 loc4_bits;
-+	u8 next_v4_index;
-+
-+	struct mptcp_loc6 locaddr6[MPTCP_MAX_ADDR];
-+	u8 loc6_bits;
-+	u8 next_v6_index;
-+};
-+
-+struct mptcp_addr_event {
-+	struct list_head list;
-+	unsigned short	family;
-+	u8	code:7,
-+		low_prio:1;
-+	union inet_addr addr;
-+};
-+
-+struct fullmesh_priv {
-+	/* Worker struct for subflow establishment */
-+	struct work_struct subflow_work;
-+	/* Delayed worker, when the routing-tables are not yet ready. */
-+	struct delayed_work subflow_retry_work;
-+
-+	/* Remote addresses */
-+	struct fullmesh_rem4 remaddr4[MPTCP_MAX_ADDR];
-+	struct fullmesh_rem6 remaddr6[MPTCP_MAX_ADDR];
-+
-+	struct mptcp_cb *mpcb;
-+
-+	u16 remove_addrs; /* Addresses to remove */
-+	u8 announced_addrs_v4; /* IPv4 Addresses we did announce */
-+	u8 announced_addrs_v6; /* IPv6 Addresses we did announce */
-+
-+	u8	add_addr; /* Are we sending an add_addr? */
-+
-+	u8 rem4_bits;
-+	u8 rem6_bits;
-+};
-+
-+struct mptcp_fm_ns {
-+	struct mptcp_loc_addr __rcu *local;
-+	spinlock_t local_lock; /* Protecting the above pointer */
-+	struct list_head events;
-+	struct delayed_work address_worker;
-+
-+	struct net *net;
-+};
-+
-+static struct mptcp_pm_ops full_mesh __read_mostly;
-+
-+static void full_mesh_create_subflows(struct sock *meta_sk);
-+
-+static struct mptcp_fm_ns *fm_get_ns(const struct net *net)
-+{
-+	return (struct mptcp_fm_ns *)net->mptcp.path_managers[MPTCP_PM_FULLMESH];
-+}
-+
-+static struct fullmesh_priv *fullmesh_get_priv(const struct mptcp_cb *mpcb)
-+{
-+	return (struct fullmesh_priv *)&mpcb->mptcp_pm[0];
-+}
-+
-+/* Find the first free index in the bitfield */
-+static int __mptcp_find_free_index(u8 bitfield, u8 base)
-+{
-+	int i;
-+
-+	/* There are anyways no free bits... */
-+	if (bitfield == 0xff)
-+		goto exit;
-+
-+	i = ffs(~(bitfield >> base)) - 1;
-+	if (i < 0)
-+		goto exit;
-+
-+	/* No free bits when starting at base, try from 0 on */
-+	if (i + base >= sizeof(bitfield) * 8)
-+		return __mptcp_find_free_index(bitfield, 0);
-+
-+	return i + base;
-+exit:
-+	return -1;
-+}
-+
-+static int mptcp_find_free_index(u8 bitfield)
-+{
-+	return __mptcp_find_free_index(bitfield, 0);
-+}
-+
-+static void mptcp_addv4_raddr(struct mptcp_cb *mpcb,
-+			      const struct in_addr *addr,
-+			      __be16 port, u8 id)
-+{
-+	int i;
-+	struct fullmesh_rem4 *rem4;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		rem4 = &fmp->remaddr4[i];
-+
-+		/* Address is already in the list --- continue */
-+		if (rem4->rem4_id == id &&
-+		    rem4->addr.s_addr == addr->s_addr && rem4->port == port)
-+			return;
-+
-+		/* This may be the case, when the peer is behind a NAT. He is
-+		 * trying to JOIN, thus sending the JOIN with a certain ID.
-+		 * However the src_addr of the IP-packet has been changed. We
-+		 * update the addr in the list, because this is the address as
-+		 * OUR BOX sees it.
-+		 */
-+		if (rem4->rem4_id == id && rem4->addr.s_addr != addr->s_addr) {
-+			/* update the address */
-+			mptcp_debug("%s: updating old addr:%pI4 to addr %pI4 with id:%d\n",
-+				    __func__, &rem4->addr.s_addr,
-+				    &addr->s_addr, id);
-+			rem4->addr.s_addr = addr->s_addr;
-+			rem4->port = port;
-+			mpcb->list_rcvd = 1;
-+			return;
-+		}
-+	}
-+
-+	i = mptcp_find_free_index(fmp->rem4_bits);
-+	/* Do we have already the maximum number of local/remote addresses? */
-+	if (i < 0) {
-+		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI4\n",
-+			    __func__, MPTCP_MAX_ADDR, &addr->s_addr);
-+		return;
-+	}
-+
-+	rem4 = &fmp->remaddr4[i];
-+
-+	/* Address is not known yet, store it */
-+	rem4->addr.s_addr = addr->s_addr;
-+	rem4->port = port;
-+	rem4->bitfield = 0;
-+	rem4->retry_bitfield = 0;
-+	rem4->rem4_id = id;
-+	mpcb->list_rcvd = 1;
-+	fmp->rem4_bits |= (1 << i);
-+
-+	return;
-+}
-+
-+static void mptcp_addv6_raddr(struct mptcp_cb *mpcb,
-+			      const struct in6_addr *addr,
-+			      __be16 port, u8 id)
-+{
-+	int i;
-+	struct fullmesh_rem6 *rem6;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		rem6 = &fmp->remaddr6[i];
-+
-+		/* Address is already in the list --- continue */
-+		if (rem6->rem6_id == id &&
-+		    ipv6_addr_equal(&rem6->addr, addr) && rem6->port == port)
-+			return;
-+
-+		/* This may be the case, when the peer is behind a NAT. He is
-+		 * trying to JOIN, thus sending the JOIN with a certain ID.
-+		 * However the src_addr of the IP-packet has been changed. We
-+		 * update the addr in the list, because this is the address as
-+		 * OUR BOX sees it.
-+		 */
-+		if (rem6->rem6_id == id) {
-+			/* update the address */
-+			mptcp_debug("%s: updating old addr: %pI6 to addr %pI6 with id:%d\n",
-+				    __func__, &rem6->addr, addr, id);
-+			rem6->addr = *addr;
-+			rem6->port = port;
-+			mpcb->list_rcvd = 1;
-+			return;
-+		}
-+	}
-+
-+	i = mptcp_find_free_index(fmp->rem6_bits);
-+	/* Do we have already the maximum number of local/remote addresses? */
-+	if (i < 0) {
-+		mptcp_debug("%s: At max num of remote addresses: %d --- not adding address: %pI6\n",
-+			    __func__, MPTCP_MAX_ADDR, addr);
-+		return;
-+	}
-+
-+	rem6 = &fmp->remaddr6[i];
-+
-+	/* Address is not known yet, store it */
-+	rem6->addr = *addr;
-+	rem6->port = port;
-+	rem6->bitfield = 0;
-+	rem6->retry_bitfield = 0;
-+	rem6->rem6_id = id;
-+	mpcb->list_rcvd = 1;
-+	fmp->rem6_bits |= (1 << i);
-+
-+	return;
-+}
-+
-+static void mptcp_v4_rem_raddress(struct mptcp_cb *mpcb, u8 id)
-+{
-+	int i;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		if (fmp->remaddr4[i].rem4_id == id) {
-+			/* remove address from bitfield */
-+			fmp->rem4_bits &= ~(1 << i);
-+
-+			break;
-+		}
-+	}
-+}
-+
-+static void mptcp_v6_rem_raddress(const struct mptcp_cb *mpcb, u8 id)
-+{
-+	int i;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		if (fmp->remaddr6[i].rem6_id == id) {
-+			/* remove address from bitfield */
-+			fmp->rem6_bits &= ~(1 << i);
-+
-+			break;
-+		}
-+	}
-+}
-+
-+/* Sets the bitfield of the remote-address field */
-+static void mptcp_v4_set_init_addr_bit(const struct mptcp_cb *mpcb,
-+				       const struct in_addr *addr, u8 index)
-+{
-+	int i;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		if (fmp->remaddr4[i].addr.s_addr == addr->s_addr) {
-+			fmp->remaddr4[i].bitfield |= (1 << index);
-+			return;
-+		}
-+	}
-+}
-+
-+/* Sets the bitfield of the remote-address field */
-+static void mptcp_v6_set_init_addr_bit(struct mptcp_cb *mpcb,
-+				       const struct in6_addr *addr, u8 index)
-+{
-+	int i;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		if (ipv6_addr_equal(&fmp->remaddr6[i].addr, addr)) {
-+			fmp->remaddr6[i].bitfield |= (1 << index);
-+			return;
-+		}
-+	}
-+}
-+
-+static void mptcp_set_init_addr_bit(struct mptcp_cb *mpcb,
-+				    const union inet_addr *addr,
-+				    sa_family_t family, u8 id)
-+{
-+	if (family == AF_INET)
-+		mptcp_v4_set_init_addr_bit(mpcb, &addr->in, id);
-+	else
-+		mptcp_v6_set_init_addr_bit(mpcb, &addr->in6, id);
-+}
-+
-+static void retry_subflow_worker(struct work_struct *work)
-+{
-+	struct delayed_work *delayed_work = container_of(work,
-+							 struct delayed_work,
-+							 work);
-+	struct fullmesh_priv *fmp = container_of(delayed_work,
-+						 struct fullmesh_priv,
-+						 subflow_retry_work);
-+	struct mptcp_cb *mpcb = fmp->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+	int iter = 0, i;
-+
-+	/* We need a local (stable) copy of the address-list. Really, it is not
-+	 * such a big deal, if the address-list is not 100% up-to-date.
-+	 */
-+	rcu_read_lock_bh();
-+	mptcp_local = rcu_dereference_bh(fm_ns->local);
-+	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
-+	rcu_read_unlock_bh();
-+
-+	if (!mptcp_local)
-+		return;
-+
-+next_subflow:
-+	if (iter) {
-+		release_sock(meta_sk);
-+		mutex_unlock(&mpcb->mpcb_mutex);
-+
-+		cond_resched();
-+	}
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	iter++;
-+
-+	if (sock_flag(meta_sk, SOCK_DEAD))
-+		goto exit;
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		struct fullmesh_rem4 *rem = &fmp->remaddr4[i];
-+		/* Do we need to retry establishing a subflow ? */
-+		if (rem->retry_bitfield) {
-+			int i = mptcp_find_free_index(~rem->retry_bitfield);
-+			struct mptcp_rem4 rem4;
-+
-+			rem->bitfield |= (1 << i);
-+			rem->retry_bitfield &= ~(1 << i);
-+
-+			rem4.addr = rem->addr;
-+			rem4.port = rem->port;
-+			rem4.rem4_id = rem->rem4_id;
-+
-+			mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i], &rem4);
-+			goto next_subflow;
-+		}
-+	}
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		struct fullmesh_rem6 *rem = &fmp->remaddr6[i];
-+
-+		/* Do we need to retry establishing a subflow ? */
-+		if (rem->retry_bitfield) {
-+			int i = mptcp_find_free_index(~rem->retry_bitfield);
-+			struct mptcp_rem6 rem6;
-+
-+			rem->bitfield |= (1 << i);
-+			rem->retry_bitfield &= ~(1 << i);
-+
-+			rem6.addr = rem->addr;
-+			rem6.port = rem->port;
-+			rem6.rem6_id = rem->rem6_id;
-+
-+			mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i], &rem6);
-+			goto next_subflow;
-+		}
-+	}
-+#endif
-+
-+exit:
-+	kfree(mptcp_local);
-+	release_sock(meta_sk);
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk);
-+}
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+	struct fullmesh_priv *fmp = container_of(work, struct fullmesh_priv,
-+						 subflow_work);
-+	struct mptcp_cb *mpcb = fmp->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	struct mptcp_loc_addr *mptcp_local;
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+	int iter = 0, retry = 0;
-+	int i;
-+
-+	/* We need a local (stable) copy of the address-list. Really, it is not
-+	 * such a big deal, if the address-list is not 100% up-to-date.
-+	 */
-+	rcu_read_lock_bh();
-+	mptcp_local = rcu_dereference_bh(fm_ns->local);
-+	mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local), GFP_ATOMIC);
-+	rcu_read_unlock_bh();
-+
-+	if (!mptcp_local)
-+		return;
-+
-+next_subflow:
-+	if (iter) {
-+		release_sock(meta_sk);
-+		mutex_unlock(&mpcb->mpcb_mutex);
-+
-+		cond_resched();
-+	}
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	iter++;
-+
-+	if (sock_flag(meta_sk, SOCK_DEAD))
-+		goto exit;
-+
-+	if (mpcb->master_sk &&
-+	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+		goto exit;
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		struct fullmesh_rem4 *rem;
-+		u8 remaining_bits;
-+
-+		rem = &fmp->remaddr4[i];
-+		remaining_bits = ~(rem->bitfield) & mptcp_local->loc4_bits;
-+
-+		/* Are there still combinations to handle? */
-+		if (remaining_bits) {
-+			int i = mptcp_find_free_index(~remaining_bits);
-+			struct mptcp_rem4 rem4;
-+
-+			rem->bitfield |= (1 << i);
-+
-+			rem4.addr = rem->addr;
-+			rem4.port = rem->port;
-+			rem4.rem4_id = rem->rem4_id;
-+
-+			/* If a route is not yet available then retry once */
-+			if (mptcp_init4_subsockets(meta_sk, &mptcp_local->locaddr4[i],
-+						   &rem4) == -ENETUNREACH)
-+				retry = rem->retry_bitfield |= (1 << i);
-+			goto next_subflow;
-+		}
-+	}
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		struct fullmesh_rem6 *rem;
-+		u8 remaining_bits;
-+
-+		rem = &fmp->remaddr6[i];
-+		remaining_bits = ~(rem->bitfield) & mptcp_local->loc6_bits;
-+
-+		/* Are there still combinations to handle? */
-+		if (remaining_bits) {
-+			int i = mptcp_find_free_index(~remaining_bits);
-+			struct mptcp_rem6 rem6;
-+
-+			rem->bitfield |= (1 << i);
-+
-+			rem6.addr = rem->addr;
-+			rem6.port = rem->port;
-+			rem6.rem6_id = rem->rem6_id;
-+
-+			/* If a route is not yet available then retry once */
-+			if (mptcp_init6_subsockets(meta_sk, &mptcp_local->locaddr6[i],
-+						   &rem6) == -ENETUNREACH)
-+				retry = rem->retry_bitfield |= (1 << i);
-+			goto next_subflow;
-+		}
-+	}
-+#endif
-+
-+	if (retry && !delayed_work_pending(&fmp->subflow_retry_work)) {
-+		sock_hold(meta_sk);
-+		queue_delayed_work(mptcp_wq, &fmp->subflow_retry_work,
-+				   msecs_to_jiffies(MPTCP_SUBFLOW_RETRY_DELAY));
-+	}
-+
-+exit:
-+	kfree(mptcp_local);
-+	release_sock(meta_sk);
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk);
-+}
-+
-+static void announce_remove_addr(u8 addr_id, struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	struct sock *sk = mptcp_select_ack_sock(meta_sk);
-+
-+	fmp->remove_addrs |= (1 << addr_id);
-+	mpcb->addr_signal = 1;
-+
-+	if (sk)
-+		tcp_send_ack(sk);
-+}
-+
-+static void update_addr_bitfields(struct sock *meta_sk,
-+				  const struct mptcp_loc_addr *mptcp_local)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	int i;
-+
-+	/* The bits in announced_addrs_* always match with loc*_bits. So, a
-+	 * simply & operation unsets the correct bits, because these go from
-+	 * announced to non-announced
-+	 */
-+	fmp->announced_addrs_v4 &= mptcp_local->loc4_bits;
-+
-+	mptcp_for_each_bit_set(fmp->rem4_bits, i) {
-+		fmp->remaddr4[i].bitfield &= mptcp_local->loc4_bits;
-+		fmp->remaddr4[i].retry_bitfield &= mptcp_local->loc4_bits;
-+	}
-+
-+	fmp->announced_addrs_v6 &= mptcp_local->loc6_bits;
-+
-+	mptcp_for_each_bit_set(fmp->rem6_bits, i) {
-+		fmp->remaddr6[i].bitfield &= mptcp_local->loc6_bits;
-+		fmp->remaddr6[i].retry_bitfield &= mptcp_local->loc6_bits;
-+	}
-+}
-+
-+static int mptcp_find_address(const struct mptcp_loc_addr *mptcp_local,
-+			      sa_family_t family, const union inet_addr *addr)
-+{
-+	int i;
-+	u8 loc_bits;
-+	bool found = false;
-+
-+	if (family == AF_INET)
-+		loc_bits = mptcp_local->loc4_bits;
-+	else
-+		loc_bits = mptcp_local->loc6_bits;
-+
-+	mptcp_for_each_bit_set(loc_bits, i) {
-+		if (family == AF_INET &&
-+		    mptcp_local->locaddr4[i].addr.s_addr == addr->in.s_addr) {
-+			found = true;
-+			break;
-+		}
-+		if (family == AF_INET6 &&
-+		    ipv6_addr_equal(&mptcp_local->locaddr6[i].addr,
-+				    &addr->in6)) {
-+			found = true;
-+			break;
-+		}
-+	}
-+
-+	if (!found)
-+		return -1;
-+
-+	return i;
-+}
-+
-+static void mptcp_address_worker(struct work_struct *work)
-+{
-+	const struct delayed_work *delayed_work = container_of(work,
-+							 struct delayed_work,
-+							 work);
-+	struct mptcp_fm_ns *fm_ns = container_of(delayed_work,
-+						 struct mptcp_fm_ns,
-+						 address_worker);
-+	struct net *net = fm_ns->net;
-+	struct mptcp_addr_event *event = NULL;
-+	struct mptcp_loc_addr *mptcp_local, *old;
-+	int i, id = -1; /* id is used in the socket-code on a delete-event */
-+	bool success; /* Used to indicate if we succeeded handling the event */
-+
-+next_event:
-+	success = false;
-+	kfree(event);
-+
-+	/* First, let's dequeue an event from our event-list */
-+	rcu_read_lock_bh();
-+	spin_lock(&fm_ns->local_lock);
-+
-+	event = list_first_entry_or_null(&fm_ns->events,
-+					 struct mptcp_addr_event, list);
-+	if (!event) {
-+		spin_unlock(&fm_ns->local_lock);
-+		rcu_read_unlock_bh();
-+		return;
-+	}
-+
-+	list_del(&event->list);
-+
-+	mptcp_local = rcu_dereference_bh(fm_ns->local);
-+
-+	if (event->code == MPTCP_EVENT_DEL) {
-+		id = mptcp_find_address(mptcp_local, event->family, &event->addr);
-+
-+		/* Not in the list - so we don't care */
-+		if (id < 0) {
-+			mptcp_debug("%s could not find id\n", __func__);
-+			goto duno;
-+		}
-+
-+		old = mptcp_local;
-+		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
-+				      GFP_ATOMIC);
-+		if (!mptcp_local)
-+			goto duno;
-+
-+		if (event->family == AF_INET)
-+			mptcp_local->loc4_bits &= ~(1 << id);
-+		else
-+			mptcp_local->loc6_bits &= ~(1 << id);
-+
-+		rcu_assign_pointer(fm_ns->local, mptcp_local);
-+		kfree(old);
-+	} else {
-+		int i = mptcp_find_address(mptcp_local, event->family, &event->addr);
-+		int j = i;
-+
-+		if (j < 0) {
-+			/* Not in the list, so we have to find an empty slot */
-+			if (event->family == AF_INET)
-+				i = __mptcp_find_free_index(mptcp_local->loc4_bits,
-+							    mptcp_local->next_v4_index);
-+			if (event->family == AF_INET6)
-+				i = __mptcp_find_free_index(mptcp_local->loc6_bits,
-+							    mptcp_local->next_v6_index);
-+
-+			if (i < 0) {
-+				mptcp_debug("%s no more space\n", __func__);
-+				goto duno;
-+			}
-+
-+			/* It might have been a MOD-event. */
-+			event->code = MPTCP_EVENT_ADD;
-+		} else {
-+			/* Let's check if anything changes */
-+			if (event->family == AF_INET &&
-+			    event->low_prio == mptcp_local->locaddr4[i].low_prio)
-+				goto duno;
-+
-+			if (event->family == AF_INET6 &&
-+			    event->low_prio == mptcp_local->locaddr6[i].low_prio)
-+				goto duno;
-+		}
-+
-+		old = mptcp_local;
-+		mptcp_local = kmemdup(mptcp_local, sizeof(*mptcp_local),
-+				      GFP_ATOMIC);
-+		if (!mptcp_local)
-+			goto duno;
-+
-+		if (event->family == AF_INET) {
-+			mptcp_local->locaddr4[i].addr.s_addr = event->addr.in.s_addr;
-+			mptcp_local->locaddr4[i].loc4_id = i + 1;
-+			mptcp_local->locaddr4[i].low_prio = event->low_prio;
-+		} else {
-+			mptcp_local->locaddr6[i].addr = event->addr.in6;
-+			mptcp_local->locaddr6[i].loc6_id = i + MPTCP_MAX_ADDR;
-+			mptcp_local->locaddr6[i].low_prio = event->low_prio;
-+		}
-+
-+		if (j < 0) {
-+			if (event->family == AF_INET) {
-+				mptcp_local->loc4_bits |= (1 << i);
-+				mptcp_local->next_v4_index = i + 1;
-+			} else {
-+				mptcp_local->loc6_bits |= (1 << i);
-+				mptcp_local->next_v6_index = i + 1;
-+			}
-+		}
-+
-+		rcu_assign_pointer(fm_ns->local, mptcp_local);
-+		kfree(old);
-+	}
-+	success = true;
-+
-+duno:
-+	spin_unlock(&fm_ns->local_lock);
-+	rcu_read_unlock_bh();
-+
-+	if (!success)
-+		goto next_event;
-+
-+	/* Now we iterate over the MPTCP-sockets and apply the event. */
-+	for (i = 0; i < MPTCP_HASH_SIZE; i++) {
-+		const struct hlist_nulls_node *node;
-+		struct tcp_sock *meta_tp;
-+
-+		rcu_read_lock_bh();
-+		hlist_nulls_for_each_entry_rcu(meta_tp, node, &tk_hashtable[i],
-+					       tk_table) {
-+			struct mptcp_cb *mpcb = meta_tp->mpcb;
-+			struct sock *meta_sk = (struct sock *)meta_tp, *sk;
-+			struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+			bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+			if (sock_net(meta_sk) != net)
-+				continue;
-+
-+			if (meta_v4) {
-+				/* skip IPv6 events if meta is IPv4 */
-+				if (event->family == AF_INET6)
-+					continue;
-+			}
-+			/* skip IPv4 events if IPV6_V6ONLY is set */
-+			else if (event->family == AF_INET &&
-+				 inet6_sk(meta_sk)->ipv6only)
-+				continue;
-+
-+			if (unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+				continue;
-+
-+			bh_lock_sock(meta_sk);
-+
-+			if (!mptcp(meta_tp) || !is_meta_sk(meta_sk) ||
-+			    mpcb->infinite_mapping_snd ||
-+			    mpcb->infinite_mapping_rcv ||
-+			    mpcb->send_infinite_mapping)
-+				goto next;
-+
-+			/* May be that the pm has changed in-between */
-+			if (mpcb->pm_ops != &full_mesh)
-+				goto next;
-+
-+			if (sock_owned_by_user(meta_sk)) {
-+				if (!test_and_set_bit(MPTCP_PATH_MANAGER,
-+						      &meta_tp->tsq_flags))
-+					sock_hold(meta_sk);
-+
-+				goto next;
-+			}
-+
-+			if (event->code == MPTCP_EVENT_ADD) {
-+				fmp->add_addr++;
-+				mpcb->addr_signal = 1;
-+
-+				sk = mptcp_select_ack_sock(meta_sk);
-+				if (sk)
-+					tcp_send_ack(sk);
-+
-+				full_mesh_create_subflows(meta_sk);
-+			}
-+
-+			if (event->code == MPTCP_EVENT_DEL) {
-+				struct sock *sk, *tmpsk;
-+				struct mptcp_loc_addr *mptcp_local;
-+				bool found = false;
-+
-+				mptcp_local = rcu_dereference_bh(fm_ns->local);
-+
-+				/* In any case, we need to update our bitfields */
-+				if (id >= 0)
-+					update_addr_bitfields(meta_sk, mptcp_local);
-+
-+				/* Look for the socket and remove him */
-+				mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
-+					if ((event->family == AF_INET6 &&
-+					     (sk->sk_family == AF_INET ||
-+					      mptcp_v6_is_v4_mapped(sk))) ||
-+					    (event->family == AF_INET &&
-+					     (sk->sk_family == AF_INET6 &&
-+					      !mptcp_v6_is_v4_mapped(sk))))
-+						continue;
-+
-+					if (event->family == AF_INET &&
-+					    (sk->sk_family == AF_INET ||
-+					     mptcp_v6_is_v4_mapped(sk)) &&
-+					     inet_sk(sk)->inet_saddr != event->addr.in.s_addr)
-+						continue;
-+
-+					if (event->family == AF_INET6 &&
-+					    sk->sk_family == AF_INET6 &&
-+					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6))
-+						continue;
-+
-+					/* Reinject, so that pf = 1 and so we
-+					 * won't select this one as the
-+					 * ack-sock.
-+					 */
-+					mptcp_reinject_data(sk, 0);
-+
-+					/* We announce the removal of this id */
-+					announce_remove_addr(tcp_sk(sk)->mptcp->loc_id, meta_sk);
-+
-+					mptcp_sub_force_close(sk);
-+					found = true;
-+				}
-+
-+				if (found)
-+					goto next;
-+
-+				/* The id may have been given by the event,
-+				 * matching on a local address. And it may not
-+				 * have matched on one of the above sockets,
-+				 * because the client never created a subflow.
-+				 * So, we have to finally remove it here.
-+				 */
-+				if (id > 0)
-+					announce_remove_addr(id, meta_sk);
-+			}
-+
-+			if (event->code == MPTCP_EVENT_MOD) {
-+				struct sock *sk;
-+
-+				mptcp_for_each_sk(mpcb, sk) {
-+					struct tcp_sock *tp = tcp_sk(sk);
-+					if (event->family == AF_INET &&
-+					    (sk->sk_family == AF_INET ||
-+					     mptcp_v6_is_v4_mapped(sk)) &&
-+					     inet_sk(sk)->inet_saddr == event->addr.in.s_addr) {
-+						if (event->low_prio != tp->mptcp->low_prio) {
-+							tp->mptcp->send_mp_prio = 1;
-+							tp->mptcp->low_prio = event->low_prio;
-+
-+							tcp_send_ack(sk);
-+						}
-+					}
-+
-+					if (event->family == AF_INET6 &&
-+					    sk->sk_family == AF_INET6 &&
-+					    !ipv6_addr_equal(&inet6_sk(sk)->saddr, &event->addr.in6)) {
-+						if (event->low_prio != tp->mptcp->low_prio) {
-+							tp->mptcp->send_mp_prio = 1;
-+							tp->mptcp->low_prio = event->low_prio;
-+
-+							tcp_send_ack(sk);
-+						}
-+					}
-+				}
-+			}
-+next:
-+			bh_unlock_sock(meta_sk);
-+			sock_put(meta_sk);
-+		}
-+		rcu_read_unlock_bh();
-+	}
-+	goto next_event;
-+}
-+
-+static struct mptcp_addr_event *lookup_similar_event(const struct net *net,
-+						     const struct mptcp_addr_event *event)
-+{
-+	struct mptcp_addr_event *eventq;
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+
-+	list_for_each_entry(eventq, &fm_ns->events, list) {
-+		if (eventq->family != event->family)
-+			continue;
-+		if (event->family == AF_INET) {
-+			if (eventq->addr.in.s_addr == event->addr.in.s_addr)
-+				return eventq;
-+		} else {
-+			if (ipv6_addr_equal(&eventq->addr.in6, &event->addr.in6))
-+				return eventq;
-+		}
-+	}
-+	return NULL;
-+}
-+
-+/* We already hold the net-namespace MPTCP-lock */
-+static void add_pm_event(struct net *net, const struct mptcp_addr_event *event)
-+{
-+	struct mptcp_addr_event *eventq = lookup_similar_event(net, event);
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+
-+	if (eventq) {
-+		switch (event->code) {
-+		case MPTCP_EVENT_DEL:
-+			mptcp_debug("%s del old_code %u\n", __func__, eventq->code);
-+			list_del(&eventq->list);
-+			kfree(eventq);
-+			break;
-+		case MPTCP_EVENT_ADD:
-+			mptcp_debug("%s add old_code %u\n", __func__, eventq->code);
-+			eventq->low_prio = event->low_prio;
-+			eventq->code = MPTCP_EVENT_ADD;
-+			return;
-+		case MPTCP_EVENT_MOD:
-+			mptcp_debug("%s mod old_code %u\n", __func__, eventq->code);
-+			eventq->low_prio = event->low_prio;
-+			eventq->code = MPTCP_EVENT_MOD;
-+			return;
-+		}
-+	}
-+
-+	/* OK, we have to add the new address to the wait queue */
-+	eventq = kmemdup(event, sizeof(struct mptcp_addr_event), GFP_ATOMIC);
-+	if (!eventq)
-+		return;
-+
-+	list_add_tail(&eventq->list, &fm_ns->events);
-+
-+	/* Create work-queue */
-+	if (!delayed_work_pending(&fm_ns->address_worker))
-+		queue_delayed_work(mptcp_wq, &fm_ns->address_worker,
-+				   msecs_to_jiffies(500));
-+}
-+
-+static void addr4_event_handler(const struct in_ifaddr *ifa, unsigned long event,
-+				struct net *net)
-+{
-+	const struct net_device *netdev = ifa->ifa_dev->dev;
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+	struct mptcp_addr_event mpevent;
-+
-+	if (ifa->ifa_scope > RT_SCOPE_LINK ||
-+	    ipv4_is_loopback(ifa->ifa_local))
-+		return;
-+
-+	spin_lock_bh(&fm_ns->local_lock);
-+
-+	mpevent.family = AF_INET;
-+	mpevent.addr.in.s_addr = ifa->ifa_local;
-+	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
-+
-+	if (event == NETDEV_DOWN || !netif_running(netdev) ||
-+	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
-+		mpevent.code = MPTCP_EVENT_DEL;
-+	else if (event == NETDEV_UP)
-+		mpevent.code = MPTCP_EVENT_ADD;
-+	else if (event == NETDEV_CHANGE)
-+		mpevent.code = MPTCP_EVENT_MOD;
-+
-+	mptcp_debug("%s created event for %pI4, code %u prio %u\n", __func__,
-+		    &ifa->ifa_local, mpevent.code, mpevent.low_prio);
-+	add_pm_event(net, &mpevent);
-+
-+	spin_unlock_bh(&fm_ns->local_lock);
-+	return;
-+}
-+
-+/* React on IPv4-addr add/rem-events */
-+static int mptcp_pm_inetaddr_event(struct notifier_block *this,
-+				   unsigned long event, void *ptr)
-+{
-+	const struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
-+	struct net *net = dev_net(ifa->ifa_dev->dev);
-+
-+	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+	      event == NETDEV_CHANGE))
-+		return NOTIFY_DONE;
-+
-+	addr4_event_handler(ifa, event, net);
-+
-+	return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block mptcp_pm_inetaddr_notifier = {
-+		.notifier_call = mptcp_pm_inetaddr_event,
-+};
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+
-+/* IPV6-related address/interface watchers */
-+struct mptcp_dad_data {
-+	struct timer_list timer;
-+	struct inet6_ifaddr *ifa;
-+};
-+
-+static void dad_callback(unsigned long arg);
-+static int inet6_addr_event(struct notifier_block *this,
-+				     unsigned long event, void *ptr);
-+
-+static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
-+{
-+	return (ifa->flags & IFA_F_TENTATIVE) &&
-+	       ifa->state == INET6_IFADDR_STATE_DAD;
-+}
-+
-+static void dad_init_timer(struct mptcp_dad_data *data,
-+				 struct inet6_ifaddr *ifa)
-+{
-+	data->ifa = ifa;
-+	data->timer.data = (unsigned long)data;
-+	data->timer.function = dad_callback;
-+	if (ifa->idev->cnf.rtr_solicit_delay)
-+		data->timer.expires = jiffies + ifa->idev->cnf.rtr_solicit_delay;
-+	else
-+		data->timer.expires = jiffies + (HZ/10);
-+}
-+
-+static void dad_callback(unsigned long arg)
-+{
-+	struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
-+
-+	if (ipv6_is_in_dad_state(data->ifa)) {
-+		dad_init_timer(data, data->ifa);
-+		add_timer(&data->timer);
-+	} else {
-+		inet6_addr_event(NULL, NETDEV_UP, data->ifa);
-+		in6_ifa_put(data->ifa);
-+		kfree(data);
-+	}
-+}
-+
-+static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
-+{
-+	struct mptcp_dad_data *data;
-+
-+	data = kmalloc(sizeof(*data), GFP_ATOMIC);
-+
-+	if (!data)
-+		return;
-+
-+	init_timer(&data->timer);
-+	dad_init_timer(data, ifa);
-+	add_timer(&data->timer);
-+	in6_ifa_hold(ifa);
-+}
-+
-+static void addr6_event_handler(const struct inet6_ifaddr *ifa, unsigned long event,
-+				struct net *net)
-+{
-+	const struct net_device *netdev = ifa->idev->dev;
-+	int addr_type = ipv6_addr_type(&ifa->addr);
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+	struct mptcp_addr_event mpevent;
-+
-+	if (ifa->scope > RT_SCOPE_LINK ||
-+	    addr_type == IPV6_ADDR_ANY ||
-+	    (addr_type & IPV6_ADDR_LOOPBACK) ||
-+	    (addr_type & IPV6_ADDR_LINKLOCAL))
-+		return;
-+
-+	spin_lock_bh(&fm_ns->local_lock);
-+
-+	mpevent.family = AF_INET6;
-+	mpevent.addr.in6 = ifa->addr;
-+	mpevent.low_prio = (netdev->flags & IFF_MPBACKUP) ? 1 : 0;
-+
-+	if (event == NETDEV_DOWN || !netif_running(netdev) ||
-+	    (netdev->flags & IFF_NOMULTIPATH) || !(netdev->flags & IFF_UP))
-+		mpevent.code = MPTCP_EVENT_DEL;
-+	else if (event == NETDEV_UP)
-+		mpevent.code = MPTCP_EVENT_ADD;
-+	else if (event == NETDEV_CHANGE)
-+		mpevent.code = MPTCP_EVENT_MOD;
-+
-+	mptcp_debug("%s created event for %pI6, code %u prio %u\n", __func__,
-+		    &ifa->addr, mpevent.code, mpevent.low_prio);
-+	add_pm_event(net, &mpevent);
-+
-+	spin_unlock_bh(&fm_ns->local_lock);
-+	return;
-+}
-+
-+/* React on IPv6-addr add/rem-events */
-+static int inet6_addr_event(struct notifier_block *this, unsigned long event,
-+			    void *ptr)
-+{
-+	struct inet6_ifaddr *ifa6 = (struct inet6_ifaddr *)ptr;
-+	struct net *net = dev_net(ifa6->idev->dev);
-+
-+	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+	      event == NETDEV_CHANGE))
-+		return NOTIFY_DONE;
-+
-+	if (ipv6_is_in_dad_state(ifa6))
-+		dad_setup_timer(ifa6);
-+	else
-+		addr6_event_handler(ifa6, event, net);
-+
-+	return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block inet6_addr_notifier = {
-+		.notifier_call = inet6_addr_event,
-+};
-+
-+#endif
-+
-+/* React on ifup/down-events */
-+static int netdev_event(struct notifier_block *this, unsigned long event,
-+			void *ptr)
-+{
-+	const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-+	struct in_device *in_dev;
-+#if IS_ENABLED(CONFIG_IPV6)
-+	struct inet6_dev *in6_dev;
-+#endif
-+
-+	if (!(event == NETDEV_UP || event == NETDEV_DOWN ||
-+	      event == NETDEV_CHANGE))
-+		return NOTIFY_DONE;
-+
-+	rcu_read_lock();
-+	in_dev = __in_dev_get_rtnl(dev);
-+
-+	if (in_dev) {
-+		for_ifa(in_dev) {
-+			mptcp_pm_inetaddr_event(NULL, event, ifa);
-+		} endfor_ifa(in_dev);
-+	}
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	in6_dev = __in6_dev_get(dev);
-+
-+	if (in6_dev) {
-+		struct inet6_ifaddr *ifa6;
-+		list_for_each_entry(ifa6, &in6_dev->addr_list, if_list)
-+			inet6_addr_event(NULL, event, ifa6);
-+	}
-+#endif
-+
-+	rcu_read_unlock();
-+	return NOTIFY_DONE;
-+}
-+
-+static struct notifier_block mptcp_pm_netdev_notifier = {
-+		.notifier_call = netdev_event,
-+};
-+
-+static void full_mesh_add_raddr(struct mptcp_cb *mpcb,
-+				const union inet_addr *addr,
-+				sa_family_t family, __be16 port, u8 id)
-+{
-+	if (family == AF_INET)
-+		mptcp_addv4_raddr(mpcb, &addr->in, port, id);
-+	else
-+		mptcp_addv6_raddr(mpcb, &addr->in6, port, id);
-+}
-+
-+static void full_mesh_new_session(const struct sock *meta_sk)
-+{
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+	int i, index;
-+	union inet_addr saddr, daddr;
-+	sa_family_t family;
-+	bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+	/* Init local variables necessary for the rest */
-+	if (meta_sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(meta_sk)) {
-+		saddr.ip = inet_sk(meta_sk)->inet_saddr;
-+		daddr.ip = inet_sk(meta_sk)->inet_daddr;
-+		family = AF_INET;
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else {
-+		saddr.in6 = inet6_sk(meta_sk)->saddr;
-+		daddr.in6 = meta_sk->sk_v6_daddr;
-+		family = AF_INET6;
-+#endif
-+	}
-+
-+	rcu_read_lock();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	index = mptcp_find_address(mptcp_local, family, &saddr);
-+	if (index < 0)
-+		goto fallback;
-+
-+	full_mesh_add_raddr(mpcb, &daddr, family, 0, 0);
-+	mptcp_set_init_addr_bit(mpcb, &daddr, family, index);
-+
-+	/* Initialize workqueue-struct */
-+	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+	INIT_DELAYED_WORK(&fmp->subflow_retry_work, retry_subflow_worker);
-+	fmp->mpcb = mpcb;
-+
-+	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+		goto skip_ipv4;
-+
-+	/* Look for the address among the local addresses */
-+	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+		__be32 ifa_address = mptcp_local->locaddr4[i].addr.s_addr;
-+
-+		/* We do not need to announce the initial subflow's address again */
-+		if (family == AF_INET && saddr.ip == ifa_address)
-+			continue;
-+
-+		fmp->add_addr++;
-+		mpcb->addr_signal = 1;
-+	}
-+
-+skip_ipv4:
-+#if IS_ENABLED(CONFIG_IPV6)
-+	/* skip IPv6 addresses if meta-socket is IPv4 */
-+	if (meta_v4)
-+		goto skip_ipv6;
-+
-+	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+		const struct in6_addr *ifa6 = &mptcp_local->locaddr6[i].addr;
-+
-+		/* We do not need to announce the initial subflow's address again */
-+		if (family == AF_INET6 && ipv6_addr_equal(&saddr.in6, ifa6))
-+			continue;
-+
-+		fmp->add_addr++;
-+		mpcb->addr_signal = 1;
-+	}
-+
-+skip_ipv6:
-+#endif
-+
-+	rcu_read_unlock();
-+
-+	if (family == AF_INET)
-+		fmp->announced_addrs_v4 |= (1 << index);
-+	else
-+		fmp->announced_addrs_v6 |= (1 << index);
-+
-+	for (i = fmp->add_addr; i && fmp->add_addr; i--)
-+		tcp_send_ack(mpcb->master_sk);
-+
-+	return;
-+
-+fallback:
-+	rcu_read_unlock();
-+	mptcp_fallback_default(mpcb);
-+	return;
-+}
-+
-+static void full_mesh_create_subflows(struct sock *meta_sk)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+
-+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+	    mpcb->send_infinite_mapping ||
-+	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+		return;
-+
-+	if (mpcb->master_sk &&
-+	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+		return;
-+
-+	if (!work_pending(&fmp->subflow_work)) {
-+		sock_hold(meta_sk);
-+		queue_work(mptcp_wq, &fmp->subflow_work);
-+	}
-+}
-+
-+/* Called upon release_sock, if the socket was owned by the user during
-+ * a path-management event.
-+ */
-+static void full_mesh_release_sock(struct sock *meta_sk)
-+{
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(meta_sk));
-+	struct sock *sk, *tmpsk;
-+	bool meta_v4 = meta_sk->sk_family == AF_INET;
-+	int i;
-+
-+	rcu_read_lock();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+		goto skip_ipv4;
-+
-+	/* First, detect modifications or additions */
-+	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+		struct in_addr ifa = mptcp_local->locaddr4[i].addr;
-+		bool found = false;
-+
-+		mptcp_for_each_sk(mpcb, sk) {
-+			struct tcp_sock *tp = tcp_sk(sk);
-+
-+			if (sk->sk_family == AF_INET6 &&
-+			    !mptcp_v6_is_v4_mapped(sk))
-+				continue;
-+
-+			if (inet_sk(sk)->inet_saddr != ifa.s_addr)
-+				continue;
-+
-+			found = true;
-+
-+			if (mptcp_local->locaddr4[i].low_prio != tp->mptcp->low_prio) {
-+				tp->mptcp->send_mp_prio = 1;
-+				tp->mptcp->low_prio = mptcp_local->locaddr4[i].low_prio;
-+
-+				tcp_send_ack(sk);
-+			}
-+		}
-+
-+		if (!found) {
-+			fmp->add_addr++;
-+			mpcb->addr_signal = 1;
-+
-+			sk = mptcp_select_ack_sock(meta_sk);
-+			if (sk)
-+				tcp_send_ack(sk);
-+			full_mesh_create_subflows(meta_sk);
-+		}
-+	}
-+
-+skip_ipv4:
-+#if IS_ENABLED(CONFIG_IPV6)
-+	/* skip IPv6 addresses if meta-socket is IPv4 */
-+	if (meta_v4)
-+		goto removal;
-+
-+	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+		struct in6_addr ifa = mptcp_local->locaddr6[i].addr;
-+		bool found = false;
-+
-+		mptcp_for_each_sk(mpcb, sk) {
-+			struct tcp_sock *tp = tcp_sk(sk);
-+
-+			if (sk->sk_family == AF_INET ||
-+			    mptcp_v6_is_v4_mapped(sk))
-+				continue;
-+
-+			if (!ipv6_addr_equal(&inet6_sk(sk)->saddr, &ifa))
-+				continue;
-+
-+			found = true;
-+
-+			if (mptcp_local->locaddr6[i].low_prio != tp->mptcp->low_prio) {
-+				tp->mptcp->send_mp_prio = 1;
-+				tp->mptcp->low_prio = mptcp_local->locaddr6[i].low_prio;
-+
-+				tcp_send_ack(sk);
-+			}
-+		}
-+
-+		if (!found) {
-+			fmp->add_addr++;
-+			mpcb->addr_signal = 1;
-+
-+			sk = mptcp_select_ack_sock(meta_sk);
-+			if (sk)
-+				tcp_send_ack(sk);
-+			full_mesh_create_subflows(meta_sk);
-+		}
-+	}
-+
-+removal:
-+#endif
-+
-+	/* Now, detect address-removals */
-+	mptcp_for_each_sk_safe(mpcb, sk, tmpsk) {
-+		bool shall_remove = true;
-+
-+		if (sk->sk_family == AF_INET || mptcp_v6_is_v4_mapped(sk)) {
-+			mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+				if (inet_sk(sk)->inet_saddr == mptcp_local->locaddr4[i].addr.s_addr) {
-+					shall_remove = false;
-+					break;
-+				}
-+			}
-+		} else {
-+			mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+				if (ipv6_addr_equal(&inet6_sk(sk)->saddr, &mptcp_local->locaddr6[i].addr)) {
-+					shall_remove = false;
-+					break;
-+				}
-+			}
-+		}
-+
-+		if (shall_remove) {
-+			/* Reinject, so that pf = 1 and so we
-+			 * won't select this one as the
-+			 * ack-sock.
-+			 */
-+			mptcp_reinject_data(sk, 0);
-+
-+			announce_remove_addr(tcp_sk(sk)->mptcp->loc_id,
-+					     meta_sk);
-+
-+			mptcp_sub_force_close(sk);
-+		}
-+	}
-+
-+	/* Just call it optimistically. It actually cannot do any harm */
-+	update_addr_bitfields(meta_sk, mptcp_local);
-+
-+	rcu_read_unlock();
-+}
-+
-+static int full_mesh_get_local_id(sa_family_t family, union inet_addr *addr,
-+				  struct net *net, bool *low_prio)
-+{
-+	struct mptcp_loc_addr *mptcp_local;
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+	int index, id = -1;
-+
-+	/* Handle the backup-flows */
-+	rcu_read_lock();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	index = mptcp_find_address(mptcp_local, family, addr);
-+
-+	if (index != -1) {
-+		if (family == AF_INET) {
-+			id = mptcp_local->locaddr4[index].loc4_id;
-+			*low_prio = mptcp_local->locaddr4[index].low_prio;
-+		} else {
-+			id = mptcp_local->locaddr6[index].loc6_id;
-+			*low_prio = mptcp_local->locaddr6[index].low_prio;
-+		}
-+	}
-+
-+
-+	rcu_read_unlock();
-+
-+	return id;
-+}
-+
-+static void full_mesh_addr_signal(struct sock *sk, unsigned *size,
-+				  struct tcp_out_options *opts,
-+				  struct sk_buff *skb)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	struct fullmesh_priv *fmp = fullmesh_get_priv(mpcb);
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_fm_ns *fm_ns = fm_get_ns(sock_net(sk));
-+	int remove_addr_len;
-+	u8 unannouncedv4 = 0, unannouncedv6 = 0;
-+	bool meta_v4 = meta_sk->sk_family == AF_INET;
-+
-+	mpcb->addr_signal = 0;
-+
-+	if (likely(!fmp->add_addr))
-+		goto remove_addr;
-+
-+	rcu_read_lock();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	if (!meta_v4 && inet6_sk(meta_sk)->ipv6only)
-+		goto skip_ipv4;
-+
-+	/* IPv4 */
-+	unannouncedv4 = (~fmp->announced_addrs_v4) & mptcp_local->loc4_bits;
-+	if (unannouncedv4 &&
-+	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR4_ALIGN) {
-+		int ind = mptcp_find_free_index(~unannouncedv4);
-+
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_ADD_ADDR;
-+		opts->add_addr4.addr_id = mptcp_local->locaddr4[ind].loc4_id;
-+		opts->add_addr4.addr = mptcp_local->locaddr4[ind].addr;
-+		opts->add_addr_v4 = 1;
-+
-+		if (skb) {
-+			fmp->announced_addrs_v4 |= (1 << ind);
-+			fmp->add_addr--;
-+		}
-+		*size += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN;
-+	}
-+
-+	if (meta_v4)
-+		goto skip_ipv6;
-+
-+skip_ipv4:
-+	/* IPv6 */
-+	unannouncedv6 = (~fmp->announced_addrs_v6) & mptcp_local->loc6_bits;
-+	if (unannouncedv6 &&
-+	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_ADD_ADDR6_ALIGN) {
-+		int ind = mptcp_find_free_index(~unannouncedv6);
-+
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_ADD_ADDR;
-+		opts->add_addr6.addr_id = mptcp_local->locaddr6[ind].loc6_id;
-+		opts->add_addr6.addr = mptcp_local->locaddr6[ind].addr;
-+		opts->add_addr_v6 = 1;
-+
-+		if (skb) {
-+			fmp->announced_addrs_v6 |= (1 << ind);
-+			fmp->add_addr--;
-+		}
-+		*size += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN;
-+	}
-+
-+skip_ipv6:
-+	rcu_read_unlock();
-+
-+	if (!unannouncedv4 && !unannouncedv6 && skb)
-+		fmp->add_addr--;
-+
-+remove_addr:
-+	if (likely(!fmp->remove_addrs))
-+		goto exit;
-+
-+	remove_addr_len = mptcp_sub_len_remove_addr_align(fmp->remove_addrs);
-+	if (MAX_TCP_OPTION_SPACE - *size < remove_addr_len)
-+		goto exit;
-+
-+	opts->options |= OPTION_MPTCP;
-+	opts->mptcp_options |= OPTION_REMOVE_ADDR;
-+	opts->remove_addrs = fmp->remove_addrs;
-+	*size += remove_addr_len;
-+	if (skb)
-+		fmp->remove_addrs = 0;
-+
-+exit:
-+	mpcb->addr_signal = !!(fmp->add_addr || fmp->remove_addrs);
-+}
-+
-+static void full_mesh_rem_raddr(struct mptcp_cb *mpcb, u8 rem_id)
-+{
-+	mptcp_v4_rem_raddress(mpcb, rem_id);
-+	mptcp_v6_rem_raddress(mpcb, rem_id);
-+}
-+
-+/* Output /proc/net/mptcp_fullmesh */
-+static int mptcp_fm_seq_show(struct seq_file *seq, void *v)
-+{
-+	const struct net *net = seq->private;
-+	struct mptcp_loc_addr *mptcp_local;
-+	const struct mptcp_fm_ns *fm_ns = fm_get_ns(net);
-+	int i;
-+
-+	seq_printf(seq, "Index, Address-ID, Backup, IP-address\n");
-+
-+	rcu_read_lock_bh();
-+	mptcp_local = rcu_dereference(fm_ns->local);
-+
-+	seq_printf(seq, "IPv4, next v4-index: %u\n", mptcp_local->next_v4_index);
-+
-+	mptcp_for_each_bit_set(mptcp_local->loc4_bits, i) {
-+		struct mptcp_loc4 *loc4 = &mptcp_local->locaddr4[i];
-+
-+		seq_printf(seq, "%u, %u, %u, %pI4\n", i, loc4->loc4_id,
-+			   loc4->low_prio, &loc4->addr);
-+	}
-+
-+	seq_printf(seq, "IPv6, next v6-index: %u\n", mptcp_local->next_v6_index);
-+
-+	mptcp_for_each_bit_set(mptcp_local->loc6_bits, i) {
-+		struct mptcp_loc6 *loc6 = &mptcp_local->locaddr6[i];
-+
-+		seq_printf(seq, "%u, %u, %u, %pI6\n", i, loc6->loc6_id,
-+			   loc6->low_prio, &loc6->addr);
-+	}
-+	rcu_read_unlock_bh();
-+
-+	return 0;
-+}
-+
-+static int mptcp_fm_seq_open(struct inode *inode, struct file *file)
-+{
-+	return single_open_net(inode, file, mptcp_fm_seq_show);
-+}
-+
-+static const struct file_operations mptcp_fm_seq_fops = {
-+	.owner = THIS_MODULE,
-+	.open = mptcp_fm_seq_open,
-+	.read = seq_read,
-+	.llseek = seq_lseek,
-+	.release = single_release_net,
-+};
-+
-+static int mptcp_fm_init_net(struct net *net)
-+{
-+	struct mptcp_loc_addr *mptcp_local;
-+	struct mptcp_fm_ns *fm_ns;
-+	int err = 0;
-+
-+	fm_ns = kzalloc(sizeof(*fm_ns), GFP_KERNEL);
-+	if (!fm_ns)
-+		return -ENOBUFS;
-+
-+	mptcp_local = kzalloc(sizeof(*mptcp_local), GFP_KERNEL);
-+	if (!mptcp_local) {
-+		err = -ENOBUFS;
-+		goto err_mptcp_local;
-+	}
-+
-+	if (!proc_create("mptcp_fullmesh", S_IRUGO, net->proc_net,
-+			 &mptcp_fm_seq_fops)) {
-+		err = -ENOMEM;
-+		goto err_seq_fops;
-+	}
-+
-+	mptcp_local->next_v4_index = 1;
-+
-+	rcu_assign_pointer(fm_ns->local, mptcp_local);
-+	INIT_DELAYED_WORK(&fm_ns->address_worker, mptcp_address_worker);
-+	INIT_LIST_HEAD(&fm_ns->events);
-+	spin_lock_init(&fm_ns->local_lock);
-+	fm_ns->net = net;
-+	net->mptcp.path_managers[MPTCP_PM_FULLMESH] = fm_ns;
-+
-+	return 0;
-+err_seq_fops:
-+	kfree(mptcp_local);
-+err_mptcp_local:
-+	kfree(fm_ns);
-+	return err;
-+}
-+
-+static void mptcp_fm_exit_net(struct net *net)
-+{
-+	struct mptcp_addr_event *eventq, *tmp;
-+	struct mptcp_fm_ns *fm_ns;
-+	struct mptcp_loc_addr *mptcp_local;
-+
-+	fm_ns = fm_get_ns(net);
-+	cancel_delayed_work_sync(&fm_ns->address_worker);
-+
-+	rcu_read_lock_bh();
-+
-+	mptcp_local = rcu_dereference_bh(fm_ns->local);
-+	kfree(mptcp_local);
-+
-+	spin_lock(&fm_ns->local_lock);
-+	list_for_each_entry_safe(eventq, tmp, &fm_ns->events, list) {
-+		list_del(&eventq->list);
-+		kfree(eventq);
-+	}
-+	spin_unlock(&fm_ns->local_lock);
-+
-+	rcu_read_unlock_bh();
-+
-+	remove_proc_entry("mptcp_fullmesh", net->proc_net);
-+
-+	kfree(fm_ns);
-+}
-+
-+static struct pernet_operations full_mesh_net_ops = {
-+	.init = mptcp_fm_init_net,
-+	.exit = mptcp_fm_exit_net,
-+};
-+
-+static struct mptcp_pm_ops full_mesh __read_mostly = {
-+	.new_session = full_mesh_new_session,
-+	.release_sock = full_mesh_release_sock,
-+	.fully_established = full_mesh_create_subflows,
-+	.new_remote_address = full_mesh_create_subflows,
-+	.get_local_id = full_mesh_get_local_id,
-+	.addr_signal = full_mesh_addr_signal,
-+	.add_raddr = full_mesh_add_raddr,
-+	.rem_raddr = full_mesh_rem_raddr,
-+	.name = "fullmesh",
-+	.owner = THIS_MODULE,
-+};
-+
-+/* General initialization of MPTCP_PM */
-+static int __init full_mesh_register(void)
-+{
-+	int ret;
-+
-+	BUILD_BUG_ON(sizeof(struct fullmesh_priv) > MPTCP_PM_SIZE);
-+
-+	ret = register_pernet_subsys(&full_mesh_net_ops);
-+	if (ret)
-+		goto out;
-+
-+	ret = register_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+	if (ret)
-+		goto err_reg_inetaddr;
-+	ret = register_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+	if (ret)
-+		goto err_reg_netdev;
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+	ret = register_inet6addr_notifier(&inet6_addr_notifier);
-+	if (ret)
-+		goto err_reg_inet6addr;
-+#endif
-+
-+	ret = mptcp_register_path_manager(&full_mesh);
-+	if (ret)
-+		goto err_reg_pm;
-+
-+out:
-+	return ret;
-+
-+
-+err_reg_pm:
-+#if IS_ENABLED(CONFIG_IPV6)
-+	unregister_inet6addr_notifier(&inet6_addr_notifier);
-+err_reg_inet6addr:
-+#endif
-+	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+err_reg_netdev:
-+	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+err_reg_inetaddr:
-+	unregister_pernet_subsys(&full_mesh_net_ops);
-+	goto out;
-+}
-+
-+static void full_mesh_unregister(void)
-+{
-+#if IS_ENABLED(CONFIG_IPV6)
-+	unregister_inet6addr_notifier(&inet6_addr_notifier);
-+#endif
-+	unregister_netdevice_notifier(&mptcp_pm_netdev_notifier);
-+	unregister_inetaddr_notifier(&mptcp_pm_inetaddr_notifier);
-+	unregister_pernet_subsys(&full_mesh_net_ops);
-+	mptcp_unregister_path_manager(&full_mesh);
-+}
-+
-+module_init(full_mesh_register);
-+module_exit(full_mesh_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("Full-Mesh MPTCP");
-+MODULE_VERSION("0.88");
-diff --git a/net/mptcp/mptcp_input.c b/net/mptcp/mptcp_input.c
-new file mode 100644
-index 000000000000..43704ccb639e
---- /dev/null
-+++ b/net/mptcp/mptcp_input.c
-@@ -0,0 +1,2405 @@
-+/*
-+ *	MPTCP implementation - Sending side
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <asm/unaligned.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-+
-+#include <linux/kconfig.h>
-+
-+/* is seq1 < seq2 ? */
-+static inline bool before64(const u64 seq1, const u64 seq2)
-+{
-+	return (s64)(seq1 - seq2) < 0;
-+}
-+
-+/* is seq1 > seq2 ? */
-+#define after64(seq1, seq2)	before64(seq2, seq1)
-+
-+static inline void mptcp_become_fully_estab(struct sock *sk)
-+{
-+	tcp_sk(sk)->mptcp->fully_established = 1;
-+
-+	if (is_master_tp(tcp_sk(sk)) &&
-+	    tcp_sk(sk)->mpcb->pm_ops->fully_established)
-+		tcp_sk(sk)->mpcb->pm_ops->fully_established(mptcp_meta_sk(sk));
-+}
-+
-+/* Similar to tcp_tso_acked without any memory accounting */
-+static inline int mptcp_tso_acked_reinject(const struct sock *meta_sk,
-+					   struct sk_buff *skb)
-+{
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	u32 packets_acked, len;
-+
-+	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una));
-+
-+	packets_acked = tcp_skb_pcount(skb);
-+
-+	if (skb_unclone(skb, GFP_ATOMIC))
-+		return 0;
-+
-+	len = meta_tp->snd_una - TCP_SKB_CB(skb)->seq;
-+	__pskb_trim_head(skb, len);
-+
-+	TCP_SKB_CB(skb)->seq += len;
-+	skb->ip_summed = CHECKSUM_PARTIAL;
-+	skb->truesize	     -= len;
-+
-+	/* Any change of skb->len requires recalculation of tso factor. */
-+	if (tcp_skb_pcount(skb) > 1)
-+		tcp_set_skb_tso_segs(meta_sk, skb, tcp_skb_mss(skb));
-+	packets_acked -= tcp_skb_pcount(skb);
-+
-+	if (packets_acked) {
-+		BUG_ON(tcp_skb_pcount(skb) == 0);
-+		BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
-+	}
-+
-+	return packets_acked;
-+}
-+
-+/**
-+ * Cleans the meta-socket retransmission queue and the reinject-queue.
-+ * @sk must be the metasocket.
-+ */
-+static void mptcp_clean_rtx_queue(struct sock *meta_sk, u32 prior_snd_una)
-+{
-+	struct sk_buff *skb, *tmp;
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	bool acked = false;
-+	u32 acked_pcount;
-+
-+	while ((skb = tcp_write_queue_head(meta_sk)) &&
-+	       skb != tcp_send_head(meta_sk)) {
-+		bool fully_acked = true;
-+
-+		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
-+			if (tcp_skb_pcount(skb) == 1 ||
-+			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
-+				break;
-+
-+			acked_pcount = tcp_tso_acked(meta_sk, skb);
-+			if (!acked_pcount)
-+				break;
-+
-+			fully_acked = false;
-+		} else {
-+			acked_pcount = tcp_skb_pcount(skb);
-+		}
-+
-+		acked = true;
-+		meta_tp->packets_out -= acked_pcount;
-+		meta_tp->retrans_stamp = 0;
-+
-+		if (!fully_acked)
-+			break;
-+
-+		tcp_unlink_write_queue(skb, meta_sk);
-+
-+		if (mptcp_is_data_fin(skb)) {
-+			struct sock *sk_it;
-+
-+			/* DATA_FIN has been acknowledged - now we can close
-+			 * the subflows
-+			 */
-+			mptcp_for_each_sk(mpcb, sk_it) {
-+				unsigned long delay = 0;
-+
-+				/* If we are the passive closer, don't trigger
-+				 * subflow-fin until the subflow has been finned
-+				 * by the peer - thus we add a delay.
-+				 */
-+				if (mpcb->passive_close &&
-+				    sk_it->sk_state == TCP_ESTABLISHED)
-+					delay = inet_csk(sk_it)->icsk_rto << 3;
-+
-+				mptcp_sub_close(sk_it, delay);
-+			}
-+		}
-+		sk_wmem_free_skb(meta_sk, skb);
-+	}
-+	/* Remove acknowledged data from the reinject queue */
-+	skb_queue_walk_safe(&mpcb->reinject_queue, skb, tmp) {
-+		if (before(meta_tp->snd_una, TCP_SKB_CB(skb)->end_seq)) {
-+			if (tcp_skb_pcount(skb) == 1 ||
-+			    !after(meta_tp->snd_una, TCP_SKB_CB(skb)->seq))
-+				break;
-+
-+			mptcp_tso_acked_reinject(meta_sk, skb);
-+			break;
-+		}
-+
-+		__skb_unlink(skb, &mpcb->reinject_queue);
-+		__kfree_skb(skb);
-+	}
-+
-+	if (likely(between(meta_tp->snd_up, prior_snd_una, meta_tp->snd_una)))
-+		meta_tp->snd_up = meta_tp->snd_una;
-+
-+	if (acked) {
-+		tcp_rearm_rto(meta_sk);
-+		/* Normally this is done in tcp_try_undo_loss - but MPTCP
-+		 * does not call this function.
-+		 */
-+		inet_csk(meta_sk)->icsk_retransmits = 0;
-+	}
-+}
-+
-+/* Inspired by tcp_rcv_state_process */
-+static int mptcp_rcv_state_process(struct sock *meta_sk, struct sock *sk,
-+				   const struct sk_buff *skb, u32 data_seq,
-+				   u16 data_len)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-+	const struct tcphdr *th = tcp_hdr(skb);
-+
-+	/* State-machine handling if FIN has been enqueued and he has
-+	 * been acked (snd_una == write_seq) - it's important that this
-+	 * here is after sk_wmem_free_skb because otherwise
-+	 * sk_forward_alloc is wrong upon inet_csk_destroy_sock()
-+	 */
-+	switch (meta_sk->sk_state) {
-+	case TCP_FIN_WAIT1: {
-+		struct dst_entry *dst;
-+		int tmo;
-+
-+		if (meta_tp->snd_una != meta_tp->write_seq)
-+			break;
-+
-+		tcp_set_state(meta_sk, TCP_FIN_WAIT2);
-+		meta_sk->sk_shutdown |= SEND_SHUTDOWN;
-+
-+		dst = __sk_dst_get(sk);
-+		if (dst)
-+			dst_confirm(dst);
-+
-+		if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+			/* Wake up lingering close() */
-+			meta_sk->sk_state_change(meta_sk);
-+			break;
-+		}
-+
-+		if (meta_tp->linger2 < 0 ||
-+		    (data_len &&
-+		     after(data_seq + data_len - (mptcp_is_data_fin2(skb, tp) ? 1 : 0),
-+			   meta_tp->rcv_nxt))) {
-+			mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
-+			tcp_done(meta_sk);
-+			NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+			return 1;
-+		}
-+
-+		tmo = tcp_fin_time(meta_sk);
-+		if (tmo > TCP_TIMEWAIT_LEN) {
-+			inet_csk_reset_keepalive_timer(meta_sk, tmo - TCP_TIMEWAIT_LEN);
-+		} else if (mptcp_is_data_fin2(skb, tp) || sock_owned_by_user(meta_sk)) {
-+			/* Bad case. We could lose such FIN otherwise.
-+			 * It is not a big problem, but it looks confusing
-+			 * and not so rare event. We still can lose it now,
-+			 * if it spins in bh_lock_sock(), but it is really
-+			 * marginal case.
-+			 */
-+			inet_csk_reset_keepalive_timer(meta_sk, tmo);
-+		} else {
-+			meta_tp->ops->time_wait(meta_sk, TCP_FIN_WAIT2, tmo);
-+		}
-+		break;
-+	}
-+	case TCP_CLOSING:
-+	case TCP_LAST_ACK:
-+		if (meta_tp->snd_una == meta_tp->write_seq) {
-+			tcp_done(meta_sk);
-+			return 1;
-+		}
-+		break;
-+	}
-+
-+	/* step 7: process the segment text */
-+	switch (meta_sk->sk_state) {
-+	case TCP_FIN_WAIT1:
-+	case TCP_FIN_WAIT2:
-+		/* RFC 793 says to queue data in these states,
-+		 * RFC 1122 says we MUST send a reset.
-+		 * BSD 4.4 also does reset.
-+		 */
-+		if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
-+			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
-+			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt) &&
-+			    !mptcp_is_data_fin2(skb, tp)) {
-+				NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPABORTONDATA);
-+				mptcp_send_active_reset(meta_sk, GFP_ATOMIC);
-+				tcp_reset(meta_sk);
-+				return 1;
-+			}
-+		}
-+		break;
-+	}
-+
-+	return 0;
-+}
-+
-+/**
-+ * @return:
-+ *  i) 1: Everything's fine.
-+ *  ii) -1: A reset has been sent on the subflow - csum-failure
-+ *  iii) 0: csum-failure but no reset sent, because it's the last subflow.
-+ *	 Last packet should not be destroyed by the caller because it has
-+ *	 been done here.
-+ */
-+static int mptcp_verif_dss_csum(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sk_buff *tmp, *tmp1, *last = NULL;
-+	__wsum csum_tcp = 0; /* cumulative checksum of pld + mptcp-header */
-+	int ans = 1, overflowed = 0, offset = 0, dss_csum_added = 0;
-+	int iter = 0;
-+
-+	skb_queue_walk_safe(&sk->sk_receive_queue, tmp, tmp1) {
-+		unsigned int csum_len;
-+
-+		if (before(tp->mptcp->map_subseq + tp->mptcp->map_data_len, TCP_SKB_CB(tmp)->end_seq))
-+			/* Mapping ends in the middle of the packet -
-+			 * csum only these bytes
-+			 */
-+			csum_len = tp->mptcp->map_subseq + tp->mptcp->map_data_len - TCP_SKB_CB(tmp)->seq;
-+		else
-+			csum_len = tmp->len;
-+
-+		offset = 0;
-+		if (overflowed) {
-+			char first_word[4];
-+			first_word[0] = 0;
-+			first_word[1] = 0;
-+			first_word[2] = 0;
-+			first_word[3] = *(tmp->data);
-+			csum_tcp = csum_partial(first_word, 4, csum_tcp);
-+			offset = 1;
-+			csum_len--;
-+			overflowed = 0;
-+		}
-+
-+		csum_tcp = skb_checksum(tmp, offset, csum_len, csum_tcp);
-+
-+		/* Was it on an odd-length? Then we have to merge the next byte
-+		 * correctly (see above)
-+		 */
-+		if (csum_len != (csum_len & (~1)))
-+			overflowed = 1;
-+
-+		if (mptcp_is_data_seq(tmp) && !dss_csum_added) {
-+			__be32 data_seq = htonl((u32)(tp->mptcp->map_data_seq >> 32));
-+
-+			/* If a 64-bit dss is present, we increase the offset
-+			 * by 4 bytes, as the high-order 64-bits will be added
-+			 * in the final csum_partial-call.
-+			 */
-+			u32 offset = skb_transport_offset(tmp) +
-+				     TCP_SKB_CB(tmp)->dss_off;
-+			if (TCP_SKB_CB(tmp)->mptcp_flags & MPTCPHDR_SEQ64_SET)
-+				offset += 4;
-+
-+			csum_tcp = skb_checksum(tmp, offset,
-+						MPTCP_SUB_LEN_SEQ_CSUM,
-+						csum_tcp);
-+
-+			csum_tcp = csum_partial(&data_seq,
-+						sizeof(data_seq), csum_tcp);
-+
-+			dss_csum_added = 1; /* Just do it once */
-+		}
-+		last = tmp;
-+		iter++;
-+
-+		if (!skb_queue_is_last(&sk->sk_receive_queue, tmp) &&
-+		    !before(TCP_SKB_CB(tmp1)->seq,
-+			    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+			break;
-+	}
-+
-+	/* Now, checksum must be 0 */
-+	if (unlikely(csum_fold(csum_tcp))) {
-+		pr_err("%s csum is wrong: %#x data_seq %u dss_csum_added %d overflowed %d iterations %d\n",
-+		       __func__, csum_fold(csum_tcp), TCP_SKB_CB(last)->seq,
-+		       dss_csum_added, overflowed, iter);
-+
-+		tp->mptcp->send_mp_fail = 1;
-+
-+		/* map_data_seq is the data-seq number of the
-+		 * mapping we are currently checking
-+		 */
-+		tp->mpcb->csum_cutoff_seq = tp->mptcp->map_data_seq;
-+
-+		if (tp->mpcb->cnt_subflows > 1) {
-+			mptcp_send_reset(sk);
-+			ans = -1;
-+		} else {
-+			tp->mpcb->send_infinite_mapping = 1;
-+
-+			/* Need to purge the rcv-queue as it's no more valid */
-+			while ((tmp = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
-+				tp->copied_seq = TCP_SKB_CB(tmp)->end_seq;
-+				kfree_skb(tmp);
-+			}
-+
-+			ans = 0;
-+		}
-+	}
-+
-+	return ans;
-+}
-+
-+static inline void mptcp_prepare_skb(struct sk_buff *skb,
-+				     const struct sock *sk)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	u32 inc = 0;
-+
-+	/* If skb is the end of this mapping (end is always at mapping-boundary
-+	 * thanks to the splitting/trimming), then we need to increase
-+	 * data-end-seq by 1 if this here is a data-fin.
-+	 *
-+	 * We need to do -1 because end_seq includes the subflow-FIN.
-+	 */
-+	if (tp->mptcp->map_data_fin &&
-+	    (tcb->end_seq - (tcp_hdr(skb)->fin ? 1 : 0)) ==
-+	    (tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
-+		inc = 1;
-+
-+		/* We manually set the fin-flag if it is a data-fin. For easy
-+		 * processing in tcp_recvmsg.
-+		 */
-+		tcp_hdr(skb)->fin = 1;
-+	} else {
-+		/* We may have a subflow-fin with data but without data-fin */
-+		tcp_hdr(skb)->fin = 0;
-+	}
-+
-+	/* Adapt data-seq's to the packet itself. We kinda transform the
-+	 * dss-mapping to a per-packet granularity. This is necessary to
-+	 * correctly handle overlapping mappings coming from different
-+	 * subflows. Otherwise it would be a complete mess.
-+	 */
-+	tcb->seq = ((u32)tp->mptcp->map_data_seq) + tcb->seq - tp->mptcp->map_subseq;
-+	tcb->end_seq = tcb->seq + skb->len + inc;
-+}
-+
-+/**
-+ * @return: 1 if the segment has been eaten and can be suppressed,
-+ *          otherwise 0.
-+ */
-+static inline int mptcp_direct_copy(const struct sk_buff *skb,
-+				    struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	int chunk = min_t(unsigned int, skb->len, meta_tp->ucopy.len);
-+	int eaten = 0;
-+
-+	__set_current_state(TASK_RUNNING);
-+
-+	local_bh_enable();
-+	if (!skb_copy_datagram_iovec(skb, 0, meta_tp->ucopy.iov, chunk)) {
-+		meta_tp->ucopy.len -= chunk;
-+		meta_tp->copied_seq += chunk;
-+		eaten = (chunk == skb->len);
-+		tcp_rcv_space_adjust(meta_sk);
-+	}
-+	local_bh_disable();
-+	return eaten;
-+}
-+
-+static inline void mptcp_reset_mapping(struct tcp_sock *tp)
-+{
-+	tp->mptcp->map_data_len = 0;
-+	tp->mptcp->map_data_seq = 0;
-+	tp->mptcp->map_subseq = 0;
-+	tp->mptcp->map_data_fin = 0;
-+	tp->mptcp->mapping_present = 0;
-+}
-+
-+/* The DSS-mapping received on the sk only covers the second half of the skb
-+ * (cut at seq). We trim the head from the skb.
-+ * Data will be freed upon kfree().
-+ *
-+ * Inspired by tcp_trim_head().
-+ */
-+static void mptcp_skb_trim_head(struct sk_buff *skb, struct sock *sk, u32 seq)
-+{
-+	int len = seq - TCP_SKB_CB(skb)->seq;
-+	u32 new_seq = TCP_SKB_CB(skb)->seq + len;
-+
-+	if (len < skb_headlen(skb))
-+		__skb_pull(skb, len);
-+	else
-+		__pskb_trim_head(skb, len - skb_headlen(skb));
-+
-+	TCP_SKB_CB(skb)->seq = new_seq;
-+
-+	skb->truesize -= len;
-+	atomic_sub(len, &sk->sk_rmem_alloc);
-+	sk_mem_uncharge(sk, len);
-+}
-+
-+/* The DSS-mapping received on the sk only covers the first half of the skb
-+ * (cut at seq). We create a second skb (@return), and queue it in the rcv-queue
-+ * as further packets may resolve the mapping of the second half of data.
-+ *
-+ * Inspired by tcp_fragment().
-+ */
-+static int mptcp_skb_split_tail(struct sk_buff *skb, struct sock *sk, u32 seq)
-+{
-+	struct sk_buff *buff;
-+	int nsize;
-+	int nlen, len;
-+
-+	len = seq - TCP_SKB_CB(skb)->seq;
-+	nsize = skb_headlen(skb) - len + tcp_sk(sk)->tcp_header_len;
-+	if (nsize < 0)
-+		nsize = 0;
-+
-+	/* Get a new skb... force flag on. */
-+	buff = alloc_skb(nsize, GFP_ATOMIC);
-+	if (buff == NULL)
-+		return -ENOMEM;
-+
-+	skb_reserve(buff, tcp_sk(sk)->tcp_header_len);
-+	skb_reset_transport_header(buff);
-+
-+	tcp_hdr(buff)->fin = tcp_hdr(skb)->fin;
-+	tcp_hdr(skb)->fin = 0;
-+
-+	/* We absolutly need to call skb_set_owner_r before refreshing the
-+	 * truesize of buff, otherwise the moved data will account twice.
-+	 */
-+	skb_set_owner_r(buff, sk);
-+	nlen = skb->len - len - nsize;
-+	buff->truesize += nlen;
-+	skb->truesize -= nlen;
-+
-+	/* Correct the sequence numbers. */
-+	TCP_SKB_CB(buff)->seq = TCP_SKB_CB(skb)->seq + len;
-+	TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
-+	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
-+
-+	skb_split(skb, buff, len);
-+
-+	__skb_queue_after(&sk->sk_receive_queue, skb, buff);
-+
-+	return 0;
-+}
-+
-+/* @return: 0  everything is fine. Just continue processing
-+ *	    1  subflow is broken stop everything
-+ *	    -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_prevalidate_skb(struct sock *sk, struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	/* If we are in infinite mode, the subflow-fin is in fact a data-fin. */
-+	if (!skb->len && tcp_hdr(skb)->fin && !mptcp_is_data_fin(skb) &&
-+	    !tp->mpcb->infinite_mapping_rcv) {
-+		/* Remove a pure subflow-fin from the queue and increase
-+		 * copied_seq.
-+		 */
-+		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+		__skb_unlink(skb, &sk->sk_receive_queue);
-+		__kfree_skb(skb);
-+		return -1;
-+	}
-+
-+	/* If we are not yet fully established and do not know the mapping for
-+	 * this segment, this path has to fallback to infinite or be torn down.
-+	 */
-+	if (!tp->mptcp->fully_established && !mptcp_is_data_seq(skb) &&
-+	    !tp->mptcp->mapping_present && !tp->mpcb->infinite_mapping_rcv) {
-+		pr_err("%s %#x will fallback - pi %d from %pS, seq %u\n",
-+		       __func__, tp->mpcb->mptcp_loc_token,
-+		       tp->mptcp->path_index, __builtin_return_address(0),
-+		       TCP_SKB_CB(skb)->seq);
-+
-+		if (!is_master_tp(tp)) {
-+			mptcp_send_reset(sk);
-+			return 1;
-+		}
-+
-+		tp->mpcb->infinite_mapping_snd = 1;
-+		tp->mpcb->infinite_mapping_rcv = 1;
-+		/* We do a seamless fallback and should not send a inf.mapping. */
-+		tp->mpcb->send_infinite_mapping = 0;
-+		tp->mptcp->fully_established = 1;
-+	}
-+
-+	/* Receiver-side becomes fully established when a whole rcv-window has
-+	 * been received without the need to fallback due to the previous
-+	 * condition.
-+	 */
-+	if (!tp->mptcp->fully_established) {
-+		tp->mptcp->init_rcv_wnd -= skb->len;
-+		if (tp->mptcp->init_rcv_wnd < 0)
-+			mptcp_become_fully_estab(sk);
-+	}
-+
-+	return 0;
-+}
-+
-+/* @return: 0  everything is fine. Just continue processing
-+ *	    1  subflow is broken stop everything
-+ *	    -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_detect_mapping(struct sock *sk, struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	u32 *ptr;
-+	u32 data_seq, sub_seq, data_len, tcp_end_seq;
-+
-+	/* If we are in infinite-mapping-mode, the subflow is guaranteed to be
-+	 * in-order at the data-level. Thus data-seq-numbers can be inferred
-+	 * from what is expected at the data-level.
-+	 */
-+	if (mpcb->infinite_mapping_rcv) {
-+		tp->mptcp->map_data_seq = mptcp_get_rcv_nxt_64(meta_tp);
-+		tp->mptcp->map_subseq = tcb->seq;
-+		tp->mptcp->map_data_len = skb->len;
-+		tp->mptcp->map_data_fin = tcp_hdr(skb)->fin;
-+		tp->mptcp->mapping_present = 1;
-+		return 0;
-+	}
-+
-+	/* No mapping here? Exit - it is either already set or still on its way */
-+	if (!mptcp_is_data_seq(skb)) {
-+		/* Too many packets without a mapping - this subflow is broken */
-+		if (!tp->mptcp->mapping_present &&
-+		    tp->rcv_nxt - tp->copied_seq > 65536) {
-+			mptcp_send_reset(sk);
-+			return 1;
-+		}
-+
-+		return 0;
-+	}
-+
-+	ptr = mptcp_skb_set_data_seq(skb, &data_seq, mpcb);
-+	ptr++;
-+	sub_seq = get_unaligned_be32(ptr) + tp->mptcp->rcv_isn;
-+	ptr++;
-+	data_len = get_unaligned_be16(ptr);
-+
-+	/* If it's an empty skb with DATA_FIN, sub_seq must get fixed.
-+	 * The draft sets it to 0, but we really would like to have the
-+	 * real value, to have an easy handling afterwards here in this
-+	 * function.
-+	 */
-+	if (mptcp_is_data_fin(skb) && skb->len == 0)
-+		sub_seq = TCP_SKB_CB(skb)->seq;
-+
-+	/* If there is already a mapping - we check if it maps with the current
-+	 * one. If not - we reset.
-+	 */
-+	if (tp->mptcp->mapping_present &&
-+	    (data_seq != (u32)tp->mptcp->map_data_seq ||
-+	     sub_seq != tp->mptcp->map_subseq ||
-+	     data_len != tp->mptcp->map_data_len + tp->mptcp->map_data_fin ||
-+	     mptcp_is_data_fin(skb) != tp->mptcp->map_data_fin)) {
-+		/* Mapping in packet is different from what we want */
-+		pr_err("%s Mappings do not match!\n", __func__);
-+		pr_err("%s dseq %u mdseq %u, sseq %u msseq %u dlen %u mdlen %u dfin %d mdfin %d\n",
-+		       __func__, data_seq, (u32)tp->mptcp->map_data_seq,
-+		       sub_seq, tp->mptcp->map_subseq, data_len,
-+		       tp->mptcp->map_data_len, mptcp_is_data_fin(skb),
-+		       tp->mptcp->map_data_fin);
-+		mptcp_send_reset(sk);
-+		return 1;
-+	}
-+
-+	/* If the previous check was good, the current mapping is valid and we exit. */
-+	if (tp->mptcp->mapping_present)
-+		return 0;
-+
-+	/* Mapping not yet set on this subflow - we set it here! */
-+
-+	if (!data_len) {
-+		mpcb->infinite_mapping_rcv = 1;
-+		tp->mptcp->fully_established = 1;
-+		/* We need to repeat mp_fail's until the sender felt
-+		 * back to infinite-mapping - here we stop repeating it.
-+		 */
-+		tp->mptcp->send_mp_fail = 0;
-+
-+		/* We have to fixup data_len - it must be the same as skb->len */
-+		data_len = skb->len + (mptcp_is_data_fin(skb) ? 1 : 0);
-+		sub_seq = tcb->seq;
-+
-+		/* TODO kill all other subflows than this one */
-+		/* data_seq and so on are set correctly */
-+
-+		/* At this point, the meta-ofo-queue has to be emptied,
-+		 * as the following data is guaranteed to be in-order at
-+		 * the data and subflow-level
-+		 */
-+		mptcp_purge_ofo_queue(meta_tp);
-+	}
-+
-+	/* We are sending mp-fail's and thus are in fallback mode.
-+	 * Ignore packets which do not announce the fallback and still
-+	 * want to provide a mapping.
-+	 */
-+	if (tp->mptcp->send_mp_fail) {
-+		tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+		__skb_unlink(skb, &sk->sk_receive_queue);
-+		__kfree_skb(skb);
-+		return -1;
-+	}
-+
-+	/* FIN increased the mapping-length by 1 */
-+	if (mptcp_is_data_fin(skb))
-+		data_len--;
-+
-+	/* Subflow-sequences of packet must be
-+	 * (at least partially) be part of the DSS-mapping's
-+	 * subflow-sequence-space.
-+	 *
-+	 * Basically the mapping is not valid, if either of the
-+	 * following conditions is true:
-+	 *
-+	 * 1. It's not a data_fin and
-+	 *    MPTCP-sub_seq >= TCP-end_seq
-+	 *
-+	 * 2. It's a data_fin and TCP-end_seq > TCP-seq and
-+	 *    MPTCP-sub_seq >= TCP-end_seq
-+	 *
-+	 * The previous two can be merged into:
-+	 *    TCP-end_seq > TCP-seq and MPTCP-sub_seq >= TCP-end_seq
-+	 *    Because if it's not a data-fin, TCP-end_seq > TCP-seq
-+	 *
-+	 * 3. It's a data_fin and skb->len == 0 and
-+	 *    MPTCP-sub_seq > TCP-end_seq
-+	 *
-+	 * 4. It's not a data_fin and TCP-end_seq > TCP-seq and
-+	 *    MPTCP-sub_seq + MPTCP-data_len <= TCP-seq
-+	 *
-+	 * 5. MPTCP-sub_seq is prior to what we already copied (copied_seq)
-+	 */
-+
-+	/* subflow-fin is not part of the mapping - ignore it here ! */
-+	tcp_end_seq = tcb->end_seq - tcp_hdr(skb)->fin;
-+	if ((!before(sub_seq, tcb->end_seq) && after(tcp_end_seq, tcb->seq)) ||
-+	    (mptcp_is_data_fin(skb) && skb->len == 0 && after(sub_seq, tcb->end_seq)) ||
-+	    (!after(sub_seq + data_len, tcb->seq) && after(tcp_end_seq, tcb->seq)) ||
-+	    before(sub_seq, tp->copied_seq)) {
-+		/* Subflow-sequences of packet is different from what is in the
-+		 * packet's dss-mapping. The peer is misbehaving - reset
-+		 */
-+		pr_err("%s Packet's mapping does not map to the DSS sub_seq %u "
-+		       "end_seq %u, tcp_end_seq %u seq %u dfin %u len %u data_len %u"
-+		       "copied_seq %u\n", __func__, sub_seq, tcb->end_seq, tcp_end_seq, tcb->seq, mptcp_is_data_fin(skb),
-+		       skb->len, data_len, tp->copied_seq);
-+		mptcp_send_reset(sk);
-+		return 1;
-+	}
-+
-+	/* Does the DSS had 64-bit seqnum's ? */
-+	if (!(tcb->mptcp_flags & MPTCPHDR_SEQ64_SET)) {
-+		/* Wrapped around? */
-+		if (unlikely(after(data_seq, meta_tp->rcv_nxt) && data_seq < meta_tp->rcv_nxt)) {
-+			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, !mpcb->rcv_hiseq_index, data_seq);
-+		} else {
-+			/* Else, access the default high-order bits */
-+			tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index, data_seq);
-+		}
-+	} else {
-+		tp->mptcp->map_data_seq = mptcp_get_data_seq_64(mpcb, (tcb->mptcp_flags & MPTCPHDR_SEQ64_INDEX) ? 1 : 0, data_seq);
-+
-+		if (unlikely(tcb->mptcp_flags & MPTCPHDR_SEQ64_OFO)) {
-+			/* We make sure that the data_seq is invalid.
-+			 * It will be dropped later.
-+			 */
-+			tp->mptcp->map_data_seq += 0xFFFFFFFF;
-+			tp->mptcp->map_data_seq += 0xFFFFFFFF;
-+		}
-+	}
-+
-+	tp->mptcp->map_data_len = data_len;
-+	tp->mptcp->map_subseq = sub_seq;
-+	tp->mptcp->map_data_fin = mptcp_is_data_fin(skb) ? 1 : 0;
-+	tp->mptcp->mapping_present = 1;
-+
-+	return 0;
-+}
-+
-+/* Similar to tcp_sequence(...) */
-+static inline bool mptcp_sequence(const struct tcp_sock *meta_tp,
-+				 u64 data_seq, u64 end_data_seq)
-+{
-+	const struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	u64 rcv_wup64;
-+
-+	/* Wrap-around? */
-+	if (meta_tp->rcv_wup > meta_tp->rcv_nxt) {
-+		rcv_wup64 = ((u64)(mpcb->rcv_high_order[mpcb->rcv_hiseq_index] - 1) << 32) |
-+				meta_tp->rcv_wup;
-+	} else {
-+		rcv_wup64 = mptcp_get_data_seq_64(mpcb, mpcb->rcv_hiseq_index,
-+						  meta_tp->rcv_wup);
-+	}
-+
-+	return	!before64(end_data_seq, rcv_wup64) &&
-+		!after64(data_seq, mptcp_get_rcv_nxt_64(meta_tp) + tcp_receive_window(meta_tp));
-+}
-+
-+/* @return: 0  everything is fine. Just continue processing
-+ *	    -1 this packet was broken - continue with the next one.
-+ */
-+static int mptcp_validate_mapping(struct sock *sk, struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sk_buff *tmp, *tmp1;
-+	u32 tcp_end_seq;
-+
-+	if (!tp->mptcp->mapping_present)
-+		return 0;
-+
-+	/* either, the new skb gave us the mapping and the first segment
-+	 * in the sub-rcv-queue has to be trimmed ...
-+	 */
-+	tmp = skb_peek(&sk->sk_receive_queue);
-+	if (before(TCP_SKB_CB(tmp)->seq, tp->mptcp->map_subseq) &&
-+	    after(TCP_SKB_CB(tmp)->end_seq, tp->mptcp->map_subseq))
-+		mptcp_skb_trim_head(tmp, sk, tp->mptcp->map_subseq);
-+
-+	/* ... or the new skb (tail) has to be split at the end. */
-+	tcp_end_seq = TCP_SKB_CB(skb)->end_seq - (tcp_hdr(skb)->fin ? 1 : 0);
-+	if (after(tcp_end_seq, tp->mptcp->map_subseq + tp->mptcp->map_data_len)) {
-+		u32 seq = tp->mptcp->map_subseq + tp->mptcp->map_data_len;
-+		if (mptcp_skb_split_tail(skb, sk, seq)) { /* Allocation failed */
-+			/* TODO : maybe handle this here better.
-+			 * We now just force meta-retransmission.
-+			 */
-+			tp->copied_seq = TCP_SKB_CB(skb)->end_seq;
-+			__skb_unlink(skb, &sk->sk_receive_queue);
-+			__kfree_skb(skb);
-+			return -1;
-+		}
-+	}
-+
-+	/* Now, remove old sk_buff's from the receive-queue.
-+	 * This may happen if the mapping has been lost for these segments and
-+	 * the next mapping has already been received.
-+	 */
-+	if (before(TCP_SKB_CB(skb_peek(&sk->sk_receive_queue))->seq, tp->mptcp->map_subseq)) {
-+		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+			if (!before(TCP_SKB_CB(tmp1)->seq, tp->mptcp->map_subseq))
-+				break;
-+
-+			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+			__skb_unlink(tmp1, &sk->sk_receive_queue);
-+
-+			/* Impossible that we could free skb here, because his
-+			 * mapping is known to be valid from previous checks
-+			 */
-+			__kfree_skb(tmp1);
-+		}
-+	}
-+
-+	return 0;
-+}
-+
-+/* @return: 0  everything is fine. Just continue processing
-+ *	    1  subflow is broken stop everything
-+ *	    -1 this mapping has been put in the meta-receive-queue
-+ *	    -2 this mapping has been eaten by the application
-+ */
-+static int mptcp_queue_skb(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	struct sk_buff *tmp, *tmp1;
-+	u64 rcv_nxt64 = mptcp_get_rcv_nxt_64(meta_tp);
-+	bool data_queued = false;
-+
-+	/* Have we not yet received the full mapping? */
-+	if (!tp->mptcp->mapping_present ||
-+	    before(tp->rcv_nxt, tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+		return 0;
-+
-+	/* Is this an overlapping mapping? rcv_nxt >= end_data_seq
-+	 * OR
-+	 * This mapping is out of window
-+	 */
-+	if (!before64(rcv_nxt64, tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin) ||
-+	    !mptcp_sequence(meta_tp, tp->mptcp->map_data_seq,
-+			    tp->mptcp->map_data_seq + tp->mptcp->map_data_len + tp->mptcp->map_data_fin)) {
-+		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+			__skb_unlink(tmp1, &sk->sk_receive_queue);
-+			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+			__kfree_skb(tmp1);
-+
-+			if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+			    !before(TCP_SKB_CB(tmp)->seq,
-+				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+				break;
-+		}
-+
-+		mptcp_reset_mapping(tp);
-+
-+		return -1;
-+	}
-+
-+	/* Record it, because we want to send our data_fin on the same path */
-+	if (tp->mptcp->map_data_fin) {
-+		mpcb->dfin_path_index = tp->mptcp->path_index;
-+		mpcb->dfin_combined = !!(sk->sk_shutdown & RCV_SHUTDOWN);
-+	}
-+
-+	/* Verify the checksum */
-+	if (mpcb->dss_csum && !mpcb->infinite_mapping_rcv) {
-+		int ret = mptcp_verif_dss_csum(sk);
-+
-+		if (ret <= 0) {
-+			mptcp_reset_mapping(tp);
-+			return 1;
-+		}
-+	}
-+
-+	if (before64(rcv_nxt64, tp->mptcp->map_data_seq)) {
-+		/* Seg's have to go to the meta-ofo-queue */
-+		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+			mptcp_prepare_skb(tmp1, sk);
-+			__skb_unlink(tmp1, &sk->sk_receive_queue);
-+			/* MUST be done here, because fragstolen may be true later.
-+			 * Then, kfree_skb_partial will not account the memory.
-+			 */
-+			skb_orphan(tmp1);
-+
-+			if (!mpcb->in_time_wait) /* In time-wait, do not receive data */
-+				mptcp_add_meta_ofo_queue(meta_sk, tmp1, sk);
-+			else
-+				__kfree_skb(tmp1);
-+
-+			if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+			    !before(TCP_SKB_CB(tmp)->seq,
-+				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+				break;
-+		}
-+		tcp_enter_quickack_mode(sk);
-+	} else {
-+		/* Ready for the meta-rcv-queue */
-+		skb_queue_walk_safe(&sk->sk_receive_queue, tmp1, tmp) {
-+			int eaten = 0;
-+			const bool copied_early = false;
-+			bool fragstolen = false;
-+			u32 old_rcv_nxt = meta_tp->rcv_nxt;
-+
-+			tp->copied_seq = TCP_SKB_CB(tmp1)->end_seq;
-+			mptcp_prepare_skb(tmp1, sk);
-+			__skb_unlink(tmp1, &sk->sk_receive_queue);
-+			/* MUST be done here, because fragstolen may be true.
-+			 * Then, kfree_skb_partial will not account the memory.
-+			 */
-+			skb_orphan(tmp1);
-+
-+			/* This segment has already been received */
-+			if (!after(TCP_SKB_CB(tmp1)->end_seq, meta_tp->rcv_nxt)) {
-+				__kfree_skb(tmp1);
-+				goto next;
-+			}
-+
-+#ifdef CONFIG_NET_DMA
-+			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt  &&
-+			    meta_tp->ucopy.task == current &&
-+			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
-+			    tmp1->len <= meta_tp->ucopy.len &&
-+			    sock_owned_by_user(meta_sk) &&
-+			    tcp_dma_try_early_copy(meta_sk, tmp1, 0)) {
-+				copied_early = true;
-+				eaten = 1;
-+			}
-+#endif
-+
-+			/* Is direct copy possible ? */
-+			if (TCP_SKB_CB(tmp1)->seq == meta_tp->rcv_nxt &&
-+			    meta_tp->ucopy.task == current &&
-+			    meta_tp->copied_seq == meta_tp->rcv_nxt &&
-+			    meta_tp->ucopy.len && sock_owned_by_user(meta_sk) &&
-+			    !copied_early)
-+				eaten = mptcp_direct_copy(tmp1, meta_sk);
-+
-+			if (mpcb->in_time_wait) /* In time-wait, do not receive data */
-+				eaten = 1;
-+
-+			if (!eaten)
-+				eaten = tcp_queue_rcv(meta_sk, tmp1, 0, &fragstolen);
-+
-+			meta_tp->rcv_nxt = TCP_SKB_CB(tmp1)->end_seq;
-+			mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
-+
-+#ifdef CONFIG_NET_DMA
-+			if (copied_early)
-+				meta_tp->cleanup_rbuf(meta_sk, tmp1->len);
-+#endif
-+
-+			if (tcp_hdr(tmp1)->fin && !mpcb->in_time_wait)
-+				mptcp_fin(meta_sk);
-+
-+			/* Check if this fills a gap in the ofo queue */
-+			if (!skb_queue_empty(&meta_tp->out_of_order_queue))
-+				mptcp_ofo_queue(meta_sk);
-+
-+#ifdef CONFIG_NET_DMA
-+			if (copied_early)
-+				__skb_queue_tail(&meta_sk->sk_async_wait_queue,
-+						 tmp1);
-+			else
-+#endif
-+			if (eaten)
-+				kfree_skb_partial(tmp1, fragstolen);
-+
-+			data_queued = true;
-+next:
-+			if (!skb_queue_empty(&sk->sk_receive_queue) &&
-+			    !before(TCP_SKB_CB(tmp)->seq,
-+				    tp->mptcp->map_subseq + tp->mptcp->map_data_len))
-+				break;
-+		}
-+	}
-+
-+	inet_csk(meta_sk)->icsk_ack.lrcvtime = tcp_time_stamp;
-+	mptcp_reset_mapping(tp);
-+
-+	return data_queued ? -1 : -2;
-+}
-+
-+void mptcp_data_ready(struct sock *sk)
-+{
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct sk_buff *skb, *tmp;
-+	int queued = 0;
-+
-+	/* restart before the check, because mptcp_fin might have changed the
-+	 * state.
-+	 */
-+restart:
-+	/* If the meta cannot receive data, there is no point in pushing data.
-+	 * If we are in time-wait, we may still be waiting for the final FIN.
-+	 * So, we should proceed with the processing.
-+	 */
-+	if (!mptcp_sk_can_recv(meta_sk) && !tcp_sk(sk)->mpcb->in_time_wait) {
-+		skb_queue_purge(&sk->sk_receive_queue);
-+		tcp_sk(sk)->copied_seq = tcp_sk(sk)->rcv_nxt;
-+		goto exit;
-+	}
-+
-+	/* Iterate over all segments, detect their mapping (if we don't have
-+	 * one yet), validate them and push everything one level higher.
-+	 */
-+	skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp) {
-+		int ret;
-+		/* Pre-validation - e.g., early fallback */
-+		ret = mptcp_prevalidate_skb(sk, skb);
-+		if (ret < 0)
-+			goto restart;
-+		else if (ret > 0)
-+			break;
-+
-+		/* Set the current mapping */
-+		ret = mptcp_detect_mapping(sk, skb);
-+		if (ret < 0)
-+			goto restart;
-+		else if (ret > 0)
-+			break;
-+
-+		/* Validation */
-+		if (mptcp_validate_mapping(sk, skb) < 0)
-+			goto restart;
-+
-+		/* Push a level higher */
-+		ret = mptcp_queue_skb(sk);
-+		if (ret < 0) {
-+			if (ret == -1)
-+				queued = ret;
-+			goto restart;
-+		} else if (ret == 0) {
-+			continue;
-+		} else { /* ret == 1 */
-+			break;
-+		}
-+	}
-+
-+exit:
-+	if (tcp_sk(sk)->close_it) {
-+		tcp_send_ack(sk);
-+		tcp_sk(sk)->ops->time_wait(sk, TCP_TIME_WAIT, 0);
-+	}
-+
-+	if (queued == -1 && !sock_flag(meta_sk, SOCK_DEAD))
-+		meta_sk->sk_data_ready(meta_sk);
-+}
-+
-+
-+int mptcp_check_req(struct sk_buff *skb, struct net *net)
-+{
-+	const struct tcphdr *th = tcp_hdr(skb);
-+	struct sock *meta_sk = NULL;
-+
-+	/* MPTCP structures not initialized */
-+	if (mptcp_init_failed)
-+		return 0;
-+
-+	if (skb->protocol == htons(ETH_P_IP))
-+		meta_sk = mptcp_v4_search_req(th->source, ip_hdr(skb)->saddr,
-+					      ip_hdr(skb)->daddr, net);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	else /* IPv6 */
-+		meta_sk = mptcp_v6_search_req(th->source, &ipv6_hdr(skb)->saddr,
-+					      &ipv6_hdr(skb)->daddr, net);
-+#endif /* CONFIG_IPV6 */
-+
-+	if (!meta_sk)
-+		return 0;
-+
-+	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+
-+	bh_lock_sock_nested(meta_sk);
-+	if (sock_owned_by_user(meta_sk)) {
-+		skb->sk = meta_sk;
-+		if (unlikely(sk_add_backlog(meta_sk, skb,
-+					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+			bh_unlock_sock(meta_sk);
-+			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-+			sock_put(meta_sk); /* Taken by mptcp_search_req */
-+			kfree_skb(skb);
-+			return 1;
-+		}
-+	} else if (skb->protocol == htons(ETH_P_IP)) {
-+		tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else { /* IPv6 */
-+		tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+	}
-+	bh_unlock_sock(meta_sk);
-+	sock_put(meta_sk); /* Taken by mptcp_vX_search_req */
-+	return 1;
-+}
-+
-+struct mp_join *mptcp_find_join(const struct sk_buff *skb)
-+{
-+	const struct tcphdr *th = tcp_hdr(skb);
-+	unsigned char *ptr;
-+	int length = (th->doff * 4) - sizeof(struct tcphdr);
-+
-+	/* Jump through the options to check whether JOIN is there */
-+	ptr = (unsigned char *)(th + 1);
-+	while (length > 0) {
-+		int opcode = *ptr++;
-+		int opsize;
-+
-+		switch (opcode) {
-+		case TCPOPT_EOL:
-+			return NULL;
-+		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
-+			length--;
-+			continue;
-+		default:
-+			opsize = *ptr++;
-+			if (opsize < 2)	/* "silly options" */
-+				return NULL;
-+			if (opsize > length)
-+				return NULL;  /* don't parse partial options */
-+			if (opcode == TCPOPT_MPTCP &&
-+			    ((struct mptcp_option *)(ptr - 2))->sub == MPTCP_SUB_JOIN) {
-+				return (struct mp_join *)(ptr - 2);
-+			}
-+			ptr += opsize - 2;
-+			length -= opsize;
-+		}
-+	}
-+	return NULL;
-+}
-+
-+int mptcp_lookup_join(struct sk_buff *skb, struct inet_timewait_sock *tw)
-+{
-+	const struct mptcp_cb *mpcb;
-+	struct sock *meta_sk;
-+	u32 token;
-+	bool meta_v4;
-+	struct mp_join *join_opt = mptcp_find_join(skb);
-+	if (!join_opt)
-+		return 0;
-+
-+	/* MPTCP structures were not initialized, so return error */
-+	if (mptcp_init_failed)
-+		return -1;
-+
-+	token = join_opt->u.syn.token;
-+	meta_sk = mptcp_hash_find(dev_net(skb_dst(skb)->dev), token);
-+	if (!meta_sk) {
-+		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
-+		return -1;
-+	}
-+
-+	meta_v4 = meta_sk->sk_family == AF_INET;
-+	if (meta_v4) {
-+		if (skb->protocol == htons(ETH_P_IPV6)) {
-+			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
-+			sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+			return -1;
-+		}
-+	} else if (skb->protocol == htons(ETH_P_IP) &&
-+		   inet6_sk(meta_sk)->ipv6only) {
-+		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
-+		sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+		return -1;
-+	}
-+
-+	mpcb = tcp_sk(meta_sk)->mpcb;
-+	if (mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping) {
-+		/* We are in fallback-mode on the reception-side -
-+		 * no new subflows!
-+		 */
-+		sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+		return -1;
-+	}
-+
-+	/* Coming from time-wait-sock processing in tcp_v4_rcv.
-+	 * We have to deschedule it before continuing, because otherwise
-+	 * mptcp_v4_do_rcv will hit again on it inside tcp_v4_hnd_req.
-+	 */
-+	if (tw) {
-+		inet_twsk_deschedule(tw, &tcp_death_row);
-+		inet_twsk_put(tw);
-+	}
-+
-+	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+	/* OK, this is a new syn/join, let's create a new open request and
-+	 * send syn+ack
-+	 */
-+	bh_lock_sock_nested(meta_sk);
-+	if (sock_owned_by_user(meta_sk)) {
-+		skb->sk = meta_sk;
-+		if (unlikely(sk_add_backlog(meta_sk, skb,
-+					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf))) {
-+			bh_unlock_sock(meta_sk);
-+			NET_INC_STATS_BH(sock_net(meta_sk),
-+					 LINUX_MIB_TCPBACKLOGDROP);
-+			sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+			kfree_skb(skb);
-+			return 1;
-+		}
-+	} else if (skb->protocol == htons(ETH_P_IP)) {
-+		tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else {
-+		tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+	}
-+	bh_unlock_sock(meta_sk);
-+	sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+	return 1;
-+}
-+
-+int mptcp_do_join_short(struct sk_buff *skb,
-+			const struct mptcp_options_received *mopt,
-+			struct net *net)
-+{
-+	struct sock *meta_sk;
-+	u32 token;
-+	bool meta_v4;
-+
-+	token = mopt->mptcp_rem_token;
-+	meta_sk = mptcp_hash_find(net, token);
-+	if (!meta_sk) {
-+		mptcp_debug("%s:mpcb not found:%x\n", __func__, token);
-+		return -1;
-+	}
-+
-+	meta_v4 = meta_sk->sk_family == AF_INET;
-+	if (meta_v4) {
-+		if (skb->protocol == htons(ETH_P_IPV6)) {
-+			mptcp_debug("SYN+MP_JOIN with IPV6 address on pure IPV4 meta\n");
-+			sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+			return -1;
-+		}
-+	} else if (skb->protocol == htons(ETH_P_IP) &&
-+		   inet6_sk(meta_sk)->ipv6only) {
-+		mptcp_debug("SYN+MP_JOIN with IPV4 address on IPV6_V6ONLY meta\n");
-+		sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+		return -1;
-+	}
-+
-+	TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_JOIN;
-+
-+	/* OK, this is a new syn/join, let's create a new open request and
-+	 * send syn+ack
-+	 */
-+	bh_lock_sock(meta_sk);
-+
-+	/* This check is also done in mptcp_vX_do_rcv. But, there we cannot
-+	 * call tcp_vX_send_reset, because we hold already two socket-locks.
-+	 * (the listener and the meta from above)
-+	 *
-+	 * And the send-reset will try to take yet another one (ip_send_reply).
-+	 * Thus, we propagate the reset up to tcp_rcv_state_process.
-+	 */
-+	if (tcp_sk(meta_sk)->mpcb->infinite_mapping_rcv ||
-+	    tcp_sk(meta_sk)->mpcb->send_infinite_mapping ||
-+	    meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table) {
-+		bh_unlock_sock(meta_sk);
-+		sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+		return -1;
-+	}
-+
-+	if (sock_owned_by_user(meta_sk)) {
-+		skb->sk = meta_sk;
-+		if (unlikely(sk_add_backlog(meta_sk, skb,
-+					    meta_sk->sk_rcvbuf + meta_sk->sk_sndbuf)))
-+			NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
-+		else
-+			/* Must make sure that upper layers won't free the
-+			 * skb if it is added to the backlog-queue.
-+			 */
-+			skb_get(skb);
-+	} else {
-+		/* mptcp_v4_do_rcv tries to free the skb - we prevent this, as
-+		 * the skb will finally be freed by tcp_v4_do_rcv (where we are
-+		 * coming from)
-+		 */
-+		skb_get(skb);
-+		if (skb->protocol == htons(ETH_P_IP)) {
-+			tcp_v4_do_rcv(meta_sk, skb);
-+#if IS_ENABLED(CONFIG_IPV6)
-+		} else { /* IPv6 */
-+			tcp_v6_do_rcv(meta_sk, skb);
-+#endif /* CONFIG_IPV6 */
-+		}
-+	}
-+
-+	bh_unlock_sock(meta_sk);
-+	sock_put(meta_sk); /* Taken by mptcp_hash_find */
-+	return 0;
-+}
-+
-+/**
-+ * Equivalent of tcp_fin() for MPTCP
-+ * Can be called only when the FIN is validly part
-+ * of the data seqnum space. Not before when we get holes.
-+ */
-+void mptcp_fin(struct sock *meta_sk)
-+{
-+	struct sock *sk = NULL, *sk_it;
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+
-+	mptcp_for_each_sk(mpcb, sk_it) {
-+		if (tcp_sk(sk_it)->mptcp->path_index == mpcb->dfin_path_index) {
-+			sk = sk_it;
-+			break;
-+		}
-+	}
-+
-+	if (!sk || sk->sk_state == TCP_CLOSE)
-+		sk = mptcp_select_ack_sock(meta_sk);
-+
-+	inet_csk_schedule_ack(sk);
-+
-+	meta_sk->sk_shutdown |= RCV_SHUTDOWN;
-+	sock_set_flag(meta_sk, SOCK_DONE);
-+
-+	switch (meta_sk->sk_state) {
-+	case TCP_SYN_RECV:
-+	case TCP_ESTABLISHED:
-+		/* Move to CLOSE_WAIT */
-+		tcp_set_state(meta_sk, TCP_CLOSE_WAIT);
-+		inet_csk(sk)->icsk_ack.pingpong = 1;
-+		break;
-+
-+	case TCP_CLOSE_WAIT:
-+	case TCP_CLOSING:
-+		/* Received a retransmission of the FIN, do
-+		 * nothing.
-+		 */
-+		break;
-+	case TCP_LAST_ACK:
-+		/* RFC793: Remain in the LAST-ACK state. */
-+		break;
-+
-+	case TCP_FIN_WAIT1:
-+		/* This case occurs when a simultaneous close
-+		 * happens, we must ack the received FIN and
-+		 * enter the CLOSING state.
-+		 */
-+		tcp_send_ack(sk);
-+		tcp_set_state(meta_sk, TCP_CLOSING);
-+		break;
-+	case TCP_FIN_WAIT2:
-+		/* Received a FIN -- send ACK and enter TIME_WAIT. */
-+		tcp_send_ack(sk);
-+		meta_tp->ops->time_wait(meta_sk, TCP_TIME_WAIT, 0);
-+		break;
-+	default:
-+		/* Only TCP_LISTEN and TCP_CLOSE are left, in these
-+		 * cases we should never reach this piece of code.
-+		 */
-+		pr_err("%s: Impossible, meta_sk->sk_state=%d\n", __func__,
-+		       meta_sk->sk_state);
-+		break;
-+	}
-+
-+	/* It _is_ possible, that we have something out-of-order _after_ FIN.
-+	 * Probably, we should reset in this case. For now drop them.
-+	 */
-+	mptcp_purge_ofo_queue(meta_tp);
-+	sk_mem_reclaim(meta_sk);
-+
-+	if (!sock_flag(meta_sk, SOCK_DEAD)) {
-+		meta_sk->sk_state_change(meta_sk);
-+
-+		/* Do not send POLL_HUP for half duplex close. */
-+		if (meta_sk->sk_shutdown == SHUTDOWN_MASK ||
-+		    meta_sk->sk_state == TCP_CLOSE)
-+			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_HUP);
-+		else
-+			sk_wake_async(meta_sk, SOCK_WAKE_WAITD, POLL_IN);
-+	}
-+
-+	return;
-+}
-+
-+static void mptcp_xmit_retransmit_queue(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sk_buff *skb;
-+
-+	if (!meta_tp->packets_out)
-+		return;
-+
-+	tcp_for_write_queue(skb, meta_sk) {
-+		if (skb == tcp_send_head(meta_sk))
-+			break;
-+
-+		if (mptcp_retransmit_skb(meta_sk, skb))
-+			return;
-+
-+		if (skb == tcp_write_queue_head(meta_sk))
-+			inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
-+						  inet_csk(meta_sk)->icsk_rto,
-+						  TCP_RTO_MAX);
-+	}
-+}
-+
-+/* Handle the DATA_ACK */
-+static void mptcp_data_ack(struct sock *sk, const struct sk_buff *skb)
-+{
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *tp = tcp_sk(sk);
-+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	u32 prior_snd_una = meta_tp->snd_una;
-+	int prior_packets;
-+	u32 nwin, data_ack, data_seq;
-+	u16 data_len = 0;
-+
-+	/* A valid packet came in - subflow is operational again */
-+	tp->pf = 0;
-+
-+	/* Even if there is no data-ack, we stop retransmitting.
-+	 * Except if this is a SYN/ACK. Then it is just a retransmission
-+	 */
-+	if (tp->mptcp->pre_established && !tcp_hdr(skb)->syn) {
-+		tp->mptcp->pre_established = 0;
-+		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+	}
-+
-+	/* If we are in infinite mapping mode, rx_opt.data_ack has been
-+	 * set by mptcp_clean_rtx_infinite.
-+	 */
-+	if (!(tcb->mptcp_flags & MPTCPHDR_ACK) && !tp->mpcb->infinite_mapping_snd)
-+		goto exit;
-+
-+	data_ack = tp->mptcp->rx_opt.data_ack;
-+
-+	if (unlikely(!tp->mptcp->fully_established) &&
-+	    tp->mptcp->snt_isn + 1 != TCP_SKB_CB(skb)->ack_seq)
-+		/* As soon as a subflow-data-ack (not acking syn, thus snt_isn + 1)
-+		 * includes a data-ack, we are fully established
-+		 */
-+		mptcp_become_fully_estab(sk);
-+
-+	/* Get the data_seq */
-+	if (mptcp_is_data_seq(skb)) {
-+		data_seq = tp->mptcp->rx_opt.data_seq;
-+		data_len = tp->mptcp->rx_opt.data_len;
-+	} else {
-+		data_seq = meta_tp->snd_wl1;
-+	}
-+
-+	/* If the ack is older than previous acks
-+	 * then we can probably ignore it.
-+	 */
-+	if (before(data_ack, prior_snd_una))
-+		goto exit;
-+
-+	/* If the ack includes data we haven't sent yet, discard
-+	 * this segment (RFC793 Section 3.9).
-+	 */
-+	if (after(data_ack, meta_tp->snd_nxt))
-+		goto exit;
-+
-+	/*** Now, update the window  - inspired by tcp_ack_update_window ***/
-+	nwin = ntohs(tcp_hdr(skb)->window);
-+
-+	if (likely(!tcp_hdr(skb)->syn))
-+		nwin <<= tp->rx_opt.snd_wscale;
-+
-+	if (tcp_may_update_window(meta_tp, data_ack, data_seq, nwin)) {
-+		tcp_update_wl(meta_tp, data_seq);
-+
-+		/* Draft v09, Section 3.3.5:
-+		 * [...] It should only update its local receive window values
-+		 * when the largest sequence number allowed (i.e.  DATA_ACK +
-+		 * receive window) increases. [...]
-+		 */
-+		if (meta_tp->snd_wnd != nwin &&
-+		    !before(data_ack + nwin, tcp_wnd_end(meta_tp))) {
-+			meta_tp->snd_wnd = nwin;
-+
-+			if (nwin > meta_tp->max_window)
-+				meta_tp->max_window = nwin;
-+		}
-+	}
-+	/*** Done, update the window ***/
-+
-+	/* We passed data and got it acked, remove any soft error
-+	 * log. Something worked...
-+	 */
-+	sk->sk_err_soft = 0;
-+	inet_csk(meta_sk)->icsk_probes_out = 0;
-+	meta_tp->rcv_tstamp = tcp_time_stamp;
-+	prior_packets = meta_tp->packets_out;
-+	if (!prior_packets)
-+		goto no_queue;
-+
-+	meta_tp->snd_una = data_ack;
-+
-+	mptcp_clean_rtx_queue(meta_sk, prior_snd_una);
-+
-+	/* We are in loss-state, and something got acked, retransmit the whole
-+	 * queue now!
-+	 */
-+	if (inet_csk(meta_sk)->icsk_ca_state == TCP_CA_Loss &&
-+	    after(data_ack, prior_snd_una)) {
-+		mptcp_xmit_retransmit_queue(meta_sk);
-+		inet_csk(meta_sk)->icsk_ca_state = TCP_CA_Open;
-+	}
-+
-+	/* Simplified version of tcp_new_space, because the snd-buffer
-+	 * is handled by all the subflows.
-+	 */
-+	if (sock_flag(meta_sk, SOCK_QUEUE_SHRUNK)) {
-+		sock_reset_flag(meta_sk, SOCK_QUEUE_SHRUNK);
-+		if (meta_sk->sk_socket &&
-+		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
-+			meta_sk->sk_write_space(meta_sk);
-+	}
-+
-+	if (meta_sk->sk_state != TCP_ESTABLISHED &&
-+	    mptcp_rcv_state_process(meta_sk, sk, skb, data_seq, data_len))
-+		return;
-+
-+exit:
-+	mptcp_push_pending_frames(meta_sk);
-+
-+	return;
-+
-+no_queue:
-+	if (tcp_send_head(meta_sk))
-+		tcp_ack_probe(meta_sk);
-+
-+	mptcp_push_pending_frames(meta_sk);
-+
-+	return;
-+}
-+
-+void mptcp_clean_rtx_infinite(const struct sk_buff *skb, struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = tcp_sk(mptcp_meta_sk(sk));
-+
-+	if (!tp->mpcb->infinite_mapping_snd)
-+		return;
-+
-+	/* The difference between both write_seq's represents the offset between
-+	 * data-sequence and subflow-sequence. As we are infinite, this must
-+	 * match.
-+	 *
-+	 * Thus, from this difference we can infer the meta snd_una.
-+	 */
-+	tp->mptcp->rx_opt.data_ack = meta_tp->snd_nxt - tp->snd_nxt +
-+				     tp->snd_una;
-+
-+	mptcp_data_ack(sk, skb);
-+}
-+
-+/**** static functions used by mptcp_parse_options */
-+
-+static void mptcp_send_reset_rem_id(const struct mptcp_cb *mpcb, u8 rem_id)
-+{
-+	struct sock *sk_it, *tmpsk;
-+
-+	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+		if (tcp_sk(sk_it)->mptcp->rem_id == rem_id) {
-+			mptcp_reinject_data(sk_it, 0);
-+			sk_it->sk_err = ECONNRESET;
-+			if (tcp_need_reset(sk_it->sk_state))
-+				tcp_sk(sk_it)->ops->send_active_reset(sk_it,
-+								      GFP_ATOMIC);
-+			mptcp_sub_force_close(sk_it);
-+		}
-+	}
-+}
-+
-+void mptcp_parse_options(const uint8_t *ptr, int opsize,
-+			 struct mptcp_options_received *mopt,
-+			 const struct sk_buff *skb)
-+{
-+	const struct mptcp_option *mp_opt = (struct mptcp_option *)ptr;
-+
-+	/* If the socket is mp-capable we would have a mopt. */
-+	if (!mopt)
-+		return;
-+
-+	switch (mp_opt->sub) {
-+	case MPTCP_SUB_CAPABLE:
-+	{
-+		const struct mp_capable *mpcapable = (struct mp_capable *)ptr;
-+
-+		if (opsize != MPTCP_SUB_LEN_CAPABLE_SYN &&
-+		    opsize != MPTCP_SUB_LEN_CAPABLE_ACK) {
-+			mptcp_debug("%s: mp_capable: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		if (!sysctl_mptcp_enabled)
-+			break;
-+
-+		/* We only support MPTCP version 0 */
-+		if (mpcapable->ver != 0)
-+			break;
-+
-+		/* MPTCP-RFC 6824:
-+		 * "If receiving a message with the 'B' flag set to 1, and this
-+		 * is not understood, then this SYN MUST be silently ignored;
-+		 */
-+		if (mpcapable->b) {
-+			mopt->drop_me = 1;
-+			break;
-+		}
-+
-+		/* MPTCP-RFC 6824:
-+		 * "An implementation that only supports this method MUST set
-+		 *  bit "H" to 1, and bits "C" through "G" to 0."
-+		 */
-+		if (!mpcapable->h)
-+			break;
-+
-+		mopt->saw_mpc = 1;
-+		mopt->dss_csum = sysctl_mptcp_checksum || mpcapable->a;
-+
-+		if (opsize >= MPTCP_SUB_LEN_CAPABLE_SYN)
-+			mopt->mptcp_key = mpcapable->sender_key;
-+
-+		break;
-+	}
-+	case MPTCP_SUB_JOIN:
-+	{
-+		const struct mp_join *mpjoin = (struct mp_join *)ptr;
-+
-+		if (opsize != MPTCP_SUB_LEN_JOIN_SYN &&
-+		    opsize != MPTCP_SUB_LEN_JOIN_SYNACK &&
-+		    opsize != MPTCP_SUB_LEN_JOIN_ACK) {
-+			mptcp_debug("%s: mp_join: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		/* saw_mpc must be set, because in tcp_check_req we assume that
-+		 * it is set to support falling back to reg. TCP if a rexmitted
-+		 * SYN has no MP_CAPABLE or MP_JOIN
-+		 */
-+		switch (opsize) {
-+		case MPTCP_SUB_LEN_JOIN_SYN:
-+			mopt->is_mp_join = 1;
-+			mopt->saw_mpc = 1;
-+			mopt->low_prio = mpjoin->b;
-+			mopt->rem_id = mpjoin->addr_id;
-+			mopt->mptcp_rem_token = mpjoin->u.syn.token;
-+			mopt->mptcp_recv_nonce = mpjoin->u.syn.nonce;
-+			break;
-+		case MPTCP_SUB_LEN_JOIN_SYNACK:
-+			mopt->saw_mpc = 1;
-+			mopt->low_prio = mpjoin->b;
-+			mopt->rem_id = mpjoin->addr_id;
-+			mopt->mptcp_recv_tmac = mpjoin->u.synack.mac;
-+			mopt->mptcp_recv_nonce = mpjoin->u.synack.nonce;
-+			break;
-+		case MPTCP_SUB_LEN_JOIN_ACK:
-+			mopt->saw_mpc = 1;
-+			mopt->join_ack = 1;
-+			memcpy(mopt->mptcp_recv_mac, mpjoin->u.ack.mac, 20);
-+			break;
-+		}
-+		break;
-+	}
-+	case MPTCP_SUB_DSS:
-+	{
-+		const struct mp_dss *mdss = (struct mp_dss *)ptr;
-+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+
-+		/* We check opsize for the csum and non-csum case. We do this,
-+		 * because the draft says that the csum SHOULD be ignored if
-+		 * it has not been negotiated in the MP_CAPABLE but still is
-+		 * present in the data.
-+		 *
-+		 * It will get ignored later in mptcp_queue_skb.
-+		 */
-+		if (opsize != mptcp_sub_len_dss(mdss, 0) &&
-+		    opsize != mptcp_sub_len_dss(mdss, 1)) {
-+			mptcp_debug("%s: mp_dss: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		ptr += 4;
-+
-+		if (mdss->A) {
-+			tcb->mptcp_flags |= MPTCPHDR_ACK;
-+
-+			if (mdss->a) {
-+				mopt->data_ack = (u32) get_unaligned_be64(ptr);
-+				ptr += MPTCP_SUB_LEN_ACK_64;
-+			} else {
-+				mopt->data_ack = get_unaligned_be32(ptr);
-+				ptr += MPTCP_SUB_LEN_ACK;
-+			}
-+		}
-+
-+		tcb->dss_off = (ptr - skb_transport_header(skb));
-+
-+		if (mdss->M) {
-+			if (mdss->m) {
-+				u64 data_seq64 = get_unaligned_be64(ptr);
-+
-+				tcb->mptcp_flags |= MPTCPHDR_SEQ64_SET;
-+				mopt->data_seq = (u32) data_seq64;
-+
-+				ptr += 12; /* 64-bit dseq + subseq */
-+			} else {
-+				mopt->data_seq = get_unaligned_be32(ptr);
-+				ptr += 8; /* 32-bit dseq + subseq */
-+			}
-+			mopt->data_len = get_unaligned_be16(ptr);
-+
-+			tcb->mptcp_flags |= MPTCPHDR_SEQ;
-+
-+			/* Is a check-sum present? */
-+			if (opsize == mptcp_sub_len_dss(mdss, 1))
-+				tcb->mptcp_flags |= MPTCPHDR_DSS_CSUM;
-+
-+			/* DATA_FIN only possible with DSS-mapping */
-+			if (mdss->F)
-+				tcb->mptcp_flags |= MPTCPHDR_FIN;
-+		}
-+
-+		break;
-+	}
-+	case MPTCP_SUB_ADD_ADDR:
-+	{
-+#if IS_ENABLED(CONFIG_IPV6)
-+		const struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+
-+		if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+		     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
-+		    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
-+		     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2)) {
-+#else
-+		if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+		    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) {
-+#endif /* CONFIG_IPV6 */
-+			mptcp_debug("%s: mp_add_addr: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		/* We have to manually parse the options if we got two of them. */
-+		if (mopt->saw_add_addr) {
-+			mopt->more_add_addr = 1;
-+			break;
-+		}
-+		mopt->saw_add_addr = 1;
-+		mopt->add_addr_ptr = ptr;
-+		break;
-+	}
-+	case MPTCP_SUB_REMOVE_ADDR:
-+		if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0) {
-+			mptcp_debug("%s: mp_remove_addr: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		if (mopt->saw_rem_addr) {
-+			mopt->more_rem_addr = 1;
-+			break;
-+		}
-+		mopt->saw_rem_addr = 1;
-+		mopt->rem_addr_ptr = ptr;
-+		break;
-+	case MPTCP_SUB_PRIO:
-+	{
-+		const struct mp_prio *mpprio = (struct mp_prio *)ptr;
-+
-+		if (opsize != MPTCP_SUB_LEN_PRIO &&
-+		    opsize != MPTCP_SUB_LEN_PRIO_ADDR) {
-+			mptcp_debug("%s: mp_prio: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		mopt->saw_low_prio = 1;
-+		mopt->low_prio = mpprio->b;
-+
-+		if (opsize == MPTCP_SUB_LEN_PRIO_ADDR) {
-+			mopt->saw_low_prio = 2;
-+			mopt->prio_addr_id = mpprio->addr_id;
-+		}
-+		break;
-+	}
-+	case MPTCP_SUB_FAIL:
-+		if (opsize != MPTCP_SUB_LEN_FAIL) {
-+			mptcp_debug("%s: mp_fail: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+		mopt->mp_fail = 1;
-+		break;
-+	case MPTCP_SUB_FCLOSE:
-+		if (opsize != MPTCP_SUB_LEN_FCLOSE) {
-+			mptcp_debug("%s: mp_fclose: bad option size %d\n",
-+				    __func__, opsize);
-+			break;
-+		}
-+
-+		mopt->mp_fclose = 1;
-+		mopt->mptcp_key = ((struct mp_fclose *)ptr)->key;
-+
-+		break;
-+	default:
-+		mptcp_debug("%s: Received unkown subtype: %d\n",
-+			    __func__, mp_opt->sub);
-+		break;
-+	}
-+}
-+
-+/** Parse only MPTCP options */
-+void tcp_parse_mptcp_options(const struct sk_buff *skb,
-+			     struct mptcp_options_received *mopt)
-+{
-+	const struct tcphdr *th = tcp_hdr(skb);
-+	int length = (th->doff * 4) - sizeof(struct tcphdr);
-+	const unsigned char *ptr = (const unsigned char *)(th + 1);
-+
-+	while (length > 0) {
-+		int opcode = *ptr++;
-+		int opsize;
-+
-+		switch (opcode) {
-+		case TCPOPT_EOL:
-+			return;
-+		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
-+			length--;
-+			continue;
-+		default:
-+			opsize = *ptr++;
-+			if (opsize < 2)	/* "silly options" */
-+				return;
-+			if (opsize > length)
-+				return;	/* don't parse partial options */
-+			if (opcode == TCPOPT_MPTCP)
-+				mptcp_parse_options(ptr - 2, opsize, mopt, skb);
-+		}
-+		ptr += opsize - 2;
-+		length -= opsize;
-+	}
-+}
-+
-+int mptcp_check_rtt(const struct tcp_sock *tp, int time)
-+{
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	struct sock *sk;
-+	u32 rtt_max = 0;
-+
-+	/* In MPTCP, we take the max delay across all flows,
-+	 * in order to take into account meta-reordering buffers.
-+	 */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		if (!mptcp_sk_can_recv(sk))
-+			continue;
-+
-+		if (rtt_max < tcp_sk(sk)->rcv_rtt_est.rtt)
-+			rtt_max = tcp_sk(sk)->rcv_rtt_est.rtt;
-+	}
-+	if (time < (rtt_max >> 3) || !rtt_max)
-+		return 1;
-+
-+	return 0;
-+}
-+
-+static void mptcp_handle_add_addr(const unsigned char *ptr, struct sock *sk)
-+{
-+	struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+	__be16 port = 0;
-+	union inet_addr addr;
-+	sa_family_t family;
-+
-+	if (mpadd->ipver == 4) {
-+		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR4 + 2)
-+			port  = mpadd->u.v4.port;
-+		family = AF_INET;
-+		addr.in = mpadd->u.v4.addr;
-+#if IS_ENABLED(CONFIG_IPV6)
-+	} else if (mpadd->ipver == 6) {
-+		if (mpadd->len == MPTCP_SUB_LEN_ADD_ADDR6 + 2)
-+			port  = mpadd->u.v6.port;
-+		family = AF_INET6;
-+		addr.in6 = mpadd->u.v6.addr;
-+#endif /* CONFIG_IPV6 */
-+	} else {
-+		return;
-+	}
-+
-+	if (mpcb->pm_ops->add_raddr)
-+		mpcb->pm_ops->add_raddr(mpcb, &addr, family, port, mpadd->addr_id);
-+}
-+
-+static void mptcp_handle_rem_addr(const unsigned char *ptr, struct sock *sk)
-+{
-+	struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
-+	int i;
-+	u8 rem_id;
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+	for (i = 0; i <= mprem->len - MPTCP_SUB_LEN_REMOVE_ADDR; i++) {
-+		rem_id = (&mprem->addrs_id)[i];
-+
-+		if (mpcb->pm_ops->rem_raddr)
-+			mpcb->pm_ops->rem_raddr(mpcb, rem_id);
-+		mptcp_send_reset_rem_id(mpcb, rem_id);
-+	}
-+}
-+
-+static void mptcp_parse_addropt(const struct sk_buff *skb, struct sock *sk)
-+{
-+	struct tcphdr *th = tcp_hdr(skb);
-+	unsigned char *ptr;
-+	int length = (th->doff * 4) - sizeof(struct tcphdr);
-+
-+	/* Jump through the options to check whether ADD_ADDR is there */
-+	ptr = (unsigned char *)(th + 1);
-+	while (length > 0) {
-+		int opcode = *ptr++;
-+		int opsize;
-+
-+		switch (opcode) {
-+		case TCPOPT_EOL:
-+			return;
-+		case TCPOPT_NOP:
-+			length--;
-+			continue;
-+		default:
-+			opsize = *ptr++;
-+			if (opsize < 2)
-+				return;
-+			if (opsize > length)
-+				return;  /* don't parse partial options */
-+			if (opcode == TCPOPT_MPTCP &&
-+			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_ADD_ADDR) {
-+#if IS_ENABLED(CONFIG_IPV6)
-+				struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+				if ((mpadd->ipver == 4 && opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+				     opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2) ||
-+				    (mpadd->ipver == 6 && opsize != MPTCP_SUB_LEN_ADD_ADDR6 &&
-+				     opsize != MPTCP_SUB_LEN_ADD_ADDR6 + 2))
-+#else
-+				if (opsize != MPTCP_SUB_LEN_ADD_ADDR4 &&
-+				    opsize != MPTCP_SUB_LEN_ADD_ADDR4 + 2)
-+#endif /* CONFIG_IPV6 */
-+					goto cont;
-+
-+				mptcp_handle_add_addr(ptr, sk);
-+			}
-+			if (opcode == TCPOPT_MPTCP &&
-+			    ((struct mptcp_option *)ptr)->sub == MPTCP_SUB_REMOVE_ADDR) {
-+				if ((opsize - MPTCP_SUB_LEN_REMOVE_ADDR) < 0)
-+					goto cont;
-+
-+				mptcp_handle_rem_addr(ptr, sk);
-+			}
-+cont:
-+			ptr += opsize - 2;
-+			length -= opsize;
-+		}
-+	}
-+	return;
-+}
-+
-+static inline int mptcp_mp_fail_rcvd(struct sock *sk, const struct tcphdr *th)
-+{
-+	struct mptcp_tcp_sock *mptcp = tcp_sk(sk)->mptcp;
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+	if (unlikely(mptcp->rx_opt.mp_fail)) {
-+		mptcp->rx_opt.mp_fail = 0;
-+
-+		if (!th->rst && !mpcb->infinite_mapping_snd) {
-+			struct sock *sk_it;
-+
-+			mpcb->send_infinite_mapping = 1;
-+			/* We resend everything that has not been acknowledged */
-+			meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
-+
-+			/* We artificially restart the whole send-queue. Thus,
-+			 * it is as if no packets are in flight
-+			 */
-+			tcp_sk(meta_sk)->packets_out = 0;
-+
-+			/* If the snd_nxt already wrapped around, we have to
-+			 * undo the wrapping, as we are restarting from snd_una
-+			 * on.
-+			 */
-+			if (tcp_sk(meta_sk)->snd_nxt < tcp_sk(meta_sk)->snd_una) {
-+				mpcb->snd_high_order[mpcb->snd_hiseq_index] -= 2;
-+				mpcb->snd_hiseq_index = mpcb->snd_hiseq_index ? 0 : 1;
-+			}
-+			tcp_sk(meta_sk)->snd_nxt = tcp_sk(meta_sk)->snd_una;
-+
-+			/* Trigger a sending on the meta. */
-+			mptcp_push_pending_frames(meta_sk);
-+
-+			mptcp_for_each_sk(mpcb, sk_it) {
-+				if (sk != sk_it)
-+					mptcp_sub_force_close(sk_it);
-+			}
-+		}
-+
-+		return 0;
-+	}
-+
-+	if (unlikely(mptcp->rx_opt.mp_fclose)) {
-+		struct sock *sk_it, *tmpsk;
-+
-+		mptcp->rx_opt.mp_fclose = 0;
-+		if (mptcp->rx_opt.mptcp_key != mpcb->mptcp_loc_key)
-+			return 0;
-+
-+		if (tcp_need_reset(sk->sk_state))
-+			tcp_sk(sk)->ops->send_active_reset(sk, GFP_ATOMIC);
-+
-+		mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk)
-+			mptcp_sub_force_close(sk_it);
-+
-+		tcp_reset(meta_sk);
-+
-+		return 1;
-+	}
-+
-+	return 0;
-+}
-+
-+static inline void mptcp_path_array_check(struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+
-+	if (unlikely(mpcb->list_rcvd)) {
-+		mpcb->list_rcvd = 0;
-+		if (mpcb->pm_ops->new_remote_address)
-+			mpcb->pm_ops->new_remote_address(meta_sk);
-+	}
-+}
-+
-+int mptcp_handle_options(struct sock *sk, const struct tcphdr *th,
-+			 const struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_options_received *mopt = &tp->mptcp->rx_opt;
-+
-+	if (tp->mpcb->infinite_mapping_rcv || tp->mpcb->infinite_mapping_snd)
-+		return 0;
-+
-+	if (mptcp_mp_fail_rcvd(sk, th))
-+		return 1;
-+
-+	/* RFC 6824, Section 3.3:
-+	 * If a checksum is not present when its use has been negotiated, the
-+	 * receiver MUST close the subflow with a RST as it is considered broken.
-+	 */
-+	if (mptcp_is_data_seq(skb) && tp->mpcb->dss_csum &&
-+	    !(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_DSS_CSUM)) {
-+		if (tcp_need_reset(sk->sk_state))
-+			tp->ops->send_active_reset(sk, GFP_ATOMIC);
-+
-+		mptcp_sub_force_close(sk);
-+		return 1;
-+	}
-+
-+	/* We have to acknowledge retransmissions of the third
-+	 * ack.
-+	 */
-+	if (mopt->join_ack) {
-+		tcp_send_delayed_ack(sk);
-+		mopt->join_ack = 0;
-+	}
-+
-+	if (mopt->saw_add_addr || mopt->saw_rem_addr) {
-+		if (mopt->more_add_addr || mopt->more_rem_addr) {
-+			mptcp_parse_addropt(skb, sk);
-+		} else {
-+			if (mopt->saw_add_addr)
-+				mptcp_handle_add_addr(mopt->add_addr_ptr, sk);
-+			if (mopt->saw_rem_addr)
-+				mptcp_handle_rem_addr(mopt->rem_addr_ptr, sk);
-+		}
-+
-+		mopt->more_add_addr = 0;
-+		mopt->saw_add_addr = 0;
-+		mopt->more_rem_addr = 0;
-+		mopt->saw_rem_addr = 0;
-+	}
-+	if (mopt->saw_low_prio) {
-+		if (mopt->saw_low_prio == 1) {
-+			tp->mptcp->rcv_low_prio = mopt->low_prio;
-+		} else {
-+			struct sock *sk_it;
-+			mptcp_for_each_sk(tp->mpcb, sk_it) {
-+				struct mptcp_tcp_sock *mptcp = tcp_sk(sk_it)->mptcp;
-+				if (mptcp->rem_id == mopt->prio_addr_id)
-+					mptcp->rcv_low_prio = mopt->low_prio;
-+			}
-+		}
-+		mopt->saw_low_prio = 0;
-+	}
-+
-+	mptcp_data_ack(sk, skb);
-+
-+	mptcp_path_array_check(mptcp_meta_sk(sk));
-+	/* Socket may have been mp_killed by a REMOVE_ADDR */
-+	if (tp->mp_killed)
-+		return 1;
-+
-+	return 0;
-+}
-+
-+/* In case of fastopen, some data can already be in the write queue.
-+ * We need to update the sequence number of the segments as they
-+ * were initially TCP sequence numbers.
-+ */
-+static void mptcp_rcv_synsent_fastopen(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct tcp_sock *master_tp = tcp_sk(meta_tp->mpcb->master_sk);
-+	struct sk_buff *skb;
-+	u32 new_mapping = meta_tp->write_seq - master_tp->snd_una;
-+
-+	/* There should only be one skb in write queue: the data not
-+	 * acknowledged in the SYN+ACK. In this case, we need to map
-+	 * this data to data sequence numbers.
-+	 */
-+	skb_queue_walk(&meta_sk->sk_write_queue, skb) {
-+		/* If the server only acknowledges partially the data sent in
-+		 * the SYN, we need to trim the acknowledged part because
-+		 * we don't want to retransmit this already received data.
-+		 * When we reach this point, tcp_ack() has already cleaned up
-+		 * fully acked segments. However, tcp trims partially acked
-+		 * segments only when retransmitting. Since MPTCP comes into
-+		 * play only now, we will fake an initial transmit, and
-+		 * retransmit_skb() will not be called. The following fragment
-+		 * comes from __tcp_retransmit_skb().
-+		 */
-+		if (before(TCP_SKB_CB(skb)->seq, master_tp->snd_una)) {
-+			BUG_ON(before(TCP_SKB_CB(skb)->end_seq,
-+				      master_tp->snd_una));
-+			/* tcp_trim_head can only returns ENOMEM if skb is
-+			 * cloned. It is not the case here (see
-+			 * tcp_send_syn_data).
-+			 */
-+			BUG_ON(tcp_trim_head(meta_sk, skb, master_tp->snd_una -
-+					     TCP_SKB_CB(skb)->seq));
-+		}
-+
-+		TCP_SKB_CB(skb)->seq += new_mapping;
-+		TCP_SKB_CB(skb)->end_seq += new_mapping;
-+	}
-+
-+	/* We can advance write_seq by the number of bytes unacknowledged
-+	 * and that were mapped in the previous loop.
-+	 */
-+	meta_tp->write_seq += master_tp->write_seq - master_tp->snd_una;
-+
-+	/* The packets from the master_sk will be entailed to it later
-+	 * Until that time, its write queue is empty, and
-+	 * write_seq must align with snd_una
-+	 */
-+	master_tp->snd_nxt = master_tp->write_seq = master_tp->snd_una;
-+	master_tp->packets_out = 0;
-+
-+	/* Although these data have been sent already over the subsk,
-+	 * They have never been sent over the meta_sk, so we rewind
-+	 * the send_head so that tcp considers it as an initial send
-+	 * (instead of retransmit).
-+	 */
-+	meta_sk->sk_send_head = tcp_write_queue_head(meta_sk);
-+}
-+
-+/* The skptr is needed, because if we become MPTCP-capable, we have to switch
-+ * from meta-socket to master-socket.
-+ *
-+ * @return: 1 - we want to reset this connection
-+ *	    2 - we want to discard the received syn/ack
-+ *	    0 - everything is fine - continue
-+ */
-+int mptcp_rcv_synsent_state_process(struct sock *sk, struct sock **skptr,
-+				    const struct sk_buff *skb,
-+				    const struct mptcp_options_received *mopt)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	if (mptcp(tp)) {
-+		u8 hash_mac_check[20];
-+		struct mptcp_cb *mpcb = tp->mpcb;
-+
-+		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_rem_key,
-+				(u8 *)&mpcb->mptcp_loc_key,
-+				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
-+				(u8 *)&tp->mptcp->mptcp_loc_nonce,
-+				(u32 *)hash_mac_check);
-+		if (memcmp(hash_mac_check,
-+			   (char *)&tp->mptcp->rx_opt.mptcp_recv_tmac, 8)) {
-+			mptcp_sub_force_close(sk);
-+			return 1;
-+		}
-+
-+		/* Set this flag in order to postpone data sending
-+		 * until the 4th ack arrives.
-+		 */
-+		tp->mptcp->pre_established = 1;
-+		tp->mptcp->rcv_low_prio = tp->mptcp->rx_opt.low_prio;
-+
-+		mptcp_hmac_sha1((u8 *)&mpcb->mptcp_loc_key,
-+				(u8 *)&mpcb->mptcp_rem_key,
-+				(u8 *)&tp->mptcp->mptcp_loc_nonce,
-+				(u8 *)&tp->mptcp->rx_opt.mptcp_recv_nonce,
-+				(u32 *)&tp->mptcp->sender_mac[0]);
-+
-+	} else if (mopt->saw_mpc) {
-+		struct sock *meta_sk = sk;
-+
-+		if (mptcp_create_master_sk(sk, mopt->mptcp_key,
-+					   ntohs(tcp_hdr(skb)->window)))
-+			return 2;
-+
-+		sk = tcp_sk(sk)->mpcb->master_sk;
-+		*skptr = sk;
-+		tp = tcp_sk(sk);
-+
-+		/* If fastopen was used data might be in the send queue. We
-+		 * need to update their sequence number to MPTCP-level seqno.
-+		 * Note that it can happen in rare cases that fastopen_req is
-+		 * NULL and syn_data is 0 but fastopen indeed occurred and
-+		 * data has been queued in the write queue (but not sent).
-+		 * Example of such rare cases: connect is non-blocking and
-+		 * TFO is configured to work without cookies.
-+		 */
-+		if (!skb_queue_empty(&meta_sk->sk_write_queue))
-+			mptcp_rcv_synsent_fastopen(meta_sk);
-+
-+		/* -1, because the SYN consumed 1 byte. In case of TFO, we
-+		 * start the subflow-sequence number as if the data of the SYN
-+		 * is not part of any mapping.
-+		 */
-+		tp->mptcp->snt_isn = tp->snd_una - 1;
-+		tp->mpcb->dss_csum = mopt->dss_csum;
-+		tp->mptcp->include_mpc = 1;
-+
-+		/* Ensure that fastopen is handled at the meta-level. */
-+		tp->fastopen_req = NULL;
-+
-+		sk_set_socket(sk, mptcp_meta_sk(sk)->sk_socket);
-+		sk->sk_wq = mptcp_meta_sk(sk)->sk_wq;
-+
-+		 /* hold in sk_clone_lock due to initialization to 2 */
-+		sock_put(sk);
-+	} else {
-+		tp->request_mptcp = 0;
-+
-+		if (tp->inside_tk_table)
-+			mptcp_hash_remove(tp);
-+	}
-+
-+	if (mptcp(tp))
-+		tp->mptcp->rcv_isn = TCP_SKB_CB(skb)->seq;
-+
-+	return 0;
-+}
-+
-+bool mptcp_should_expand_sndbuf(const struct sock *sk)
-+{
-+	const struct sock *sk_it;
-+	const struct sock *meta_sk = mptcp_meta_sk(sk);
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	int cnt_backups = 0;
-+	int backup_available = 0;
-+
-+	/* We circumvent this check in tcp_check_space, because we want to
-+	 * always call sk_write_space. So, we reproduce the check here.
-+	 */
-+	if (!meta_sk->sk_socket ||
-+	    !test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags))
-+		return false;
-+
-+	/* If the user specified a specific send buffer setting, do
-+	 * not modify it.
-+	 */
-+	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-+		return false;
-+
-+	/* If we are under global TCP memory pressure, do not expand.  */
-+	if (sk_under_memory_pressure(meta_sk))
-+		return false;
-+
-+	/* If we are under soft global TCP memory pressure, do not expand.  */
-+	if (sk_memory_allocated(meta_sk) >= sk_prot_mem_limits(meta_sk, 0))
-+		return false;
-+
-+
-+	/* For MPTCP we look for a subsocket that could send data.
-+	 * If we found one, then we update the send-buffer.
-+	 */
-+	mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+		struct tcp_sock *tp_it = tcp_sk(sk_it);
-+
-+		if (!mptcp_sk_can_send(sk_it))
-+			continue;
-+
-+		/* Backup-flows have to be counted - if there is no other
-+		 * subflow we take the backup-flow into account.
-+		 */
-+		if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio)
-+			cnt_backups++;
-+
-+		if (tp_it->packets_out < tp_it->snd_cwnd) {
-+			if (tp_it->mptcp->rcv_low_prio || tp_it->mptcp->low_prio) {
-+				backup_available = 1;
-+				continue;
-+			}
-+			return true;
-+		}
-+	}
-+
-+	/* Backup-flow is available for sending - update send-buffer */
-+	if (meta_tp->mpcb->cnt_established == cnt_backups && backup_available)
-+		return true;
-+	return false;
-+}
-+
-+void mptcp_init_buffer_space(struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	int space;
-+
-+	tcp_init_buffer_space(sk);
-+
-+	if (is_master_tp(tp)) {
-+		meta_tp->rcvq_space.space = meta_tp->rcv_wnd;
-+		meta_tp->rcvq_space.time = tcp_time_stamp;
-+		meta_tp->rcvq_space.seq = meta_tp->copied_seq;
-+
-+		/* If there is only one subflow, we just use regular TCP
-+		 * autotuning. User-locks are handled already by
-+		 * tcp_init_buffer_space
-+		 */
-+		meta_tp->window_clamp = tp->window_clamp;
-+		meta_tp->rcv_ssthresh = tp->rcv_ssthresh;
-+		meta_sk->sk_rcvbuf = sk->sk_rcvbuf;
-+		meta_sk->sk_sndbuf = sk->sk_sndbuf;
-+
-+		return;
-+	}
-+
-+	if (meta_sk->sk_userlocks & SOCK_RCVBUF_LOCK)
-+		goto snd_buf;
-+
-+	/* Adding a new subflow to the rcv-buffer space. We make a simple
-+	 * addition, to give some space to allow traffic on the new subflow.
-+	 * Autotuning will increase it further later on.
-+	 */
-+	space = min(meta_sk->sk_rcvbuf + sk->sk_rcvbuf, sysctl_tcp_rmem[2]);
-+	if (space > meta_sk->sk_rcvbuf) {
-+		meta_tp->window_clamp += tp->window_clamp;
-+		meta_tp->rcv_ssthresh += tp->rcv_ssthresh;
-+		meta_sk->sk_rcvbuf = space;
-+	}
-+
-+snd_buf:
-+	if (meta_sk->sk_userlocks & SOCK_SNDBUF_LOCK)
-+		return;
-+
-+	/* Adding a new subflow to the send-buffer space. We make a simple
-+	 * addition, to give some space to allow traffic on the new subflow.
-+	 * Autotuning will increase it further later on.
-+	 */
-+	space = min(meta_sk->sk_sndbuf + sk->sk_sndbuf, sysctl_tcp_wmem[2]);
-+	if (space > meta_sk->sk_sndbuf) {
-+		meta_sk->sk_sndbuf = space;
-+		meta_sk->sk_write_space(meta_sk);
-+	}
-+}
-+
-+void mptcp_tcp_set_rto(struct sock *sk)
-+{
-+	tcp_set_rto(sk);
-+	mptcp_set_rto(sk);
-+}
-diff --git a/net/mptcp/mptcp_ipv4.c b/net/mptcp/mptcp_ipv4.c
-new file mode 100644
-index 000000000000..1183d1305d35
---- /dev/null
-+++ b/net/mptcp/mptcp_ipv4.c
-@@ -0,0 +1,483 @@
-+/*
-+ *	MPTCP implementation - IPv4-specific functions
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/export.h>
-+#include <linux/ip.h>
-+#include <linux/list.h>
-+#include <linux/skbuff.h>
-+#include <linux/spinlock.h>
-+#include <linux/tcp.h>
-+
-+#include <net/inet_common.h>
-+#include <net/inet_connection_sock.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/request_sock.h>
-+#include <net/tcp.h>
-+
-+u32 mptcp_v4_get_nonce(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
-+{
-+	u32 hash[MD5_DIGEST_WORDS];
-+
-+	hash[0] = (__force u32)saddr;
-+	hash[1] = (__force u32)daddr;
-+	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-+	hash[3] = mptcp_seed++;
-+
-+	md5_transform(hash, mptcp_secret);
-+
-+	return hash[0];
-+}
-+
-+u64 mptcp_v4_get_key(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport)
-+{
-+	u32 hash[MD5_DIGEST_WORDS];
-+
-+	hash[0] = (__force u32)saddr;
-+	hash[1] = (__force u32)daddr;
-+	hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-+	hash[3] = mptcp_seed++;
-+
-+	md5_transform(hash, mptcp_secret);
-+
-+	return *((u64 *)hash);
-+}
-+
-+
-+static void mptcp_v4_reqsk_destructor(struct request_sock *req)
-+{
-+	mptcp_reqsk_destructor(req);
-+
-+	tcp_v4_reqsk_destructor(req);
-+}
-+
-+static int mptcp_v4_init_req(struct request_sock *req, struct sock *sk,
-+			     struct sk_buff *skb)
-+{
-+	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
-+	mptcp_reqsk_init(req, skb);
-+
-+	return 0;
-+}
-+
-+static int mptcp_v4_join_init_req(struct request_sock *req, struct sock *sk,
-+				  struct sk_buff *skb)
-+{
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+	union inet_addr addr;
-+	int loc_id;
-+	bool low_prio = false;
-+
-+	/* We need to do this as early as possible. Because, if we fail later
-+	 * (e.g., get_local_id), then reqsk_free tries to remove the
-+	 * request-socket from the htb in mptcp_hash_request_remove as pprev
-+	 * may be different from NULL.
-+	 */
-+	mtreq->hash_entry.pprev = NULL;
-+
-+	tcp_request_sock_ipv4_ops.init_req(req, sk, skb);
-+
-+	mtreq->mptcp_loc_nonce = mptcp_v4_get_nonce(ip_hdr(skb)->saddr,
-+						    ip_hdr(skb)->daddr,
-+						    tcp_hdr(skb)->source,
-+						    tcp_hdr(skb)->dest);
-+	addr.ip = inet_rsk(req)->ir_loc_addr;
-+	loc_id = mpcb->pm_ops->get_local_id(AF_INET, &addr, sock_net(sk), &low_prio);
-+	if (loc_id == -1)
-+		return -1;
-+	mtreq->loc_id = loc_id;
-+	mtreq->low_prio = low_prio;
-+
-+	mptcp_join_reqsk_init(mpcb, req, skb);
-+
-+	return 0;
-+}
-+
-+/* Similar to tcp_request_sock_ops */
-+struct request_sock_ops mptcp_request_sock_ops __read_mostly = {
-+	.family		=	PF_INET,
-+	.obj_size	=	sizeof(struct mptcp_request_sock),
-+	.rtx_syn_ack	=	tcp_rtx_synack,
-+	.send_ack	=	tcp_v4_reqsk_send_ack,
-+	.destructor	=	mptcp_v4_reqsk_destructor,
-+	.send_reset	=	tcp_v4_send_reset,
-+	.syn_ack_timeout =	tcp_syn_ack_timeout,
-+};
-+
-+static void mptcp_v4_reqsk_queue_hash_add(struct sock *meta_sk,
-+					  struct request_sock *req,
-+					  const unsigned long timeout)
-+{
-+	const u32 h1 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
-+				     inet_rsk(req)->ir_rmt_port,
-+				     0, MPTCP_HASH_SIZE);
-+	/* We cannot call inet_csk_reqsk_queue_hash_add(), because we do not
-+	 * want to reset the keepalive-timer (responsible for retransmitting
-+	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
-+	 * overload the keepalive timer. Also, it's not a big deal, because the
-+	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
-+	 * if the third ACK gets lost, the client will handle the retransmission
-+	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
-+	 * SYN.
-+	 */
-+	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
-+	const u32 h2 = inet_synq_hash(inet_rsk(req)->ir_rmt_addr,
-+				     inet_rsk(req)->ir_rmt_port,
-+				     lopt->hash_rnd, lopt->nr_table_entries);
-+
-+	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
-+	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
-+		mptcp_reset_synack_timer(meta_sk, timeout);
-+
-+	rcu_read_lock();
-+	spin_lock(&mptcp_reqsk_hlock);
-+	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
-+	spin_unlock(&mptcp_reqsk_hlock);
-+	rcu_read_unlock();
-+}
-+
-+/* Similar to tcp_v4_conn_request */
-+static int mptcp_v4_join_request(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	return tcp_conn_request(&mptcp_request_sock_ops,
-+				&mptcp_join_request_sock_ipv4_ops,
-+				meta_sk, skb);
-+}
-+
-+/* We only process join requests here. (either the SYN or the final ACK) */
-+int mptcp_v4_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *child, *rsk = NULL;
-+	int ret;
-+
-+	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
-+		struct tcphdr *th = tcp_hdr(skb);
-+		const struct iphdr *iph = ip_hdr(skb);
-+		struct sock *sk;
-+
-+		sk = inet_lookup_established(sock_net(meta_sk), &tcp_hashinfo,
-+					     iph->saddr, th->source, iph->daddr,
-+					     th->dest, inet_iif(skb));
-+
-+		if (!sk) {
-+			kfree_skb(skb);
-+			return 0;
-+		}
-+		if (is_meta_sk(sk)) {
-+			WARN("%s Did not find a sub-sk - did found the meta!\n", __func__);
-+			kfree_skb(skb);
-+			sock_put(sk);
-+			return 0;
-+		}
-+
-+		if (sk->sk_state == TCP_TIME_WAIT) {
-+			inet_twsk_put(inet_twsk(sk));
-+			kfree_skb(skb);
-+			return 0;
-+		}
-+
-+		ret = tcp_v4_do_rcv(sk, skb);
-+		sock_put(sk);
-+
-+		return ret;
-+	}
-+	TCP_SKB_CB(skb)->mptcp_flags = 0;
-+
-+	/* Has been removed from the tk-table. Thus, no new subflows.
-+	 *
-+	 * Check for close-state is necessary, because we may have been closed
-+	 * without passing by mptcp_close().
-+	 *
-+	 * When falling back, no new subflows are allowed either.
-+	 */
-+	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
-+	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
-+		goto reset_and_discard;
-+
-+	child = tcp_v4_hnd_req(meta_sk, skb);
-+
-+	if (!child)
-+		goto discard;
-+
-+	if (child != meta_sk) {
-+		sock_rps_save_rxhash(child, skb);
-+		/* We don't call tcp_child_process here, because we hold
-+		 * already the meta-sk-lock and are sure that it is not owned
-+		 * by the user.
-+		 */
-+		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
-+		bh_unlock_sock(child);
-+		sock_put(child);
-+		if (ret) {
-+			rsk = child;
-+			goto reset_and_discard;
-+		}
-+	} else {
-+		if (tcp_hdr(skb)->syn) {
-+			mptcp_v4_join_request(meta_sk, skb);
-+			goto discard;
-+		}
-+		goto reset_and_discard;
-+	}
-+	return 0;
-+
-+reset_and_discard:
-+	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
-+		const struct tcphdr *th = tcp_hdr(skb);
-+		const struct iphdr *iph = ip_hdr(skb);
-+		struct request_sock **prev, *req;
-+		/* If we end up here, it means we should not have matched on the
-+		 * request-socket. But, because the request-sock queue is only
-+		 * destroyed in mptcp_close, the socket may actually already be
-+		 * in close-state (e.g., through shutdown()) while still having
-+		 * pending request sockets.
-+		 */
-+		req = inet_csk_search_req(meta_sk, &prev, th->source,
-+					  iph->saddr, iph->daddr);
-+		if (req) {
-+			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
-+					    req);
-+			reqsk_free(req);
-+		}
-+	}
-+
-+	tcp_v4_send_reset(rsk, skb);
-+discard:
-+	kfree_skb(skb);
-+	return 0;
-+}
-+
-+/* After this, the ref count of the meta_sk associated with the request_sock
-+ * is incremented. Thus it is the responsibility of the caller
-+ * to call sock_put() when the reference is not needed anymore.
-+ */
-+struct sock *mptcp_v4_search_req(const __be16 rport, const __be32 raddr,
-+				 const __be32 laddr, const struct net *net)
-+{
-+	const struct mptcp_request_sock *mtreq;
-+	struct sock *meta_sk = NULL;
-+	const struct hlist_nulls_node *node;
-+	const u32 hash = inet_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
-+
-+	rcu_read_lock();
-+begin:
-+	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
-+				       hash_entry) {
-+		struct inet_request_sock *ireq = inet_rsk(rev_mptcp_rsk(mtreq));
-+		meta_sk = mtreq->mptcp_mpcb->meta_sk;
-+
-+		if (ireq->ir_rmt_port == rport &&
-+		    ireq->ir_rmt_addr == raddr &&
-+		    ireq->ir_loc_addr == laddr &&
-+		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET &&
-+		    net_eq(net, sock_net(meta_sk)))
-+			goto found;
-+		meta_sk = NULL;
-+	}
-+	/* A request-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
-+		goto begin;
-+
-+found:
-+	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+		meta_sk = NULL;
-+	rcu_read_unlock();
-+
-+	return meta_sk;
-+}
-+
-+/* Create a new IPv4 subflow.
-+ *
-+ * We are in user-context and meta-sock-lock is hold.
-+ */
-+int mptcp_init4_subsockets(struct sock *meta_sk, const struct mptcp_loc4 *loc,
-+			   struct mptcp_rem4 *rem)
-+{
-+	struct tcp_sock *tp;
-+	struct sock *sk;
-+	struct sockaddr_in loc_in, rem_in;
-+	struct socket sock;
-+	int ret;
-+
-+	/** First, create and prepare the new socket */
-+
-+	sock.type = meta_sk->sk_socket->type;
-+	sock.state = SS_UNCONNECTED;
-+	sock.wq = meta_sk->sk_socket->wq;
-+	sock.file = meta_sk->sk_socket->file;
-+	sock.ops = NULL;
-+
-+	ret = inet_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
-+	if (unlikely(ret < 0)) {
-+		mptcp_debug("%s inet_create failed ret: %d\n", __func__, ret);
-+		return ret;
-+	}
-+
-+	sk = sock.sk;
-+	tp = tcp_sk(sk);
-+
-+	/* All subsockets need the MPTCP-lock-class */
-+	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
-+	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
-+
-+	if (mptcp_add_sock(meta_sk, sk, loc->loc4_id, rem->rem4_id, GFP_KERNEL))
-+		goto error;
-+
-+	tp->mptcp->slave_sk = 1;
-+	tp->mptcp->low_prio = loc->low_prio;
-+
-+	/* Initializing the timer for an MPTCP subflow */
-+	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
-+
-+	/** Then, connect the socket to the peer */
-+	loc_in.sin_family = AF_INET;
-+	rem_in.sin_family = AF_INET;
-+	loc_in.sin_port = 0;
-+	if (rem->port)
-+		rem_in.sin_port = rem->port;
-+	else
-+		rem_in.sin_port = inet_sk(meta_sk)->inet_dport;
-+	loc_in.sin_addr = loc->addr;
-+	rem_in.sin_addr = rem->addr;
-+
-+	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in));
-+	if (ret < 0) {
-+		mptcp_debug("%s: MPTCP subsocket bind() failed, error %d\n",
-+			    __func__, ret);
-+		goto error;
-+	}
-+
-+	mptcp_debug("%s: token %#x pi %d src_addr:%pI4:%d dst_addr:%pI4:%d\n",
-+		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
-+		    tp->mptcp->path_index, &loc_in.sin_addr,
-+		    ntohs(loc_in.sin_port), &rem_in.sin_addr,
-+		    ntohs(rem_in.sin_port));
-+
-+	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4)
-+		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v4(sk, rem->addr);
-+
-+	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
-+				sizeof(struct sockaddr_in), O_NONBLOCK);
-+	if (ret < 0 && ret != -EINPROGRESS) {
-+		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
-+			    __func__, ret);
-+		goto error;
-+	}
-+
-+	sk_set_socket(sk, meta_sk->sk_socket);
-+	sk->sk_wq = meta_sk->sk_wq;
-+
-+	return 0;
-+
-+error:
-+	/* May happen if mptcp_add_sock fails first */
-+	if (!mptcp(tp)) {
-+		tcp_close(sk, 0);
-+	} else {
-+		local_bh_disable();
-+		mptcp_sub_force_close(sk);
-+		local_bh_enable();
-+	}
-+	return ret;
-+}
-+EXPORT_SYMBOL(mptcp_init4_subsockets);
-+
-+const struct inet_connection_sock_af_ops mptcp_v4_specific = {
-+	.queue_xmit	   = ip_queue_xmit,
-+	.send_check	   = tcp_v4_send_check,
-+	.rebuild_header	   = inet_sk_rebuild_header,
-+	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
-+	.conn_request	   = mptcp_conn_request,
-+	.syn_recv_sock	   = tcp_v4_syn_recv_sock,
-+	.net_header_len	   = sizeof(struct iphdr),
-+	.setsockopt	   = ip_setsockopt,
-+	.getsockopt	   = ip_getsockopt,
-+	.addr2sockaddr	   = inet_csk_addr2sockaddr,
-+	.sockaddr_len	   = sizeof(struct sockaddr_in),
-+	.bind_conflict	   = inet_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+	.compat_setsockopt = compat_ip_setsockopt,
-+	.compat_getsockopt = compat_ip_getsockopt,
-+#endif
-+};
-+
-+struct tcp_request_sock_ops mptcp_request_sock_ipv4_ops;
-+struct tcp_request_sock_ops mptcp_join_request_sock_ipv4_ops;
-+
-+/* General initialization of IPv4 for MPTCP */
-+int mptcp_pm_v4_init(void)
-+{
-+	int ret = 0;
-+	struct request_sock_ops *ops = &mptcp_request_sock_ops;
-+
-+	mptcp_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-+	mptcp_request_sock_ipv4_ops.init_req = mptcp_v4_init_req;
-+
-+	mptcp_join_request_sock_ipv4_ops = tcp_request_sock_ipv4_ops;
-+	mptcp_join_request_sock_ipv4_ops.init_req = mptcp_v4_join_init_req;
-+	mptcp_join_request_sock_ipv4_ops.queue_hash_add = mptcp_v4_reqsk_queue_hash_add;
-+
-+	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP");
-+	if (ops->slab_name == NULL) {
-+		ret = -ENOMEM;
-+		goto out;
-+	}
-+
-+	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
-+				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+				      NULL);
-+
-+	if (ops->slab == NULL) {
-+		ret =  -ENOMEM;
-+		goto err_reqsk_create;
-+	}
-+
-+out:
-+	return ret;
-+
-+err_reqsk_create:
-+	kfree(ops->slab_name);
-+	ops->slab_name = NULL;
-+	goto out;
-+}
-+
-+void mptcp_pm_v4_undo(void)
-+{
-+	kmem_cache_destroy(mptcp_request_sock_ops.slab);
-+	kfree(mptcp_request_sock_ops.slab_name);
-+}
-diff --git a/net/mptcp/mptcp_ipv6.c b/net/mptcp/mptcp_ipv6.c
-new file mode 100644
-index 000000000000..1036973aa855
---- /dev/null
-+++ b/net/mptcp/mptcp_ipv6.c
-@@ -0,0 +1,518 @@
-+/*
-+ *	MPTCP implementation - IPv6-specific functions
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/export.h>
-+#include <linux/in6.h>
-+#include <linux/kernel.h>
-+
-+#include <net/addrconf.h>
-+#include <net/flow.h>
-+#include <net/inet6_connection_sock.h>
-+#include <net/inet6_hashtables.h>
-+#include <net/inet_common.h>
-+#include <net/ipv6.h>
-+#include <net/ip6_checksum.h>
-+#include <net/ip6_route.h>
-+#include <net/mptcp.h>
-+#include <net/mptcp_v6.h>
-+#include <net/tcp.h>
-+#include <net/transp_v6.h>
-+
-+__u32 mptcp_v6_get_nonce(const __be32 *saddr, const __be32 *daddr,
-+			 __be16 sport, __be16 dport)
-+{
-+	u32 secret[MD5_MESSAGE_BYTES / 4];
-+	u32 hash[MD5_DIGEST_WORDS];
-+	u32 i;
-+
-+	memcpy(hash, saddr, 16);
-+	for (i = 0; i < 4; i++)
-+		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
-+	secret[4] = mptcp_secret[4] +
-+		    (((__force u16)sport << 16) + (__force u16)dport);
-+	secret[5] = mptcp_seed++;
-+	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
-+		secret[i] = mptcp_secret[i];
-+
-+	md5_transform(hash, secret);
-+
-+	return hash[0];
-+}
-+
-+u64 mptcp_v6_get_key(const __be32 *saddr, const __be32 *daddr,
-+		     __be16 sport, __be16 dport)
-+{
-+	u32 secret[MD5_MESSAGE_BYTES / 4];
-+	u32 hash[MD5_DIGEST_WORDS];
-+	u32 i;
-+
-+	memcpy(hash, saddr, 16);
-+	for (i = 0; i < 4; i++)
-+		secret[i] = mptcp_secret[i] + (__force u32)daddr[i];
-+	secret[4] = mptcp_secret[4] +
-+		    (((__force u16)sport << 16) + (__force u16)dport);
-+	secret[5] = mptcp_seed++;
-+	for (i = 6; i < MD5_MESSAGE_BYTES / 4; i++)
-+		secret[i] = mptcp_secret[i];
-+
-+	md5_transform(hash, secret);
-+
-+	return *((u64 *)hash);
-+}
-+
-+static void mptcp_v6_reqsk_destructor(struct request_sock *req)
-+{
-+	mptcp_reqsk_destructor(req);
-+
-+	tcp_v6_reqsk_destructor(req);
-+}
-+
-+static int mptcp_v6_init_req(struct request_sock *req, struct sock *sk,
-+			     struct sk_buff *skb)
-+{
-+	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
-+	mptcp_reqsk_init(req, skb);
-+
-+	return 0;
-+}
-+
-+static int mptcp_v6_join_init_req(struct request_sock *req, struct sock *sk,
-+				  struct sk_buff *skb)
-+{
-+	struct mptcp_request_sock *mtreq = mptcp_rsk(req);
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+	union inet_addr addr;
-+	int loc_id;
-+	bool low_prio = false;
-+
-+	/* We need to do this as early as possible. Because, if we fail later
-+	 * (e.g., get_local_id), then reqsk_free tries to remove the
-+	 * request-socket from the htb in mptcp_hash_request_remove as pprev
-+	 * may be different from NULL.
-+	 */
-+	mtreq->hash_entry.pprev = NULL;
-+
-+	tcp_request_sock_ipv6_ops.init_req(req, sk, skb);
-+
-+	mtreq->mptcp_loc_nonce = mptcp_v6_get_nonce(ipv6_hdr(skb)->saddr.s6_addr32,
-+						    ipv6_hdr(skb)->daddr.s6_addr32,
-+						    tcp_hdr(skb)->source,
-+						    tcp_hdr(skb)->dest);
-+	addr.in6 = inet_rsk(req)->ir_v6_loc_addr;
-+	loc_id = mpcb->pm_ops->get_local_id(AF_INET6, &addr, sock_net(sk), &low_prio);
-+	if (loc_id == -1)
-+		return -1;
-+	mtreq->loc_id = loc_id;
-+	mtreq->low_prio = low_prio;
-+
-+	mptcp_join_reqsk_init(mpcb, req, skb);
-+
-+	return 0;
-+}
-+
-+/* Similar to tcp6_request_sock_ops */
-+struct request_sock_ops mptcp6_request_sock_ops __read_mostly = {
-+	.family		=	AF_INET6,
-+	.obj_size	=	sizeof(struct mptcp_request_sock),
-+	.rtx_syn_ack	=	tcp_v6_rtx_synack,
-+	.send_ack	=	tcp_v6_reqsk_send_ack,
-+	.destructor	=	mptcp_v6_reqsk_destructor,
-+	.send_reset	=	tcp_v6_send_reset,
-+	.syn_ack_timeout =	tcp_syn_ack_timeout,
-+};
-+
-+static void mptcp_v6_reqsk_queue_hash_add(struct sock *meta_sk,
-+					  struct request_sock *req,
-+					  const unsigned long timeout)
-+{
-+	const u32 h1 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
-+				      inet_rsk(req)->ir_rmt_port,
-+				      0, MPTCP_HASH_SIZE);
-+	/* We cannot call inet6_csk_reqsk_queue_hash_add(), because we do not
-+	 * want to reset the keepalive-timer (responsible for retransmitting
-+	 * SYN/ACKs). We do not retransmit SYN/ACKs+MP_JOINs, because we cannot
-+	 * overload the keepalive timer. Also, it's not a big deal, because the
-+	 * third ACK of the MP_JOIN-handshake is sent in a reliable manner. So,
-+	 * if the third ACK gets lost, the client will handle the retransmission
-+	 * anyways. If our SYN/ACK gets lost, the client will retransmit the
-+	 * SYN.
-+	 */
-+	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+	struct listen_sock *lopt = meta_icsk->icsk_accept_queue.listen_opt;
-+	const u32 h2 = inet6_synq_hash(&inet_rsk(req)->ir_v6_rmt_addr,
-+				      inet_rsk(req)->ir_rmt_port,
-+				      lopt->hash_rnd, lopt->nr_table_entries);
-+
-+	reqsk_queue_hash_req(&meta_icsk->icsk_accept_queue, h2, req, timeout);
-+	if (reqsk_queue_added(&meta_icsk->icsk_accept_queue) == 0)
-+		mptcp_reset_synack_timer(meta_sk, timeout);
-+
-+	rcu_read_lock();
-+	spin_lock(&mptcp_reqsk_hlock);
-+	hlist_nulls_add_head_rcu(&mptcp_rsk(req)->hash_entry, &mptcp_reqsk_htb[h1]);
-+	spin_unlock(&mptcp_reqsk_hlock);
-+	rcu_read_unlock();
-+}
-+
-+static int mptcp_v6_join_request(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	return tcp_conn_request(&mptcp6_request_sock_ops,
-+				&mptcp_join_request_sock_ipv6_ops,
-+				meta_sk, skb);
-+}
-+
-+int mptcp_v6_do_rcv(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *child, *rsk = NULL;
-+	int ret;
-+
-+	if (!(TCP_SKB_CB(skb)->mptcp_flags & MPTCPHDR_JOIN)) {
-+		struct tcphdr *th = tcp_hdr(skb);
-+		const struct ipv6hdr *ip6h = ipv6_hdr(skb);
-+		struct sock *sk;
-+
-+		sk = __inet6_lookup_established(sock_net(meta_sk),
-+						&tcp_hashinfo,
-+						&ip6h->saddr, th->source,
-+						&ip6h->daddr, ntohs(th->dest),
-+						inet6_iif(skb));
-+
-+		if (!sk) {
-+			kfree_skb(skb);
-+			return 0;
-+		}
-+		if (is_meta_sk(sk)) {
-+			WARN("%s Did not find a sub-sk!\n", __func__);
-+			kfree_skb(skb);
-+			sock_put(sk);
-+			return 0;
-+		}
-+
-+		if (sk->sk_state == TCP_TIME_WAIT) {
-+			inet_twsk_put(inet_twsk(sk));
-+			kfree_skb(skb);
-+			return 0;
-+		}
-+
-+		ret = tcp_v6_do_rcv(sk, skb);
-+		sock_put(sk);
-+
-+		return ret;
-+	}
-+	TCP_SKB_CB(skb)->mptcp_flags = 0;
-+
-+	/* Has been removed from the tk-table. Thus, no new subflows.
-+	 *
-+	 * Check for close-state is necessary, because we may have been closed
-+	 * without passing by mptcp_close().
-+	 *
-+	 * When falling back, no new subflows are allowed either.
-+	 */
-+	if (meta_sk->sk_state == TCP_CLOSE || !tcp_sk(meta_sk)->inside_tk_table ||
-+	    mpcb->infinite_mapping_rcv || mpcb->send_infinite_mapping)
-+		goto reset_and_discard;
-+
-+	child = tcp_v6_hnd_req(meta_sk, skb);
-+
-+	if (!child)
-+		goto discard;
-+
-+	if (child != meta_sk) {
-+		sock_rps_save_rxhash(child, skb);
-+		/* We don't call tcp_child_process here, because we hold
-+		 * already the meta-sk-lock and are sure that it is not owned
-+		 * by the user.
-+		 */
-+		ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb), skb->len);
-+		bh_unlock_sock(child);
-+		sock_put(child);
-+		if (ret) {
-+			rsk = child;
-+			goto reset_and_discard;
-+		}
-+	} else {
-+		if (tcp_hdr(skb)->syn) {
-+			mptcp_v6_join_request(meta_sk, skb);
-+			goto discard;
-+		}
-+		goto reset_and_discard;
-+	}
-+	return 0;
-+
-+reset_and_discard:
-+	if (reqsk_queue_len(&inet_csk(meta_sk)->icsk_accept_queue)) {
-+		const struct tcphdr *th = tcp_hdr(skb);
-+		struct request_sock **prev, *req;
-+		/* If we end up here, it means we should not have matched on the
-+		 * request-socket. But, because the request-sock queue is only
-+		 * destroyed in mptcp_close, the socket may actually already be
-+		 * in close-state (e.g., through shutdown()) while still having
-+		 * pending request sockets.
-+		 */
-+		req = inet6_csk_search_req(meta_sk, &prev, th->source,
-+					   &ipv6_hdr(skb)->saddr,
-+					   &ipv6_hdr(skb)->daddr, inet6_iif(skb));
-+		if (req) {
-+			inet_csk_reqsk_queue_unlink(meta_sk, req, prev);
-+			reqsk_queue_removed(&inet_csk(meta_sk)->icsk_accept_queue,
-+					    req);
-+			reqsk_free(req);
-+		}
-+	}
-+
-+	tcp_v6_send_reset(rsk, skb);
-+discard:
-+	kfree_skb(skb);
-+	return 0;
-+}
-+
-+/* After this, the ref count of the meta_sk associated with the request_sock
-+ * is incremented. Thus it is the responsibility of the caller
-+ * to call sock_put() when the reference is not needed anymore.
-+ */
-+struct sock *mptcp_v6_search_req(const __be16 rport, const struct in6_addr *raddr,
-+				 const struct in6_addr *laddr, const struct net *net)
-+{
-+	const struct mptcp_request_sock *mtreq;
-+	struct sock *meta_sk = NULL;
-+	const struct hlist_nulls_node *node;
-+	const u32 hash = inet6_synq_hash(raddr, rport, 0, MPTCP_HASH_SIZE);
-+
-+	rcu_read_lock();
-+begin:
-+	hlist_nulls_for_each_entry_rcu(mtreq, node, &mptcp_reqsk_htb[hash],
-+				       hash_entry) {
-+		struct inet_request_sock *treq = inet_rsk(rev_mptcp_rsk(mtreq));
-+		meta_sk = mtreq->mptcp_mpcb->meta_sk;
-+
-+		if (inet_rsk(rev_mptcp_rsk(mtreq))->ir_rmt_port == rport &&
-+		    rev_mptcp_rsk(mtreq)->rsk_ops->family == AF_INET6 &&
-+		    ipv6_addr_equal(&treq->ir_v6_rmt_addr, raddr) &&
-+		    ipv6_addr_equal(&treq->ir_v6_loc_addr, laddr) &&
-+		    net_eq(net, sock_net(meta_sk)))
-+			goto found;
-+		meta_sk = NULL;
-+	}
-+	/* A request-socket is destroyed by RCU. So, it might have been recycled
-+	 * and put into another hash-table list. So, after the lookup we may
-+	 * end up in a different list. So, we may need to restart.
-+	 *
-+	 * See also the comment in __inet_lookup_established.
-+	 */
-+	if (get_nulls_value(node) != hash + MPTCP_REQSK_NULLS_BASE)
-+		goto begin;
-+
-+found:
-+	if (meta_sk && unlikely(!atomic_inc_not_zero(&meta_sk->sk_refcnt)))
-+		meta_sk = NULL;
-+	rcu_read_unlock();
-+
-+	return meta_sk;
-+}
-+
-+/* Create a new IPv6 subflow.
-+ *
-+ * We are in user-context and meta-sock-lock is hold.
-+ */
-+int mptcp_init6_subsockets(struct sock *meta_sk, const struct mptcp_loc6 *loc,
-+			   struct mptcp_rem6 *rem)
-+{
-+	struct tcp_sock *tp;
-+	struct sock *sk;
-+	struct sockaddr_in6 loc_in, rem_in;
-+	struct socket sock;
-+	int ret;
-+
-+	/** First, create and prepare the new socket */
-+
-+	sock.type = meta_sk->sk_socket->type;
-+	sock.state = SS_UNCONNECTED;
-+	sock.wq = meta_sk->sk_socket->wq;
-+	sock.file = meta_sk->sk_socket->file;
-+	sock.ops = NULL;
-+
-+	ret = inet6_create(sock_net(meta_sk), &sock, IPPROTO_TCP, 1);
-+	if (unlikely(ret < 0)) {
-+		mptcp_debug("%s inet6_create failed ret: %d\n", __func__, ret);
-+		return ret;
-+	}
-+
-+	sk = sock.sk;
-+	tp = tcp_sk(sk);
-+
-+	/* All subsockets need the MPTCP-lock-class */
-+	lockdep_set_class_and_name(&(sk)->sk_lock.slock, &meta_slock_key, "slock-AF_INET-MPTCP");
-+	lockdep_init_map(&(sk)->sk_lock.dep_map, "sk_lock-AF_INET-MPTCP", &meta_key, 0);
-+
-+	if (mptcp_add_sock(meta_sk, sk, loc->loc6_id, rem->rem6_id, GFP_KERNEL))
-+		goto error;
-+
-+	tp->mptcp->slave_sk = 1;
-+	tp->mptcp->low_prio = loc->low_prio;
-+
-+	/* Initializing the timer for an MPTCP subflow */
-+	setup_timer(&tp->mptcp->mptcp_ack_timer, mptcp_ack_handler, (unsigned long)sk);
-+
-+	/** Then, connect the socket to the peer */
-+	loc_in.sin6_family = AF_INET6;
-+	rem_in.sin6_family = AF_INET6;
-+	loc_in.sin6_port = 0;
-+	if (rem->port)
-+		rem_in.sin6_port = rem->port;
-+	else
-+		rem_in.sin6_port = inet_sk(meta_sk)->inet_dport;
-+	loc_in.sin6_addr = loc->addr;
-+	rem_in.sin6_addr = rem->addr;
-+
-+	ret = sock.ops->bind(&sock, (struct sockaddr *)&loc_in, sizeof(struct sockaddr_in6));
-+	if (ret < 0) {
-+		mptcp_debug("%s: MPTCP subsocket bind()failed, error %d\n",
-+			    __func__, ret);
-+		goto error;
-+	}
-+
-+	mptcp_debug("%s: token %#x pi %d src_addr:%pI6:%d dst_addr:%pI6:%d\n",
-+		    __func__, tcp_sk(meta_sk)->mpcb->mptcp_loc_token,
-+		    tp->mptcp->path_index, &loc_in.sin6_addr,
-+		    ntohs(loc_in.sin6_port), &rem_in.sin6_addr,
-+		    ntohs(rem_in.sin6_port));
-+
-+	if (tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6)
-+		tcp_sk(meta_sk)->mpcb->pm_ops->init_subsocket_v6(sk, rem->addr);
-+
-+	ret = sock.ops->connect(&sock, (struct sockaddr *)&rem_in,
-+				sizeof(struct sockaddr_in6), O_NONBLOCK);
-+	if (ret < 0 && ret != -EINPROGRESS) {
-+		mptcp_debug("%s: MPTCP subsocket connect() failed, error %d\n",
-+			    __func__, ret);
-+		goto error;
-+	}
-+
-+	sk_set_socket(sk, meta_sk->sk_socket);
-+	sk->sk_wq = meta_sk->sk_wq;
-+
-+	return 0;
-+
-+error:
-+	/* May happen if mptcp_add_sock fails first */
-+	if (!mptcp(tp)) {
-+		tcp_close(sk, 0);
-+	} else {
-+		local_bh_disable();
-+		mptcp_sub_force_close(sk);
-+		local_bh_enable();
-+	}
-+	return ret;
-+}
-+EXPORT_SYMBOL(mptcp_init6_subsockets);
-+
-+const struct inet_connection_sock_af_ops mptcp_v6_specific = {
-+	.queue_xmit	   = inet6_csk_xmit,
-+	.send_check	   = tcp_v6_send_check,
-+	.rebuild_header	   = inet6_sk_rebuild_header,
-+	.sk_rx_dst_set	   = inet6_sk_rx_dst_set,
-+	.conn_request	   = mptcp_conn_request,
-+	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
-+	.net_header_len	   = sizeof(struct ipv6hdr),
-+	.net_frag_header_len = sizeof(struct frag_hdr),
-+	.setsockopt	   = ipv6_setsockopt,
-+	.getsockopt	   = ipv6_getsockopt,
-+	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
-+	.sockaddr_len	   = sizeof(struct sockaddr_in6),
-+	.bind_conflict	   = inet6_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+	.compat_setsockopt = compat_ipv6_setsockopt,
-+	.compat_getsockopt = compat_ipv6_getsockopt,
-+#endif
-+};
-+
-+const struct inet_connection_sock_af_ops mptcp_v6_mapped = {
-+	.queue_xmit	   = ip_queue_xmit,
-+	.send_check	   = tcp_v4_send_check,
-+	.rebuild_header	   = inet_sk_rebuild_header,
-+	.sk_rx_dst_set	   = inet_sk_rx_dst_set,
-+	.conn_request	   = mptcp_conn_request,
-+	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
-+	.net_header_len	   = sizeof(struct iphdr),
-+	.setsockopt	   = ipv6_setsockopt,
-+	.getsockopt	   = ipv6_getsockopt,
-+	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
-+	.sockaddr_len	   = sizeof(struct sockaddr_in6),
-+	.bind_conflict	   = inet6_csk_bind_conflict,
-+#ifdef CONFIG_COMPAT
-+	.compat_setsockopt = compat_ipv6_setsockopt,
-+	.compat_getsockopt = compat_ipv6_getsockopt,
-+#endif
-+};
-+
-+struct tcp_request_sock_ops mptcp_request_sock_ipv6_ops;
-+struct tcp_request_sock_ops mptcp_join_request_sock_ipv6_ops;
-+
-+int mptcp_pm_v6_init(void)
-+{
-+	int ret = 0;
-+	struct request_sock_ops *ops = &mptcp6_request_sock_ops;
-+
-+	mptcp_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
-+	mptcp_request_sock_ipv6_ops.init_req = mptcp_v6_init_req;
-+
-+	mptcp_join_request_sock_ipv6_ops = tcp_request_sock_ipv6_ops;
-+	mptcp_join_request_sock_ipv6_ops.init_req = mptcp_v6_join_init_req;
-+	mptcp_join_request_sock_ipv6_ops.queue_hash_add = mptcp_v6_reqsk_queue_hash_add;
-+
-+	ops->slab_name = kasprintf(GFP_KERNEL, "request_sock_%s", "MPTCP6");
-+	if (ops->slab_name == NULL) {
-+		ret = -ENOMEM;
-+		goto out;
-+	}
-+
-+	ops->slab = kmem_cache_create(ops->slab_name, ops->obj_size, 0,
-+				      SLAB_DESTROY_BY_RCU|SLAB_HWCACHE_ALIGN,
-+				      NULL);
-+
-+	if (ops->slab == NULL) {
-+		ret =  -ENOMEM;
-+		goto err_reqsk_create;
-+	}
-+
-+out:
-+	return ret;
-+
-+err_reqsk_create:
-+	kfree(ops->slab_name);
-+	ops->slab_name = NULL;
-+	goto out;
-+}
-+
-+void mptcp_pm_v6_undo(void)
-+{
-+	kmem_cache_destroy(mptcp6_request_sock_ops.slab);
-+	kfree(mptcp6_request_sock_ops.slab_name);
-+}
-diff --git a/net/mptcp/mptcp_ndiffports.c b/net/mptcp/mptcp_ndiffports.c
-new file mode 100644
-index 000000000000..6f5087983175
---- /dev/null
-+++ b/net/mptcp/mptcp_ndiffports.c
-@@ -0,0 +1,161 @@
-+#include <linux/module.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+
-+#if IS_ENABLED(CONFIG_IPV6)
-+#include <net/mptcp_v6.h>
-+#endif
-+
-+struct ndiffports_priv {
-+	/* Worker struct for subflow establishment */
-+	struct work_struct subflow_work;
-+
-+	struct mptcp_cb *mpcb;
-+};
-+
-+static int num_subflows __read_mostly = 2;
-+module_param(num_subflows, int, 0644);
-+MODULE_PARM_DESC(num_subflows, "choose the number of subflows per MPTCP connection");
-+
-+/**
-+ * Create all new subflows, by doing calls to mptcp_initX_subsockets
-+ *
-+ * This function uses a goto next_subflow, to allow releasing the lock between
-+ * new subflows and giving other processes a chance to do some work on the
-+ * socket and potentially finishing the communication.
-+ **/
-+static void create_subflow_worker(struct work_struct *work)
-+{
-+	const struct ndiffports_priv *pm_priv = container_of(work,
-+						     struct ndiffports_priv,
-+						     subflow_work);
-+	struct mptcp_cb *mpcb = pm_priv->mpcb;
-+	struct sock *meta_sk = mpcb->meta_sk;
-+	int iter = 0;
-+
-+next_subflow:
-+	if (iter) {
-+		release_sock(meta_sk);
-+		mutex_unlock(&mpcb->mpcb_mutex);
-+
-+		cond_resched();
-+	}
-+	mutex_lock(&mpcb->mpcb_mutex);
-+	lock_sock_nested(meta_sk, SINGLE_DEPTH_NESTING);
-+
-+	iter++;
-+
-+	if (sock_flag(meta_sk, SOCK_DEAD))
-+		goto exit;
-+
-+	if (mpcb->master_sk &&
-+	    !tcp_sk(mpcb->master_sk)->mptcp->fully_established)
-+		goto exit;
-+
-+	if (num_subflows > iter && num_subflows > mpcb->cnt_subflows) {
-+		if (meta_sk->sk_family == AF_INET ||
-+		    mptcp_v6_is_v4_mapped(meta_sk)) {
-+			struct mptcp_loc4 loc;
-+			struct mptcp_rem4 rem;
-+
-+			loc.addr.s_addr = inet_sk(meta_sk)->inet_saddr;
-+			loc.loc4_id = 0;
-+			loc.low_prio = 0;
-+
-+			rem.addr.s_addr = inet_sk(meta_sk)->inet_daddr;
-+			rem.port = inet_sk(meta_sk)->inet_dport;
-+			rem.rem4_id = 0; /* Default 0 */
-+
-+			mptcp_init4_subsockets(meta_sk, &loc, &rem);
-+		} else {
-+#if IS_ENABLED(CONFIG_IPV6)
-+			struct mptcp_loc6 loc;
-+			struct mptcp_rem6 rem;
-+
-+			loc.addr = inet6_sk(meta_sk)->saddr;
-+			loc.loc6_id = 0;
-+			loc.low_prio = 0;
-+
-+			rem.addr = meta_sk->sk_v6_daddr;
-+			rem.port = inet_sk(meta_sk)->inet_dport;
-+			rem.rem6_id = 0; /* Default 0 */
-+
-+			mptcp_init6_subsockets(meta_sk, &loc, &rem);
-+#endif
-+		}
-+		goto next_subflow;
-+	}
-+
-+exit:
-+	release_sock(meta_sk);
-+	mutex_unlock(&mpcb->mpcb_mutex);
-+	sock_put(meta_sk);
-+}
-+
-+static void ndiffports_new_session(const struct sock *meta_sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct ndiffports_priv *fmp = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
-+
-+	/* Initialize workqueue-struct */
-+	INIT_WORK(&fmp->subflow_work, create_subflow_worker);
-+	fmp->mpcb = mpcb;
-+}
-+
-+static void ndiffports_create_subflows(struct sock *meta_sk)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct ndiffports_priv *pm_priv = (struct ndiffports_priv *)&mpcb->mptcp_pm[0];
-+
-+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv ||
-+	    mpcb->send_infinite_mapping ||
-+	    mpcb->server_side || sock_flag(meta_sk, SOCK_DEAD))
-+		return;
-+
-+	if (!work_pending(&pm_priv->subflow_work)) {
-+		sock_hold(meta_sk);
-+		queue_work(mptcp_wq, &pm_priv->subflow_work);
-+	}
-+}
-+
-+static int ndiffports_get_local_id(sa_family_t family, union inet_addr *addr,
-+				   struct net *net, bool *low_prio)
-+{
-+	return 0;
-+}
-+
-+static struct mptcp_pm_ops ndiffports __read_mostly = {
-+	.new_session = ndiffports_new_session,
-+	.fully_established = ndiffports_create_subflows,
-+	.get_local_id = ndiffports_get_local_id,
-+	.name = "ndiffports",
-+	.owner = THIS_MODULE,
-+};
-+
-+/* General initialization of MPTCP_PM */
-+static int __init ndiffports_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct ndiffports_priv) > MPTCP_PM_SIZE);
-+
-+	if (mptcp_register_path_manager(&ndiffports))
-+		goto exit;
-+
-+	return 0;
-+
-+exit:
-+	return -1;
-+}
-+
-+static void ndiffports_unregister(void)
-+{
-+	mptcp_unregister_path_manager(&ndiffports);
-+}
-+
-+module_init(ndiffports_register);
-+module_exit(ndiffports_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("NDIFF-PORTS MPTCP");
-+MODULE_VERSION("0.88");
-diff --git a/net/mptcp/mptcp_ofo_queue.c b/net/mptcp/mptcp_ofo_queue.c
-new file mode 100644
-index 000000000000..ec4e98622637
---- /dev/null
-+++ b/net/mptcp/mptcp_ofo_queue.c
-@@ -0,0 +1,295 @@
-+/*
-+ *	MPTCP implementation - Fast algorithm for MPTCP meta-reordering
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/skbuff.h>
-+#include <linux/slab.h>
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+void mptcp_remove_shortcuts(const struct mptcp_cb *mpcb,
-+			    const struct sk_buff *skb)
-+{
-+	struct tcp_sock *tp;
-+
-+	mptcp_for_each_tp(mpcb, tp) {
-+		if (tp->mptcp->shortcut_ofoqueue == skb) {
-+			tp->mptcp->shortcut_ofoqueue = NULL;
-+			return;
-+		}
-+	}
-+}
-+
-+/* Does 'skb' fits after 'here' in the queue 'head' ?
-+ * If yes, we queue it and return 1
-+ */
-+static int mptcp_ofo_queue_after(struct sk_buff_head *head,
-+				 struct sk_buff *skb, struct sk_buff *here,
-+				 const struct tcp_sock *tp)
-+{
-+	struct sock *meta_sk = tp->meta_sk;
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	u32 seq = TCP_SKB_CB(skb)->seq;
-+	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+	/* We want to queue skb after here, thus seq >= end_seq */
-+	if (before(seq, TCP_SKB_CB(here)->end_seq))
-+		return 0;
-+
-+	if (seq == TCP_SKB_CB(here)->end_seq) {
-+		bool fragstolen = false;
-+
-+		if (!tcp_try_coalesce(meta_sk, here, skb, &fragstolen)) {
-+			__skb_queue_after(&meta_tp->out_of_order_queue, here, skb);
-+			return 1;
-+		} else {
-+			kfree_skb_partial(skb, fragstolen);
-+			return -1;
-+		}
-+	}
-+
-+	/* If here is the last one, we can always queue it */
-+	if (skb_queue_is_last(head, here)) {
-+		__skb_queue_after(head, here, skb);
-+		return 1;
-+	} else {
-+		struct sk_buff *skb1 = skb_queue_next(head, here);
-+		/* It's not the last one, but does it fits between 'here' and
-+		 * the one after 'here' ? Thus, does end_seq <= after_here->seq
-+		 */
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->seq)) {
-+			__skb_queue_after(head, here, skb);
-+			return 1;
-+		}
-+	}
-+
-+	return 0;
-+}
-+
-+static void try_shortcut(struct sk_buff *shortcut, struct sk_buff *skb,
-+			 struct sk_buff_head *head, struct tcp_sock *tp)
-+{
-+	struct sock *meta_sk = tp->meta_sk;
-+	struct tcp_sock *tp_it, *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sk_buff *skb1, *best_shortcut = NULL;
-+	u32 seq = TCP_SKB_CB(skb)->seq;
-+	u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+	u32 distance = 0xffffffff;
-+
-+	/* First, check the tp's shortcut */
-+	if (!shortcut) {
-+		if (skb_queue_empty(head)) {
-+			__skb_queue_head(head, skb);
-+			goto end;
-+		}
-+	} else {
-+		int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
-+		/* Does the tp's shortcut is a hit? If yes, we insert. */
-+
-+		if (ret) {
-+			skb = (ret > 0) ? skb : NULL;
-+			goto end;
-+		}
-+	}
-+
-+	/* Check the shortcuts of the other subsockets. */
-+	mptcp_for_each_tp(mpcb, tp_it) {
-+		shortcut = tp_it->mptcp->shortcut_ofoqueue;
-+		/* Can we queue it here? If yes, do so! */
-+		if (shortcut) {
-+			int ret = mptcp_ofo_queue_after(head, skb, shortcut, tp);
-+
-+			if (ret) {
-+				skb = (ret > 0) ? skb : NULL;
-+				goto end;
-+			}
-+		}
-+
-+		/* Could not queue it, check if we are close.
-+		 * We are looking for a shortcut, close enough to seq to
-+		 * set skb1 prematurely and thus improve the subsequent lookup,
-+		 * which tries to find a skb1 so that skb1->seq <= seq.
-+		 *
-+		 * So, here we only take shortcuts, whose shortcut->seq > seq,
-+		 * and minimize the distance between shortcut->seq and seq and
-+		 * set best_shortcut to this one with the minimal distance.
-+		 *
-+		 * That way, the subsequent while-loop is shortest.
-+		 */
-+		if (shortcut && after(TCP_SKB_CB(shortcut)->seq, seq)) {
-+			/* Are we closer than the current best shortcut? */
-+			if ((u32)(TCP_SKB_CB(shortcut)->seq - seq) < distance) {
-+				distance = (u32)(TCP_SKB_CB(shortcut)->seq - seq);
-+				best_shortcut = shortcut;
-+			}
-+		}
-+	}
-+
-+	if (best_shortcut)
-+		skb1 = best_shortcut;
-+	else
-+		skb1 = skb_peek_tail(head);
-+
-+	if (seq == TCP_SKB_CB(skb1)->end_seq) {
-+		bool fragstolen = false;
-+
-+		if (!tcp_try_coalesce(meta_sk, skb1, skb, &fragstolen)) {
-+			__skb_queue_after(&meta_tp->out_of_order_queue, skb1, skb);
-+		} else {
-+			kfree_skb_partial(skb, fragstolen);
-+			skb = NULL;
-+		}
-+
-+		goto end;
-+	}
-+
-+	/* Find the insertion point, starting from best_shortcut if available.
-+	 *
-+	 * Inspired from tcp_data_queue_ofo.
-+	 */
-+	while (1) {
-+		/* skb1->seq <= seq */
-+		if (!after(TCP_SKB_CB(skb1)->seq, seq))
-+			break;
-+		if (skb_queue_is_first(head, skb1)) {
-+			skb1 = NULL;
-+			break;
-+		}
-+		skb1 = skb_queue_prev(head, skb1);
-+	}
-+
-+	/* Do skb overlap to previous one? */
-+	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+			/* All the bits are present. */
-+			__kfree_skb(skb);
-+			skb = NULL;
-+			goto end;
-+		}
-+		if (seq == TCP_SKB_CB(skb1)->seq) {
-+			if (skb_queue_is_first(head, skb1))
-+				skb1 = NULL;
-+			else
-+				skb1 = skb_queue_prev(head, skb1);
-+		}
-+	}
-+	if (!skb1)
-+		__skb_queue_head(head, skb);
-+	else
-+		__skb_queue_after(head, skb1, skb);
-+
-+	/* And clean segments covered by new one as whole. */
-+	while (!skb_queue_is_last(head, skb)) {
-+		skb1 = skb_queue_next(head, skb);
-+
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-+			break;
-+
-+		__skb_unlink(skb1, head);
-+		mptcp_remove_shortcuts(mpcb, skb1);
-+		__kfree_skb(skb1);
-+	}
-+
-+end:
-+	if (skb) {
-+		skb_set_owner_r(skb, meta_sk);
-+		tp->mptcp->shortcut_ofoqueue = skb;
-+	}
-+
-+	return;
-+}
-+
-+/**
-+ * @sk: the subflow that received this skb.
-+ */
-+void mptcp_add_meta_ofo_queue(const struct sock *meta_sk, struct sk_buff *skb,
-+			      struct sock *sk)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+
-+	try_shortcut(tp->mptcp->shortcut_ofoqueue, skb,
-+		     &tcp_sk(meta_sk)->out_of_order_queue, tp);
-+}
-+
-+bool mptcp_prune_ofo_queue(struct sock *sk)
-+{
-+	struct tcp_sock *tp	= tcp_sk(sk);
-+	bool res		= false;
-+
-+	if (!skb_queue_empty(&tp->out_of_order_queue)) {
-+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-+		mptcp_purge_ofo_queue(tp);
-+
-+		/* No sack at the mptcp-level */
-+		sk_mem_reclaim(sk);
-+		res = true;
-+	}
-+
-+	return res;
-+}
-+
-+void mptcp_ofo_queue(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sk_buff *skb;
-+
-+	while ((skb = skb_peek(&meta_tp->out_of_order_queue)) != NULL) {
-+		u32 old_rcv_nxt = meta_tp->rcv_nxt;
-+		if (after(TCP_SKB_CB(skb)->seq, meta_tp->rcv_nxt))
-+			break;
-+
-+		if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->rcv_nxt)) {
-+			__skb_unlink(skb, &meta_tp->out_of_order_queue);
-+			mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+			__kfree_skb(skb);
-+			continue;
-+		}
-+
-+		__skb_unlink(skb, &meta_tp->out_of_order_queue);
-+		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+
-+		__skb_queue_tail(&meta_sk->sk_receive_queue, skb);
-+		meta_tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
-+		mptcp_check_rcvseq_wrap(meta_tp, old_rcv_nxt);
-+
-+		if (tcp_hdr(skb)->fin)
-+			mptcp_fin(meta_sk);
-+	}
-+}
-+
-+void mptcp_purge_ofo_queue(struct tcp_sock *meta_tp)
-+{
-+	struct sk_buff_head *head = &meta_tp->out_of_order_queue;
-+	struct sk_buff *skb, *tmp;
-+
-+	skb_queue_walk_safe(head, skb, tmp) {
-+		__skb_unlink(skb, head);
-+		mptcp_remove_shortcuts(meta_tp->mpcb, skb);
-+		kfree_skb(skb);
-+	}
-+}
-diff --git a/net/mptcp/mptcp_olia.c b/net/mptcp/mptcp_olia.c
-new file mode 100644
-index 000000000000..53f5c43bb488
---- /dev/null
-+++ b/net/mptcp/mptcp_olia.c
-@@ -0,0 +1,311 @@
-+/*
-+ * MPTCP implementation - OPPORTUNISTIC LINKED INCREASES CONGESTION CONTROL:
-+ *
-+ * Algorithm design:
-+ * Ramin Khalili <ramin.khalili@epfl.ch>
-+ * Nicolas Gast <nicolas.gast@epfl.ch>
-+ * Jean-Yves Le Boudec <jean-yves.leboudec@epfl.ch>
-+ *
-+ * Implementation:
-+ * Ramin Khalili <ramin.khalili@epfl.ch>
-+ *
-+ * Ported to the official MPTCP-kernel:
-+ * Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ * This program is free software; you can redistribute it and/or
-+ * modify it under the terms of the GNU General Public License
-+ * as published by the Free Software Foundation; either version
-+ * 2 of the License, or (at your option) any later version.
-+ */
-+
-+
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+
-+#include <linux/module.h>
-+
-+static int scale = 10;
-+
-+struct mptcp_olia {
-+	u32	mptcp_loss1;
-+	u32	mptcp_loss2;
-+	u32	mptcp_loss3;
-+	int	epsilon_num;
-+	u32	epsilon_den;
-+	int	mptcp_snd_cwnd_cnt;
-+};
-+
-+static inline int mptcp_olia_sk_can_send(const struct sock *sk)
-+{
-+	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
-+}
-+
-+static inline u64 mptcp_olia_scale(u64 val, int scale)
-+{
-+	return (u64) val << scale;
-+}
-+
-+/* take care of artificially inflate (see RFC5681)
-+ * of cwnd during fast-retransmit phase
-+ */
-+static u32 mptcp_get_crt_cwnd(struct sock *sk)
-+{
-+	const struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+	if (icsk->icsk_ca_state == TCP_CA_Recovery)
-+		return tcp_sk(sk)->snd_ssthresh;
-+	else
-+		return tcp_sk(sk)->snd_cwnd;
-+}
-+
-+/* return the dominator of the first term of  the increasing term */
-+static u64 mptcp_get_rate(const struct mptcp_cb *mpcb , u32 path_rtt)
-+{
-+	struct sock *sk;
-+	u64 rate = 1; /* We have to avoid a zero-rate because it is used as a divisor */
-+
-+	mptcp_for_each_sk(mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+		u64 scaled_num;
-+		u32 tmp_cwnd;
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+		scaled_num = mptcp_olia_scale(tmp_cwnd, scale) * path_rtt;
-+		rate += div_u64(scaled_num , tp->srtt_us);
-+	}
-+	rate *= rate;
-+	return rate;
-+}
-+
-+/* find the maximum cwnd, used to find set M */
-+static u32 mptcp_get_max_cwnd(const struct mptcp_cb *mpcb)
-+{
-+	struct sock *sk;
-+	u32 best_cwnd = 0;
-+
-+	mptcp_for_each_sk(mpcb, sk) {
-+		u32 tmp_cwnd;
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+		if (tmp_cwnd > best_cwnd)
-+			best_cwnd = tmp_cwnd;
-+	}
-+	return best_cwnd;
-+}
-+
-+static void mptcp_get_epsilon(const struct mptcp_cb *mpcb)
-+{
-+	struct mptcp_olia *ca;
-+	struct tcp_sock *tp;
-+	struct sock *sk;
-+	u64 tmp_int, tmp_rtt, best_int = 0, best_rtt = 1;
-+	u32 max_cwnd = 1, best_cwnd = 1, tmp_cwnd;
-+	u8 M = 0, B_not_M = 0;
-+
-+	/* TODO - integrate this in the following loop - we just want to iterate once */
-+
-+	max_cwnd = mptcp_get_max_cwnd(mpcb);
-+
-+	/* find the best path */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		tp = tcp_sk(sk);
-+		ca = inet_csk_ca(sk);
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+		/* TODO - check here and rename variables */
-+		tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+			      ca->mptcp_loss2 - ca->mptcp_loss1);
-+
-+		tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+		if ((u64)tmp_int * best_rtt >= (u64)best_int * tmp_rtt) {
-+			best_rtt = tmp_rtt;
-+			best_int = tmp_int;
-+			best_cwnd = tmp_cwnd;
-+		}
-+	}
-+
-+	/* TODO - integrate this here in mptcp_get_max_cwnd and in the previous loop */
-+	/* find the size of M and B_not_M */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		tp = tcp_sk(sk);
-+		ca = inet_csk_ca(sk);
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+		if (tmp_cwnd == max_cwnd) {
-+			M++;
-+		} else {
-+			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+				      ca->mptcp_loss2 - ca->mptcp_loss1);
-+
-+			if ((u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt)
-+				B_not_M++;
-+		}
-+	}
-+
-+	/* check if the path is in M or B_not_M and set the value of epsilon accordingly */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		tp = tcp_sk(sk);
-+		ca = inet_csk_ca(sk);
-+
-+		if (!mptcp_olia_sk_can_send(sk))
-+			continue;
-+
-+		if (B_not_M == 0) {
-+			ca->epsilon_num = 0;
-+			ca->epsilon_den = 1;
-+		} else {
-+			tmp_rtt = (u64)tp->srtt_us * tp->srtt_us;
-+			tmp_int = max(ca->mptcp_loss3 - ca->mptcp_loss2,
-+				      ca->mptcp_loss2 - ca->mptcp_loss1);
-+			tmp_cwnd = mptcp_get_crt_cwnd(sk);
-+
-+			if (tmp_cwnd < max_cwnd &&
-+			    (u64)tmp_int * best_rtt == (u64)best_int * tmp_rtt) {
-+				ca->epsilon_num = 1;
-+				ca->epsilon_den = mpcb->cnt_established * B_not_M;
-+			} else if (tmp_cwnd == max_cwnd) {
-+				ca->epsilon_num = -1;
-+				ca->epsilon_den = mpcb->cnt_established  * M;
-+			} else {
-+				ca->epsilon_num = 0;
-+				ca->epsilon_den = 1;
-+			}
-+		}
-+	}
-+}
-+
-+/* setting the initial values */
-+static void mptcp_olia_init(struct sock *sk)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_olia *ca = inet_csk_ca(sk);
-+
-+	if (mptcp(tp)) {
-+		ca->mptcp_loss1 = tp->snd_una;
-+		ca->mptcp_loss2 = tp->snd_una;
-+		ca->mptcp_loss3 = tp->snd_una;
-+		ca->mptcp_snd_cwnd_cnt = 0;
-+		ca->epsilon_num = 0;
-+		ca->epsilon_den = 1;
-+	}
-+}
-+
-+/* updating inter-loss distance and ssthresh */
-+static void mptcp_olia_set_state(struct sock *sk, u8 new_state)
-+{
-+	if (!mptcp(tcp_sk(sk)))
-+		return;
-+
-+	if (new_state == TCP_CA_Loss ||
-+	    new_state == TCP_CA_Recovery || new_state == TCP_CA_CWR) {
-+		struct mptcp_olia *ca = inet_csk_ca(sk);
-+
-+		if (ca->mptcp_loss3 != ca->mptcp_loss2 &&
-+		    !inet_csk(sk)->icsk_retransmits) {
-+			ca->mptcp_loss1 = ca->mptcp_loss2;
-+			ca->mptcp_loss2 = ca->mptcp_loss3;
-+		}
-+	}
-+}
-+
-+/* main algorithm */
-+static void mptcp_olia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_olia *ca = inet_csk_ca(sk);
-+	const struct mptcp_cb *mpcb = tp->mpcb;
-+
-+	u64 inc_num, inc_den, rate, cwnd_scaled;
-+
-+	if (!mptcp(tp)) {
-+		tcp_reno_cong_avoid(sk, ack, acked);
-+		return;
-+	}
-+
-+	ca->mptcp_loss3 = tp->snd_una;
-+
-+	if (!tcp_is_cwnd_limited(sk))
-+		return;
-+
-+	/* slow start if it is in the safe area */
-+	if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+		tcp_slow_start(tp, acked);
-+		return;
-+	}
-+
-+	mptcp_get_epsilon(mpcb);
-+	rate = mptcp_get_rate(mpcb, tp->srtt_us);
-+	cwnd_scaled = mptcp_olia_scale(tp->snd_cwnd, scale);
-+	inc_den = ca->epsilon_den * tp->snd_cwnd * rate ? : 1;
-+
-+	/* calculate the increasing term, scaling is used to reduce the rounding effect */
-+	if (ca->epsilon_num == -1) {
-+		if (ca->epsilon_den * cwnd_scaled * cwnd_scaled < rate) {
-+			inc_num = rate - ca->epsilon_den *
-+				cwnd_scaled * cwnd_scaled;
-+			ca->mptcp_snd_cwnd_cnt -= div64_u64(
-+			    mptcp_olia_scale(inc_num , scale) , inc_den);
-+		} else {
-+			inc_num = ca->epsilon_den *
-+			    cwnd_scaled * cwnd_scaled - rate;
-+			ca->mptcp_snd_cwnd_cnt += div64_u64(
-+			    mptcp_olia_scale(inc_num , scale) , inc_den);
-+		}
-+	} else {
-+		inc_num = ca->epsilon_num * rate +
-+		    ca->epsilon_den * cwnd_scaled * cwnd_scaled;
-+		ca->mptcp_snd_cwnd_cnt += div64_u64(
-+		    mptcp_olia_scale(inc_num , scale) , inc_den);
-+	}
-+
-+
-+	if (ca->mptcp_snd_cwnd_cnt >= (1 << scale) - 1) {
-+		if (tp->snd_cwnd < tp->snd_cwnd_clamp)
-+			tp->snd_cwnd++;
-+		ca->mptcp_snd_cwnd_cnt = 0;
-+	} else if (ca->mptcp_snd_cwnd_cnt <= 0 - (1 << scale) + 1) {
-+		tp->snd_cwnd = max((int) 1 , (int) tp->snd_cwnd - 1);
-+		ca->mptcp_snd_cwnd_cnt = 0;
-+	}
-+}
-+
-+static struct tcp_congestion_ops mptcp_olia = {
-+	.init		= mptcp_olia_init,
-+	.ssthresh	= tcp_reno_ssthresh,
-+	.cong_avoid	= mptcp_olia_cong_avoid,
-+	.set_state	= mptcp_olia_set_state,
-+	.owner		= THIS_MODULE,
-+	.name		= "olia",
-+};
-+
-+static int __init mptcp_olia_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct mptcp_olia) > ICSK_CA_PRIV_SIZE);
-+	return tcp_register_congestion_control(&mptcp_olia);
-+}
-+
-+static void __exit mptcp_olia_unregister(void)
-+{
-+	tcp_unregister_congestion_control(&mptcp_olia);
-+}
-+
-+module_init(mptcp_olia_register);
-+module_exit(mptcp_olia_unregister);
-+
-+MODULE_AUTHOR("Ramin Khalili, Nicolas Gast, Jean-Yves Le Boudec");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP COUPLED CONGESTION CONTROL");
-+MODULE_VERSION("0.1");
-diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
-new file mode 100644
-index 000000000000..400ea254c078
---- /dev/null
-+++ b/net/mptcp/mptcp_output.c
-@@ -0,0 +1,1743 @@
-+/*
-+ *	MPTCP implementation - Sending side
-+ *
-+ *	Initial Design & Implementation:
-+ *	Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *	Current Maintainer & Author:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	Additional authors:
-+ *	Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *	Gregory Detal <gregory.detal@uclouvain.be>
-+ *	Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *	Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *	Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *	Andreas Ripke <ripke@neclab.eu>
-+ *	Vlad Dogaru <vlad.dogaru@intel.com>
-+ *	Octavian Purdila <octavian.purdila@intel.com>
-+ *	John Ronan <jronan@tssg.org>
-+ *	Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *	Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/kconfig.h>
-+#include <linux/skbuff.h>
-+#include <linux/tcp.h>
-+
-+#include <net/mptcp.h>
-+#include <net/mptcp_v4.h>
-+#include <net/mptcp_v6.h>
-+#include <net/sock.h>
-+
-+static const int mptcp_dss_len = MPTCP_SUB_LEN_DSS_ALIGN +
-+				 MPTCP_SUB_LEN_ACK_ALIGN +
-+				 MPTCP_SUB_LEN_SEQ_ALIGN;
-+
-+static inline int mptcp_sub_len_remove_addr(u16 bitfield)
-+{
-+	unsigned int c;
-+	for (c = 0; bitfield; c++)
-+		bitfield &= bitfield - 1;
-+	return MPTCP_SUB_LEN_REMOVE_ADDR + c - 1;
-+}
-+
-+int mptcp_sub_len_remove_addr_align(u16 bitfield)
-+{
-+	return ALIGN(mptcp_sub_len_remove_addr(bitfield), 4);
-+}
-+EXPORT_SYMBOL(mptcp_sub_len_remove_addr_align);
-+
-+/* get the data-seq and end-data-seq and store them again in the
-+ * tcp_skb_cb
-+ */
-+static int mptcp_reconstruct_mapping(struct sk_buff *skb)
-+{
-+	const struct mp_dss *mpdss = (struct mp_dss *)TCP_SKB_CB(skb)->dss;
-+	u32 *p32;
-+	u16 *p16;
-+
-+	if (!mpdss->M)
-+		return 1;
-+
-+	/* Move the pointer to the data-seq */
-+	p32 = (u32 *)mpdss;
-+	p32++;
-+	if (mpdss->A) {
-+		p32++;
-+		if (mpdss->a)
-+			p32++;
-+	}
-+
-+	TCP_SKB_CB(skb)->seq = ntohl(*p32);
-+
-+	/* Get the data_len to calculate the end_data_seq */
-+	p32++;
-+	p32++;
-+	p16 = (u16 *)p32;
-+	TCP_SKB_CB(skb)->end_seq = ntohs(*p16) + TCP_SKB_CB(skb)->seq;
-+
-+	return 0;
-+}
-+
-+static void mptcp_find_and_set_pathmask(const struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	struct sk_buff *skb_it;
-+
-+	skb_it = tcp_write_queue_head(meta_sk);
-+
-+	tcp_for_write_queue_from(skb_it, meta_sk) {
-+		if (skb_it == tcp_send_head(meta_sk))
-+			break;
-+
-+		if (TCP_SKB_CB(skb_it)->seq == TCP_SKB_CB(skb)->seq) {
-+			TCP_SKB_CB(skb)->path_mask = TCP_SKB_CB(skb_it)->path_mask;
-+			break;
-+		}
-+	}
-+}
-+
-+/* Reinject data from one TCP subflow to the meta_sk. If sk == NULL, we are
-+ * coming from the meta-retransmit-timer
-+ */
-+static void __mptcp_reinject_data(struct sk_buff *orig_skb, struct sock *meta_sk,
-+				  struct sock *sk, int clone_it)
-+{
-+	struct sk_buff *skb, *skb1;
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	u32 seq, end_seq;
-+
-+	if (clone_it) {
-+		/* pskb_copy is necessary here, because the TCP/IP-headers
-+		 * will be changed when it's going to be reinjected on another
-+		 * subflow.
-+		 */
-+		skb = pskb_copy_for_clone(orig_skb, GFP_ATOMIC);
-+	} else {
-+		__skb_unlink(orig_skb, &sk->sk_write_queue);
-+		sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
-+		sk->sk_wmem_queued -= orig_skb->truesize;
-+		sk_mem_uncharge(sk, orig_skb->truesize);
-+		skb = orig_skb;
-+	}
-+	if (unlikely(!skb))
-+		return;
-+
-+	if (sk && mptcp_reconstruct_mapping(skb)) {
-+		__kfree_skb(skb);
-+		return;
-+	}
-+
-+	skb->sk = meta_sk;
-+
-+	/* If it reached already the destination, we don't have to reinject it */
-+	if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
-+		__kfree_skb(skb);
-+		return;
-+	}
-+
-+	/* Only reinject segments that are fully covered by the mapping */
-+	if (skb->len + (mptcp_is_data_fin(skb) ? 1 : 0) !=
-+	    TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq) {
-+		u32 seq = TCP_SKB_CB(skb)->seq;
-+		u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+		__kfree_skb(skb);
-+
-+		/* Ok, now we have to look for the full mapping in the meta
-+		 * send-queue :S
-+		 */
-+		tcp_for_write_queue(skb, meta_sk) {
-+			/* Not yet at the mapping? */
-+			if (before(TCP_SKB_CB(skb)->seq, seq))
-+				continue;
-+			/* We have passed by the mapping */
-+			if (after(TCP_SKB_CB(skb)->end_seq, end_seq))
-+				return;
-+
-+			__mptcp_reinject_data(skb, meta_sk, NULL, 1);
-+		}
-+		return;
-+	}
-+
-+	/* Segment goes back to the MPTCP-layer. So, we need to zero the
-+	 * path_mask/dss.
-+	 */
-+	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
-+
-+	/* We need to find out the path-mask from the meta-write-queue
-+	 * to properly select a subflow.
-+	 */
-+	mptcp_find_and_set_pathmask(meta_sk, skb);
-+
-+	/* If it's empty, just add */
-+	if (skb_queue_empty(&mpcb->reinject_queue)) {
-+		skb_queue_head(&mpcb->reinject_queue, skb);
-+		return;
-+	}
-+
-+	/* Find place to insert skb - or even we can 'drop' it, as the
-+	 * data is already covered by other skb's in the reinject-queue.
-+	 *
-+	 * This is inspired by code from tcp_data_queue.
-+	 */
-+
-+	skb1 = skb_peek_tail(&mpcb->reinject_queue);
-+	seq = TCP_SKB_CB(skb)->seq;
-+	while (1) {
-+		if (!after(TCP_SKB_CB(skb1)->seq, seq))
-+			break;
-+		if (skb_queue_is_first(&mpcb->reinject_queue, skb1)) {
-+			skb1 = NULL;
-+			break;
-+		}
-+		skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
-+	}
-+
-+	/* Do skb overlap to previous one? */
-+	end_seq = TCP_SKB_CB(skb)->end_seq;
-+	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-+			/* All the bits are present. Don't reinject */
-+			__kfree_skb(skb);
-+			return;
-+		}
-+		if (seq == TCP_SKB_CB(skb1)->seq) {
-+			if (skb_queue_is_first(&mpcb->reinject_queue, skb1))
-+				skb1 = NULL;
-+			else
-+				skb1 = skb_queue_prev(&mpcb->reinject_queue, skb1);
-+		}
-+	}
-+	if (!skb1)
-+		__skb_queue_head(&mpcb->reinject_queue, skb);
-+	else
-+		__skb_queue_after(&mpcb->reinject_queue, skb1, skb);
-+
-+	/* And clean segments covered by new one as whole. */
-+	while (!skb_queue_is_last(&mpcb->reinject_queue, skb)) {
-+		skb1 = skb_queue_next(&mpcb->reinject_queue, skb);
-+
-+		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-+			break;
-+
-+		__skb_unlink(skb1, &mpcb->reinject_queue);
-+		__kfree_skb(skb1);
-+	}
-+	return;
-+}
-+
-+/* Inserts data into the reinject queue */
-+void mptcp_reinject_data(struct sock *sk, int clone_it)
-+{
-+	struct sk_buff *skb_it, *tmp;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct sock *meta_sk = tp->meta_sk;
-+
-+	/* It has already been closed - there is really no point in reinjecting */
-+	if (meta_sk->sk_state == TCP_CLOSE)
-+		return;
-+
-+	skb_queue_walk_safe(&sk->sk_write_queue, skb_it, tmp) {
-+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb_it);
-+		/* Subflow syn's and fin's are not reinjected.
-+		 *
-+		 * As well as empty subflow-fins with a data-fin.
-+		 * They are reinjected below (without the subflow-fin-flag)
-+		 */
-+		if (tcb->tcp_flags & TCPHDR_SYN ||
-+		    (tcb->tcp_flags & TCPHDR_FIN && !mptcp_is_data_fin(skb_it)) ||
-+		    (tcb->tcp_flags & TCPHDR_FIN && mptcp_is_data_fin(skb_it) && !skb_it->len))
-+			continue;
-+
-+		__mptcp_reinject_data(skb_it, meta_sk, sk, clone_it);
-+	}
-+
-+	skb_it = tcp_write_queue_tail(meta_sk);
-+	/* If sk has sent the empty data-fin, we have to reinject it too. */
-+	if (skb_it && mptcp_is_data_fin(skb_it) && skb_it->len == 0 &&
-+	    TCP_SKB_CB(skb_it)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index)) {
-+		__mptcp_reinject_data(skb_it, meta_sk, NULL, 1);
-+	}
-+
-+	mptcp_push_pending_frames(meta_sk);
-+
-+	tp->pf = 1;
-+}
-+EXPORT_SYMBOL(mptcp_reinject_data);
-+
-+static void mptcp_combine_dfin(const struct sk_buff *skb, const struct sock *meta_sk,
-+			       struct sock *subsk)
-+{
-+	const struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sock *sk_it;
-+	int all_empty = 1, all_acked;
-+
-+	/* In infinite mapping we always try to combine */
-+	if (mpcb->infinite_mapping_snd && tcp_close_state(subsk)) {
-+		subsk->sk_shutdown |= SEND_SHUTDOWN;
-+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+		return;
-+	}
-+
-+	/* Don't combine, if they didn't combine - otherwise we end up in
-+	 * TIME_WAIT, even if our app is smart enough to avoid it
-+	 */
-+	if (meta_sk->sk_shutdown & RCV_SHUTDOWN) {
-+		if (!mpcb->dfin_combined)
-+			return;
-+	}
-+
-+	/* If no other subflow has data to send, we can combine */
-+	mptcp_for_each_sk(mpcb, sk_it) {
-+		if (!mptcp_sk_can_send(sk_it))
-+			continue;
-+
-+		if (!tcp_write_queue_empty(sk_it))
-+			all_empty = 0;
-+	}
-+
-+	/* If all data has been DATA_ACKed, we can combine.
-+	 * -1, because the data_fin consumed one byte
-+	 */
-+	all_acked = (meta_tp->snd_una == (meta_tp->write_seq - 1));
-+
-+	if ((all_empty || all_acked) && tcp_close_state(subsk)) {
-+		subsk->sk_shutdown |= SEND_SHUTDOWN;
-+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-+	}
-+}
-+
-+static int mptcp_write_dss_mapping(const struct tcp_sock *tp, const struct sk_buff *skb,
-+				   __be32 *ptr)
-+{
-+	const struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	__be32 *start = ptr;
-+	__u16 data_len;
-+
-+	*ptr++ = htonl(tcb->seq); /* data_seq */
-+
-+	/* If it's a non-data DATA_FIN, we set subseq to 0 (draft v7) */
-+	if (mptcp_is_data_fin(skb) && skb->len == 0)
-+		*ptr++ = 0; /* subseq */
-+	else
-+		*ptr++ = htonl(tp->write_seq - tp->mptcp->snt_isn); /* subseq */
-+
-+	if (tcb->mptcp_flags & MPTCPHDR_INF)
-+		data_len = 0;
-+	else
-+		data_len = tcb->end_seq - tcb->seq;
-+
-+	if (tp->mpcb->dss_csum && data_len) {
-+		__be16 *p16 = (__be16 *)ptr;
-+		__be32 hdseq = mptcp_get_highorder_sndbits(skb, tp->mpcb);
-+		__wsum csum;
-+
-+		*ptr = htonl(((data_len) << 16) |
-+			     (TCPOPT_EOL << 8) |
-+			     (TCPOPT_EOL));
-+		csum = csum_partial(ptr - 2, 12, skb->csum);
-+		p16++;
-+		*p16++ = csum_fold(csum_partial(&hdseq, sizeof(hdseq), csum));
-+	} else {
-+		*ptr++ = htonl(((data_len) << 16) |
-+			       (TCPOPT_NOP << 8) |
-+			       (TCPOPT_NOP));
-+	}
-+
-+	return ptr - start;
-+}
-+
-+static int mptcp_write_dss_data_ack(const struct tcp_sock *tp, const struct sk_buff *skb,
-+				    __be32 *ptr)
-+{
-+	struct mp_dss *mdss = (struct mp_dss *)ptr;
-+	__be32 *start = ptr;
-+
-+	mdss->kind = TCPOPT_MPTCP;
-+	mdss->sub = MPTCP_SUB_DSS;
-+	mdss->rsv1 = 0;
-+	mdss->rsv2 = 0;
-+	mdss->F = mptcp_is_data_fin(skb) ? 1 : 0;
-+	mdss->m = 0;
-+	mdss->M = mptcp_is_data_seq(skb) ? 1 : 0;
-+	mdss->a = 0;
-+	mdss->A = 1;
-+	mdss->len = mptcp_sub_len_dss(mdss, tp->mpcb->dss_csum);
-+	ptr++;
-+
-+	*ptr++ = htonl(mptcp_meta_tp(tp)->rcv_nxt);
-+
-+	return ptr - start;
-+}
-+
-+/* RFC6824 states that once a particular subflow mapping has been sent
-+ * out it must never be changed. However, packets may be split while
-+ * they are in the retransmission queue (due to SACK or ACKs) and that
-+ * arguably means that we would change the mapping (e.g. it splits it,
-+ * our sends out a subset of the initial mapping).
-+ *
-+ * Furthermore, the skb checksum is not always preserved across splits
-+ * (e.g. mptcp_fragment) which would mean that we need to recompute
-+ * the DSS checksum in this case.
-+ *
-+ * To avoid this we save the initial DSS mapping which allows us to
-+ * send the same DSS mapping even for fragmented retransmits.
-+ */
-+static void mptcp_save_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb)
-+{
-+	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
-+	__be32 *ptr = (__be32 *)tcb->dss;
-+
-+	tcb->mptcp_flags |= MPTCPHDR_SEQ;
-+
-+	ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-+	ptr += mptcp_write_dss_mapping(tp, skb, ptr);
-+}
-+
-+/* Write the saved DSS mapping to the header */
-+static int mptcp_write_dss_data_seq(const struct tcp_sock *tp, struct sk_buff *skb,
-+				    __be32 *ptr)
-+{
-+	__be32 *start = ptr;
-+
-+	memcpy(ptr, TCP_SKB_CB(skb)->dss, mptcp_dss_len);
-+
-+	/* update the data_ack */
-+	start[1] = htonl(mptcp_meta_tp(tp)->rcv_nxt);
-+
-+	/* dss is in a union with inet_skb_parm and
-+	 * the IP layer expects zeroed IPCB fields.
-+	 */
-+	memset(TCP_SKB_CB(skb)->dss, 0 , mptcp_dss_len);
-+
-+	return mptcp_dss_len/sizeof(*ptr);
-+}
-+
-+static bool mptcp_skb_entail(struct sock *sk, struct sk_buff *skb, int reinject)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	const struct sock *meta_sk = mptcp_meta_sk(sk);
-+	const struct mptcp_cb *mpcb = tp->mpcb;
-+	struct tcp_skb_cb *tcb;
-+	struct sk_buff *subskb = NULL;
-+
-+	if (!reinject)
-+		TCP_SKB_CB(skb)->mptcp_flags |= (mpcb->snd_hiseq_index ?
-+						  MPTCPHDR_SEQ64_INDEX : 0);
-+
-+	subskb = pskb_copy_for_clone(skb, GFP_ATOMIC);
-+	if (!subskb)
-+		return false;
-+
-+	/* At the subflow-level we need to call again tcp_init_tso_segs. We
-+	 * force this, by setting gso_segs to 0. It has been set to 1 prior to
-+	 * the call to mptcp_skb_entail.
-+	 */
-+	skb_shinfo(subskb)->gso_segs = 0;
-+
-+	TCP_SKB_CB(skb)->path_mask |= mptcp_pi_to_flag(tp->mptcp->path_index);
-+
-+	if (!(sk->sk_route_caps & NETIF_F_ALL_CSUM) &&
-+	    skb->ip_summed == CHECKSUM_PARTIAL) {
-+		subskb->csum = skb->csum = skb_checksum(skb, 0, skb->len, 0);
-+		subskb->ip_summed = skb->ip_summed = CHECKSUM_NONE;
-+	}
-+
-+	tcb = TCP_SKB_CB(subskb);
-+
-+	if (tp->mpcb->send_infinite_mapping &&
-+	    !tp->mpcb->infinite_mapping_snd &&
-+	    !before(tcb->seq, mptcp_meta_tp(tp)->snd_nxt)) {
-+		tp->mptcp->fully_established = 1;
-+		tp->mpcb->infinite_mapping_snd = 1;
-+		tp->mptcp->infinite_cutoff_seq = tp->write_seq;
-+		tcb->mptcp_flags |= MPTCPHDR_INF;
-+	}
-+
-+	if (mptcp_is_data_fin(subskb))
-+		mptcp_combine_dfin(subskb, meta_sk, sk);
-+
-+	mptcp_save_dss_data_seq(tp, subskb);
-+
-+	tcb->seq = tp->write_seq;
-+	tcb->sacked = 0; /* reset the sacked field: from the point of view
-+			  * of this subflow, we are sending a brand new
-+			  * segment
-+			  */
-+	/* Take into account seg len */
-+	tp->write_seq += subskb->len + ((tcb->tcp_flags & TCPHDR_FIN) ? 1 : 0);
-+	tcb->end_seq = tp->write_seq;
-+
-+	/* If it's a non-payload DATA_FIN (also no subflow-fin), the
-+	 * segment is not part of the subflow but on a meta-only-level.
-+	 */
-+	if (!mptcp_is_data_fin(subskb) || tcb->end_seq != tcb->seq) {
-+		tcp_add_write_queue_tail(sk, subskb);
-+		sk->sk_wmem_queued += subskb->truesize;
-+		sk_mem_charge(sk, subskb->truesize);
-+	} else {
-+		int err;
-+
-+		/* Necessary to initialize for tcp_transmit_skb. mss of 1, as
-+		 * skb->len = 0 will force tso_segs to 1.
-+		 */
-+		tcp_init_tso_segs(sk, subskb, 1);
-+		/* Empty data-fins are sent immediatly on the subflow */
-+		TCP_SKB_CB(subskb)->when = tcp_time_stamp;
-+		err = tcp_transmit_skb(sk, subskb, 1, GFP_ATOMIC);
-+
-+		/* It has not been queued, we can free it now. */
-+		kfree_skb(subskb);
-+
-+		if (err)
-+			return false;
-+	}
-+
-+	if (!tp->mptcp->fully_established) {
-+		tp->mptcp->second_packet = 1;
-+		tp->mptcp->last_end_data_seq = TCP_SKB_CB(skb)->end_seq;
-+	}
-+
-+	return true;
-+}
-+
-+/* Fragment an skb and update the mptcp meta-data. Due to reinject, we
-+ * might need to undo some operations done by tcp_fragment.
-+ */
-+static int mptcp_fragment(struct sock *meta_sk, struct sk_buff *skb, u32 len,
-+			  gfp_t gfp, int reinject)
-+{
-+	int ret, diff, old_factor;
-+	struct sk_buff *buff;
-+	u8 flags;
-+
-+	if (skb_headlen(skb) < len)
-+		diff = skb->len - len;
-+	else
-+		diff = skb->data_len;
-+	old_factor = tcp_skb_pcount(skb);
-+
-+	/* The mss_now in tcp_fragment is used to set the tso_segs of the skb.
-+	 * At the MPTCP-level we do not care about the absolute value. All we
-+	 * care about is that it is set to 1 for accurate packets_out
-+	 * accounting.
-+	 */
-+	ret = tcp_fragment(meta_sk, skb, len, UINT_MAX, gfp);
-+	if (ret)
-+		return ret;
-+
-+	buff = skb->next;
-+
-+	flags = TCP_SKB_CB(skb)->mptcp_flags;
-+	TCP_SKB_CB(skb)->mptcp_flags = flags & ~(MPTCPHDR_FIN);
-+	TCP_SKB_CB(buff)->mptcp_flags = flags;
-+	TCP_SKB_CB(buff)->path_mask = TCP_SKB_CB(skb)->path_mask;
-+
-+	/* If reinject == 1, the buff will be added to the reinject
-+	 * queue, which is currently not part of memory accounting. So
-+	 * undo the changes done by tcp_fragment and update the
-+	 * reinject queue. Also, undo changes to the packet counters.
-+	 */
-+	if (reinject == 1) {
-+		int undo = buff->truesize - diff;
-+		meta_sk->sk_wmem_queued -= undo;
-+		sk_mem_uncharge(meta_sk, undo);
-+
-+		tcp_sk(meta_sk)->mpcb->reinject_queue.qlen++;
-+		meta_sk->sk_write_queue.qlen--;
-+
-+		if (!before(tcp_sk(meta_sk)->snd_nxt, TCP_SKB_CB(buff)->end_seq)) {
-+			undo = old_factor - tcp_skb_pcount(skb) -
-+				tcp_skb_pcount(buff);
-+			if (undo)
-+				tcp_adjust_pcount(meta_sk, skb, -undo);
-+		}
-+	}
-+
-+	return 0;
-+}
-+
-+/* Inspired by tcp_write_wakeup */
-+int mptcp_write_wakeup(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sk_buff *skb;
-+	struct sock *sk_it;
-+	int ans = 0;
-+
-+	if (meta_sk->sk_state == TCP_CLOSE)
-+		return -1;
-+
-+	skb = tcp_send_head(meta_sk);
-+	if (skb &&
-+	    before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(meta_tp))) {
-+		unsigned int mss;
-+		unsigned int seg_size = tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq;
-+		struct sock *subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, true);
-+		struct tcp_sock *subtp;
-+		if (!subsk)
-+			goto window_probe;
-+		subtp = tcp_sk(subsk);
-+		mss = tcp_current_mss(subsk);
-+
-+		seg_size = min(tcp_wnd_end(meta_tp) - TCP_SKB_CB(skb)->seq,
-+			       tcp_wnd_end(subtp) - subtp->write_seq);
-+
-+		if (before(meta_tp->pushed_seq, TCP_SKB_CB(skb)->end_seq))
-+			meta_tp->pushed_seq = TCP_SKB_CB(skb)->end_seq;
-+
-+		/* We are probing the opening of a window
-+		 * but the window size is != 0
-+		 * must have been a result SWS avoidance ( sender )
-+		 */
-+		if (seg_size < TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq ||
-+		    skb->len > mss) {
-+			seg_size = min(seg_size, mss);
-+			TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
-+			if (mptcp_fragment(meta_sk, skb, seg_size,
-+					   GFP_ATOMIC, 0))
-+				return -1;
-+		} else if (!tcp_skb_pcount(skb)) {
-+			/* see mptcp_write_xmit on why we use UINT_MAX */
-+			tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
-+		}
-+
-+		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
-+		if (!mptcp_skb_entail(subsk, skb, 0))
-+			return -1;
-+		TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+		mptcp_check_sndseq_wrap(meta_tp, TCP_SKB_CB(skb)->end_seq -
-+						 TCP_SKB_CB(skb)->seq);
-+		tcp_event_new_data_sent(meta_sk, skb);
-+
-+		__tcp_push_pending_frames(subsk, mss, TCP_NAGLE_PUSH);
-+
-+		return 0;
-+	} else {
-+window_probe:
-+		if (between(meta_tp->snd_up, meta_tp->snd_una + 1,
-+			    meta_tp->snd_una + 0xFFFF)) {
-+			mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+				if (mptcp_sk_can_send_ack(sk_it))
-+					tcp_xmit_probe_skb(sk_it, 1);
-+			}
-+		}
-+
-+		/* At least one of the tcp_xmit_probe_skb's has to succeed */
-+		mptcp_for_each_sk(meta_tp->mpcb, sk_it) {
-+			int ret;
-+
-+			if (!mptcp_sk_can_send_ack(sk_it))
-+				continue;
-+
-+			ret = tcp_xmit_probe_skb(sk_it, 0);
-+			if (unlikely(ret > 0))
-+				ans = ret;
-+		}
-+		return ans;
-+	}
-+}
-+
-+bool mptcp_write_xmit(struct sock *meta_sk, unsigned int mss_now, int nonagle,
-+		     int push_one, gfp_t gfp)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk), *subtp;
-+	struct sock *subsk = NULL;
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sk_buff *skb;
-+	unsigned int sent_pkts;
-+	int reinject = 0;
-+	unsigned int sublimit;
-+
-+	sent_pkts = 0;
-+
-+	while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
-+						    &sublimit))) {
-+		unsigned int limit;
-+
-+		subtp = tcp_sk(subsk);
-+		mss_now = tcp_current_mss(subsk);
-+
-+		if (reinject == 1) {
-+			if (!after(TCP_SKB_CB(skb)->end_seq, meta_tp->snd_una)) {
-+				/* Segment already reached the peer, take the next one */
-+				__skb_unlink(skb, &mpcb->reinject_queue);
-+				__kfree_skb(skb);
-+				continue;
-+			}
-+		}
-+
-+		/* If the segment was cloned (e.g. a meta retransmission),
-+		 * the header must be expanded/copied so that there is no
-+		 * corruption of TSO information.
-+		 */
-+		if (skb_unclone(skb, GFP_ATOMIC))
-+			break;
-+
-+		if (unlikely(!tcp_snd_wnd_test(meta_tp, skb, mss_now)))
-+			break;
-+
-+		/* Force tso_segs to 1 by using UINT_MAX.
-+		 * We actually don't care about the exact number of segments
-+		 * emitted on the subflow. We need just to set tso_segs, because
-+		 * we still need an accurate packets_out count in
-+		 * tcp_event_new_data_sent.
-+		 */
-+		tcp_set_skb_tso_segs(meta_sk, skb, UINT_MAX);
-+
-+		/* Check for nagle, irregardless of tso_segs. If the segment is
-+		 * actually larger than mss_now (TSO segment), then
-+		 * tcp_nagle_check will have partial == false and always trigger
-+		 * the transmission.
-+		 * tcp_write_xmit has a TSO-level nagle check which is not
-+		 * subject to the MPTCP-level. It is based on the properties of
-+		 * the subflow, not the MPTCP-level.
-+		 */
-+		if (unlikely(!tcp_nagle_test(meta_tp, skb, mss_now,
-+					     (tcp_skb_is_last(meta_sk, skb) ?
-+					      nonagle : TCP_NAGLE_PUSH))))
-+			break;
-+
-+		limit = mss_now;
-+		/* skb->len > mss_now is the equivalent of tso_segs > 1 in
-+		 * tcp_write_xmit. Otherwise split-point would return 0.
-+		 */
-+		if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
-+			/* We limit the size of the skb so that it fits into the
-+			 * window. Call tcp_mss_split_point to avoid duplicating
-+			 * code.
-+			 * We really only care about fitting the skb into the
-+			 * window. That's why we use UINT_MAX. If the skb does
-+			 * not fit into the cwnd_quota or the NIC's max-segs
-+			 * limitation, it will be split by the subflow's
-+			 * tcp_write_xmit which does the appropriate call to
-+			 * tcp_mss_split_point.
-+			 */
-+			limit = tcp_mss_split_point(meta_sk, skb, mss_now,
-+						    UINT_MAX / mss_now,
-+						    nonagle);
-+
-+		if (sublimit)
-+			limit = min(limit, sublimit);
-+
-+		if (skb->len > limit &&
-+		    unlikely(mptcp_fragment(meta_sk, skb, limit, gfp, reinject)))
-+			break;
-+
-+		if (!mptcp_skb_entail(subsk, skb, reinject))
-+			break;
-+		/* Nagle is handled at the MPTCP-layer, so
-+		 * always push on the subflow
-+		 */
-+		__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
-+		TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+		if (!reinject) {
-+			mptcp_check_sndseq_wrap(meta_tp,
-+						TCP_SKB_CB(skb)->end_seq -
-+						TCP_SKB_CB(skb)->seq);
-+			tcp_event_new_data_sent(meta_sk, skb);
-+		}
-+
-+		tcp_minshall_update(meta_tp, mss_now, skb);
-+		sent_pkts += tcp_skb_pcount(skb);
-+
-+		if (reinject > 0) {
-+			__skb_unlink(skb, &mpcb->reinject_queue);
-+			kfree_skb(skb);
-+		}
-+
-+		if (push_one)
-+			break;
-+	}
-+
-+	return !meta_tp->packets_out && tcp_send_head(meta_sk);
-+}
-+
-+void mptcp_write_space(struct sock *sk)
-+{
-+	mptcp_push_pending_frames(mptcp_meta_sk(sk));
-+}
-+
-+u32 __mptcp_select_window(struct sock *sk)
-+{
-+	struct inet_connection_sock *icsk = inet_csk(sk);
-+	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
-+	int mss, free_space, full_space, window;
-+
-+	/* MSS for the peer's data.  Previous versions used mss_clamp
-+	 * here.  I don't know if the value based on our guesses
-+	 * of peer's MSS is better for the performance.  It's more correct
-+	 * but may be worse for the performance because of rcv_mss
-+	 * fluctuations.  --SAW  1998/11/1
-+	 */
-+	mss = icsk->icsk_ack.rcv_mss;
-+	free_space = tcp_space(sk);
-+	full_space = min_t(int, meta_tp->window_clamp,
-+			tcp_full_space(sk));
-+
-+	if (mss > full_space)
-+		mss = full_space;
-+
-+	if (free_space < (full_space >> 1)) {
-+		icsk->icsk_ack.quick = 0;
-+
-+		if (tcp_memory_pressure)
-+			/* TODO this has to be adapted when we support different
-+			 * MSS's among the subflows.
-+			 */
-+			meta_tp->rcv_ssthresh = min(meta_tp->rcv_ssthresh,
-+						    4U * meta_tp->advmss);
-+
-+		if (free_space < mss)
-+			return 0;
-+	}
-+
-+	if (free_space > meta_tp->rcv_ssthresh)
-+		free_space = meta_tp->rcv_ssthresh;
-+
-+	/* Don't do rounding if we are using window scaling, since the
-+	 * scaled window will not line up with the MSS boundary anyway.
-+	 */
-+	window = meta_tp->rcv_wnd;
-+	if (tp->rx_opt.rcv_wscale) {
-+		window = free_space;
-+
-+		/* Advertise enough space so that it won't get scaled away.
-+		 * Import case: prevent zero window announcement if
-+		 * 1<<rcv_wscale > mss.
-+		 */
-+		if (((window >> tp->rx_opt.rcv_wscale) << tp->
-+		     rx_opt.rcv_wscale) != window)
-+			window = (((window >> tp->rx_opt.rcv_wscale) + 1)
-+				  << tp->rx_opt.rcv_wscale);
-+	} else {
-+		/* Get the largest window that is a nice multiple of mss.
-+		 * Window clamp already applied above.
-+		 * If our current window offering is within 1 mss of the
-+		 * free space we just keep it. This prevents the divide
-+		 * and multiply from happening most of the time.
-+		 * We also don't do any window rounding when the free space
-+		 * is too small.
-+		 */
-+		if (window <= free_space - mss || window > free_space)
-+			window = (free_space / mss) * mss;
-+		else if (mss == full_space &&
-+			 free_space > window + (full_space >> 1))
-+			window = free_space;
-+	}
-+
-+	return window;
-+}
-+
-+void mptcp_syn_options(const struct sock *sk, struct tcp_out_options *opts,
-+		       unsigned *remaining)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+
-+	opts->options |= OPTION_MPTCP;
-+	if (is_master_tp(tp)) {
-+		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYN;
-+		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
-+		opts->mp_capable.sender_key = tp->mptcp_loc_key;
-+		opts->dss_csum = !!sysctl_mptcp_checksum;
-+	} else {
-+		const struct mptcp_cb *mpcb = tp->mpcb;
-+
-+		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYN;
-+		*remaining -= MPTCP_SUB_LEN_JOIN_SYN_ALIGN;
-+		opts->mp_join_syns.token = mpcb->mptcp_rem_token;
-+		opts->mp_join_syns.low_prio  = tp->mptcp->low_prio;
-+		opts->addr_id = tp->mptcp->loc_id;
-+		opts->mp_join_syns.sender_nonce = tp->mptcp->mptcp_loc_nonce;
-+	}
-+}
-+
-+void mptcp_synack_options(struct request_sock *req,
-+			  struct tcp_out_options *opts, unsigned *remaining)
-+{
-+	struct mptcp_request_sock *mtreq;
-+	mtreq = mptcp_rsk(req);
-+
-+	opts->options |= OPTION_MPTCP;
-+	/* MPCB not yet set - thus it's a new MPTCP-session */
-+	if (!mtreq->is_sub) {
-+		opts->mptcp_options |= OPTION_MP_CAPABLE | OPTION_TYPE_SYNACK;
-+		opts->mp_capable.sender_key = mtreq->mptcp_loc_key;
-+		opts->dss_csum = !!sysctl_mptcp_checksum || mtreq->dss_csum;
-+		*remaining -= MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN;
-+	} else {
-+		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_SYNACK;
-+		opts->mp_join_syns.sender_truncated_mac =
-+				mtreq->mptcp_hash_tmac;
-+		opts->mp_join_syns.sender_nonce = mtreq->mptcp_loc_nonce;
-+		opts->mp_join_syns.low_prio = mtreq->low_prio;
-+		opts->addr_id = mtreq->loc_id;
-+		*remaining -= MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN;
-+	}
-+}
-+
-+void mptcp_established_options(struct sock *sk, struct sk_buff *skb,
-+			       struct tcp_out_options *opts, unsigned *size)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct mptcp_cb *mpcb = tp->mpcb;
-+	const struct tcp_skb_cb *tcb = skb ? TCP_SKB_CB(skb) : NULL;
-+
-+	/* We are coming from tcp_current_mss with the meta_sk as an argument.
-+	 * It does not make sense to check for the options, because when the
-+	 * segment gets sent, another subflow will be chosen.
-+	 */
-+	if (!skb && is_meta_sk(sk))
-+		return;
-+
-+	/* In fallback mp_fail-mode, we have to repeat it until the fallback
-+	 * has been done by the sender
-+	 */
-+	if (unlikely(tp->mptcp->send_mp_fail)) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_FAIL;
-+		*size += MPTCP_SUB_LEN_FAIL;
-+		return;
-+	}
-+
-+	if (unlikely(tp->send_mp_fclose)) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_FCLOSE;
-+		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
-+		*size += MPTCP_SUB_LEN_FCLOSE_ALIGN;
-+		return;
-+	}
-+
-+	/* 1. If we are the sender of the infinite-mapping, we need the
-+	 *    MPTCPHDR_INF-flag, because a retransmission of the
-+	 *    infinite-announcment still needs the mptcp-option.
-+	 *
-+	 *    We need infinite_cutoff_seq, because retransmissions from before
-+	 *    the infinite-cutoff-moment still need the MPTCP-signalling to stay
-+	 *    consistent.
-+	 *
-+	 * 2. If we are the receiver of the infinite-mapping, we always skip
-+	 *    mptcp-options, because acknowledgments from before the
-+	 *    infinite-mapping point have already been sent out.
-+	 *
-+	 * I know, the whole infinite-mapping stuff is ugly...
-+	 *
-+	 * TODO: Handle wrapped data-sequence numbers
-+	 *       (even if it's very unlikely)
-+	 */
-+	if (unlikely(mpcb->infinite_mapping_snd) &&
-+	    ((mpcb->send_infinite_mapping && tcb &&
-+	      mptcp_is_data_seq(skb) &&
-+	      !(tcb->mptcp_flags & MPTCPHDR_INF) &&
-+	      !before(tcb->seq, tp->mptcp->infinite_cutoff_seq)) ||
-+	     !mpcb->send_infinite_mapping))
-+		return;
-+
-+	if (unlikely(tp->mptcp->include_mpc)) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_CAPABLE |
-+				       OPTION_TYPE_ACK;
-+		*size += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN;
-+		opts->mp_capable.sender_key = mpcb->mptcp_loc_key;
-+		opts->mp_capable.receiver_key = mpcb->mptcp_rem_key;
-+		opts->dss_csum = mpcb->dss_csum;
-+
-+		if (skb)
-+			tp->mptcp->include_mpc = 0;
-+	}
-+	if (unlikely(tp->mptcp->pre_established)) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_JOIN | OPTION_TYPE_ACK;
-+		*size += MPTCP_SUB_LEN_JOIN_ACK_ALIGN;
-+	}
-+
-+	if (!tp->mptcp->include_mpc && !tp->mptcp->pre_established) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_DATA_ACK;
-+		/* If !skb, we come from tcp_current_mss and thus we always
-+		 * assume that the DSS-option will be set for the data-packet.
-+		 */
-+		if (skb && !mptcp_is_data_seq(skb)) {
-+			*size += MPTCP_SUB_LEN_ACK_ALIGN;
-+		} else {
-+			/* Doesn't matter, if csum included or not. It will be
-+			 * either 10 or 12, and thus aligned = 12
-+			 */
-+			*size += MPTCP_SUB_LEN_ACK_ALIGN +
-+				 MPTCP_SUB_LEN_SEQ_ALIGN;
-+		}
-+
-+		*size += MPTCP_SUB_LEN_DSS_ALIGN;
-+	}
-+
-+	if (unlikely(mpcb->addr_signal) && mpcb->pm_ops->addr_signal)
-+		mpcb->pm_ops->addr_signal(sk, size, opts, skb);
-+
-+	if (unlikely(tp->mptcp->send_mp_prio) &&
-+	    MAX_TCP_OPTION_SPACE - *size >= MPTCP_SUB_LEN_PRIO_ALIGN) {
-+		opts->options |= OPTION_MPTCP;
-+		opts->mptcp_options |= OPTION_MP_PRIO;
-+		if (skb)
-+			tp->mptcp->send_mp_prio = 0;
-+		*size += MPTCP_SUB_LEN_PRIO_ALIGN;
-+	}
-+
-+	return;
-+}
-+
-+u16 mptcp_select_window(struct sock *sk)
-+{
-+	u16 new_win		= tcp_select_window(sk);
-+	struct tcp_sock *tp	= tcp_sk(sk);
-+	struct tcp_sock *meta_tp = mptcp_meta_tp(tp);
-+
-+	meta_tp->rcv_wnd	= tp->rcv_wnd;
-+	meta_tp->rcv_wup	= meta_tp->rcv_nxt;
-+
-+	return new_win;
-+}
-+
-+void mptcp_options_write(__be32 *ptr, struct tcp_sock *tp,
-+			 const struct tcp_out_options *opts,
-+			 struct sk_buff *skb)
-+{
-+	if (unlikely(OPTION_MP_CAPABLE & opts->mptcp_options)) {
-+		struct mp_capable *mpc = (struct mp_capable *)ptr;
-+
-+		mpc->kind = TCPOPT_MPTCP;
-+
-+		if ((OPTION_TYPE_SYN & opts->mptcp_options) ||
-+		    (OPTION_TYPE_SYNACK & opts->mptcp_options)) {
-+			mpc->sender_key = opts->mp_capable.sender_key;
-+			mpc->len = MPTCP_SUB_LEN_CAPABLE_SYN;
-+			ptr += MPTCP_SUB_LEN_CAPABLE_SYN_ALIGN >> 2;
-+		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
-+			mpc->sender_key = opts->mp_capable.sender_key;
-+			mpc->receiver_key = opts->mp_capable.receiver_key;
-+			mpc->len = MPTCP_SUB_LEN_CAPABLE_ACK;
-+			ptr += MPTCP_SUB_LEN_CAPABLE_ACK_ALIGN >> 2;
-+		}
-+
-+		mpc->sub = MPTCP_SUB_CAPABLE;
-+		mpc->ver = 0;
-+		mpc->a = opts->dss_csum;
-+		mpc->b = 0;
-+		mpc->rsv = 0;
-+		mpc->h = 1;
-+	}
-+
-+	if (unlikely(OPTION_MP_JOIN & opts->mptcp_options)) {
-+		struct mp_join *mpj = (struct mp_join *)ptr;
-+
-+		mpj->kind = TCPOPT_MPTCP;
-+		mpj->sub = MPTCP_SUB_JOIN;
-+		mpj->rsv = 0;
-+
-+		if (OPTION_TYPE_SYN & opts->mptcp_options) {
-+			mpj->len = MPTCP_SUB_LEN_JOIN_SYN;
-+			mpj->u.syn.token = opts->mp_join_syns.token;
-+			mpj->u.syn.nonce = opts->mp_join_syns.sender_nonce;
-+			mpj->b = opts->mp_join_syns.low_prio;
-+			mpj->addr_id = opts->addr_id;
-+			ptr += MPTCP_SUB_LEN_JOIN_SYN_ALIGN >> 2;
-+		} else if (OPTION_TYPE_SYNACK & opts->mptcp_options) {
-+			mpj->len = MPTCP_SUB_LEN_JOIN_SYNACK;
-+			mpj->u.synack.mac =
-+				opts->mp_join_syns.sender_truncated_mac;
-+			mpj->u.synack.nonce = opts->mp_join_syns.sender_nonce;
-+			mpj->b = opts->mp_join_syns.low_prio;
-+			mpj->addr_id = opts->addr_id;
-+			ptr += MPTCP_SUB_LEN_JOIN_SYNACK_ALIGN >> 2;
-+		} else if (OPTION_TYPE_ACK & opts->mptcp_options) {
-+			mpj->len = MPTCP_SUB_LEN_JOIN_ACK;
-+			mpj->addr_id = 0; /* addr_id is rsv (RFC 6824, p. 21) */
-+			memcpy(mpj->u.ack.mac, &tp->mptcp->sender_mac[0], 20);
-+			ptr += MPTCP_SUB_LEN_JOIN_ACK_ALIGN >> 2;
-+		}
-+	}
-+	if (unlikely(OPTION_ADD_ADDR & opts->mptcp_options)) {
-+		struct mp_add_addr *mpadd = (struct mp_add_addr *)ptr;
-+
-+		mpadd->kind = TCPOPT_MPTCP;
-+		if (opts->add_addr_v4) {
-+			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR4;
-+			mpadd->sub = MPTCP_SUB_ADD_ADDR;
-+			mpadd->ipver = 4;
-+			mpadd->addr_id = opts->add_addr4.addr_id;
-+			mpadd->u.v4.addr = opts->add_addr4.addr;
-+			ptr += MPTCP_SUB_LEN_ADD_ADDR4_ALIGN >> 2;
-+		} else if (opts->add_addr_v6) {
-+			mpadd->len = MPTCP_SUB_LEN_ADD_ADDR6;
-+			mpadd->sub = MPTCP_SUB_ADD_ADDR;
-+			mpadd->ipver = 6;
-+			mpadd->addr_id = opts->add_addr6.addr_id;
-+			memcpy(&mpadd->u.v6.addr, &opts->add_addr6.addr,
-+			       sizeof(mpadd->u.v6.addr));
-+			ptr += MPTCP_SUB_LEN_ADD_ADDR6_ALIGN >> 2;
-+		}
-+	}
-+	if (unlikely(OPTION_REMOVE_ADDR & opts->mptcp_options)) {
-+		struct mp_remove_addr *mprem = (struct mp_remove_addr *)ptr;
-+		u8 *addrs_id;
-+		int id, len, len_align;
-+
-+		len = mptcp_sub_len_remove_addr(opts->remove_addrs);
-+		len_align = mptcp_sub_len_remove_addr_align(opts->remove_addrs);
-+
-+		mprem->kind = TCPOPT_MPTCP;
-+		mprem->len = len;
-+		mprem->sub = MPTCP_SUB_REMOVE_ADDR;
-+		mprem->rsv = 0;
-+		addrs_id = &mprem->addrs_id;
-+
-+		mptcp_for_each_bit_set(opts->remove_addrs, id)
-+			*(addrs_id++) = id;
-+
-+		/* Fill the rest with NOP's */
-+		if (len_align > len) {
-+			int i;
-+			for (i = 0; i < len_align - len; i++)
-+				*(addrs_id++) = TCPOPT_NOP;
-+		}
-+
-+		ptr += len_align >> 2;
-+	}
-+	if (unlikely(OPTION_MP_FAIL & opts->mptcp_options)) {
-+		struct mp_fail *mpfail = (struct mp_fail *)ptr;
-+
-+		mpfail->kind = TCPOPT_MPTCP;
-+		mpfail->len = MPTCP_SUB_LEN_FAIL;
-+		mpfail->sub = MPTCP_SUB_FAIL;
-+		mpfail->rsv1 = 0;
-+		mpfail->rsv2 = 0;
-+		mpfail->data_seq = htonll(tp->mpcb->csum_cutoff_seq);
-+
-+		ptr += MPTCP_SUB_LEN_FAIL_ALIGN >> 2;
-+	}
-+	if (unlikely(OPTION_MP_FCLOSE & opts->mptcp_options)) {
-+		struct mp_fclose *mpfclose = (struct mp_fclose *)ptr;
-+
-+		mpfclose->kind = TCPOPT_MPTCP;
-+		mpfclose->len = MPTCP_SUB_LEN_FCLOSE;
-+		mpfclose->sub = MPTCP_SUB_FCLOSE;
-+		mpfclose->rsv1 = 0;
-+		mpfclose->rsv2 = 0;
-+		mpfclose->key = opts->mp_capable.receiver_key;
-+
-+		ptr += MPTCP_SUB_LEN_FCLOSE_ALIGN >> 2;
-+	}
-+
-+	if (OPTION_DATA_ACK & opts->mptcp_options) {
-+		if (!mptcp_is_data_seq(skb))
-+			ptr += mptcp_write_dss_data_ack(tp, skb, ptr);
-+		else
-+			ptr += mptcp_write_dss_data_seq(tp, skb, ptr);
-+	}
-+	if (unlikely(OPTION_MP_PRIO & opts->mptcp_options)) {
-+		struct mp_prio *mpprio = (struct mp_prio *)ptr;
-+
-+		mpprio->kind = TCPOPT_MPTCP;
-+		mpprio->len = MPTCP_SUB_LEN_PRIO;
-+		mpprio->sub = MPTCP_SUB_PRIO;
-+		mpprio->rsv = 0;
-+		mpprio->b = tp->mptcp->low_prio;
-+		mpprio->addr_id = TCPOPT_NOP;
-+
-+		ptr += MPTCP_SUB_LEN_PRIO_ALIGN >> 2;
-+	}
-+}
-+
-+/* Sends the datafin */
-+void mptcp_send_fin(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sk_buff *skb = tcp_write_queue_tail(meta_sk);
-+	int mss_now;
-+
-+	if ((1 << meta_sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK))
-+		meta_tp->mpcb->passive_close = 1;
-+
-+	/* Optimization, tack on the FIN if we have a queue of
-+	 * unsent frames.  But be careful about outgoing SACKS
-+	 * and IP options.
-+	 */
-+	mss_now = mptcp_current_mss(meta_sk);
-+
-+	if (tcp_send_head(meta_sk) != NULL) {
-+		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
-+		TCP_SKB_CB(skb)->end_seq++;
-+		meta_tp->write_seq++;
-+	} else {
-+		/* Socket is locked, keep trying until memory is available. */
-+		for (;;) {
-+			skb = alloc_skb_fclone(MAX_TCP_HEADER,
-+					       meta_sk->sk_allocation);
-+			if (skb)
-+				break;
-+			yield();
-+		}
-+		/* Reserve space for headers and prepare control bits. */
-+		skb_reserve(skb, MAX_TCP_HEADER);
-+
-+		tcp_init_nondata_skb(skb, meta_tp->write_seq, TCPHDR_ACK);
-+		TCP_SKB_CB(skb)->end_seq++;
-+		TCP_SKB_CB(skb)->mptcp_flags |= MPTCPHDR_FIN;
-+		tcp_queue_skb(meta_sk, skb);
-+	}
-+	__tcp_push_pending_frames(meta_sk, mss_now, TCP_NAGLE_OFF);
-+}
-+
-+void mptcp_send_active_reset(struct sock *meta_sk, gfp_t priority)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct sock *sk = NULL, *sk_it = NULL, *tmpsk;
-+
-+	if (!mpcb->cnt_subflows)
-+		return;
-+
-+	WARN_ON(meta_tp->send_mp_fclose);
-+
-+	/* First - select a socket */
-+	sk = mptcp_select_ack_sock(meta_sk);
-+
-+	/* May happen if no subflow is in an appropriate state */
-+	if (!sk)
-+		return;
-+
-+	/* We are in infinite mode - just send a reset */
-+	if (mpcb->infinite_mapping_snd || mpcb->infinite_mapping_rcv) {
-+		sk->sk_err = ECONNRESET;
-+		if (tcp_need_reset(sk->sk_state))
-+			tcp_send_active_reset(sk, priority);
-+		mptcp_sub_force_close(sk);
-+		return;
-+	}
-+
-+
-+	tcp_sk(sk)->send_mp_fclose = 1;
-+	/** Reset all other subflows */
-+
-+	/* tcp_done must be handled with bh disabled */
-+	if (!in_serving_softirq())
-+		local_bh_disable();
-+
-+	mptcp_for_each_sk_safe(mpcb, sk_it, tmpsk) {
-+		if (tcp_sk(sk_it)->send_mp_fclose)
-+			continue;
-+
-+		sk_it->sk_err = ECONNRESET;
-+		if (tcp_need_reset(sk_it->sk_state))
-+			tcp_send_active_reset(sk_it, GFP_ATOMIC);
-+		mptcp_sub_force_close(sk_it);
-+	}
-+
-+	if (!in_serving_softirq())
-+		local_bh_enable();
-+
-+	tcp_send_ack(sk);
-+	inet_csk_reset_keepalive_timer(sk, inet_csk(sk)->icsk_rto);
-+
-+	meta_tp->send_mp_fclose = 1;
-+}
-+
-+static void mptcp_ack_retransmit_timer(struct sock *sk)
-+{
-+	struct sk_buff *skb;
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct inet_connection_sock *icsk = inet_csk(sk);
-+
-+	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
-+		goto out; /* Routing failure or similar */
-+
-+	if (!tp->retrans_stamp)
-+		tp->retrans_stamp = tcp_time_stamp ? : 1;
-+
-+	if (tcp_write_timeout(sk)) {
-+		tp->mptcp->pre_established = 0;
-+		sk_stop_timer(sk, &tp->mptcp->mptcp_ack_timer);
-+		tp->ops->send_active_reset(sk, GFP_ATOMIC);
-+		goto out;
-+	}
-+
-+	skb = alloc_skb(MAX_TCP_HEADER, GFP_ATOMIC);
-+	if (skb == NULL) {
-+		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+			       jiffies + icsk->icsk_rto);
-+		return;
-+	}
-+
-+	/* Reserve space for headers and prepare control bits */
-+	skb_reserve(skb, MAX_TCP_HEADER);
-+	tcp_init_nondata_skb(skb, tp->snd_una, TCPHDR_ACK);
-+
-+	TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+	if (tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC) > 0) {
-+		/* Retransmission failed because of local congestion,
-+		 * do not backoff.
-+		 */
-+		if (!icsk->icsk_retransmits)
-+			icsk->icsk_retransmits = 1;
-+		sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+			       jiffies + icsk->icsk_rto);
-+		return;
-+	}
-+
-+
-+	icsk->icsk_retransmits++;
-+	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-+	sk_reset_timer(sk, &tp->mptcp->mptcp_ack_timer,
-+		       jiffies + icsk->icsk_rto);
-+	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1, 0, 0))
-+		__sk_dst_reset(sk);
-+
-+out:;
-+}
-+
-+void mptcp_ack_handler(unsigned long data)
-+{
-+	struct sock *sk = (struct sock *)data;
-+	struct sock *meta_sk = mptcp_meta_sk(sk);
-+
-+	bh_lock_sock(meta_sk);
-+	if (sock_owned_by_user(meta_sk)) {
-+		/* Try again later */
-+		sk_reset_timer(sk, &tcp_sk(sk)->mptcp->mptcp_ack_timer,
-+			       jiffies + (HZ / 20));
-+		goto out_unlock;
-+	}
-+
-+	if (sk->sk_state == TCP_CLOSE)
-+		goto out_unlock;
-+	if (!tcp_sk(sk)->mptcp->pre_established)
-+		goto out_unlock;
-+
-+	mptcp_ack_retransmit_timer(sk);
-+
-+	sk_mem_reclaim(sk);
-+
-+out_unlock:
-+	bh_unlock_sock(meta_sk);
-+	sock_put(sk);
-+}
-+
-+/* Similar to tcp_retransmit_skb
-+ *
-+ * The diff is that we handle the retransmission-stats (retrans_stamp) at the
-+ * meta-level.
-+ */
-+int mptcp_retransmit_skb(struct sock *meta_sk, struct sk_buff *skb)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct sock *subsk;
-+	unsigned int limit, mss_now;
-+	int err = -1;
-+
-+	/* Do not sent more than we queued. 1/4 is reserved for possible
-+	 * copying overhead: fragmentation, tunneling, mangling etc.
-+	 *
-+	 * This is a meta-retransmission thus we check on the meta-socket.
-+	 */
-+	if (atomic_read(&meta_sk->sk_wmem_alloc) >
-+	    min(meta_sk->sk_wmem_queued + (meta_sk->sk_wmem_queued >> 2), meta_sk->sk_sndbuf)) {
-+		return -EAGAIN;
-+	}
-+
-+	/* We need to make sure that the retransmitted segment can be sent on a
-+	 * subflow right now. If it is too big, it needs to be fragmented.
-+	 */
-+	subsk = meta_tp->mpcb->sched_ops->get_subflow(meta_sk, skb, false);
-+	if (!subsk) {
-+		/* We want to increase icsk_retransmits, thus return 0, so that
-+		 * mptcp_retransmit_timer enters the desired branch.
-+		 */
-+		err = 0;
-+		goto failed;
-+	}
-+	mss_now = tcp_current_mss(subsk);
-+
-+	/* If the segment was cloned (e.g. a meta retransmission), the header
-+	 * must be expanded/copied so that there is no corruption of TSO
-+	 * information.
-+	 */
-+	if (skb_unclone(skb, GFP_ATOMIC)) {
-+		err = -ENOMEM;
-+		goto failed;
-+	}
-+
-+	/* Must have been set by mptcp_write_xmit before */
-+	BUG_ON(!tcp_skb_pcount(skb));
-+
-+	limit = mss_now;
-+	/* skb->len > mss_now is the equivalent of tso_segs > 1 in
-+	 * tcp_write_xmit. Otherwise split-point would return 0.
-+	 */
-+	if (skb->len > mss_now && !tcp_urg_mode(meta_tp))
-+		limit = tcp_mss_split_point(meta_sk, skb, mss_now,
-+					    UINT_MAX / mss_now,
-+					    TCP_NAGLE_OFF);
-+
-+	if (skb->len > limit &&
-+	    unlikely(mptcp_fragment(meta_sk, skb, limit,
-+				    GFP_ATOMIC, 0)))
-+		goto failed;
-+
-+	if (!mptcp_skb_entail(subsk, skb, -1))
-+		goto failed;
-+	TCP_SKB_CB(skb)->when = tcp_time_stamp;
-+
-+	/* Update global TCP statistics. */
-+	TCP_INC_STATS(sock_net(meta_sk), TCP_MIB_RETRANSSEGS);
-+
-+	/* Diff to tcp_retransmit_skb */
-+
-+	/* Save stamp of the first retransmit. */
-+	if (!meta_tp->retrans_stamp)
-+		meta_tp->retrans_stamp = TCP_SKB_CB(skb)->when;
-+
-+	__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
-+
-+	return 0;
-+
-+failed:
-+	NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPRETRANSFAIL);
-+	return err;
-+}
-+
-+/* Similar to tcp_retransmit_timer
-+ *
-+ * The diff is that we have to handle retransmissions of the FAST_CLOSE-message
-+ * and that we don't have an srtt estimation at the meta-level.
-+ */
-+void mptcp_retransmit_timer(struct sock *meta_sk)
-+{
-+	struct tcp_sock *meta_tp = tcp_sk(meta_sk);
-+	struct mptcp_cb *mpcb = meta_tp->mpcb;
-+	struct inet_connection_sock *meta_icsk = inet_csk(meta_sk);
-+	int err;
-+
-+	/* In fallback, retransmission is handled at the subflow-level */
-+	if (!meta_tp->packets_out || mpcb->infinite_mapping_snd ||
-+	    mpcb->send_infinite_mapping)
-+		return;
-+
-+	WARN_ON(tcp_write_queue_empty(meta_sk));
-+
-+	if (!meta_tp->snd_wnd && !sock_flag(meta_sk, SOCK_DEAD) &&
-+	    !((1 << meta_sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))) {
-+		/* Receiver dastardly shrinks window. Our retransmits
-+		 * become zero probes, but we should not timeout this
-+		 * connection. If the socket is an orphan, time it out,
-+		 * we cannot allow such beasts to hang infinitely.
-+		 */
-+		struct inet_sock *meta_inet = inet_sk(meta_sk);
-+		if (meta_sk->sk_family == AF_INET) {
-+			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI4:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-+				       &meta_inet->inet_daddr,
-+				       ntohs(meta_inet->inet_dport),
-+				       meta_inet->inet_num, meta_tp->snd_una,
-+				       meta_tp->snd_nxt);
-+		}
-+#if IS_ENABLED(CONFIG_IPV6)
-+		else if (meta_sk->sk_family == AF_INET6) {
-+			LIMIT_NETDEBUG(KERN_DEBUG "MPTCP: Peer %pI6:%u/%u unexpectedly shrunk window %u:%u (repaired)\n",
-+				       &meta_sk->sk_v6_daddr,
-+				       ntohs(meta_inet->inet_dport),
-+				       meta_inet->inet_num, meta_tp->snd_una,
-+				       meta_tp->snd_nxt);
-+		}
-+#endif
-+		if (tcp_time_stamp - meta_tp->rcv_tstamp > TCP_RTO_MAX) {
-+			tcp_write_err(meta_sk);
-+			return;
-+		}
-+
-+		mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
-+		goto out_reset_timer;
-+	}
-+
-+	if (tcp_write_timeout(meta_sk))
-+		return;
-+
-+	if (meta_icsk->icsk_retransmits == 0)
-+		NET_INC_STATS_BH(sock_net(meta_sk), LINUX_MIB_TCPTIMEOUTS);
-+
-+	meta_icsk->icsk_ca_state = TCP_CA_Loss;
-+
-+	err = mptcp_retransmit_skb(meta_sk, tcp_write_queue_head(meta_sk));
-+	if (err > 0) {
-+		/* Retransmission failed because of local congestion,
-+		 * do not backoff.
-+		 */
-+		if (!meta_icsk->icsk_retransmits)
-+			meta_icsk->icsk_retransmits = 1;
-+		inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS,
-+					  min(meta_icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
-+					  TCP_RTO_MAX);
-+		return;
-+	}
-+
-+	/* Increase the timeout each time we retransmit.  Note that
-+	 * we do not increase the rtt estimate.  rto is initialized
-+	 * from rtt, but increases here.  Jacobson (SIGCOMM 88) suggests
-+	 * that doubling rto each time is the least we can get away with.
-+	 * In KA9Q, Karn uses this for the first few times, and then
-+	 * goes to quadratic.  netBSD doubles, but only goes up to *64,
-+	 * and clamps at 1 to 64 sec afterwards.  Note that 120 sec is
-+	 * defined in the protocol as the maximum possible RTT.  I guess
-+	 * we'll have to use something other than TCP to talk to the
-+	 * University of Mars.
-+	 *
-+	 * PAWS allows us longer timeouts and large windows, so once
-+	 * implemented ftp to mars will work nicely. We will have to fix
-+	 * the 120 second clamps though!
-+	 */
-+	meta_icsk->icsk_backoff++;
-+	meta_icsk->icsk_retransmits++;
-+
-+out_reset_timer:
-+	/* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
-+	 * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
-+	 * might be increased if the stream oscillates between thin and thick,
-+	 * thus the old value might already be too high compared to the value
-+	 * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
-+	 * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
-+	 * exponential backoff behaviour to avoid continue hammering
-+	 * linear-timeout retransmissions into a black hole
-+	 */
-+	if (meta_sk->sk_state == TCP_ESTABLISHED &&
-+	    (meta_tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
-+	    tcp_stream_is_thin(meta_tp) &&
-+	    meta_icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
-+		meta_icsk->icsk_backoff = 0;
-+		/* We cannot do the same as in tcp_write_timer because the
-+		 * srtt is not set here.
-+		 */
-+		mptcp_set_rto(meta_sk);
-+	} else {
-+		/* Use normal (exponential) backoff */
-+		meta_icsk->icsk_rto = min(meta_icsk->icsk_rto << 1, TCP_RTO_MAX);
-+	}
-+	inet_csk_reset_xmit_timer(meta_sk, ICSK_TIME_RETRANS, meta_icsk->icsk_rto, TCP_RTO_MAX);
-+
-+	return;
-+}
-+
-+/* Modify values to an mptcp-level for the initial window of new subflows */
-+void mptcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
-+				__u32 *window_clamp, int wscale_ok,
-+				__u8 *rcv_wscale, __u32 init_rcv_wnd,
-+				 const struct sock *sk)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(sk)->mpcb;
-+
-+	*window_clamp = mpcb->orig_window_clamp;
-+	__space = tcp_win_from_space(mpcb->orig_sk_rcvbuf);
-+
-+	tcp_select_initial_window(__space, mss, rcv_wnd, window_clamp,
-+				  wscale_ok, rcv_wscale, init_rcv_wnd, sk);
-+}
-+
-+static inline u64 mptcp_calc_rate(const struct sock *meta_sk, unsigned int mss,
-+				  unsigned int (*mss_cb)(struct sock *sk))
-+{
-+	struct sock *sk;
-+	u64 rate = 0;
-+
-+	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+		int this_mss;
-+		u64 this_rate;
-+
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+
-+		/* Do not consider subflows without a RTT estimation yet
-+		 * otherwise this_rate >>> rate.
-+		 */
-+		if (unlikely(!tp->srtt_us))
-+			continue;
-+
-+		this_mss = mss_cb(sk);
-+
-+		/* If this_mss is smaller than mss, it means that a segment will
-+		 * be splitted in two (or more) when pushed on this subflow. If
-+		 * you consider that mss = 1428 and this_mss = 1420 then two
-+		 * segments will be generated: a 1420-byte and 8-byte segment.
-+		 * The latter will introduce a large overhead as for a single
-+		 * data segment 2 slots will be used in the congestion window.
-+		 * Therefore reducing by ~2 the potential throughput of this
-+		 * subflow. Indeed, 1428 will be send while 2840 could have been
-+		 * sent if mss == 1420 reducing the throughput by 2840 / 1428.
-+		 *
-+		 * The following algorithm take into account this overhead
-+		 * when computing the potential throughput that MPTCP can
-+		 * achieve when generating mss-byte segments.
-+		 *
-+		 * The formulae is the following:
-+		 *  \sum_{\forall sub} ratio * \frac{mss * cwnd_sub}{rtt_sub}
-+		 * Where ratio is computed as follows:
-+		 *  \frac{mss}{\ceil{mss / mss_sub} * mss_sub}
-+		 *
-+		 * ratio gives the reduction factor of the theoretical
-+		 * throughput a subflow can achieve if MPTCP uses a specific
-+		 * MSS value.
-+		 */
-+		this_rate = div64_u64((u64)mss * mss * (USEC_PER_SEC << 3) *
-+				      max(tp->snd_cwnd, tp->packets_out),
-+				      (u64)tp->srtt_us *
-+				      DIV_ROUND_UP(mss, this_mss) * this_mss);
-+		rate += this_rate;
-+	}
-+
-+	return rate;
-+}
-+
-+static unsigned int __mptcp_current_mss(const struct sock *meta_sk,
-+					unsigned int (*mss_cb)(struct sock *sk))
-+{
-+	unsigned int mss = 0;
-+	u64 rate = 0;
-+	struct sock *sk;
-+
-+	mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+		int this_mss;
-+		u64 this_rate;
-+
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+
-+		this_mss = mss_cb(sk);
-+
-+		/* Same mss values will produce the same throughput. */
-+		if (this_mss == mss)
-+			continue;
-+
-+		/* See whether using this mss value can theoretically improve
-+		 * the performances.
-+		 */
-+		this_rate = mptcp_calc_rate(meta_sk, this_mss, mss_cb);
-+		if (this_rate >= rate) {
-+			mss = this_mss;
-+			rate = this_rate;
-+		}
-+	}
-+
-+	return mss;
-+}
-+
-+unsigned int mptcp_current_mss(struct sock *meta_sk)
-+{
-+	unsigned int mss = __mptcp_current_mss(meta_sk, tcp_current_mss);
-+
-+	/* If no subflow is available, we take a default-mss from the
-+	 * meta-socket.
-+	 */
-+	return !mss ? tcp_current_mss(meta_sk) : mss;
-+}
-+
-+static unsigned int mptcp_select_size_mss(struct sock *sk)
-+{
-+	return tcp_sk(sk)->mss_cache;
-+}
-+
-+int mptcp_select_size(const struct sock *meta_sk, bool sg)
-+{
-+	unsigned int mss = __mptcp_current_mss(meta_sk, mptcp_select_size_mss);
-+
-+	if (sg) {
-+		if (mptcp_sk_can_gso(meta_sk)) {
-+			mss = SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
-+		} else {
-+			int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
-+
-+			if (mss >= pgbreak &&
-+			    mss <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
-+				mss = pgbreak;
-+		}
-+	}
-+
-+	return !mss ? tcp_sk(meta_sk)->mss_cache : mss;
-+}
-+
-+int mptcp_check_snd_buf(const struct tcp_sock *tp)
-+{
-+	const struct sock *sk;
-+	u32 rtt_max = tp->srtt_us;
-+	u64 bw_est;
-+
-+	if (!tp->srtt_us)
-+		return tp->reordering + 1;
-+
-+	mptcp_for_each_sk(tp->mpcb, sk) {
-+		if (!mptcp_sk_can_send(sk))
-+			continue;
-+
-+		if (rtt_max < tcp_sk(sk)->srtt_us)
-+			rtt_max = tcp_sk(sk)->srtt_us;
-+	}
-+
-+	bw_est = div64_u64(((u64)tp->snd_cwnd * rtt_max) << 16,
-+				(u64)tp->srtt_us);
-+
-+	return max_t(unsigned int, (u32)(bw_est >> 16),
-+			tp->reordering + 1);
-+}
-+
-+unsigned int mptcp_xmit_size_goal(const struct sock *meta_sk, u32 mss_now,
-+				  int large_allowed)
-+{
-+	struct sock *sk;
-+	u32 xmit_size_goal = 0;
-+
-+	if (large_allowed && mptcp_sk_can_gso(meta_sk)) {
-+		mptcp_for_each_sk(tcp_sk(meta_sk)->mpcb, sk) {
-+			int this_size_goal;
-+
-+			if (!mptcp_sk_can_send(sk))
-+				continue;
-+
-+			this_size_goal = tcp_xmit_size_goal(sk, mss_now, 1);
-+			if (this_size_goal > xmit_size_goal)
-+				xmit_size_goal = this_size_goal;
-+		}
-+	}
-+
-+	return max(xmit_size_goal, mss_now);
-+}
-+
-+/* Similar to tcp_trim_head - but we correctly copy the DSS-option */
-+int mptcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
-+{
-+	if (skb_cloned(skb)) {
-+		if (pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
-+			return -ENOMEM;
-+	}
-+
-+	__pskb_trim_head(skb, len);
-+
-+	TCP_SKB_CB(skb)->seq += len;
-+	skb->ip_summed = CHECKSUM_PARTIAL;
-+
-+	skb->truesize	     -= len;
-+	sk->sk_wmem_queued   -= len;
-+	sk_mem_uncharge(sk, len);
-+	sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
-+
-+	/* Any change of skb->len requires recalculation of tso factor. */
-+	if (tcp_skb_pcount(skb) > 1)
-+		tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
-+
-+	return 0;
-+}
-diff --git a/net/mptcp/mptcp_pm.c b/net/mptcp/mptcp_pm.c
-new file mode 100644
-index 000000000000..9542f950729f
---- /dev/null
-+++ b/net/mptcp/mptcp_pm.c
-@@ -0,0 +1,169 @@
-+/*
-+ *     MPTCP implementation - MPTCP-subflow-management
-+ *
-+ *     Initial Design & Implementation:
-+ *     Sébastien Barré <sebastien.barre@uclouvain.be>
-+ *
-+ *     Current Maintainer & Author:
-+ *     Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *     Additional authors:
-+ *     Jaakko Korkeaniemi <jaakko.korkeaniemi@aalto.fi>
-+ *     Gregory Detal <gregory.detal@uclouvain.be>
-+ *     Fabien Duchêne <fabien.duchene@uclouvain.be>
-+ *     Andreas Seelinger <Andreas.Seelinger@rwth-aachen.de>
-+ *     Lavkesh Lahngir <lavkesh51@gmail.com>
-+ *     Andreas Ripke <ripke@neclab.eu>
-+ *     Vlad Dogaru <vlad.dogaru@intel.com>
-+ *     Octavian Purdila <octavian.purdila@intel.com>
-+ *     John Ronan <jronan@tssg.org>
-+ *     Catalin Nicutar <catalin.nicutar@gmail.com>
-+ *     Brandon Heller <brandonh@stanford.edu>
-+ *
-+ *
-+ *     This program is free software; you can redistribute it and/or
-+ *      modify it under the terms of the GNU General Public License
-+ *      as published by the Free Software Foundation; either version
-+ *      2 of the License, or (at your option) any later version.
-+ */
-+
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static DEFINE_SPINLOCK(mptcp_pm_list_lock);
-+static LIST_HEAD(mptcp_pm_list);
-+
-+static int mptcp_default_id(sa_family_t family, union inet_addr *addr,
-+			    struct net *net, bool *low_prio)
-+{
-+	return 0;
-+}
-+
-+struct mptcp_pm_ops mptcp_pm_default = {
-+	.get_local_id = mptcp_default_id, /* We do not care */
-+	.name = "default",
-+	.owner = THIS_MODULE,
-+};
-+
-+static struct mptcp_pm_ops *mptcp_pm_find(const char *name)
-+{
-+	struct mptcp_pm_ops *e;
-+
-+	list_for_each_entry_rcu(e, &mptcp_pm_list, list) {
-+		if (strcmp(e->name, name) == 0)
-+			return e;
-+	}
-+
-+	return NULL;
-+}
-+
-+int mptcp_register_path_manager(struct mptcp_pm_ops *pm)
-+{
-+	int ret = 0;
-+
-+	if (!pm->get_local_id)
-+		return -EINVAL;
-+
-+	spin_lock(&mptcp_pm_list_lock);
-+	if (mptcp_pm_find(pm->name)) {
-+		pr_notice("%s already registered\n", pm->name);
-+		ret = -EEXIST;
-+	} else {
-+		list_add_tail_rcu(&pm->list, &mptcp_pm_list);
-+		pr_info("%s registered\n", pm->name);
-+	}
-+	spin_unlock(&mptcp_pm_list_lock);
-+
-+	return ret;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_register_path_manager);
-+
-+void mptcp_unregister_path_manager(struct mptcp_pm_ops *pm)
-+{
-+	spin_lock(&mptcp_pm_list_lock);
-+	list_del_rcu(&pm->list);
-+	spin_unlock(&mptcp_pm_list_lock);
-+}
-+EXPORT_SYMBOL_GPL(mptcp_unregister_path_manager);
-+
-+void mptcp_get_default_path_manager(char *name)
-+{
-+	struct mptcp_pm_ops *pm;
-+
-+	BUG_ON(list_empty(&mptcp_pm_list));
-+
-+	rcu_read_lock();
-+	pm = list_entry(mptcp_pm_list.next, struct mptcp_pm_ops, list);
-+	strncpy(name, pm->name, MPTCP_PM_NAME_MAX);
-+	rcu_read_unlock();
-+}
-+
-+int mptcp_set_default_path_manager(const char *name)
-+{
-+	struct mptcp_pm_ops *pm;
-+	int ret = -ENOENT;
-+
-+	spin_lock(&mptcp_pm_list_lock);
-+	pm = mptcp_pm_find(name);
-+#ifdef CONFIG_MODULES
-+	if (!pm && capable(CAP_NET_ADMIN)) {
-+		spin_unlock(&mptcp_pm_list_lock);
-+
-+		request_module("mptcp_%s", name);
-+		spin_lock(&mptcp_pm_list_lock);
-+		pm = mptcp_pm_find(name);
-+	}
-+#endif
-+
-+	if (pm) {
-+		list_move(&pm->list, &mptcp_pm_list);
-+		ret = 0;
-+	} else {
-+		pr_info("%s is not available\n", name);
-+	}
-+	spin_unlock(&mptcp_pm_list_lock);
-+
-+	return ret;
-+}
-+
-+void mptcp_init_path_manager(struct mptcp_cb *mpcb)
-+{
-+	struct mptcp_pm_ops *pm;
-+
-+	rcu_read_lock();
-+	list_for_each_entry_rcu(pm, &mptcp_pm_list, list) {
-+		if (try_module_get(pm->owner)) {
-+			mpcb->pm_ops = pm;
-+			break;
-+		}
-+	}
-+	rcu_read_unlock();
-+}
-+
-+/* Manage refcounts on socket close. */
-+void mptcp_cleanup_path_manager(struct mptcp_cb *mpcb)
-+{
-+	module_put(mpcb->pm_ops->owner);
-+}
-+
-+/* Fallback to the default path-manager. */
-+void mptcp_fallback_default(struct mptcp_cb *mpcb)
-+{
-+	struct mptcp_pm_ops *pm;
-+
-+	mptcp_cleanup_path_manager(mpcb);
-+	pm = mptcp_pm_find("default");
-+
-+	/* Cannot fail - it's the default module */
-+	try_module_get(pm->owner);
-+	mpcb->pm_ops = pm;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_fallback_default);
-+
-+/* Set default value from kernel configuration at bootup */
-+static int __init mptcp_path_manager_default(void)
-+{
-+	return mptcp_set_default_path_manager(CONFIG_DEFAULT_MPTCP_PM);
-+}
-+late_initcall(mptcp_path_manager_default);
-diff --git a/net/mptcp/mptcp_rr.c b/net/mptcp/mptcp_rr.c
-new file mode 100644
-index 000000000000..93278f684069
---- /dev/null
-+++ b/net/mptcp/mptcp_rr.c
-@@ -0,0 +1,301 @@
-+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static unsigned char num_segments __read_mostly = 1;
-+module_param(num_segments, byte, 0644);
-+MODULE_PARM_DESC(num_segments, "The number of consecutive segments that are part of a burst");
-+
-+static bool cwnd_limited __read_mostly = 1;
-+module_param(cwnd_limited, bool, 0644);
-+MODULE_PARM_DESC(cwnd_limited, "if set to 1, the scheduler tries to fill the congestion-window on all subflows");
-+
-+struct rrsched_priv {
-+	unsigned char quota;
-+};
-+
-+static struct rrsched_priv *rrsched_get_priv(const struct tcp_sock *tp)
-+{
-+	return (struct rrsched_priv *)&tp->mptcp->mptcp_sched[0];
-+}
-+
-+/* If the sub-socket sk available to send the skb? */
-+static bool mptcp_rr_is_available(const struct sock *sk, const struct sk_buff *skb,
-+				  bool zero_wnd_test, bool cwnd_test)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	unsigned int space, in_flight;
-+
-+	/* Set of states for which we are allowed to send data */
-+	if (!mptcp_sk_can_send(sk))
-+		return false;
-+
-+	/* We do not send data on this subflow unless it is
-+	 * fully established, i.e. the 4th ack has been received.
-+	 */
-+	if (tp->mptcp->pre_established)
-+		return false;
-+
-+	if (tp->pf)
-+		return false;
-+
-+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
-+		/* If SACK is disabled, and we got a loss, TCP does not exit
-+		 * the loss-state until something above high_seq has been acked.
-+		 * (see tcp_try_undo_recovery)
-+		 *
-+		 * high_seq is the snd_nxt at the moment of the RTO. As soon
-+		 * as we have an RTO, we won't push data on the subflow.
-+		 * Thus, snd_una can never go beyond high_seq.
-+		 */
-+		if (!tcp_is_reno(tp))
-+			return false;
-+		else if (tp->snd_una != tp->high_seq)
-+			return false;
-+	}
-+
-+	if (!tp->mptcp->fully_established) {
-+		/* Make sure that we send in-order data */
-+		if (skb && tp->mptcp->second_packet &&
-+		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
-+			return false;
-+	}
-+
-+	if (!cwnd_test)
-+		goto zero_wnd_test;
-+
-+	in_flight = tcp_packets_in_flight(tp);
-+	/* Not even a single spot in the cwnd */
-+	if (in_flight >= tp->snd_cwnd)
-+		return false;
-+
-+	/* Now, check if what is queued in the subflow's send-queue
-+	 * already fills the cwnd.
-+	 */
-+	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
-+
-+	if (tp->write_seq - tp->snd_nxt > space)
-+		return false;
-+
-+zero_wnd_test:
-+	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
-+		return false;
-+
-+	return true;
-+}
-+
-+/* Are we not allowed to reinject this skb on tp? */
-+static int mptcp_rr_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
-+{
-+	/* If the skb has already been enqueued in this sk, try to find
-+	 * another one.
-+	 */
-+	return skb &&
-+		/* Has the skb already been enqueued into this subsocket? */
-+		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
-+}
-+
-+/* We just look for any subflow that is available */
-+static struct sock *rr_get_available_subflow(struct sock *meta_sk,
-+					     struct sk_buff *skb,
-+					     bool zero_wnd_test)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *sk, *bestsk = NULL, *backupsk = NULL;
-+
-+	/* Answer data_fin on same subflow!!! */
-+	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
-+	    skb && mptcp_is_data_fin(skb)) {
-+		mptcp_for_each_sk(mpcb, sk) {
-+			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
-+			    mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
-+				return sk;
-+		}
-+	}
-+
-+	/* First, find the best subflow */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+
-+		if (!mptcp_rr_is_available(sk, skb, zero_wnd_test, true))
-+			continue;
-+
-+		if (mptcp_rr_dont_reinject_skb(tp, skb)) {
-+			backupsk = sk;
-+			continue;
-+		}
-+
-+		bestsk = sk;
-+	}
-+
-+	if (bestsk) {
-+		sk = bestsk;
-+	} else if (backupsk) {
-+		/* It has been sent on all subflows once - let's give it a
-+		 * chance again by restarting its pathmask.
-+		 */
-+		if (skb)
-+			TCP_SKB_CB(skb)->path_mask = 0;
-+		sk = backupsk;
-+	}
-+
-+	return sk;
-+}
-+
-+/* Returns the next segment to be sent from the mptcp meta-queue.
-+ * (chooses the reinject queue if any segment is waiting in it, otherwise,
-+ * chooses the normal write queue).
-+ * Sets *@reinject to 1 if the returned segment comes from the
-+ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
-+ * and sets it to -1 if it is a meta-level retransmission to optimize the
-+ * receive-buffer.
-+ */
-+static struct sk_buff *__mptcp_rr_next_segment(const struct sock *meta_sk, int *reinject)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sk_buff *skb = NULL;
-+
-+	*reinject = 0;
-+
-+	/* If we are in fallback-mode, just take from the meta-send-queue */
-+	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
-+		return tcp_send_head(meta_sk);
-+
-+	skb = skb_peek(&mpcb->reinject_queue);
-+
-+	if (skb)
-+		*reinject = 1;
-+	else
-+		skb = tcp_send_head(meta_sk);
-+	return skb;
-+}
-+
-+static struct sk_buff *mptcp_rr_next_segment(struct sock *meta_sk,
-+					     int *reinject,
-+					     struct sock **subsk,
-+					     unsigned int *limit)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *sk_it, *choose_sk = NULL;
-+	struct sk_buff *skb = __mptcp_rr_next_segment(meta_sk, reinject);
-+	unsigned char split = num_segments;
-+	unsigned char iter = 0, full_subs = 0;
-+
-+	/* As we set it, we have to reset it as well. */
-+	*limit = 0;
-+
-+	if (!skb)
-+		return NULL;
-+
-+	if (*reinject) {
-+		*subsk = rr_get_available_subflow(meta_sk, skb, false);
-+		if (!*subsk)
-+			return NULL;
-+
-+		return skb;
-+	}
-+
-+retry:
-+
-+	/* First, we look for a subflow who is currently being used */
-+	mptcp_for_each_sk(mpcb, sk_it) {
-+		struct tcp_sock *tp_it = tcp_sk(sk_it);
-+		struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
-+
-+		if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
-+			continue;
-+
-+		iter++;
-+
-+		/* Is this subflow currently being used? */
-+		if (rsp->quota > 0 && rsp->quota < num_segments) {
-+			split = num_segments - rsp->quota;
-+			choose_sk = sk_it;
-+			goto found;
-+		}
-+
-+		/* Or, it's totally unused */
-+		if (!rsp->quota) {
-+			split = num_segments;
-+			choose_sk = sk_it;
-+		}
-+
-+		/* Or, it must then be fully used  */
-+		if (rsp->quota == num_segments)
-+			full_subs++;
-+	}
-+
-+	/* All considered subflows have a full quota, and we considered at
-+	 * least one.
-+	 */
-+	if (iter && iter == full_subs) {
-+		/* So, we restart this round by setting quota to 0 and retry
-+		 * to find a subflow.
-+		 */
-+		mptcp_for_each_sk(mpcb, sk_it) {
-+			struct tcp_sock *tp_it = tcp_sk(sk_it);
-+			struct rrsched_priv *rsp = rrsched_get_priv(tp_it);
-+
-+			if (!mptcp_rr_is_available(sk_it, skb, false, cwnd_limited))
-+				continue;
-+
-+			rsp->quota = 0;
-+		}
-+
-+		goto retry;
-+	}
-+
-+found:
-+	if (choose_sk) {
-+		unsigned int mss_now;
-+		struct tcp_sock *choose_tp = tcp_sk(choose_sk);
-+		struct rrsched_priv *rsp = rrsched_get_priv(choose_tp);
-+
-+		if (!mptcp_rr_is_available(choose_sk, skb, false, true))
-+			return NULL;
-+
-+		*subsk = choose_sk;
-+		mss_now = tcp_current_mss(*subsk);
-+		*limit = split * mss_now;
-+
-+		if (skb->len > mss_now)
-+			rsp->quota += DIV_ROUND_UP(skb->len, mss_now);
-+		else
-+			rsp->quota++;
-+
-+		return skb;
-+	}
-+
-+	return NULL;
-+}
-+
-+static struct mptcp_sched_ops mptcp_sched_rr = {
-+	.get_subflow = rr_get_available_subflow,
-+	.next_segment = mptcp_rr_next_segment,
-+	.name = "roundrobin",
-+	.owner = THIS_MODULE,
-+};
-+
-+static int __init rr_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct rrsched_priv) > MPTCP_SCHED_SIZE);
-+
-+	if (mptcp_register_scheduler(&mptcp_sched_rr))
-+		return -1;
-+
-+	return 0;
-+}
-+
-+static void rr_unregister(void)
-+{
-+	mptcp_unregister_scheduler(&mptcp_sched_rr);
-+}
-+
-+module_init(rr_register);
-+module_exit(rr_unregister);
-+
-+MODULE_AUTHOR("Christoph Paasch");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("ROUNDROBIN MPTCP");
-+MODULE_VERSION("0.89");
-diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
-new file mode 100644
-index 000000000000..6c7ff4eceac1
---- /dev/null
-+++ b/net/mptcp/mptcp_sched.c
-@@ -0,0 +1,493 @@
-+/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
-+
-+#include <linux/module.h>
-+#include <net/mptcp.h>
-+
-+static DEFINE_SPINLOCK(mptcp_sched_list_lock);
-+static LIST_HEAD(mptcp_sched_list);
-+
-+struct defsched_priv {
-+	u32	last_rbuf_opti;
-+};
-+
-+static struct defsched_priv *defsched_get_priv(const struct tcp_sock *tp)
-+{
-+	return (struct defsched_priv *)&tp->mptcp->mptcp_sched[0];
-+}
-+
-+/* If the sub-socket sk available to send the skb? */
-+static bool mptcp_is_available(struct sock *sk, const struct sk_buff *skb,
-+			       bool zero_wnd_test)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	unsigned int mss_now, space, in_flight;
-+
-+	/* Set of states for which we are allowed to send data */
-+	if (!mptcp_sk_can_send(sk))
-+		return false;
-+
-+	/* We do not send data on this subflow unless it is
-+	 * fully established, i.e. the 4th ack has been received.
-+	 */
-+	if (tp->mptcp->pre_established)
-+		return false;
-+
-+	if (tp->pf)
-+		return false;
-+
-+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Loss) {
-+		/* If SACK is disabled, and we got a loss, TCP does not exit
-+		 * the loss-state until something above high_seq has been acked.
-+		 * (see tcp_try_undo_recovery)
-+		 *
-+		 * high_seq is the snd_nxt at the moment of the RTO. As soon
-+		 * as we have an RTO, we won't push data on the subflow.
-+		 * Thus, snd_una can never go beyond high_seq.
-+		 */
-+		if (!tcp_is_reno(tp))
-+			return false;
-+		else if (tp->snd_una != tp->high_seq)
-+			return false;
-+	}
-+
-+	if (!tp->mptcp->fully_established) {
-+		/* Make sure that we send in-order data */
-+		if (skb && tp->mptcp->second_packet &&
-+		    tp->mptcp->last_end_data_seq != TCP_SKB_CB(skb)->seq)
-+			return false;
-+	}
-+
-+	/* If TSQ is already throttling us, do not send on this subflow. When
-+	 * TSQ gets cleared the subflow becomes eligible again.
-+	 */
-+	if (test_bit(TSQ_THROTTLED, &tp->tsq_flags))
-+		return false;
-+
-+	in_flight = tcp_packets_in_flight(tp);
-+	/* Not even a single spot in the cwnd */
-+	if (in_flight >= tp->snd_cwnd)
-+		return false;
-+
-+	/* Now, check if what is queued in the subflow's send-queue
-+	 * already fills the cwnd.
-+	 */
-+	space = (tp->snd_cwnd - in_flight) * tp->mss_cache;
-+
-+	if (tp->write_seq - tp->snd_nxt > space)
-+		return false;
-+
-+	if (zero_wnd_test && !before(tp->write_seq, tcp_wnd_end(tp)))
-+		return false;
-+
-+	mss_now = tcp_current_mss(sk);
-+
-+	/* Don't send on this subflow if we bypass the allowed send-window at
-+	 * the per-subflow level. Similar to tcp_snd_wnd_test, but manually
-+	 * calculated end_seq (because here at this point end_seq is still at
-+	 * the meta-level).
-+	 */
-+	if (skb && !zero_wnd_test &&
-+	    after(tp->write_seq + min(skb->len, mss_now), tcp_wnd_end(tp)))
-+		return false;
-+
-+	return true;
-+}
-+
-+/* Are we not allowed to reinject this skb on tp? */
-+static int mptcp_dont_reinject_skb(const struct tcp_sock *tp, const struct sk_buff *skb)
-+{
-+	/* If the skb has already been enqueued in this sk, try to find
-+	 * another one.
-+	 */
-+	return skb &&
-+		/* Has the skb already been enqueued into this subsocket? */
-+		mptcp_pi_to_flag(tp->mptcp->path_index) & TCP_SKB_CB(skb)->path_mask;
-+}
-+
-+/* This is the scheduler. This function decides on which flow to send
-+ * a given MSS. If all subflows are found to be busy, NULL is returned
-+ * The flow is selected based on the shortest RTT.
-+ * If all paths have full cong windows, we simply return NULL.
-+ *
-+ * Additionally, this function is aware of the backup-subflows.
-+ */
-+static struct sock *get_available_subflow(struct sock *meta_sk,
-+					  struct sk_buff *skb,
-+					  bool zero_wnd_test)
-+{
-+	struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sock *sk, *bestsk = NULL, *lowpriosk = NULL, *backupsk = NULL;
-+	u32 min_time_to_peer = 0xffffffff, lowprio_min_time_to_peer = 0xffffffff;
-+	int cnt_backups = 0;
-+
-+	/* if there is only one subflow, bypass the scheduling function */
-+	if (mpcb->cnt_subflows == 1) {
-+		bestsk = (struct sock *)mpcb->connection_list;
-+		if (!mptcp_is_available(bestsk, skb, zero_wnd_test))
-+			bestsk = NULL;
-+		return bestsk;
-+	}
-+
-+	/* Answer data_fin on same subflow!!! */
-+	if (meta_sk->sk_shutdown & RCV_SHUTDOWN &&
-+	    skb && mptcp_is_data_fin(skb)) {
-+		mptcp_for_each_sk(mpcb, sk) {
-+			if (tcp_sk(sk)->mptcp->path_index == mpcb->dfin_path_index &&
-+			    mptcp_is_available(sk, skb, zero_wnd_test))
-+				return sk;
-+		}
-+	}
-+
-+	/* First, find the best subflow */
-+	mptcp_for_each_sk(mpcb, sk) {
-+		struct tcp_sock *tp = tcp_sk(sk);
-+
-+		if (tp->mptcp->rcv_low_prio || tp->mptcp->low_prio)
-+			cnt_backups++;
-+
-+		if ((tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
-+		    tp->srtt_us < lowprio_min_time_to_peer) {
-+			if (!mptcp_is_available(sk, skb, zero_wnd_test))
-+				continue;
-+
-+			if (mptcp_dont_reinject_skb(tp, skb)) {
-+				backupsk = sk;
-+				continue;
-+			}
-+
-+			lowprio_min_time_to_peer = tp->srtt_us;
-+			lowpriosk = sk;
-+		} else if (!(tp->mptcp->rcv_low_prio || tp->mptcp->low_prio) &&
-+			   tp->srtt_us < min_time_to_peer) {
-+			if (!mptcp_is_available(sk, skb, zero_wnd_test))
-+				continue;
-+
-+			if (mptcp_dont_reinject_skb(tp, skb)) {
-+				backupsk = sk;
-+				continue;
-+			}
-+
-+			min_time_to_peer = tp->srtt_us;
-+			bestsk = sk;
-+		}
-+	}
-+
-+	if (mpcb->cnt_established == cnt_backups && lowpriosk) {
-+		sk = lowpriosk;
-+	} else if (bestsk) {
-+		sk = bestsk;
-+	} else if (backupsk) {
-+		/* It has been sent on all subflows once - let's give it a
-+		 * chance again by restarting its pathmask.
-+		 */
-+		if (skb)
-+			TCP_SKB_CB(skb)->path_mask = 0;
-+		sk = backupsk;
-+	}
-+
-+	return sk;
-+}
-+
-+static struct sk_buff *mptcp_rcv_buf_optimization(struct sock *sk, int penal)
-+{
-+	struct sock *meta_sk;
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct tcp_sock *tp_it;
-+	struct sk_buff *skb_head;
-+	struct defsched_priv *dsp = defsched_get_priv(tp);
-+
-+	if (tp->mpcb->cnt_subflows == 1)
-+		return NULL;
-+
-+	meta_sk = mptcp_meta_sk(sk);
-+	skb_head = tcp_write_queue_head(meta_sk);
-+
-+	if (!skb_head || skb_head == tcp_send_head(meta_sk))
-+		return NULL;
-+
-+	/* If penalization is optional (coming from mptcp_next_segment() and
-+	 * We are not send-buffer-limited we do not penalize. The retransmission
-+	 * is just an optimization to fix the idle-time due to the delay before
-+	 * we wake up the application.
-+	 */
-+	if (!penal && sk_stream_memory_free(meta_sk))
-+		goto retrans;
-+
-+	/* Only penalize again after an RTT has elapsed */
-+	if (tcp_time_stamp - dsp->last_rbuf_opti < usecs_to_jiffies(tp->srtt_us >> 3))
-+		goto retrans;
-+
-+	/* Half the cwnd of the slow flow */
-+	mptcp_for_each_tp(tp->mpcb, tp_it) {
-+		if (tp_it != tp &&
-+		    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
-+			if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
-+				tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
-+				if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
-+					tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
-+
-+				dsp->last_rbuf_opti = tcp_time_stamp;
-+			}
-+			break;
-+		}
-+	}
-+
-+retrans:
-+
-+	/* Segment not yet injected into this path? Take it!!! */
-+	if (!(TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp->mptcp->path_index))) {
-+		bool do_retrans = false;
-+		mptcp_for_each_tp(tp->mpcb, tp_it) {
-+			if (tp_it != tp &&
-+			    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
-+				if (tp_it->snd_cwnd <= 4) {
-+					do_retrans = true;
-+					break;
-+				}
-+
-+				if (4 * tp->srtt_us >= tp_it->srtt_us) {
-+					do_retrans = false;
-+					break;
-+				} else {
-+					do_retrans = true;
-+				}
-+			}
-+		}
-+
-+		if (do_retrans && mptcp_is_available(sk, skb_head, false))
-+			return skb_head;
-+	}
-+	return NULL;
-+}
-+
-+/* Returns the next segment to be sent from the mptcp meta-queue.
-+ * (chooses the reinject queue if any segment is waiting in it, otherwise,
-+ * chooses the normal write queue).
-+ * Sets *@reinject to 1 if the returned segment comes from the
-+ * reinject queue. Sets it to 0 if it is the regular send-head of the meta-sk,
-+ * and sets it to -1 if it is a meta-level retransmission to optimize the
-+ * receive-buffer.
-+ */
-+static struct sk_buff *__mptcp_next_segment(struct sock *meta_sk, int *reinject)
-+{
-+	const struct mptcp_cb *mpcb = tcp_sk(meta_sk)->mpcb;
-+	struct sk_buff *skb = NULL;
-+
-+	*reinject = 0;
-+
-+	/* If we are in fallback-mode, just take from the meta-send-queue */
-+	if (mpcb->infinite_mapping_snd || mpcb->send_infinite_mapping)
-+		return tcp_send_head(meta_sk);
-+
-+	skb = skb_peek(&mpcb->reinject_queue);
-+
-+	if (skb) {
-+		*reinject = 1;
-+	} else {
-+		skb = tcp_send_head(meta_sk);
-+
-+		if (!skb && meta_sk->sk_socket &&
-+		    test_bit(SOCK_NOSPACE, &meta_sk->sk_socket->flags) &&
-+		    sk_stream_wspace(meta_sk) < sk_stream_min_wspace(meta_sk)) {
-+			struct sock *subsk = get_available_subflow(meta_sk, NULL,
-+								   false);
-+			if (!subsk)
-+				return NULL;
-+
-+			skb = mptcp_rcv_buf_optimization(subsk, 0);
-+			if (skb)
-+				*reinject = -1;
-+		}
-+	}
-+	return skb;
-+}
-+
-+static struct sk_buff *mptcp_next_segment(struct sock *meta_sk,
-+					  int *reinject,
-+					  struct sock **subsk,
-+					  unsigned int *limit)
-+{
-+	struct sk_buff *skb = __mptcp_next_segment(meta_sk, reinject);
-+	unsigned int mss_now;
-+	struct tcp_sock *subtp;
-+	u16 gso_max_segs;
-+	u32 max_len, max_segs, window, needed;
-+
-+	/* As we set it, we have to reset it as well. */
-+	*limit = 0;
-+
-+	if (!skb)
-+		return NULL;
-+
-+	*subsk = get_available_subflow(meta_sk, skb, false);
-+	if (!*subsk)
-+		return NULL;
-+
-+	subtp = tcp_sk(*subsk);
-+	mss_now = tcp_current_mss(*subsk);
-+
-+	if (!*reinject && unlikely(!tcp_snd_wnd_test(tcp_sk(meta_sk), skb, mss_now))) {
-+		skb = mptcp_rcv_buf_optimization(*subsk, 1);
-+		if (skb)
-+			*reinject = -1;
-+		else
-+			return NULL;
-+	}
-+
-+	/* No splitting required, as we will only send one single segment */
-+	if (skb->len <= mss_now)
-+		return skb;
-+
-+	/* The following is similar to tcp_mss_split_point, but
-+	 * we do not care about nagle, because we will anyways
-+	 * use TCP_NAGLE_PUSH, which overrides this.
-+	 *
-+	 * So, we first limit according to the cwnd/gso-size and then according
-+	 * to the subflow's window.
-+	 */
-+
-+	gso_max_segs = (*subsk)->sk_gso_max_segs;
-+	if (!gso_max_segs) /* No gso supported on the subflow's NIC */
-+		gso_max_segs = 1;
-+	max_segs = min_t(unsigned int, tcp_cwnd_test(subtp, skb), gso_max_segs);
-+	if (!max_segs)
-+		return NULL;
-+
-+	max_len = mss_now * max_segs;
-+	window = tcp_wnd_end(subtp) - subtp->write_seq;
-+
-+	needed = min(skb->len, window);
-+	if (max_len <= skb->len)
-+		/* Take max_win, which is actually the cwnd/gso-size */
-+		*limit = max_len;
-+	else
-+		/* Or, take the window */
-+		*limit = needed;
-+
-+	return skb;
-+}
-+
-+static void defsched_init(struct sock *sk)
-+{
-+	struct defsched_priv *dsp = defsched_get_priv(tcp_sk(sk));
-+
-+	dsp->last_rbuf_opti = tcp_time_stamp;
-+}
-+
-+struct mptcp_sched_ops mptcp_sched_default = {
-+	.get_subflow = get_available_subflow,
-+	.next_segment = mptcp_next_segment,
-+	.init = defsched_init,
-+	.name = "default",
-+	.owner = THIS_MODULE,
-+};
-+
-+static struct mptcp_sched_ops *mptcp_sched_find(const char *name)
-+{
-+	struct mptcp_sched_ops *e;
-+
-+	list_for_each_entry_rcu(e, &mptcp_sched_list, list) {
-+		if (strcmp(e->name, name) == 0)
-+			return e;
-+	}
-+
-+	return NULL;
-+}
-+
-+int mptcp_register_scheduler(struct mptcp_sched_ops *sched)
-+{
-+	int ret = 0;
-+
-+	if (!sched->get_subflow || !sched->next_segment)
-+		return -EINVAL;
-+
-+	spin_lock(&mptcp_sched_list_lock);
-+	if (mptcp_sched_find(sched->name)) {
-+		pr_notice("%s already registered\n", sched->name);
-+		ret = -EEXIST;
-+	} else {
-+		list_add_tail_rcu(&sched->list, &mptcp_sched_list);
-+		pr_info("%s registered\n", sched->name);
-+	}
-+	spin_unlock(&mptcp_sched_list_lock);
-+
-+	return ret;
-+}
-+EXPORT_SYMBOL_GPL(mptcp_register_scheduler);
-+
-+void mptcp_unregister_scheduler(struct mptcp_sched_ops *sched)
-+{
-+	spin_lock(&mptcp_sched_list_lock);
-+	list_del_rcu(&sched->list);
-+	spin_unlock(&mptcp_sched_list_lock);
-+}
-+EXPORT_SYMBOL_GPL(mptcp_unregister_scheduler);
-+
-+void mptcp_get_default_scheduler(char *name)
-+{
-+	struct mptcp_sched_ops *sched;
-+
-+	BUG_ON(list_empty(&mptcp_sched_list));
-+
-+	rcu_read_lock();
-+	sched = list_entry(mptcp_sched_list.next, struct mptcp_sched_ops, list);
-+	strncpy(name, sched->name, MPTCP_SCHED_NAME_MAX);
-+	rcu_read_unlock();
-+}
-+
-+int mptcp_set_default_scheduler(const char *name)
-+{
-+	struct mptcp_sched_ops *sched;
-+	int ret = -ENOENT;
-+
-+	spin_lock(&mptcp_sched_list_lock);
-+	sched = mptcp_sched_find(name);
-+#ifdef CONFIG_MODULES
-+	if (!sched && capable(CAP_NET_ADMIN)) {
-+		spin_unlock(&mptcp_sched_list_lock);
-+
-+		request_module("mptcp_%s", name);
-+		spin_lock(&mptcp_sched_list_lock);
-+		sched = mptcp_sched_find(name);
-+	}
-+#endif
-+
-+	if (sched) {
-+		list_move(&sched->list, &mptcp_sched_list);
-+		ret = 0;
-+	} else {
-+		pr_info("%s is not available\n", name);
-+	}
-+	spin_unlock(&mptcp_sched_list_lock);
-+
-+	return ret;
-+}
-+
-+void mptcp_init_scheduler(struct mptcp_cb *mpcb)
-+{
-+	struct mptcp_sched_ops *sched;
-+
-+	rcu_read_lock();
-+	list_for_each_entry_rcu(sched, &mptcp_sched_list, list) {
-+		if (try_module_get(sched->owner)) {
-+			mpcb->sched_ops = sched;
-+			break;
-+		}
-+	}
-+	rcu_read_unlock();
-+}
-+
-+/* Manage refcounts on socket close. */
-+void mptcp_cleanup_scheduler(struct mptcp_cb *mpcb)
-+{
-+	module_put(mpcb->sched_ops->owner);
-+}
-+
-+/* Set default value from kernel configuration at bootup */
-+static int __init mptcp_scheduler_default(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct defsched_priv) > MPTCP_SCHED_SIZE);
-+
-+	return mptcp_set_default_scheduler(CONFIG_DEFAULT_MPTCP_SCHED);
-+}
-+late_initcall(mptcp_scheduler_default);
-diff --git a/net/mptcp/mptcp_wvegas.c b/net/mptcp/mptcp_wvegas.c
-new file mode 100644
-index 000000000000..29ca1d868d17
---- /dev/null
-+++ b/net/mptcp/mptcp_wvegas.c
-@@ -0,0 +1,268 @@
-+/*
-+ *	MPTCP implementation - WEIGHTED VEGAS
-+ *
-+ *	Algorithm design:
-+ *	Yu Cao <cyAnalyst@126.com>
-+ *	Mingwei Xu <xmw@csnet1.cs.tsinghua.edu.cn>
-+ *	Xiaoming Fu <fu@cs.uni-goettinggen.de>
-+ *
-+ *	Implementation:
-+ *	Yu Cao <cyAnalyst@126.com>
-+ *	Enhuan Dong <deh13@mails.tsinghua.edu.cn>
-+ *
-+ *	Ported to the official MPTCP-kernel:
-+ *	Christoph Paasch <christoph.paasch@uclouvain.be>
-+ *
-+ *	This program is free software; you can redistribute it and/or
-+ *	modify it under the terms of the GNU General Public License
-+ *	as published by the Free Software Foundation; either version
-+ *	2 of the License, or (at your option) any later version.
-+ */
-+
-+#include <linux/skbuff.h>
-+#include <net/tcp.h>
-+#include <net/mptcp.h>
-+#include <linux/module.h>
-+#include <linux/tcp.h>
-+
-+static int initial_alpha = 2;
-+static int total_alpha = 10;
-+static int gamma = 1;
-+
-+module_param(initial_alpha, int, 0644);
-+MODULE_PARM_DESC(initial_alpha, "initial alpha for all subflows");
-+module_param(total_alpha, int, 0644);
-+MODULE_PARM_DESC(total_alpha, "total alpha for all subflows");
-+module_param(gamma, int, 0644);
-+MODULE_PARM_DESC(gamma, "limit on increase (scale by 2)");
-+
-+#define MPTCP_WVEGAS_SCALE 16
-+
-+/* wVegas variables */
-+struct wvegas {
-+	u32	beg_snd_nxt;	/* right edge during last RTT */
-+	u8	doing_wvegas_now;/* if true, do wvegas for this RTT */
-+
-+	u16	cnt_rtt;		/* # of RTTs measured within last RTT */
-+	u32 sampled_rtt; /* cumulative RTTs measured within last RTT (in usec) */
-+	u32	base_rtt;	/* the min of all wVegas RTT measurements seen (in usec) */
-+
-+	u64 instant_rate; /* cwnd / srtt_us, unit: pkts/us * 2^16 */
-+	u64 weight; /* the ratio of subflow's rate to the total rate, * 2^16 */
-+	int alpha; /* alpha for each subflows */
-+
-+	u32 queue_delay; /* queue delay*/
-+};
-+
-+
-+static inline u64 mptcp_wvegas_scale(u32 val, int scale)
-+{
-+	return (u64) val << scale;
-+}
-+
-+static void wvegas_enable(const struct sock *sk)
-+{
-+	const struct tcp_sock *tp = tcp_sk(sk);
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	wvegas->doing_wvegas_now = 1;
-+
-+	wvegas->beg_snd_nxt = tp->snd_nxt;
-+
-+	wvegas->cnt_rtt = 0;
-+	wvegas->sampled_rtt = 0;
-+
-+	wvegas->instant_rate = 0;
-+	wvegas->alpha = initial_alpha;
-+	wvegas->weight = mptcp_wvegas_scale(1, MPTCP_WVEGAS_SCALE);
-+
-+	wvegas->queue_delay = 0;
-+}
-+
-+static inline void wvegas_disable(const struct sock *sk)
-+{
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	wvegas->doing_wvegas_now = 0;
-+}
-+
-+static void mptcp_wvegas_init(struct sock *sk)
-+{
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	wvegas->base_rtt = 0x7fffffff;
-+	wvegas_enable(sk);
-+}
-+
-+static inline u64 mptcp_wvegas_rate(u32 cwnd, u32 rtt_us)
-+{
-+	return div_u64(mptcp_wvegas_scale(cwnd, MPTCP_WVEGAS_SCALE), rtt_us);
-+}
-+
-+static void mptcp_wvegas_pkts_acked(struct sock *sk, u32 cnt, s32 rtt_us)
-+{
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+	u32 vrtt;
-+
-+	if (rtt_us < 0)
-+		return;
-+
-+	vrtt = rtt_us + 1;
-+
-+	if (vrtt < wvegas->base_rtt)
-+		wvegas->base_rtt = vrtt;
-+
-+	wvegas->sampled_rtt += vrtt;
-+	wvegas->cnt_rtt++;
-+}
-+
-+static void mptcp_wvegas_state(struct sock *sk, u8 ca_state)
-+{
-+	if (ca_state == TCP_CA_Open)
-+		wvegas_enable(sk);
-+	else
-+		wvegas_disable(sk);
-+}
-+
-+static void mptcp_wvegas_cwnd_event(struct sock *sk, enum tcp_ca_event event)
-+{
-+	if (event == CA_EVENT_CWND_RESTART) {
-+		mptcp_wvegas_init(sk);
-+	} else if (event == CA_EVENT_LOSS) {
-+		struct wvegas *wvegas = inet_csk_ca(sk);
-+		wvegas->instant_rate = 0;
-+	}
-+}
-+
-+static inline u32 mptcp_wvegas_ssthresh(const struct tcp_sock *tp)
-+{
-+	return  min(tp->snd_ssthresh, tp->snd_cwnd - 1);
-+}
-+
-+static u64 mptcp_wvegas_weight(const struct mptcp_cb *mpcb, const struct sock *sk)
-+{
-+	u64 total_rate = 0;
-+	struct sock *sub_sk;
-+	const struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	if (!mpcb)
-+		return wvegas->weight;
-+
-+
-+	mptcp_for_each_sk(mpcb, sub_sk) {
-+		struct wvegas *sub_wvegas = inet_csk_ca(sub_sk);
-+
-+		/* sampled_rtt is initialized by 0 */
-+		if (mptcp_sk_can_send(sub_sk) && (sub_wvegas->sampled_rtt > 0))
-+			total_rate += sub_wvegas->instant_rate;
-+	}
-+
-+	if (total_rate && wvegas->instant_rate)
-+		return div64_u64(mptcp_wvegas_scale(wvegas->instant_rate, MPTCP_WVEGAS_SCALE), total_rate);
-+	else
-+		return wvegas->weight;
-+}
-+
-+static void mptcp_wvegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-+{
-+	struct tcp_sock *tp = tcp_sk(sk);
-+	struct wvegas *wvegas = inet_csk_ca(sk);
-+
-+	if (!wvegas->doing_wvegas_now) {
-+		tcp_reno_cong_avoid(sk, ack, acked);
-+		return;
-+	}
-+
-+	if (after(ack, wvegas->beg_snd_nxt)) {
-+		wvegas->beg_snd_nxt  = tp->snd_nxt;
-+
-+		if (wvegas->cnt_rtt <= 2) {
-+			tcp_reno_cong_avoid(sk, ack, acked);
-+		} else {
-+			u32 rtt, diff, q_delay;
-+			u64 target_cwnd;
-+
-+			rtt = wvegas->sampled_rtt / wvegas->cnt_rtt;
-+			target_cwnd = div_u64(((u64)tp->snd_cwnd * wvegas->base_rtt), rtt);
-+
-+			diff = div_u64((u64)tp->snd_cwnd * (rtt - wvegas->base_rtt), rtt);
-+
-+			if (diff > gamma && tp->snd_cwnd <= tp->snd_ssthresh) {
-+				tp->snd_cwnd = min(tp->snd_cwnd, (u32)target_cwnd+1);
-+				tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
-+
-+			} else if (tp->snd_cwnd <= tp->snd_ssthresh) {
-+				tcp_slow_start(tp, acked);
-+			} else {
-+				if (diff >= wvegas->alpha) {
-+					wvegas->instant_rate = mptcp_wvegas_rate(tp->snd_cwnd, rtt);
-+					wvegas->weight = mptcp_wvegas_weight(tp->mpcb, sk);
-+					wvegas->alpha = max(2U, (u32)((wvegas->weight * total_alpha) >> MPTCP_WVEGAS_SCALE));
-+				}
-+				if (diff > wvegas->alpha) {
-+					tp->snd_cwnd--;
-+					tp->snd_ssthresh = mptcp_wvegas_ssthresh(tp);
-+				} else if (diff < wvegas->alpha) {
-+					tp->snd_cwnd++;
-+				}
-+
-+				/* Try to drain link queue if needed*/
-+				q_delay = rtt - wvegas->base_rtt;
-+				if ((wvegas->queue_delay == 0) || (wvegas->queue_delay > q_delay))
-+					wvegas->queue_delay = q_delay;
-+
-+				if (q_delay >= 2 * wvegas->queue_delay) {
-+					u32 backoff_factor = div_u64(mptcp_wvegas_scale(wvegas->base_rtt, MPTCP_WVEGAS_SCALE), 2 * rtt);
-+					tp->snd_cwnd = ((u64)tp->snd_cwnd * backoff_factor) >> MPTCP_WVEGAS_SCALE;
-+					wvegas->queue_delay = 0;
-+				}
-+			}
-+
-+			if (tp->snd_cwnd < 2)
-+				tp->snd_cwnd = 2;
-+			else if (tp->snd_cwnd > tp->snd_cwnd_clamp)
-+				tp->snd_cwnd = tp->snd_cwnd_clamp;
-+
-+			tp->snd_ssthresh = tcp_current_ssthresh(sk);
-+		}
-+
-+		wvegas->cnt_rtt = 0;
-+		wvegas->sampled_rtt = 0;
-+	}
-+	/* Use normal slow start */
-+	else if (tp->snd_cwnd <= tp->snd_ssthresh)
-+		tcp_slow_start(tp, acked);
-+}
-+
-+
-+static struct tcp_congestion_ops mptcp_wvegas __read_mostly = {
-+	.init		= mptcp_wvegas_init,
-+	.ssthresh	= tcp_reno_ssthresh,
-+	.cong_avoid	= mptcp_wvegas_cong_avoid,
-+	.pkts_acked	= mptcp_wvegas_pkts_acked,
-+	.set_state	= mptcp_wvegas_state,
-+	.cwnd_event	= mptcp_wvegas_cwnd_event,
-+
-+	.owner		= THIS_MODULE,
-+	.name		= "wvegas",
-+};
-+
-+static int __init mptcp_wvegas_register(void)
-+{
-+	BUILD_BUG_ON(sizeof(struct wvegas) > ICSK_CA_PRIV_SIZE);
-+	tcp_register_congestion_control(&mptcp_wvegas);
-+	return 0;
-+}
-+
-+static void __exit mptcp_wvegas_unregister(void)
-+{
-+	tcp_unregister_congestion_control(&mptcp_wvegas);
-+}
-+
-+module_init(mptcp_wvegas_register);
-+module_exit(mptcp_wvegas_unregister);
-+
-+MODULE_AUTHOR("Yu Cao, Enhuan Dong");
-+MODULE_LICENSE("GPL");
-+MODULE_DESCRIPTION("MPTCP wVegas");
-+MODULE_VERSION("0.1");


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-07  1:28 Anthony G. Basile
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-10-07  1:28 UTC (permalink / raw
  To: gentoo-commits

commit:     f0e24d581e380ceb5a563a6bc0a9e66ad077fe31
Author:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Tue Oct  7 01:28:33 2014 +0000
Commit:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Tue Oct  7 01:28:33 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=f0e24d58

Add patch t support namespace user.pax.* on tmpfs, bug #470644

---
 0000_README                  |  4 ++++
 1500_XATTR_USER_PREFIX.patch | 54 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/0000_README b/0000_README
index 3cc9441..25ca364 100644
--- a/0000_README
+++ b/0000_README
@@ -54,6 +54,10 @@ Patch:  1002_linux-3.16.3.patch
 From:   http://www.kernel.org
 Desc:   Linux 3.16.3
 
+Patch:  1500_XATTR_USER_PREFIX.patch
+From:   https://bugs.gentoo.org/show_bug.cgi?id=470644
+Desc:   Support for namespace user.pax.* on tmpfs.
+
 Patch:  2400_kcopy-patch-for-infiniband-driver.patch
 From:   Alexey Shvetsov <alexxy@gentoo.org>
 Desc:   Zero copy for infiniband psm userspace driver

diff --git a/1500_XATTR_USER_PREFIX.patch b/1500_XATTR_USER_PREFIX.patch
new file mode 100644
index 0000000..cc15cd5
--- /dev/null
+++ b/1500_XATTR_USER_PREFIX.patch
@@ -0,0 +1,54 @@
+From: Anthony G. Basile <blueness@gentoo.org>
+
+This patch adds support for a restricted user-controlled namespace on
+tmpfs filesystem used to house PaX flags.  The namespace must be of the
+form user.pax.* and its value cannot exceed a size of 8 bytes.
+
+This is needed even on all Gentoo systems so that XATTR_PAX flags
+are preserved for users who might build packages using portage on
+a tmpfs system with a non-hardened kernel and then switch to a
+hardened kernel with XATTR_PAX enabled.
+
+The namespace is added to any user with Extended Attribute support
+enabled for tmpfs.  Users who do not enable xattrs will not have
+the XATTR_PAX flags preserved.
+
+diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
+index e4629b9..6958086 100644
+--- a/include/uapi/linux/xattr.h
++++ b/include/uapi/linux/xattr.h
+@@ -63,5 +63,9 @@
+ #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
+ #define XATTR_NAME_POSIX_ACL_DEFAULT XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_DEFAULT
+ 
++/* User namespace */
++#define XATTR_PAX_PREFIX XATTR_USER_PREFIX "pax."
++#define XATTR_PAX_FLAGS_SUFFIX "flags"
++#define XATTR_NAME_PAX_FLAGS XATTR_PAX_PREFIX XATTR_PAX_FLAGS_SUFFIX
+ 
+ #endif /* _UAPI_LINUX_XATTR_H */
+diff --git a/mm/shmem.c b/mm/shmem.c
+index 1c44af7..f23bb1b 100644
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -2201,6 +2201,7 @@ static const struct xattr_handler *shmem_xattr_handlers[] = {
+ static int shmem_xattr_validate(const char *name)
+ {
+ 	struct { const char *prefix; size_t len; } arr[] = {
++		{ XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN},
+ 		{ XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN },
+ 		{ XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN }
+ 	};
+@@ -2256,6 +2257,12 @@ static int shmem_setxattr(struct dentry *dentry, const char *name,
+ 	if (err)
+ 		return err;
+ 
++	if (!strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN)) {
++		if (strcmp(name, XATTR_NAME_PAX_FLAGS))
++			return -EOPNOTSUPP;
++		if (size > 8)
++			return -EINVAL;
++	}
+ 	return simple_xattr_set(&info->xattrs, name, value, size, flags);
+ }
+ 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-07  1:34 Anthony G. Basile
  0 siblings, 0 replies; 26+ messages in thread
From: Anthony G. Basile @ 2014-10-07  1:34 UTC (permalink / raw
  To: gentoo-commits

commit:     469245b0b190204e29f395ab73a0c3b5b2ab988f
Author:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
AuthorDate: Tue Oct  7 01:28:33 2014 +0000
Commit:     Anthony G. Basile <blueness <AT> gentoo <DOT> org>
CommitDate: Tue Oct  7 01:34:53 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=469245b0

Add patch to support namespace user.pax.* on tmpfs, bug #470644

---
 0000_README                  |  4 ++++
 1500_XATTR_USER_PREFIX.patch | 54 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/0000_README b/0000_README
index 3cc9441..25ca364 100644
--- a/0000_README
+++ b/0000_README
@@ -54,6 +54,10 @@ Patch:  1002_linux-3.16.3.patch
 From:   http://www.kernel.org
 Desc:   Linux 3.16.3
 
+Patch:  1500_XATTR_USER_PREFIX.patch
+From:   https://bugs.gentoo.org/show_bug.cgi?id=470644
+Desc:   Support for namespace user.pax.* on tmpfs.
+
 Patch:  2400_kcopy-patch-for-infiniband-driver.patch
 From:   Alexey Shvetsov <alexxy@gentoo.org>
 Desc:   Zero copy for infiniband psm userspace driver

diff --git a/1500_XATTR_USER_PREFIX.patch b/1500_XATTR_USER_PREFIX.patch
new file mode 100644
index 0000000..cc15cd5
--- /dev/null
+++ b/1500_XATTR_USER_PREFIX.patch
@@ -0,0 +1,54 @@
+From: Anthony G. Basile <blueness@gentoo.org>
+
+This patch adds support for a restricted user-controlled namespace on
+tmpfs filesystem used to house PaX flags.  The namespace must be of the
+form user.pax.* and its value cannot exceed a size of 8 bytes.
+
+This is needed even on all Gentoo systems so that XATTR_PAX flags
+are preserved for users who might build packages using portage on
+a tmpfs system with a non-hardened kernel and then switch to a
+hardened kernel with XATTR_PAX enabled.
+
+The namespace is added to any user with Extended Attribute support
+enabled for tmpfs.  Users who do not enable xattrs will not have
+the XATTR_PAX flags preserved.
+
+diff --git a/include/uapi/linux/xattr.h b/include/uapi/linux/xattr.h
+index e4629b9..6958086 100644
+--- a/include/uapi/linux/xattr.h
++++ b/include/uapi/linux/xattr.h
+@@ -63,5 +63,9 @@
+ #define XATTR_POSIX_ACL_DEFAULT  "posix_acl_default"
+ #define XATTR_NAME_POSIX_ACL_DEFAULT XATTR_SYSTEM_PREFIX XATTR_POSIX_ACL_DEFAULT
+ 
++/* User namespace */
++#define XATTR_PAX_PREFIX XATTR_USER_PREFIX "pax."
++#define XATTR_PAX_FLAGS_SUFFIX "flags"
++#define XATTR_NAME_PAX_FLAGS XATTR_PAX_PREFIX XATTR_PAX_FLAGS_SUFFIX
+ 
+ #endif /* _UAPI_LINUX_XATTR_H */
+diff --git a/mm/shmem.c b/mm/shmem.c
+index 1c44af7..f23bb1b 100644
+--- a/mm/shmem.c
++++ b/mm/shmem.c
+@@ -2201,6 +2201,7 @@ static const struct xattr_handler *shmem_xattr_handlers[] = {
+ static int shmem_xattr_validate(const char *name)
+ {
+ 	struct { const char *prefix; size_t len; } arr[] = {
++		{ XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN},
+ 		{ XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN },
+ 		{ XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN }
+ 	};
+@@ -2256,6 +2257,12 @@ static int shmem_setxattr(struct dentry *dentry, const char *name,
+ 	if (err)
+ 		return err;
+ 
++	if (!strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN)) {
++		if (strcmp(name, XATTR_NAME_PAX_FLAGS))
++			return -EOPNOTSUPP;
++		if (size > 8)
++			return -EINVAL;
++	}
+ 	return simple_xattr_set(&info->xattrs, name, value, size, flags);
+ }
+ 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-09 19:54 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-09 19:54 UTC (permalink / raw
  To: gentoo-commits

commit:     5a7ae131b7b69198d892277ab46031299237a9a6
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Thu Oct  9 19:54:07 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Thu Oct  9 19:54:07 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=5a7ae131

Linux patch 3.16.5

---
 0000_README             |   8 +
 1004_linux-3.16.5.patch | 987 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 995 insertions(+)

diff --git a/0000_README b/0000_README
index 25ca364..ede03f9 100644
--- a/0000_README
+++ b/0000_README
@@ -54,6 +54,14 @@ Patch:  1002_linux-3.16.3.patch
 From:   http://www.kernel.org
 Desc:   Linux 3.16.3
 
+Patch:  1003_linux-3.16.4.patch
+From:   http://www.kernel.org
+Desc:   Linux 3.16.4
+
+Patch:  1004_linux-3.16.5.patch
+From:   http://www.kernel.org
+Desc:   Linux 3.16.5
+
 Patch:  1500_XATTR_USER_PREFIX.patch
 From:   https://bugs.gentoo.org/show_bug.cgi?id=470644
 Desc:   Support for namespace user.pax.* on tmpfs.

diff --git a/1004_linux-3.16.5.patch b/1004_linux-3.16.5.patch
new file mode 100644
index 0000000..248afad
--- /dev/null
+++ b/1004_linux-3.16.5.patch
@@ -0,0 +1,987 @@
+diff --git a/Makefile b/Makefile
+index e75c75f0ec35..41efc3d9f2e0 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 4
++SUBLEVEL = 5
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+ 
+diff --git a/arch/ia64/pci/fixup.c b/arch/ia64/pci/fixup.c
+index 1fe9aa5068ea..fc505d58f078 100644
+--- a/arch/ia64/pci/fixup.c
++++ b/arch/ia64/pci/fixup.c
+@@ -6,6 +6,7 @@
+ #include <linux/pci.h>
+ #include <linux/init.h>
+ #include <linux/vgaarb.h>
++#include <linux/screen_info.h>
+ 
+ #include <asm/machvec.h>
+ 
+@@ -61,8 +62,7 @@ static void pci_fixup_video(struct pci_dev *pdev)
+ 		pci_read_config_word(pdev, PCI_COMMAND, &config);
+ 		if (config & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) {
+ 			pdev->resource[PCI_ROM_RESOURCE].flags |= IORESOURCE_ROM_SHADOW;
+-			dev_printk(KERN_DEBUG, &pdev->dev, "Boot video device\n");
+-			vga_set_default_device(pdev);
++			dev_printk(KERN_DEBUG, &pdev->dev, "Video device with shadowed ROM\n");
+ 		}
+ 	}
+ }
+diff --git a/arch/x86/include/asm/vga.h b/arch/x86/include/asm/vga.h
+index 44282fbf7bf9..c4b9dc2f67c5 100644
+--- a/arch/x86/include/asm/vga.h
++++ b/arch/x86/include/asm/vga.h
+@@ -17,10 +17,4 @@
+ #define vga_readb(x) (*(x))
+ #define vga_writeb(x, y) (*(y) = (x))
+ 
+-#ifdef CONFIG_FB_EFI
+-#define __ARCH_HAS_VGA_DEFAULT_DEVICE
+-extern struct pci_dev *vga_default_device(void);
+-extern void vga_set_default_device(struct pci_dev *pdev);
+-#endif
+-
+ #endif /* _ASM_X86_VGA_H */
+diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
+index b5e60268d93f..9a2b7101ae8a 100644
+--- a/arch/x86/pci/fixup.c
++++ b/arch/x86/pci/fixup.c
+@@ -350,8 +350,7 @@ static void pci_fixup_video(struct pci_dev *pdev)
+ 		pci_read_config_word(pdev, PCI_COMMAND, &config);
+ 		if (config & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) {
+ 			pdev->resource[PCI_ROM_RESOURCE].flags |= IORESOURCE_ROM_SHADOW;
+-			dev_printk(KERN_DEBUG, &pdev->dev, "Boot video device\n");
+-			vga_set_default_device(pdev);
++			dev_printk(KERN_DEBUG, &pdev->dev, "Video device with shadowed ROM\n");
+ 		}
+ 	}
+ }
+diff --git a/drivers/cpufreq/integrator-cpufreq.c b/drivers/cpufreq/integrator-cpufreq.c
+index e5122f1bfe78..302eb5c55d01 100644
+--- a/drivers/cpufreq/integrator-cpufreq.c
++++ b/drivers/cpufreq/integrator-cpufreq.c
+@@ -213,9 +213,9 @@ static int __init integrator_cpufreq_probe(struct platform_device *pdev)
+ 	return cpufreq_register_driver(&integrator_driver);
+ }
+ 
+-static void __exit integrator_cpufreq_remove(struct platform_device *pdev)
++static int __exit integrator_cpufreq_remove(struct platform_device *pdev)
+ {
+-	cpufreq_unregister_driver(&integrator_driver);
++	return cpufreq_unregister_driver(&integrator_driver);
+ }
+ 
+ static const struct of_device_id integrator_cpufreq_match[] = {
+diff --git a/drivers/cpufreq/pcc-cpufreq.c b/drivers/cpufreq/pcc-cpufreq.c
+index 728a2d879499..4d2c8e861089 100644
+--- a/drivers/cpufreq/pcc-cpufreq.c
++++ b/drivers/cpufreq/pcc-cpufreq.c
+@@ -204,7 +204,6 @@ static int pcc_cpufreq_target(struct cpufreq_policy *policy,
+ 	u32 input_buffer;
+ 	int cpu;
+ 
+-	spin_lock(&pcc_lock);
+ 	cpu = policy->cpu;
+ 	pcc_cpu_data = per_cpu_ptr(pcc_cpu_info, cpu);
+ 
+@@ -216,6 +215,7 @@ static int pcc_cpufreq_target(struct cpufreq_policy *policy,
+ 	freqs.old = policy->cur;
+ 	freqs.new = target_freq;
+ 	cpufreq_freq_transition_begin(policy, &freqs);
++	spin_lock(&pcc_lock);
+ 
+ 	input_buffer = 0x1 | (((target_freq * 100)
+ 			       / (ioread32(&pcch_hdr->nominal) * 1000)) << 8);
+diff --git a/drivers/gpu/drm/i915/i915_gem_gtt.c b/drivers/gpu/drm/i915/i915_gem_gtt.c
+index 8b3cde703364..8faabb95cd65 100644
+--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
++++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
+@@ -1297,6 +1297,16 @@ void i915_check_and_clear_faults(struct drm_device *dev)
+ 	POSTING_READ(RING_FAULT_REG(&dev_priv->ring[RCS]));
+ }
+ 
++static void i915_ggtt_flush(struct drm_i915_private *dev_priv)
++{
++	if (INTEL_INFO(dev_priv->dev)->gen < 6) {
++		intel_gtt_chipset_flush();
++	} else {
++		I915_WRITE(GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN);
++		POSTING_READ(GFX_FLSH_CNTL_GEN6);
++	}
++}
++
+ void i915_gem_suspend_gtt_mappings(struct drm_device *dev)
+ {
+ 	struct drm_i915_private *dev_priv = dev->dev_private;
+@@ -1313,6 +1323,8 @@ void i915_gem_suspend_gtt_mappings(struct drm_device *dev)
+ 				       dev_priv->gtt.base.start,
+ 				       dev_priv->gtt.base.total,
+ 				       true);
++
++	i915_ggtt_flush(dev_priv);
+ }
+ 
+ void i915_gem_restore_gtt_mappings(struct drm_device *dev)
+@@ -1365,7 +1377,7 @@ void i915_gem_restore_gtt_mappings(struct drm_device *dev)
+ 		gen6_write_pdes(container_of(vm, struct i915_hw_ppgtt, base));
+ 	}
+ 
+-	i915_gem_chipset_flush(dev);
++	i915_ggtt_flush(dev_priv);
+ }
+ 
+ int i915_gem_gtt_prepare_object(struct drm_i915_gem_object *obj)
+diff --git a/drivers/gpu/drm/i915/intel_opregion.c b/drivers/gpu/drm/i915/intel_opregion.c
+index 4f6b53998d79..b9135dc3fe5d 100644
+--- a/drivers/gpu/drm/i915/intel_opregion.c
++++ b/drivers/gpu/drm/i915/intel_opregion.c
+@@ -395,6 +395,16 @@ int intel_opregion_notify_adapter(struct drm_device *dev, pci_power_t state)
+ 	return -EINVAL;
+ }
+ 
++/*
++ * If the vendor backlight interface is not in use and ACPI backlight interface
++ * is broken, do not bother processing backlight change requests from firmware.
++ */
++static bool should_ignore_backlight_request(void)
++{
++	return acpi_video_backlight_support() &&
++	       !acpi_video_verify_backlight_support();
++}
++
+ static u32 asle_set_backlight(struct drm_device *dev, u32 bclp)
+ {
+ 	struct drm_i915_private *dev_priv = dev->dev_private;
+@@ -403,11 +413,7 @@ static u32 asle_set_backlight(struct drm_device *dev, u32 bclp)
+ 
+ 	DRM_DEBUG_DRIVER("bclp = 0x%08x\n", bclp);
+ 
+-	/*
+-	 * If the acpi_video interface is not supposed to be used, don't
+-	 * bother processing backlight level change requests from firmware.
+-	 */
+-	if (!acpi_video_verify_backlight_support()) {
++	if (should_ignore_backlight_request()) {
+ 		DRM_DEBUG_KMS("opregion backlight request ignored\n");
+ 		return 0;
+ 	}
+diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
+index af0259708358..366641d0483f 100644
+--- a/drivers/gpu/vga/vgaarb.c
++++ b/drivers/gpu/vga/vgaarb.c
+@@ -41,6 +41,7 @@
+ #include <linux/poll.h>
+ #include <linux/miscdevice.h>
+ #include <linux/slab.h>
++#include <linux/screen_info.h>
+ 
+ #include <linux/uaccess.h>
+ 
+@@ -580,8 +581,11 @@ static bool vga_arbiter_add_pci_device(struct pci_dev *pdev)
+ 	 */
+ #ifndef __ARCH_HAS_VGA_DEFAULT_DEVICE
+ 	if (vga_default == NULL &&
+-	    ((vgadev->owns & VGA_RSRC_LEGACY_MASK) == VGA_RSRC_LEGACY_MASK))
++	    ((vgadev->owns & VGA_RSRC_LEGACY_MASK) == VGA_RSRC_LEGACY_MASK)) {
++		pr_info("vgaarb: setting as boot device: PCI:%s\n",
++			pci_name(pdev));
+ 		vga_set_default_device(pdev);
++	}
+ #endif
+ 
+ 	vga_arbiter_check_bridge_sharing(vgadev);
+@@ -1316,6 +1320,38 @@ static int __init vga_arb_device_init(void)
+ 	pr_info("vgaarb: loaded\n");
+ 
+ 	list_for_each_entry(vgadev, &vga_list, list) {
++#if defined(CONFIG_X86) || defined(CONFIG_IA64)
++		/* Override I/O based detection done by vga_arbiter_add_pci_device()
++		 * as it may take the wrong device (e.g. on Apple system under EFI).
++		 *
++		 * Select the device owning the boot framebuffer if there is one.
++		 */
++		resource_size_t start, end;
++		int i;
++
++		/* Does firmware framebuffer belong to us? */
++		for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
++			if (!(pci_resource_flags(vgadev->pdev, i) & IORESOURCE_MEM))
++				continue;
++
++			start = pci_resource_start(vgadev->pdev, i);
++			end  = pci_resource_end(vgadev->pdev, i);
++
++			if (!start || !end)
++				continue;
++
++			if (screen_info.lfb_base < start ||
++			    (screen_info.lfb_base + screen_info.lfb_size) >= end)
++				continue;
++			if (!vga_default_device())
++				pr_info("vgaarb: setting as boot device: PCI:%s\n",
++					pci_name(vgadev->pdev));
++			else if (vgadev->pdev != vga_default_device())
++				pr_info("vgaarb: overriding boot device: PCI:%s\n",
++					pci_name(vgadev->pdev));
++			vga_set_default_device(vgadev->pdev);
++		}
++#endif
+ 		if (vgadev->bridge_has_one_vga)
+ 			pr_info("vgaarb: bridge control possible %s\n", pci_name(vgadev->pdev));
+ 		else
+diff --git a/drivers/i2c/busses/i2c-qup.c b/drivers/i2c/busses/i2c-qup.c
+index 2a5efb5b487c..eb47c98131ec 100644
+--- a/drivers/i2c/busses/i2c-qup.c
++++ b/drivers/i2c/busses/i2c-qup.c
+@@ -670,16 +670,20 @@ static int qup_i2c_probe(struct platform_device *pdev)
+ 	qup->adap.dev.of_node = pdev->dev.of_node;
+ 	strlcpy(qup->adap.name, "QUP I2C adapter", sizeof(qup->adap.name));
+ 
+-	ret = i2c_add_adapter(&qup->adap);
+-	if (ret)
+-		goto fail;
+-
+ 	pm_runtime_set_autosuspend_delay(qup->dev, MSEC_PER_SEC);
+ 	pm_runtime_use_autosuspend(qup->dev);
+ 	pm_runtime_set_active(qup->dev);
+ 	pm_runtime_enable(qup->dev);
++
++	ret = i2c_add_adapter(&qup->adap);
++	if (ret)
++		goto fail_runtime;
++
+ 	return 0;
+ 
++fail_runtime:
++	pm_runtime_disable(qup->dev);
++	pm_runtime_set_suspended(qup->dev);
+ fail:
+ 	qup_i2c_disable_clocks(qup);
+ 	return ret;
+diff --git a/drivers/i2c/busses/i2c-rk3x.c b/drivers/i2c/busses/i2c-rk3x.c
+index 93cfc837200b..b38b0529946a 100644
+--- a/drivers/i2c/busses/i2c-rk3x.c
++++ b/drivers/i2c/busses/i2c-rk3x.c
+@@ -238,7 +238,7 @@ static void rk3x_i2c_fill_transmit_buf(struct rk3x_i2c *i2c)
+ 	for (i = 0; i < 8; ++i) {
+ 		val = 0;
+ 		for (j = 0; j < 4; ++j) {
+-			if (i2c->processed == i2c->msg->len)
++			if ((i2c->processed == i2c->msg->len) && (cnt != 0))
+ 				break;
+ 
+ 			if (i2c->processed == 0 && cnt == 0)
+diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
+index 183588b11fc1..9f0fbecd1eb5 100644
+--- a/drivers/md/raid5.c
++++ b/drivers/md/raid5.c
+@@ -64,6 +64,10 @@
+ #define cpu_to_group(cpu) cpu_to_node(cpu)
+ #define ANY_GROUP NUMA_NO_NODE
+ 
++static bool devices_handle_discard_safely = false;
++module_param(devices_handle_discard_safely, bool, 0644);
++MODULE_PARM_DESC(devices_handle_discard_safely,
++		 "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
+ static struct workqueue_struct *raid5_wq;
+ /*
+  * Stripe cache
+@@ -6208,7 +6212,7 @@ static int run(struct mddev *mddev)
+ 		mddev->queue->limits.discard_granularity = stripe;
+ 		/*
+ 		 * unaligned part of discard request will be ignored, so can't
+-		 * guarantee discard_zerors_data
++		 * guarantee discard_zeroes_data
+ 		 */
+ 		mddev->queue->limits.discard_zeroes_data = 0;
+ 
+@@ -6233,6 +6237,18 @@ static int run(struct mddev *mddev)
+ 			    !bdev_get_queue(rdev->bdev)->
+ 						limits.discard_zeroes_data)
+ 				discard_supported = false;
++			/* Unfortunately, discard_zeroes_data is not currently
++			 * a guarantee - just a hint.  So we only allow DISCARD
++			 * if the sysadmin has confirmed that only safe devices
++			 * are in use by setting a module parameter.
++			 */
++			if (!devices_handle_discard_safely) {
++				if (discard_supported) {
++					pr_info("md/raid456: discard support disabled due to uncertainty.\n");
++					pr_info("Set raid456.devices_handle_discard_safely=Y to override.\n");
++				}
++				discard_supported = false;
++			}
+ 		}
+ 
+ 		if (discard_supported &&
+diff --git a/drivers/media/v4l2-core/videobuf2-core.c b/drivers/media/v4l2-core/videobuf2-core.c
+index dcdceae30ab0..a946523772d6 100644
+--- a/drivers/media/v4l2-core/videobuf2-core.c
++++ b/drivers/media/v4l2-core/videobuf2-core.c
+@@ -967,6 +967,7 @@ static int __reqbufs(struct vb2_queue *q, struct v4l2_requestbuffers *req)
+ 	 * to the userspace.
+ 	 */
+ 	req->count = allocated_buffers;
++	q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
+ 
+ 	return 0;
+ }
+@@ -1014,6 +1015,7 @@ static int __create_bufs(struct vb2_queue *q, struct v4l2_create_buffers *create
+ 		memset(q->plane_sizes, 0, sizeof(q->plane_sizes));
+ 		memset(q->alloc_ctx, 0, sizeof(q->alloc_ctx));
+ 		q->memory = create->memory;
++		q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
+ 	}
+ 
+ 	num_buffers = min(create->count, VIDEO_MAX_FRAME - q->num_buffers);
+@@ -1812,6 +1814,7 @@ static int vb2_internal_qbuf(struct vb2_queue *q, struct v4l2_buffer *b)
+ 	 */
+ 	list_add_tail(&vb->queued_entry, &q->queued_list);
+ 	q->queued_count++;
++	q->waiting_for_buffers = false;
+ 	vb->state = VB2_BUF_STATE_QUEUED;
+ 	if (V4L2_TYPE_IS_OUTPUT(q->type)) {
+ 		/*
+@@ -2244,6 +2247,7 @@ static int vb2_internal_streamoff(struct vb2_queue *q, enum v4l2_buf_type type)
+ 	 * their normal dequeued state.
+ 	 */
+ 	__vb2_queue_cancel(q);
++	q->waiting_for_buffers = !V4L2_TYPE_IS_OUTPUT(q->type);
+ 
+ 	dprintk(3, "successful\n");
+ 	return 0;
+@@ -2562,9 +2566,16 @@ unsigned int vb2_poll(struct vb2_queue *q, struct file *file, poll_table *wait)
+ 	}
+ 
+ 	/*
+-	 * There is nothing to wait for if no buffers have already been queued.
++	 * There is nothing to wait for if the queue isn't streaming.
+ 	 */
+-	if (list_empty(&q->queued_list))
++	if (!vb2_is_streaming(q))
++		return res | POLLERR;
++	/*
++	 * For compatibility with vb1: if QBUF hasn't been called yet, then
++	 * return POLLERR as well. This only affects capture queues, output
++	 * queues will always initialize waiting_for_buffers to false.
++	 */
++	if (q->waiting_for_buffers)
+ 		return res | POLLERR;
+ 
+ 	if (list_empty(&q->done_list))
+diff --git a/drivers/usb/storage/uas-detect.h b/drivers/usb/storage/uas-detect.h
+index bb05b984d5f6..8a6f371ed6e7 100644
+--- a/drivers/usb/storage/uas-detect.h
++++ b/drivers/usb/storage/uas-detect.h
+@@ -9,32 +9,15 @@ static int uas_is_interface(struct usb_host_interface *intf)
+ 		intf->desc.bInterfaceProtocol == USB_PR_UAS);
+ }
+ 
+-static int uas_isnt_supported(struct usb_device *udev)
+-{
+-	struct usb_hcd *hcd = bus_to_hcd(udev->bus);
+-
+-	dev_warn(&udev->dev, "The driver for the USB controller %s does not "
+-			"support scatter-gather which is\n",
+-			hcd->driver->description);
+-	dev_warn(&udev->dev, "required by the UAS driver. Please try an"
+-			"alternative USB controller if you wish to use UAS.\n");
+-	return -ENODEV;
+-}
+-
+ static int uas_find_uas_alt_setting(struct usb_interface *intf)
+ {
+ 	int i;
+-	struct usb_device *udev = interface_to_usbdev(intf);
+-	int sg_supported = udev->bus->sg_tablesize != 0;
+ 
+ 	for (i = 0; i < intf->num_altsetting; i++) {
+ 		struct usb_host_interface *alt = &intf->altsetting[i];
+ 
+-		if (uas_is_interface(alt)) {
+-			if (!sg_supported)
+-				return uas_isnt_supported(udev);
++		if (uas_is_interface(alt))
+ 			return alt->desc.bAlternateSetting;
+-		}
+ 	}
+ 
+ 	return -ENODEV;
+@@ -76,13 +59,6 @@ static int uas_use_uas_driver(struct usb_interface *intf,
+ 	unsigned long flags = id->driver_info;
+ 	int r, alt;
+ 
+-	usb_stor_adjust_quirks(udev, &flags);
+-
+-	if (flags & US_FL_IGNORE_UAS)
+-		return 0;
+-
+-	if (udev->speed >= USB_SPEED_SUPER && !hcd->can_do_streams)
+-		return 0;
+ 
+ 	alt = uas_find_uas_alt_setting(intf);
+ 	if (alt < 0)
+@@ -92,5 +68,46 @@ static int uas_use_uas_driver(struct usb_interface *intf,
+ 	if (r < 0)
+ 		return 0;
+ 
++	/*
++	 * ASM1051 and older ASM1053 devices have the same usb-id, and UAS is
++	 * broken on the ASM1051, use the number of streams to differentiate.
++	 * New ASM1053-s also support 32 streams, but have a different prod-id.
++	 */
++	if (le16_to_cpu(udev->descriptor.idVendor) == 0x174c &&
++			le16_to_cpu(udev->descriptor.idProduct) == 0x55aa) {
++		if (udev->speed < USB_SPEED_SUPER) {
++			/* No streams info, assume ASM1051 */
++			flags |= US_FL_IGNORE_UAS;
++		} else if (usb_ss_max_streams(&eps[1]->ss_ep_comp) == 32) {
++			flags |= US_FL_IGNORE_UAS;
++		}
++	}
++
++	usb_stor_adjust_quirks(udev, &flags);
++
++	if (flags & US_FL_IGNORE_UAS) {
++		dev_warn(&udev->dev,
++			"UAS is blacklisted for this device, using usb-storage instead\n");
++		return 0;
++	}
++
++	if (udev->bus->sg_tablesize == 0) {
++		dev_warn(&udev->dev,
++			"The driver for the USB controller %s does not support scatter-gather which is\n",
++			hcd->driver->description);
++		dev_warn(&udev->dev,
++			"required by the UAS driver. Please try an other USB controller if you wish to use UAS.\n");
++		return 0;
++	}
++
++	if (udev->speed >= USB_SPEED_SUPER && !hcd->can_do_streams) {
++		dev_warn(&udev->dev,
++			"USB controller %s does not support streams, which are required by the UAS driver.\n",
++			hcd_to_bus(hcd)->bus_name);
++		dev_warn(&udev->dev,
++			"Please try an other USB controller if you wish to use UAS.\n");
++		return 0;
++	}
++
+ 	return 1;
+ }
+diff --git a/drivers/video/fbdev/efifb.c b/drivers/video/fbdev/efifb.c
+index ae9618ff6735..982f6abe6faf 100644
+--- a/drivers/video/fbdev/efifb.c
++++ b/drivers/video/fbdev/efifb.c
+@@ -19,8 +19,6 @@
+ 
+ static bool request_mem_succeeded = false;
+ 
+-static struct pci_dev *default_vga;
+-
+ static struct fb_var_screeninfo efifb_defined = {
+ 	.activate		= FB_ACTIVATE_NOW,
+ 	.height			= -1,
+@@ -84,23 +82,10 @@ static struct fb_ops efifb_ops = {
+ 	.fb_imageblit	= cfb_imageblit,
+ };
+ 
+-struct pci_dev *vga_default_device(void)
+-{
+-	return default_vga;
+-}
+-
+-EXPORT_SYMBOL_GPL(vga_default_device);
+-
+-void vga_set_default_device(struct pci_dev *pdev)
+-{
+-	default_vga = pdev;
+-}
+-
+ static int efifb_setup(char *options)
+ {
+ 	char *this_opt;
+ 	int i;
+-	struct pci_dev *dev = NULL;
+ 
+ 	if (options && *options) {
+ 		while ((this_opt = strsep(&options, ",")) != NULL) {
+@@ -126,30 +111,6 @@ static int efifb_setup(char *options)
+ 		}
+ 	}
+ 
+-	for_each_pci_dev(dev) {
+-		int i;
+-
+-		if ((dev->class >> 8) != PCI_CLASS_DISPLAY_VGA)
+-			continue;
+-
+-		for (i=0; i < DEVICE_COUNT_RESOURCE; i++) {
+-			resource_size_t start, end;
+-
+-			if (!(pci_resource_flags(dev, i) & IORESOURCE_MEM))
+-				continue;
+-
+-			start = pci_resource_start(dev, i);
+-			end  = pci_resource_end(dev, i);
+-
+-			if (!start || !end)
+-				continue;
+-
+-			if (screen_info.lfb_base >= start &&
+-			    (screen_info.lfb_base + screen_info.lfb_size) < end)
+-				default_vga = dev;
+-		}
+-	}
+-
+ 	return 0;
+ }
+ 
+diff --git a/fs/cifs/smb1ops.c b/fs/cifs/smb1ops.c
+index 84ca0a4caaeb..e9ad8d37bb00 100644
+--- a/fs/cifs/smb1ops.c
++++ b/fs/cifs/smb1ops.c
+@@ -586,7 +586,7 @@ cifs_query_path_info(const unsigned int xid, struct cifs_tcon *tcon,
+ 		tmprc = CIFS_open(xid, &oparms, &oplock, NULL);
+ 		if (tmprc == -EOPNOTSUPP)
+ 			*symlink = true;
+-		else
++		else if (tmprc == 0)
+ 			CIFSSMBClose(xid, tcon, fid.netfid);
+ 	}
+ 
+diff --git a/fs/cifs/smb2maperror.c b/fs/cifs/smb2maperror.c
+index a689514e260f..a491814cb2c0 100644
+--- a/fs/cifs/smb2maperror.c
++++ b/fs/cifs/smb2maperror.c
+@@ -256,6 +256,8 @@ static const struct status_to_posix_error smb2_error_map_table[] = {
+ 	{STATUS_DLL_MIGHT_BE_INCOMPATIBLE, -EIO,
+ 	"STATUS_DLL_MIGHT_BE_INCOMPATIBLE"},
+ 	{STATUS_STOPPED_ON_SYMLINK, -EOPNOTSUPP, "STATUS_STOPPED_ON_SYMLINK"},
++	{STATUS_IO_REPARSE_TAG_NOT_HANDLED, -EOPNOTSUPP,
++	"STATUS_REPARSE_NOT_HANDLED"},
+ 	{STATUS_DEVICE_REQUIRES_CLEANING, -EIO,
+ 	"STATUS_DEVICE_REQUIRES_CLEANING"},
+ 	{STATUS_DEVICE_DOOR_OPEN, -EIO, "STATUS_DEVICE_DOOR_OPEN"},
+diff --git a/fs/udf/inode.c b/fs/udf/inode.c
+index 236cd48184c2..a932f7740b51 100644
+--- a/fs/udf/inode.c
++++ b/fs/udf/inode.c
+@@ -1271,13 +1271,22 @@ update_time:
+ 	return 0;
+ }
+ 
++/*
++ * Maximum length of linked list formed by ICB hierarchy. The chosen number is
++ * arbitrary - just that we hopefully don't limit any real use of rewritten
++ * inode on write-once media but avoid looping for too long on corrupted media.
++ */
++#define UDF_MAX_ICB_NESTING 1024
++
+ static void __udf_read_inode(struct inode *inode)
+ {
+ 	struct buffer_head *bh = NULL;
+ 	struct fileEntry *fe;
+ 	uint16_t ident;
+ 	struct udf_inode_info *iinfo = UDF_I(inode);
++	unsigned int indirections = 0;
+ 
++reread:
+ 	/*
+ 	 * Set defaults, but the inode is still incomplete!
+ 	 * Note: get_new_inode() sets the following on a new inode:
+@@ -1314,28 +1323,26 @@ static void __udf_read_inode(struct inode *inode)
+ 		ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1,
+ 					&ident);
+ 		if (ident == TAG_IDENT_IE && ibh) {
+-			struct buffer_head *nbh = NULL;
+ 			struct kernel_lb_addr loc;
+ 			struct indirectEntry *ie;
+ 
+ 			ie = (struct indirectEntry *)ibh->b_data;
+ 			loc = lelb_to_cpu(ie->indirectICB.extLocation);
+ 
+-			if (ie->indirectICB.extLength &&
+-				(nbh = udf_read_ptagged(inode->i_sb, &loc, 0,
+-							&ident))) {
+-				if (ident == TAG_IDENT_FE ||
+-					ident == TAG_IDENT_EFE) {
+-					memcpy(&iinfo->i_location,
+-						&loc,
+-						sizeof(struct kernel_lb_addr));
+-					brelse(bh);
+-					brelse(ibh);
+-					brelse(nbh);
+-					__udf_read_inode(inode);
++			if (ie->indirectICB.extLength) {
++				brelse(bh);
++				brelse(ibh);
++				memcpy(&iinfo->i_location, &loc,
++				       sizeof(struct kernel_lb_addr));
++				if (++indirections > UDF_MAX_ICB_NESTING) {
++					udf_err(inode->i_sb,
++						"too many ICBs in ICB hierarchy"
++						" (max %d supported)\n",
++						UDF_MAX_ICB_NESTING);
++					make_bad_inode(inode);
+ 					return;
+ 				}
+-				brelse(nbh);
++				goto reread;
+ 			}
+ 		}
+ 		brelse(ibh);
+diff --git a/include/linux/jiffies.h b/include/linux/jiffies.h
+index 1f44466c1e9d..c367cbdf73ab 100644
+--- a/include/linux/jiffies.h
++++ b/include/linux/jiffies.h
+@@ -258,23 +258,11 @@ extern unsigned long preset_lpj;
+ #define SEC_JIFFIE_SC (32 - SHIFT_HZ)
+ #endif
+ #define NSEC_JIFFIE_SC (SEC_JIFFIE_SC + 29)
+-#define USEC_JIFFIE_SC (SEC_JIFFIE_SC + 19)
+ #define SEC_CONVERSION ((unsigned long)((((u64)NSEC_PER_SEC << SEC_JIFFIE_SC) +\
+                                 TICK_NSEC -1) / (u64)TICK_NSEC))
+ 
+ #define NSEC_CONVERSION ((unsigned long)((((u64)1 << NSEC_JIFFIE_SC) +\
+                                         TICK_NSEC -1) / (u64)TICK_NSEC))
+-#define USEC_CONVERSION  \
+-                    ((unsigned long)((((u64)NSEC_PER_USEC << USEC_JIFFIE_SC) +\
+-                                        TICK_NSEC -1) / (u64)TICK_NSEC))
+-/*
+- * USEC_ROUND is used in the timeval to jiffie conversion.  See there
+- * for more details.  It is the scaled resolution rounding value.  Note
+- * that it is a 64-bit value.  Since, when it is applied, we are already
+- * in jiffies (albit scaled), it is nothing but the bits we will shift
+- * off.
+- */
+-#define USEC_ROUND (u64)(((u64)1 << USEC_JIFFIE_SC) - 1)
+ /*
+  * The maximum jiffie value is (MAX_INT >> 1).  Here we translate that
+  * into seconds.  The 64-bit case will overflow if we are not careful,
+diff --git a/include/media/videobuf2-core.h b/include/media/videobuf2-core.h
+index 8fab6fa0dbfb..d6f010c17f4a 100644
+--- a/include/media/videobuf2-core.h
++++ b/include/media/videobuf2-core.h
+@@ -375,6 +375,9 @@ struct v4l2_fh;
+  * @streaming:	current streaming state
+  * @start_streaming_called: start_streaming() was called successfully and we
+  *		started streaming.
++ * @waiting_for_buffers: used in poll() to check if vb2 is still waiting for
++ *		buffers. Only set for capture queues if qbuf has not yet been
++ *		called since poll() needs to return POLLERR in that situation.
+  * @fileio:	file io emulator internal data, used only if emulator is active
+  * @threadio:	thread io internal data, used only if thread is active
+  */
+@@ -411,6 +414,7 @@ struct vb2_queue {
+ 
+ 	unsigned int			streaming:1;
+ 	unsigned int			start_streaming_called:1;
++	unsigned int			waiting_for_buffers:1;
+ 
+ 	struct vb2_fileio_data		*fileio;
+ 	struct vb2_threadio_data	*threadio;
+diff --git a/init/Kconfig b/init/Kconfig
+index 9d76b99af1b9..35685a46e4da 100644
+--- a/init/Kconfig
++++ b/init/Kconfig
+@@ -1432,6 +1432,7 @@ config FUTEX
+ 
+ config HAVE_FUTEX_CMPXCHG
+ 	bool
++	depends on FUTEX
+ 	help
+ 	  Architectures should select this if futex_atomic_cmpxchg_inatomic()
+ 	  is implemented and always working. This removes a couple of runtime
+diff --git a/kernel/events/core.c b/kernel/events/core.c
+index f626c9f1f3c0..2065959042ea 100644
+--- a/kernel/events/core.c
++++ b/kernel/events/core.c
+@@ -7921,8 +7921,10 @@ int perf_event_init_task(struct task_struct *child)
+ 
+ 	for_each_task_context_nr(ctxn) {
+ 		ret = perf_event_init_context(child, ctxn);
+-		if (ret)
++		if (ret) {
++			perf_event_free_task(child);
+ 			return ret;
++		}
+ 	}
+ 
+ 	return 0;
+diff --git a/kernel/fork.c b/kernel/fork.c
+index 6a13c46cd87d..b41958b0cb67 100644
+--- a/kernel/fork.c
++++ b/kernel/fork.c
+@@ -1326,7 +1326,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
+ 		goto bad_fork_cleanup_policy;
+ 	retval = audit_alloc(p);
+ 	if (retval)
+-		goto bad_fork_cleanup_policy;
++		goto bad_fork_cleanup_perf;
+ 	/* copy all the process information */
+ 	retval = copy_semundo(clone_flags, p);
+ 	if (retval)
+@@ -1525,8 +1525,9 @@ bad_fork_cleanup_semundo:
+ 	exit_sem(p);
+ bad_fork_cleanup_audit:
+ 	audit_free(p);
+-bad_fork_cleanup_policy:
++bad_fork_cleanup_perf:
+ 	perf_event_free_task(p);
++bad_fork_cleanup_policy:
+ #ifdef CONFIG_NUMA
+ 	mpol_put(p->mempolicy);
+ bad_fork_cleanup_threadgroup_lock:
+diff --git a/kernel/time.c b/kernel/time.c
+index 7c7964c33ae7..3c49ab45f822 100644
+--- a/kernel/time.c
++++ b/kernel/time.c
+@@ -496,17 +496,20 @@ EXPORT_SYMBOL(usecs_to_jiffies);
+  * that a remainder subtract here would not do the right thing as the
+  * resolution values don't fall on second boundries.  I.e. the line:
+  * nsec -= nsec % TICK_NSEC; is NOT a correct resolution rounding.
++ * Note that due to the small error in the multiplier here, this
++ * rounding is incorrect for sufficiently large values of tv_nsec, but
++ * well formed timespecs should have tv_nsec < NSEC_PER_SEC, so we're
++ * OK.
+  *
+  * Rather, we just shift the bits off the right.
+  *
+  * The >> (NSEC_JIFFIE_SC - SEC_JIFFIE_SC) converts the scaled nsec
+  * value to a scaled second value.
+  */
+-unsigned long
+-timespec_to_jiffies(const struct timespec *value)
++static unsigned long
++__timespec_to_jiffies(unsigned long sec, long nsec)
+ {
+-	unsigned long sec = value->tv_sec;
+-	long nsec = value->tv_nsec + TICK_NSEC - 1;
++	nsec = nsec + TICK_NSEC - 1;
+ 
+ 	if (sec >= MAX_SEC_IN_JIFFIES){
+ 		sec = MAX_SEC_IN_JIFFIES;
+@@ -517,6 +520,13 @@ timespec_to_jiffies(const struct timespec *value)
+ 		 (NSEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
+ 
+ }
++
++unsigned long
++timespec_to_jiffies(const struct timespec *value)
++{
++	return __timespec_to_jiffies(value->tv_sec, value->tv_nsec);
++}
++
+ EXPORT_SYMBOL(timespec_to_jiffies);
+ 
+ void
+@@ -533,31 +543,27 @@ jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
+ }
+ EXPORT_SYMBOL(jiffies_to_timespec);
+ 
+-/* Same for "timeval"
++/*
++ * We could use a similar algorithm to timespec_to_jiffies (with a
++ * different multiplier for usec instead of nsec). But this has a
++ * problem with rounding: we can't exactly add TICK_NSEC - 1 to the
++ * usec value, since it's not necessarily integral.
+  *
+- * Well, almost.  The problem here is that the real system resolution is
+- * in nanoseconds and the value being converted is in micro seconds.
+- * Also for some machines (those that use HZ = 1024, in-particular),
+- * there is a LARGE error in the tick size in microseconds.
+-
+- * The solution we use is to do the rounding AFTER we convert the
+- * microsecond part.  Thus the USEC_ROUND, the bits to be shifted off.
+- * Instruction wise, this should cost only an additional add with carry
+- * instruction above the way it was done above.
++ * We could instead round in the intermediate scaled representation
++ * (i.e. in units of 1/2^(large scale) jiffies) but that's also
++ * perilous: the scaling introduces a small positive error, which
++ * combined with a division-rounding-upward (i.e. adding 2^(scale) - 1
++ * units to the intermediate before shifting) leads to accidental
++ * overflow and overestimates.
++ *
++ * At the cost of one additional multiplication by a constant, just
++ * use the timespec implementation.
+  */
+ unsigned long
+ timeval_to_jiffies(const struct timeval *value)
+ {
+-	unsigned long sec = value->tv_sec;
+-	long usec = value->tv_usec;
+-
+-	if (sec >= MAX_SEC_IN_JIFFIES){
+-		sec = MAX_SEC_IN_JIFFIES;
+-		usec = 0;
+-	}
+-	return (((u64)sec * SEC_CONVERSION) +
+-		(((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
+-		 (USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
++	return __timespec_to_jiffies(value->tv_sec,
++				     value->tv_usec * NSEC_PER_USEC);
+ }
+ EXPORT_SYMBOL(timeval_to_jiffies);
+ 
+diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
+index 2ff0580d3dcd..51862982e1e9 100644
+--- a/kernel/trace/ring_buffer.c
++++ b/kernel/trace/ring_buffer.c
+@@ -3375,7 +3375,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter)
+ 	iter->head = cpu_buffer->reader_page->read;
+ 
+ 	iter->cache_reader_page = iter->head_page;
+-	iter->cache_read = iter->head;
++	iter->cache_read = cpu_buffer->read;
+ 
+ 	if (iter->head)
+ 		iter->read_stamp = cpu_buffer->read_stamp;
+diff --git a/mm/huge_memory.c b/mm/huge_memory.c
+index 33514d88fef9..c9ef81e08e4a 100644
+--- a/mm/huge_memory.c
++++ b/mm/huge_memory.c
+@@ -1775,21 +1775,24 @@ static int __split_huge_page_map(struct page *page,
+ 	if (pmd) {
+ 		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ 		pmd_populate(mm, &_pmd, pgtable);
++		if (pmd_write(*pmd))
++			BUG_ON(page_mapcount(page) != 1);
+ 
+ 		haddr = address;
+ 		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ 			pte_t *pte, entry;
+ 			BUG_ON(PageCompound(page+i));
++			/*
++			 * Note that pmd_numa is not transferred deliberately
++			 * to avoid any possibility that pte_numa leaks to
++			 * a PROT_NONE VMA by accident.
++			 */
+ 			entry = mk_pte(page + i, vma->vm_page_prot);
+ 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ 			if (!pmd_write(*pmd))
+ 				entry = pte_wrprotect(entry);
+-			else
+-				BUG_ON(page_mapcount(page) != 1);
+ 			if (!pmd_young(*pmd))
+ 				entry = pte_mkold(entry);
+-			if (pmd_numa(*pmd))
+-				entry = pte_mknuma(entry);
+ 			pte = pte_offset_map(&_pmd, haddr);
+ 			BUG_ON(!pte_none(*pte));
+ 			set_pte_at(mm, haddr, pte, entry);
+diff --git a/mm/memcontrol.c b/mm/memcontrol.c
+index 1f14a430c656..15fe66d83987 100644
+--- a/mm/memcontrol.c
++++ b/mm/memcontrol.c
+@@ -292,6 +292,9 @@ struct mem_cgroup {
+ 	/* vmpressure notifications */
+ 	struct vmpressure vmpressure;
+ 
++	/* css_online() has been completed */
++	int initialized;
++
+ 	/*
+ 	 * the counter to account for mem+swap usage.
+ 	 */
+@@ -1106,10 +1109,21 @@ skip_node:
+ 	 * skipping css reference should be safe.
+ 	 */
+ 	if (next_css) {
+-		if ((next_css == &root->css) ||
+-		    ((next_css->flags & CSS_ONLINE) &&
+-		     css_tryget_online(next_css)))
+-			return mem_cgroup_from_css(next_css);
++		struct mem_cgroup *memcg = mem_cgroup_from_css(next_css);
++
++		if (next_css == &root->css)
++			return memcg;
++
++		if (css_tryget_online(next_css)) {
++			/*
++			 * Make sure the memcg is initialized:
++			 * mem_cgroup_css_online() orders the the
++			 * initialization against setting the flag.
++			 */
++			if (smp_load_acquire(&memcg->initialized))
++				return memcg;
++			css_put(next_css);
++		}
+ 
+ 		prev_css = next_css;
+ 		goto skip_node;
+@@ -6277,6 +6291,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
+ {
+ 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ 	struct mem_cgroup *parent = mem_cgroup_from_css(css->parent);
++	int ret;
+ 
+ 	if (css->id > MEM_CGROUP_ID_MAX)
+ 		return -ENOSPC;
+@@ -6313,7 +6328,18 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
+ 	}
+ 	mutex_unlock(&memcg_create_mutex);
+ 
+-	return memcg_init_kmem(memcg, &memory_cgrp_subsys);
++	ret = memcg_init_kmem(memcg, &memory_cgrp_subsys);
++	if (ret)
++		return ret;
++
++	/*
++	 * Make sure the memcg is initialized: mem_cgroup_iter()
++	 * orders reading memcg->initialized against its callers
++	 * reading the memcg members.
++	 */
++	smp_store_release(&memcg->initialized, 1);
++
++	return 0;
+ }
+ 
+ /*
+diff --git a/mm/migrate.c b/mm/migrate.c
+index be6dbf995c0c..0bba97914af0 100644
+--- a/mm/migrate.c
++++ b/mm/migrate.c
+@@ -146,8 +146,11 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+ 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
+ 	if (pte_swp_soft_dirty(*ptep))
+ 		pte = pte_mksoft_dirty(pte);
++
++	/* Recheck VMA as permissions can change since migration started  */
+ 	if (is_write_migration_entry(entry))
+-		pte = pte_mkwrite(pte);
++		pte = maybe_mkwrite(pte, vma);
++
+ #ifdef CONFIG_HUGETLB_PAGE
+ 	if (PageHuge(new)) {
+ 		pte = pte_mkhuge(pte);
+diff --git a/sound/soc/codecs/ssm2602.c b/sound/soc/codecs/ssm2602.c
+index 97b0454eb346..eb1bb7414b8b 100644
+--- a/sound/soc/codecs/ssm2602.c
++++ b/sound/soc/codecs/ssm2602.c
+@@ -647,7 +647,7 @@ int ssm2602_probe(struct device *dev, enum ssm2602_type type,
+ 		return -ENOMEM;
+ 
+ 	dev_set_drvdata(dev, ssm2602);
+-	ssm2602->type = SSM2602;
++	ssm2602->type = type;
+ 	ssm2602->regmap = regmap;
+ 
+ 	return snd_soc_register_codec(dev, &soc_codec_dev_ssm2602,
+diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
+index b87d7d882e6d..49acc989e452 100644
+--- a/sound/soc/soc-core.c
++++ b/sound/soc/soc-core.c
+@@ -3181,7 +3181,7 @@ int snd_soc_bytes_put(struct snd_kcontrol *kcontrol,
+ 	unsigned int val, mask;
+ 	void *data;
+ 
+-	if (!component->regmap)
++	if (!component->regmap || !params->num_regs)
+ 		return -EINVAL;
+ 
+ 	len = params->num_regs * component->val_bytes;


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-15 12:42 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-15 12:42 UTC (permalink / raw
  To: gentoo-commits

commit:     10da52d34f75c039a20e3e60cb9dc3e05bc1cbb7
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Wed Oct 15 12:42:37 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Wed Oct 15 12:42:37 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=10da52d3

Linux patch 3.16.6

---
 0000_README             |    4 +
 1005_linux-3.16.6.patch | 2652 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 2656 insertions(+)

diff --git a/0000_README b/0000_README
index ede03f9..a7526a7 100644
--- a/0000_README
+++ b/0000_README
@@ -62,6 +62,10 @@ Patch:  1004_linux-3.16.5.patch
 From:   http://www.kernel.org
 Desc:   Linux 3.16.5
 
+Patch:  1005_linux-3.16.6.patch
+From:   http://www.kernel.org
+Desc:   Linux 3.16.6
+
 Patch:  1500_XATTR_USER_PREFIX.patch
 From:   https://bugs.gentoo.org/show_bug.cgi?id=470644
 Desc:   Support for namespace user.pax.* on tmpfs.

diff --git a/1005_linux-3.16.6.patch b/1005_linux-3.16.6.patch
new file mode 100644
index 0000000..422fde0
--- /dev/null
+++ b/1005_linux-3.16.6.patch
@@ -0,0 +1,2652 @@
+diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
+index f896f68a3ba3..c4da64b525b2 100644
+--- a/Documentation/kernel-parameters.txt
++++ b/Documentation/kernel-parameters.txt
+@@ -3459,6 +3459,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
+ 					READ_DISC_INFO command);
+ 				e = NO_READ_CAPACITY_16 (don't use
+ 					READ_CAPACITY_16 command);
++				f = NO_REPORT_OPCODES (don't use report opcodes
++					command, uas only);
+ 				h = CAPACITY_HEURISTICS (decrease the
+ 					reported device capacity by one
+ 					sector if the number is odd);
+@@ -3478,6 +3480,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
+ 					bogus residue values);
+ 				s = SINGLE_LUN (the device has only one
+ 					Logical Unit);
++				t = NO_ATA_1X (don't allow ATA(12) and ATA(16)
++					commands, uas only);
+ 				u = IGNORE_UAS (don't bind to the uas driver);
+ 				w = NO_WP_DETECT (don't test whether the
+ 					medium is write-protected).
+diff --git a/Makefile b/Makefile
+index 41efc3d9f2e0..5c4bc3fc18c0 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 5
++SUBLEVEL = 6
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+ 
+diff --git a/drivers/base/node.c b/drivers/base/node.c
+index 8f7ed9933a7c..40e4585f110a 100644
+--- a/drivers/base/node.c
++++ b/drivers/base/node.c
+@@ -603,7 +603,6 @@ void unregister_one_node(int nid)
+ 		return;
+ 
+ 	unregister_node(node_devices[nid]);
+-	kfree(node_devices[nid]);
+ 	node_devices[nid] = NULL;
+ }
+ 
+diff --git a/drivers/crypto/caam/caamhash.c b/drivers/crypto/caam/caamhash.c
+index 0d9284ef96a8..42e41f3b5cf1 100644
+--- a/drivers/crypto/caam/caamhash.c
++++ b/drivers/crypto/caam/caamhash.c
+@@ -1338,9 +1338,9 @@ static int ahash_update_first(struct ahash_request *req)
+ 	struct device *jrdev = ctx->jrdev;
+ 	gfp_t flags = (req->base.flags & (CRYPTO_TFM_REQ_MAY_BACKLOG |
+ 		       CRYPTO_TFM_REQ_MAY_SLEEP)) ? GFP_KERNEL : GFP_ATOMIC;
+-	u8 *next_buf = state->buf_0 + state->current_buf *
+-		       CAAM_MAX_HASH_BLOCK_SIZE;
+-	int *next_buflen = &state->buflen_0 + state->current_buf;
++	u8 *next_buf = state->current_buf ? state->buf_1 : state->buf_0;
++	int *next_buflen = state->current_buf ?
++		&state->buflen_1 : &state->buflen_0;
+ 	int to_hash;
+ 	u32 *sh_desc = ctx->sh_desc_update_first, *desc;
+ 	dma_addr_t ptr = ctx->sh_desc_update_first_dma;
+diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
+index 701f86cd5993..5f29c9a9a316 100644
+--- a/drivers/net/bonding/bond_main.c
++++ b/drivers/net/bonding/bond_main.c
+@@ -3667,8 +3667,14 @@ static int bond_xmit_roundrobin(struct sk_buff *skb, struct net_device *bond_dev
+ 		else
+ 			bond_xmit_slave_id(bond, skb, 0);
+ 	} else {
+-		slave_id = bond_rr_gen_slave_id(bond);
+-		bond_xmit_slave_id(bond, skb, slave_id % bond->slave_cnt);
++		int slave_cnt = ACCESS_ONCE(bond->slave_cnt);
++
++		if (likely(slave_cnt)) {
++			slave_id = bond_rr_gen_slave_id(bond);
++			bond_xmit_slave_id(bond, skb, slave_id % slave_cnt);
++		} else {
++			dev_kfree_skb_any(skb);
++		}
+ 	}
+ 
+ 	return NETDEV_TX_OK;
+@@ -3699,8 +3705,13 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
+ static int bond_xmit_xor(struct sk_buff *skb, struct net_device *bond_dev)
+ {
+ 	struct bonding *bond = netdev_priv(bond_dev);
++	int slave_cnt = ACCESS_ONCE(bond->slave_cnt);
+ 
+-	bond_xmit_slave_id(bond, skb, bond_xmit_hash(bond, skb) % bond->slave_cnt);
++	if (likely(slave_cnt))
++		bond_xmit_slave_id(bond, skb,
++				   bond_xmit_hash(bond, skb) % slave_cnt);
++	else
++		dev_kfree_skb_any(skb);
+ 
+ 	return NETDEV_TX_OK;
+ }
+diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
+index 5776e503e4c5..6e4a6bddf56e 100644
+--- a/drivers/net/ethernet/broadcom/bcmsysport.c
++++ b/drivers/net/ethernet/broadcom/bcmsysport.c
+@@ -757,7 +757,8 @@ static irqreturn_t bcm_sysport_tx_isr(int irq, void *dev_id)
+ 	return IRQ_HANDLED;
+ }
+ 
+-static int bcm_sysport_insert_tsb(struct sk_buff *skb, struct net_device *dev)
++static struct sk_buff *bcm_sysport_insert_tsb(struct sk_buff *skb,
++					      struct net_device *dev)
+ {
+ 	struct sk_buff *nskb;
+ 	struct bcm_tsb *tsb;
+@@ -773,7 +774,7 @@ static int bcm_sysport_insert_tsb(struct sk_buff *skb, struct net_device *dev)
+ 		if (!nskb) {
+ 			dev->stats.tx_errors++;
+ 			dev->stats.tx_dropped++;
+-			return -ENOMEM;
++			return NULL;
+ 		}
+ 		skb = nskb;
+ 	}
+@@ -792,7 +793,7 @@ static int bcm_sysport_insert_tsb(struct sk_buff *skb, struct net_device *dev)
+ 			ip_proto = ipv6_hdr(skb)->nexthdr;
+ 			break;
+ 		default:
+-			return 0;
++			return skb;
+ 		}
+ 
+ 		/* Get the checksum offset and the L4 (transport) offset */
+@@ -810,7 +811,7 @@ static int bcm_sysport_insert_tsb(struct sk_buff *skb, struct net_device *dev)
+ 		tsb->l4_ptr_dest_map = csum_info;
+ 	}
+ 
+-	return 0;
++	return skb;
+ }
+ 
+ static netdev_tx_t bcm_sysport_xmit(struct sk_buff *skb,
+@@ -844,8 +845,8 @@ static netdev_tx_t bcm_sysport_xmit(struct sk_buff *skb,
+ 
+ 	/* Insert TSB and checksum infos */
+ 	if (priv->tsb_en) {
+-		ret = bcm_sysport_insert_tsb(skb, dev);
+-		if (ret) {
++		skb = bcm_sysport_insert_tsb(skb, dev);
++		if (!skb) {
+ 			ret = NETDEV_TX_OK;
+ 			goto out;
+ 		}
+diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+index 6a8b1453a1b9..73cfb21899a7 100644
+--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
++++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+@@ -10044,6 +10044,8 @@ static void bnx2x_prev_unload_close_mac(struct bnx2x *bp,
+ }
+ 
+ #define BNX2X_PREV_UNDI_PROD_ADDR(p) (BAR_TSTRORM_INTMEM + 0x1508 + ((p) << 4))
++#define BNX2X_PREV_UNDI_PROD_ADDR_H(f) (BAR_TSTRORM_INTMEM + \
++					0x1848 + ((f) << 4))
+ #define BNX2X_PREV_UNDI_RCQ(val)	((val) & 0xffff)
+ #define BNX2X_PREV_UNDI_BD(val)		((val) >> 16 & 0xffff)
+ #define BNX2X_PREV_UNDI_PROD(rcq, bd)	((bd) << 16 | (rcq))
+@@ -10051,8 +10053,6 @@ static void bnx2x_prev_unload_close_mac(struct bnx2x *bp,
+ #define BCM_5710_UNDI_FW_MF_MAJOR	(0x07)
+ #define BCM_5710_UNDI_FW_MF_MINOR	(0x08)
+ #define BCM_5710_UNDI_FW_MF_VERS	(0x05)
+-#define BNX2X_PREV_UNDI_MF_PORT(p) (BAR_TSTRORM_INTMEM + 0x150c + ((p) << 4))
+-#define BNX2X_PREV_UNDI_MF_FUNC(f) (BAR_TSTRORM_INTMEM + 0x184c + ((f) << 4))
+ 
+ static bool bnx2x_prev_is_after_undi(struct bnx2x *bp)
+ {
+@@ -10071,72 +10071,25 @@ static bool bnx2x_prev_is_after_undi(struct bnx2x *bp)
+ 	return false;
+ }
+ 
+-static bool bnx2x_prev_unload_undi_fw_supports_mf(struct bnx2x *bp)
+-{
+-	u8 major, minor, version;
+-	u32 fw;
+-
+-	/* Must check that FW is loaded */
+-	if (!(REG_RD(bp, MISC_REG_RESET_REG_1) &
+-	     MISC_REGISTERS_RESET_REG_1_RST_XSEM)) {
+-		BNX2X_DEV_INFO("XSEM is reset - UNDI MF FW is not loaded\n");
+-		return false;
+-	}
+-
+-	/* Read Currently loaded FW version */
+-	fw = REG_RD(bp, XSEM_REG_PRAM);
+-	major = fw & 0xff;
+-	minor = (fw >> 0x8) & 0xff;
+-	version = (fw >> 0x10) & 0xff;
+-	BNX2X_DEV_INFO("Loaded FW: 0x%08x: Major 0x%02x Minor 0x%02x Version 0x%02x\n",
+-		       fw, major, minor, version);
+-
+-	if (major > BCM_5710_UNDI_FW_MF_MAJOR)
+-		return true;
+-
+-	if ((major == BCM_5710_UNDI_FW_MF_MAJOR) &&
+-	    (minor > BCM_5710_UNDI_FW_MF_MINOR))
+-		return true;
+-
+-	if ((major == BCM_5710_UNDI_FW_MF_MAJOR) &&
+-	    (minor == BCM_5710_UNDI_FW_MF_MINOR) &&
+-	    (version >= BCM_5710_UNDI_FW_MF_VERS))
+-		return true;
+-
+-	return false;
+-}
+-
+-static void bnx2x_prev_unload_undi_mf(struct bnx2x *bp)
+-{
+-	int i;
+-
+-	/* Due to legacy (FW) code, the first function on each engine has a
+-	 * different offset macro from the rest of the functions.
+-	 * Setting this for all 8 functions is harmless regardless of whether
+-	 * this is actually a multi-function device.
+-	 */
+-	for (i = 0; i < 2; i++)
+-		REG_WR(bp, BNX2X_PREV_UNDI_MF_PORT(i), 1);
+-
+-	for (i = 2; i < 8; i++)
+-		REG_WR(bp, BNX2X_PREV_UNDI_MF_FUNC(i - 2), 1);
+-
+-	BNX2X_DEV_INFO("UNDI FW (MF) set to discard\n");
+-}
+-
+-static void bnx2x_prev_unload_undi_inc(struct bnx2x *bp, u8 port, u8 inc)
++static void bnx2x_prev_unload_undi_inc(struct bnx2x *bp, u8 inc)
+ {
+ 	u16 rcq, bd;
+-	u32 tmp_reg = REG_RD(bp, BNX2X_PREV_UNDI_PROD_ADDR(port));
++	u32 addr, tmp_reg;
+ 
++	if (BP_FUNC(bp) < 2)
++		addr = BNX2X_PREV_UNDI_PROD_ADDR(BP_PORT(bp));
++	else
++		addr = BNX2X_PREV_UNDI_PROD_ADDR_H(BP_FUNC(bp) - 2);
++
++	tmp_reg = REG_RD(bp, addr);
+ 	rcq = BNX2X_PREV_UNDI_RCQ(tmp_reg) + inc;
+ 	bd = BNX2X_PREV_UNDI_BD(tmp_reg) + inc;
+ 
+ 	tmp_reg = BNX2X_PREV_UNDI_PROD(rcq, bd);
+-	REG_WR(bp, BNX2X_PREV_UNDI_PROD_ADDR(port), tmp_reg);
++	REG_WR(bp, addr, tmp_reg);
+ 
+-	BNX2X_DEV_INFO("UNDI producer [%d] rings bd -> 0x%04x, rcq -> 0x%04x\n",
+-		       port, bd, rcq);
++	BNX2X_DEV_INFO("UNDI producer [%d/%d][%08x] rings bd -> 0x%04x, rcq -> 0x%04x\n",
++		       BP_PORT(bp), BP_FUNC(bp), addr, bd, rcq);
+ }
+ 
+ static int bnx2x_prev_mcp_done(struct bnx2x *bp)
+@@ -10375,7 +10328,6 @@ static int bnx2x_prev_unload_common(struct bnx2x *bp)
+ 	/* Reset should be performed after BRB is emptied */
+ 	if (reset_reg & MISC_REGISTERS_RESET_REG_1_RST_BRB1) {
+ 		u32 timer_count = 1000;
+-		bool need_write = true;
+ 
+ 		/* Close the MAC Rx to prevent BRB from filling up */
+ 		bnx2x_prev_unload_close_mac(bp, &mac_vals);
+@@ -10412,20 +10364,10 @@ static int bnx2x_prev_unload_common(struct bnx2x *bp)
+ 			else
+ 				timer_count--;
+ 
+-			/* New UNDI FW supports MF and contains better
+-			 * cleaning methods - might be redundant but harmless.
+-			 */
+-			if (bnx2x_prev_unload_undi_fw_supports_mf(bp)) {
+-				if (need_write) {
+-					bnx2x_prev_unload_undi_mf(bp);
+-					need_write = false;
+-				}
+-			} else if (prev_undi) {
+-				/* If UNDI resides in memory,
+-				 * manually increment it
+-				 */
+-				bnx2x_prev_unload_undi_inc(bp, BP_PORT(bp), 1);
+-			}
++			/* If UNDI resides in memory, manually increment it */
++			if (prev_undi)
++				bnx2x_prev_unload_undi_inc(bp, 1);
++
+ 			udelay(10);
+ 		}
+ 
+diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
+index a3dd5dc64f4c..8345c6523799 100644
+--- a/drivers/net/ethernet/broadcom/tg3.c
++++ b/drivers/net/ethernet/broadcom/tg3.c
+@@ -6918,7 +6918,8 @@ static int tg3_rx(struct tg3_napi *tnapi, int budget)
+ 		skb->protocol = eth_type_trans(skb, tp->dev);
+ 
+ 		if (len > (tp->dev->mtu + ETH_HLEN) &&
+-		    skb->protocol != htons(ETH_P_8021Q)) {
++		    skb->protocol != htons(ETH_P_8021Q) &&
++		    skb->protocol != htons(ETH_P_8021AD)) {
+ 			dev_kfree_skb_any(skb);
+ 			goto drop_it_no_recycle;
+ 		}
+@@ -7914,8 +7915,6 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ 
+ 	entry = tnapi->tx_prod;
+ 	base_flags = 0;
+-	if (skb->ip_summed == CHECKSUM_PARTIAL)
+-		base_flags |= TXD_FLAG_TCPUDP_CSUM;
+ 
+ 	mss = skb_shinfo(skb)->gso_size;
+ 	if (mss) {
+@@ -7929,6 +7928,13 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ 
+ 		hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb) - ETH_HLEN;
+ 
++		/* HW/FW can not correctly segment packets that have been
++		 * vlan encapsulated.
++		 */
++		if (skb->protocol == htons(ETH_P_8021Q) ||
++		    skb->protocol == htons(ETH_P_8021AD))
++			return tg3_tso_bug(tp, tnapi, txq, skb);
++
+ 		if (!skb_is_gso_v6(skb)) {
+ 			if (unlikely((ETH_HLEN + hdr_len) > 80) &&
+ 			    tg3_flag(tp, TSO_BUG))
+@@ -7979,6 +7985,17 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+ 				base_flags |= tsflags << 12;
+ 			}
+ 		}
++	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
++		/* HW/FW can not correctly checksum packets that have been
++		 * vlan encapsulated.
++		 */
++		if (skb->protocol == htons(ETH_P_8021Q) ||
++		    skb->protocol == htons(ETH_P_8021AD)) {
++			if (skb_checksum_help(skb))
++				goto drop;
++		} else  {
++			base_flags |= TXD_FLAG_TCPUDP_CSUM;
++		}
+ 	}
+ 
+ 	if (tg3_flag(tp, USE_JUMBO_BDFLAG) &&
+diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
+index e9daa072ebb4..45b13fda6bed 100644
+--- a/drivers/net/ethernet/cadence/macb.c
++++ b/drivers/net/ethernet/cadence/macb.c
+@@ -30,7 +30,6 @@
+ #include <linux/of_device.h>
+ #include <linux/of_mdio.h>
+ #include <linux/of_net.h>
+-#include <linux/pinctrl/consumer.h>
+ 
+ #include "macb.h"
+ 
+@@ -1803,7 +1802,6 @@ static int __init macb_probe(struct platform_device *pdev)
+ 	struct phy_device *phydev;
+ 	u32 config;
+ 	int err = -ENXIO;
+-	struct pinctrl *pinctrl;
+ 	const char *mac;
+ 
+ 	regs = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+@@ -1812,15 +1810,6 @@ static int __init macb_probe(struct platform_device *pdev)
+ 		goto err_out;
+ 	}
+ 
+-	pinctrl = devm_pinctrl_get_select_default(&pdev->dev);
+-	if (IS_ERR(pinctrl)) {
+-		err = PTR_ERR(pinctrl);
+-		if (err == -EPROBE_DEFER)
+-			goto err_out;
+-
+-		dev_warn(&pdev->dev, "No pinctrl provided\n");
+-	}
+-
+ 	err = -ENOMEM;
+ 	dev = alloc_etherdev(sizeof(*bp));
+ 	if (!dev)
+diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
+index 5d940a26055c..c9d2988e364d 100644
+--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
++++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
+@@ -2380,6 +2380,22 @@ struct mlx4_slaves_pport mlx4_phys_to_slaves_pport_actv(
+ }
+ EXPORT_SYMBOL_GPL(mlx4_phys_to_slaves_pport_actv);
+ 
++static int mlx4_slaves_closest_port(struct mlx4_dev *dev, int slave, int port)
++{
++	struct mlx4_active_ports actv_ports = mlx4_get_active_ports(dev, slave);
++	int min_port = find_first_bit(actv_ports.ports, dev->caps.num_ports)
++			+ 1;
++	int max_port = min_port +
++		bitmap_weight(actv_ports.ports, dev->caps.num_ports);
++
++	if (port < min_port)
++		port = min_port;
++	else if (port >= max_port)
++		port = max_port - 1;
++
++	return port;
++}
++
+ int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac)
+ {
+ 	struct mlx4_priv *priv = mlx4_priv(dev);
+@@ -2393,6 +2409,7 @@ int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac)
+ 	if (slave < 0)
+ 		return -EINVAL;
+ 
++	port = mlx4_slaves_closest_port(dev, slave, port);
+ 	s_info = &priv->mfunc.master.vf_admin[slave].vport[port];
+ 	s_info->mac = mac;
+ 	mlx4_info(dev, "default mac on vf %d port %d to %llX will take afect only after vf restart\n",
+@@ -2419,6 +2436,7 @@ int mlx4_set_vf_vlan(struct mlx4_dev *dev, int port, int vf, u16 vlan, u8 qos)
+ 	if (slave < 0)
+ 		return -EINVAL;
+ 
++	port = mlx4_slaves_closest_port(dev, slave, port);
+ 	vf_admin = &priv->mfunc.master.vf_admin[slave].vport[port];
+ 
+ 	if ((0 == vlan) && (0 == qos))
+@@ -2446,6 +2464,7 @@ bool mlx4_get_slave_default_vlan(struct mlx4_dev *dev, int port, int slave,
+ 	struct mlx4_priv *priv;
+ 
+ 	priv = mlx4_priv(dev);
++	port = mlx4_slaves_closest_port(dev, slave, port);
+ 	vp_oper = &priv->mfunc.master.vf_oper[slave].vport[port];
+ 
+ 	if (MLX4_VGT != vp_oper->state.default_vlan) {
+@@ -2473,6 +2492,7 @@ int mlx4_set_vf_spoofchk(struct mlx4_dev *dev, int port, int vf, bool setting)
+ 	if (slave < 0)
+ 		return -EINVAL;
+ 
++	port = mlx4_slaves_closest_port(dev, slave, port);
+ 	s_info = &priv->mfunc.master.vf_admin[slave].vport[port];
+ 	s_info->spoofchk = setting;
+ 
+@@ -2526,6 +2546,7 @@ int mlx4_set_vf_link_state(struct mlx4_dev *dev, int port, int vf, int link_stat
+ 	if (slave < 0)
+ 		return -EINVAL;
+ 
++	port = mlx4_slaves_closest_port(dev, slave, port);
+ 	switch (link_state) {
+ 	case IFLA_VF_LINK_STATE_AUTO:
+ 		/* get current link state */
+diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
+index 82ab427290c3..3bdc11e44ec3 100644
+--- a/drivers/net/ethernet/mellanox/mlx4/main.c
++++ b/drivers/net/ethernet/mellanox/mlx4/main.c
+@@ -78,13 +78,13 @@ MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
+ #endif /* CONFIG_PCI_MSI */
+ 
+ static uint8_t num_vfs[3] = {0, 0, 0};
+-static int num_vfs_argc = 3;
++static int num_vfs_argc;
+ module_param_array(num_vfs, byte , &num_vfs_argc, 0444);
+ MODULE_PARM_DESC(num_vfs, "enable #num_vfs functions if num_vfs > 0\n"
+ 			  "num_vfs=port1,port2,port1+2");
+ 
+ static uint8_t probe_vf[3] = {0, 0, 0};
+-static int probe_vfs_argc = 3;
++static int probe_vfs_argc;
+ module_param_array(probe_vf, byte, &probe_vfs_argc, 0444);
+ MODULE_PARM_DESC(probe_vf, "number of vfs to probe by pf driver (num_vfs > 0)\n"
+ 			   "probe_vf=port1,port2,port1+2");
+diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+index f3d5d79f1cd1..a173c985aa73 100644
+--- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
++++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+@@ -872,6 +872,10 @@ static int myri10ge_dma_test(struct myri10ge_priv *mgp, int test_type)
+ 		return -ENOMEM;
+ 	dmatest_bus = pci_map_page(mgp->pdev, dmatest_page, 0, PAGE_SIZE,
+ 				   DMA_BIDIRECTIONAL);
++	if (unlikely(pci_dma_mapping_error(mgp->pdev, dmatest_bus))) {
++		__free_page(dmatest_page);
++		return -ENOMEM;
++	}
+ 
+ 	/* Run a small DMA test.
+ 	 * The magic multipliers to the length tell the firmware
+@@ -1293,6 +1297,7 @@ myri10ge_alloc_rx_pages(struct myri10ge_priv *mgp, struct myri10ge_rx_buf *rx,
+ 			int bytes, int watchdog)
+ {
+ 	struct page *page;
++	dma_addr_t bus;
+ 	int idx;
+ #if MYRI10GE_ALLOC_SIZE > 4096
+ 	int end_offset;
+@@ -1317,11 +1322,21 @@ myri10ge_alloc_rx_pages(struct myri10ge_priv *mgp, struct myri10ge_rx_buf *rx,
+ 					rx->watchdog_needed = 1;
+ 				return;
+ 			}
++
++			bus = pci_map_page(mgp->pdev, page, 0,
++					   MYRI10GE_ALLOC_SIZE,
++					   PCI_DMA_FROMDEVICE);
++			if (unlikely(pci_dma_mapping_error(mgp->pdev, bus))) {
++				__free_pages(page, MYRI10GE_ALLOC_ORDER);
++				if (rx->fill_cnt - rx->cnt < 16)
++					rx->watchdog_needed = 1;
++				return;
++			}
++
+ 			rx->page = page;
+ 			rx->page_offset = 0;
+-			rx->bus = pci_map_page(mgp->pdev, page, 0,
+-					       MYRI10GE_ALLOC_SIZE,
+-					       PCI_DMA_FROMDEVICE);
++			rx->bus = bus;
++
+ 		}
+ 		rx->info[idx].page = rx->page;
+ 		rx->info[idx].page_offset = rx->page_offset;
+@@ -2763,6 +2778,35 @@ myri10ge_submit_req(struct myri10ge_tx_buf *tx, struct mcp_kreq_ether_send *src,
+ 	mb();
+ }
+ 
++static void myri10ge_unmap_tx_dma(struct myri10ge_priv *mgp,
++				  struct myri10ge_tx_buf *tx, int idx)
++{
++	unsigned int len;
++	int last_idx;
++
++	/* Free any DMA resources we've alloced and clear out the skb slot */
++	last_idx = (idx + 1) & tx->mask;
++	idx = tx->req & tx->mask;
++	do {
++		len = dma_unmap_len(&tx->info[idx], len);
++		if (len) {
++			if (tx->info[idx].skb != NULL)
++				pci_unmap_single(mgp->pdev,
++						 dma_unmap_addr(&tx->info[idx],
++								bus), len,
++						 PCI_DMA_TODEVICE);
++			else
++				pci_unmap_page(mgp->pdev,
++					       dma_unmap_addr(&tx->info[idx],
++							      bus), len,
++					       PCI_DMA_TODEVICE);
++			dma_unmap_len_set(&tx->info[idx], len, 0);
++			tx->info[idx].skb = NULL;
++		}
++		idx = (idx + 1) & tx->mask;
++	} while (idx != last_idx);
++}
++
+ /*
+  * Transmit a packet.  We need to split the packet so that a single
+  * segment does not cross myri10ge->tx_boundary, so this makes segment
+@@ -2786,7 +2830,7 @@ static netdev_tx_t myri10ge_xmit(struct sk_buff *skb,
+ 	u32 low;
+ 	__be32 high_swapped;
+ 	unsigned int len;
+-	int idx, last_idx, avail, frag_cnt, frag_idx, count, mss, max_segments;
++	int idx, avail, frag_cnt, frag_idx, count, mss, max_segments;
+ 	u16 pseudo_hdr_offset, cksum_offset, queue;
+ 	int cum_len, seglen, boundary, rdma_count;
+ 	u8 flags, odd_flag;
+@@ -2883,9 +2927,12 @@ again:
+ 
+ 	/* map the skb for DMA */
+ 	len = skb_headlen(skb);
++	bus = pci_map_single(mgp->pdev, skb->data, len, PCI_DMA_TODEVICE);
++	if (unlikely(pci_dma_mapping_error(mgp->pdev, bus)))
++		goto drop;
++
+ 	idx = tx->req & tx->mask;
+ 	tx->info[idx].skb = skb;
+-	bus = pci_map_single(mgp->pdev, skb->data, len, PCI_DMA_TODEVICE);
+ 	dma_unmap_addr_set(&tx->info[idx], bus, bus);
+ 	dma_unmap_len_set(&tx->info[idx], len, len);
+ 
+@@ -2984,12 +3031,16 @@ again:
+ 			break;
+ 
+ 		/* map next fragment for DMA */
+-		idx = (count + tx->req) & tx->mask;
+ 		frag = &skb_shinfo(skb)->frags[frag_idx];
+ 		frag_idx++;
+ 		len = skb_frag_size(frag);
+ 		bus = skb_frag_dma_map(&mgp->pdev->dev, frag, 0, len,
+ 				       DMA_TO_DEVICE);
++		if (unlikely(pci_dma_mapping_error(mgp->pdev, bus))) {
++			myri10ge_unmap_tx_dma(mgp, tx, idx);
++			goto drop;
++		}
++		idx = (count + tx->req) & tx->mask;
+ 		dma_unmap_addr_set(&tx->info[idx], bus, bus);
+ 		dma_unmap_len_set(&tx->info[idx], len, len);
+ 	}
+@@ -3020,31 +3071,8 @@ again:
+ 	return NETDEV_TX_OK;
+ 
+ abort_linearize:
+-	/* Free any DMA resources we've alloced and clear out the skb
+-	 * slot so as to not trip up assertions, and to avoid a
+-	 * double-free if linearizing fails */
++	myri10ge_unmap_tx_dma(mgp, tx, idx);
+ 
+-	last_idx = (idx + 1) & tx->mask;
+-	idx = tx->req & tx->mask;
+-	tx->info[idx].skb = NULL;
+-	do {
+-		len = dma_unmap_len(&tx->info[idx], len);
+-		if (len) {
+-			if (tx->info[idx].skb != NULL)
+-				pci_unmap_single(mgp->pdev,
+-						 dma_unmap_addr(&tx->info[idx],
+-								bus), len,
+-						 PCI_DMA_TODEVICE);
+-			else
+-				pci_unmap_page(mgp->pdev,
+-					       dma_unmap_addr(&tx->info[idx],
+-							      bus), len,
+-					       PCI_DMA_TODEVICE);
+-			dma_unmap_len_set(&tx->info[idx], len, 0);
+-			tx->info[idx].skb = NULL;
+-		}
+-		idx = (idx + 1) & tx->mask;
+-	} while (idx != last_idx);
+ 	if (skb_is_gso(skb)) {
+ 		netdev_err(mgp->dev, "TSO but wanted to linearize?!?!?\n");
+ 		goto drop;
+diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
+index d97d5f39a04e..7edf976ecfa0 100644
+--- a/drivers/net/hyperv/netvsc.c
++++ b/drivers/net/hyperv/netvsc.c
+@@ -708,6 +708,7 @@ int netvsc_send(struct hv_device *device,
+ 	unsigned int section_index = NETVSC_INVALID_INDEX;
+ 	u32 msg_size = 0;
+ 	struct sk_buff *skb;
++	u16 q_idx = packet->q_idx;
+ 
+ 
+ 	net_device = get_outbound_net_device(device);
+@@ -772,24 +773,24 @@ int netvsc_send(struct hv_device *device,
+ 
+ 	if (ret == 0) {
+ 		atomic_inc(&net_device->num_outstanding_sends);
+-		atomic_inc(&net_device->queue_sends[packet->q_idx]);
++		atomic_inc(&net_device->queue_sends[q_idx]);
+ 
+ 		if (hv_ringbuf_avail_percent(&out_channel->outbound) <
+ 			RING_AVAIL_PERCENT_LOWATER) {
+ 			netif_tx_stop_queue(netdev_get_tx_queue(
+-					    ndev, packet->q_idx));
++					    ndev, q_idx));
+ 
+ 			if (atomic_read(&net_device->
+-				queue_sends[packet->q_idx]) < 1)
++				queue_sends[q_idx]) < 1)
+ 				netif_tx_wake_queue(netdev_get_tx_queue(
+-						    ndev, packet->q_idx));
++						    ndev, q_idx));
+ 		}
+ 	} else if (ret == -EAGAIN) {
+ 		netif_tx_stop_queue(netdev_get_tx_queue(
+-				    ndev, packet->q_idx));
+-		if (atomic_read(&net_device->queue_sends[packet->q_idx]) < 1) {
++				    ndev, q_idx));
++		if (atomic_read(&net_device->queue_sends[q_idx]) < 1) {
+ 			netif_tx_wake_queue(netdev_get_tx_queue(
+-					    ndev, packet->q_idx));
++					    ndev, q_idx));
+ 			ret = -ENOSPC;
+ 		}
+ 	} else {
+diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
+index 4fd71b75e666..f15297201777 100644
+--- a/drivers/net/hyperv/netvsc_drv.c
++++ b/drivers/net/hyperv/netvsc_drv.c
+@@ -387,6 +387,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct net_device *net)
+ 	int  hdr_offset;
+ 	u32 net_trans_info;
+ 	u32 hash;
++	u32 skb_length = skb->len;
+ 
+ 
+ 	/* We will atmost need two pages to describe the rndis
+@@ -562,7 +563,7 @@ do_send:
+ 
+ drop:
+ 	if (ret == 0) {
+-		net->stats.tx_bytes += skb->len;
++		net->stats.tx_bytes += skb_length;
+ 		net->stats.tx_packets++;
+ 	} else {
+ 		kfree(packet);
+diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
+index ef8a5c20236a..f3008e3cf118 100644
+--- a/drivers/net/macvlan.c
++++ b/drivers/net/macvlan.c
+@@ -36,6 +36,7 @@
+ #include <linux/netpoll.h>
+ 
+ #define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
++#define MACVLAN_BC_QUEUE_LEN	1000
+ 
+ struct macvlan_port {
+ 	struct net_device	*dev;
+@@ -45,10 +46,9 @@ struct macvlan_port {
+ 	struct sk_buff_head	bc_queue;
+ 	struct work_struct	bc_work;
+ 	bool 			passthru;
++	int			count;
+ };
+ 
+-#define MACVLAN_PORT_IS_EMPTY(port)    list_empty(&port->vlans)
+-
+ struct macvlan_skb_cb {
+ 	const struct macvlan_dev *src;
+ };
+@@ -249,7 +249,7 @@ static void macvlan_broadcast_enqueue(struct macvlan_port *port,
+ 		goto err;
+ 
+ 	spin_lock(&port->bc_queue.lock);
+-	if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) {
++	if (skb_queue_len(&port->bc_queue) < MACVLAN_BC_QUEUE_LEN) {
+ 		__skb_queue_tail(&port->bc_queue, nskb);
+ 		err = 0;
+ 	}
+@@ -667,7 +667,8 @@ static void macvlan_uninit(struct net_device *dev)
+ 
+ 	free_percpu(vlan->pcpu_stats);
+ 
+-	if (MACVLAN_PORT_IS_EMPTY(port))
++	port->count -= 1;
++	if (!port->count)
+ 		macvlan_port_destroy(port->dev);
+ }
+ 
+@@ -800,6 +801,7 @@ static netdev_features_t macvlan_fix_features(struct net_device *dev,
+ 					     features,
+ 					     mask);
+ 	features |= ALWAYS_ON_FEATURES;
++	features &= ~NETIF_F_NETNS_LOCAL;
+ 
+ 	return features;
+ }
+@@ -1020,12 +1022,13 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+ 		vlan->flags = nla_get_u16(data[IFLA_MACVLAN_FLAGS]);
+ 
+ 	if (vlan->mode == MACVLAN_MODE_PASSTHRU) {
+-		if (!MACVLAN_PORT_IS_EMPTY(port))
++		if (port->count)
+ 			return -EINVAL;
+ 		port->passthru = true;
+ 		eth_hw_addr_inherit(dev, lowerdev);
+ 	}
+ 
++	port->count += 1;
+ 	err = register_netdevice(dev);
+ 	if (err < 0)
+ 		goto destroy_port;
+@@ -1043,7 +1046,8 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+ unregister_netdev:
+ 	unregister_netdevice(dev);
+ destroy_port:
+-	if (MACVLAN_PORT_IS_EMPTY(port))
++	port->count -= 1;
++	if (!port->count)
+ 		macvlan_port_destroy(lowerdev);
+ 
+ 	return err;
+diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
+index 3381c4f91a8c..0c6adaaf898c 100644
+--- a/drivers/net/macvtap.c
++++ b/drivers/net/macvtap.c
+@@ -112,17 +112,15 @@ out:
+ 	return err;
+ }
+ 
++/* Requires RTNL */
+ static int macvtap_set_queue(struct net_device *dev, struct file *file,
+ 			     struct macvtap_queue *q)
+ {
+ 	struct macvlan_dev *vlan = netdev_priv(dev);
+-	int err = -EBUSY;
+ 
+-	rtnl_lock();
+ 	if (vlan->numqueues == MAX_MACVTAP_QUEUES)
+-		goto out;
++		return -EBUSY;
+ 
+-	err = 0;
+ 	rcu_assign_pointer(q->vlan, vlan);
+ 	rcu_assign_pointer(vlan->taps[vlan->numvtaps], q);
+ 	sock_hold(&q->sk);
+@@ -136,9 +134,7 @@ static int macvtap_set_queue(struct net_device *dev, struct file *file,
+ 	vlan->numvtaps++;
+ 	vlan->numqueues++;
+ 
+-out:
+-	rtnl_unlock();
+-	return err;
++	return 0;
+ }
+ 
+ static int macvtap_disable_queue(struct macvtap_queue *q)
+@@ -454,11 +450,12 @@ static void macvtap_sock_destruct(struct sock *sk)
+ static int macvtap_open(struct inode *inode, struct file *file)
+ {
+ 	struct net *net = current->nsproxy->net_ns;
+-	struct net_device *dev = dev_get_by_macvtap_minor(iminor(inode));
++	struct net_device *dev;
+ 	struct macvtap_queue *q;
+-	int err;
++	int err = -ENODEV;
+ 
+-	err = -ENODEV;
++	rtnl_lock();
++	dev = dev_get_by_macvtap_minor(iminor(inode));
+ 	if (!dev)
+ 		goto out;
+ 
+@@ -498,6 +495,7 @@ out:
+ 	if (dev)
+ 		dev_put(dev);
+ 
++	rtnl_unlock();
+ 	return err;
+ }
+ 
+diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
+index 180c49479c42..a4b08198fb9f 100644
+--- a/drivers/net/phy/smsc.c
++++ b/drivers/net/phy/smsc.c
+@@ -43,6 +43,22 @@ static int smsc_phy_ack_interrupt(struct phy_device *phydev)
+ 
+ static int smsc_phy_config_init(struct phy_device *phydev)
+ {
++	int rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
++
++	if (rc < 0)
++		return rc;
++
++	/* Enable energy detect mode for this SMSC Transceivers */
++	rc = phy_write(phydev, MII_LAN83C185_CTRL_STATUS,
++		       rc | MII_LAN83C185_EDPWRDOWN);
++	if (rc < 0)
++		return rc;
++
++	return smsc_phy_ack_interrupt(phydev);
++}
++
++static int smsc_phy_reset(struct phy_device *phydev)
++{
+ 	int rc = phy_read(phydev, MII_LAN83C185_SPECIAL_MODES);
+ 	if (rc < 0)
+ 		return rc;
+@@ -66,18 +82,7 @@ static int smsc_phy_config_init(struct phy_device *phydev)
+ 			rc = phy_read(phydev, MII_BMCR);
+ 		} while (rc & BMCR_RESET);
+ 	}
+-
+-	rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
+-	if (rc < 0)
+-		return rc;
+-
+-	/* Enable energy detect mode for this SMSC Transceivers */
+-	rc = phy_write(phydev, MII_LAN83C185_CTRL_STATUS,
+-		       rc | MII_LAN83C185_EDPWRDOWN);
+-	if (rc < 0)
+-		return rc;
+-
+-	return smsc_phy_ack_interrupt (phydev);
++	return 0;
+ }
+ 
+ static int lan911x_config_init(struct phy_device *phydev)
+@@ -142,6 +147,7 @@ static struct phy_driver smsc_phy_driver[] = {
+ 	.config_aneg	= genphy_config_aneg,
+ 	.read_status	= genphy_read_status,
+ 	.config_init	= smsc_phy_config_init,
++	.soft_reset	= smsc_phy_reset,
+ 
+ 	/* IRQ related */
+ 	.ack_interrupt	= smsc_phy_ack_interrupt,
+@@ -164,6 +170,7 @@ static struct phy_driver smsc_phy_driver[] = {
+ 	.config_aneg	= genphy_config_aneg,
+ 	.read_status	= genphy_read_status,
+ 	.config_init	= smsc_phy_config_init,
++	.soft_reset	= smsc_phy_reset,
+ 
+ 	/* IRQ related */
+ 	.ack_interrupt	= smsc_phy_ack_interrupt,
+@@ -186,6 +193,7 @@ static struct phy_driver smsc_phy_driver[] = {
+ 	.config_aneg	= genphy_config_aneg,
+ 	.read_status	= genphy_read_status,
+ 	.config_init	= smsc_phy_config_init,
++	.soft_reset	= smsc_phy_reset,
+ 
+ 	/* IRQ related */
+ 	.ack_interrupt	= smsc_phy_ack_interrupt,
+@@ -230,6 +238,7 @@ static struct phy_driver smsc_phy_driver[] = {
+ 	.config_aneg	= genphy_config_aneg,
+ 	.read_status	= lan87xx_read_status,
+ 	.config_init	= smsc_phy_config_init,
++	.soft_reset	= smsc_phy_reset,
+ 
+ 	/* IRQ related */
+ 	.ack_interrupt	= smsc_phy_ack_interrupt,
+diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
+index b4958c7ffa84..cb2a00e1d95a 100644
+--- a/drivers/net/team/team.c
++++ b/drivers/net/team/team.c
+@@ -647,7 +647,7 @@ static void team_notify_peers(struct team *team)
+ {
+ 	if (!team->notify_peers.count || !netif_running(team->dev))
+ 		return;
+-	atomic_set(&team->notify_peers.count_pending, team->notify_peers.count);
++	atomic_add(team->notify_peers.count, &team->notify_peers.count_pending);
+ 	schedule_delayed_work(&team->notify_peers.dw, 0);
+ }
+ 
+@@ -687,7 +687,7 @@ static void team_mcast_rejoin(struct team *team)
+ {
+ 	if (!team->mcast_rejoin.count || !netif_running(team->dev))
+ 		return;
+-	atomic_set(&team->mcast_rejoin.count_pending, team->mcast_rejoin.count);
++	atomic_add(team->mcast_rejoin.count, &team->mcast_rejoin.count_pending);
+ 	schedule_delayed_work(&team->mcast_rejoin.dw, 0);
+ }
+ 
+diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
+index 9f79192c9aa0..31a7ad0d7d5f 100644
+--- a/drivers/net/vxlan.c
++++ b/drivers/net/vxlan.c
+@@ -1325,7 +1325,7 @@ static int arp_reduce(struct net_device *dev, struct sk_buff *skb)
+ 	} else if (vxlan->flags & VXLAN_F_L3MISS) {
+ 		union vxlan_addr ipa = {
+ 			.sin.sin_addr.s_addr = tip,
+-			.sa.sa_family = AF_INET,
++			.sin.sin_family = AF_INET,
+ 		};
+ 
+ 		vxlan_ip_miss(dev, &ipa);
+@@ -1486,7 +1486,7 @@ static int neigh_reduce(struct net_device *dev, struct sk_buff *skb)
+ 	} else if (vxlan->flags & VXLAN_F_L3MISS) {
+ 		union vxlan_addr ipa = {
+ 			.sin6.sin6_addr = msg->target,
+-			.sa.sa_family = AF_INET6,
++			.sin6.sin6_family = AF_INET6,
+ 		};
+ 
+ 		vxlan_ip_miss(dev, &ipa);
+@@ -1519,7 +1519,7 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
+ 		if (!n && (vxlan->flags & VXLAN_F_L3MISS)) {
+ 			union vxlan_addr ipa = {
+ 				.sin.sin_addr.s_addr = pip->daddr,
+-				.sa.sa_family = AF_INET,
++				.sin.sin_family = AF_INET,
+ 			};
+ 
+ 			vxlan_ip_miss(dev, &ipa);
+@@ -1540,7 +1540,7 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
+ 		if (!n && (vxlan->flags & VXLAN_F_L3MISS)) {
+ 			union vxlan_addr ipa = {
+ 				.sin6.sin6_addr = pip6->daddr,
+-				.sa.sa_family = AF_INET6,
++				.sin6.sin6_family = AF_INET6,
+ 			};
+ 
+ 			vxlan_ip_miss(dev, &ipa);
+diff --git a/drivers/tty/serial/8250/8250_pci.c b/drivers/tty/serial/8250/8250_pci.c
+index 33137b3ba94d..370f6e46caf5 100644
+--- a/drivers/tty/serial/8250/8250_pci.c
++++ b/drivers/tty/serial/8250/8250_pci.c
+@@ -1790,6 +1790,7 @@ pci_wch_ch353_setup(struct serial_private *priv,
+ #define PCI_DEVICE_ID_COMMTECH_4222PCIE	0x0022
+ #define PCI_DEVICE_ID_BROADCOM_TRUMANAGE 0x160a
+ #define PCI_DEVICE_ID_AMCC_ADDIDATA_APCI7800 0x818e
++#define PCI_DEVICE_ID_INTEL_QRK_UART	0x0936
+ 
+ #define PCI_VENDOR_ID_SUNIX		0x1fd4
+ #define PCI_DEVICE_ID_SUNIX_1999	0x1999
+@@ -1900,6 +1901,13 @@ static struct pci_serial_quirk pci_serial_quirks[] __refdata = {
+ 		.subdevice	= PCI_ANY_ID,
+ 		.setup		= byt_serial_setup,
+ 	},
++	{
++		.vendor		= PCI_VENDOR_ID_INTEL,
++		.device		= PCI_DEVICE_ID_INTEL_QRK_UART,
++		.subvendor	= PCI_ANY_ID,
++		.subdevice	= PCI_ANY_ID,
++		.setup		= pci_default_setup,
++	},
+ 	/*
+ 	 * ITE
+ 	 */
+@@ -2742,6 +2750,7 @@ enum pci_board_num_t {
+ 	pbn_ADDIDATA_PCIe_8_3906250,
+ 	pbn_ce4100_1_115200,
+ 	pbn_byt,
++	pbn_qrk,
+ 	pbn_omegapci,
+ 	pbn_NETMOS9900_2s_115200,
+ 	pbn_brcm_trumanage,
+@@ -3492,6 +3501,12 @@ static struct pciserial_board pci_boards[] = {
+ 		.uart_offset	= 0x80,
+ 		.reg_shift      = 2,
+ 	},
++	[pbn_qrk] = {
++		.flags		= FL_BASE0,
++		.num_ports	= 1,
++		.base_baud	= 2764800,
++		.reg_shift	= 2,
++	},
+ 	[pbn_omegapci] = {
+ 		.flags		= FL_BASE0,
+ 		.num_ports	= 8,
+@@ -5194,6 +5209,12 @@ static struct pci_device_id serial_pci_tbl[] = {
+ 		pbn_byt },
+ 
+ 	/*
++	 * Intel Quark x1000
++	 */
++	{	PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_QRK_UART,
++		PCI_ANY_ID, PCI_ANY_ID, 0, 0,
++		pbn_qrk },
++	/*
+ 	 * Cronyx Omega PCI
+ 	 */
+ 	{	PCI_VENDOR_ID_PLX, PCI_DEVICE_ID_PLX_CRONYX_OMEGA,
+diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
+index 50e854509f55..ba2a8f3b8059 100644
+--- a/drivers/usb/core/hub.c
++++ b/drivers/usb/core/hub.c
+@@ -1983,8 +1983,10 @@ void usb_set_device_state(struct usb_device *udev,
+ 					|| new_state == USB_STATE_SUSPENDED)
+ 				;	/* No change to wakeup settings */
+ 			else if (new_state == USB_STATE_CONFIGURED)
+-				wakeup = udev->actconfig->desc.bmAttributes
+-					 & USB_CONFIG_ATT_WAKEUP;
++				wakeup = (udev->quirks &
++					USB_QUIRK_IGNORE_REMOTE_WAKEUP) ? 0 :
++					udev->actconfig->desc.bmAttributes &
++					USB_CONFIG_ATT_WAKEUP;
+ 			else
+ 				wakeup = 0;
+ 		}
+diff --git a/drivers/usb/core/quirks.c b/drivers/usb/core/quirks.c
+index 739ee8e8bdfd..5144d11d032c 100644
+--- a/drivers/usb/core/quirks.c
++++ b/drivers/usb/core/quirks.c
+@@ -160,6 +160,10 @@ static const struct usb_device_id usb_interface_quirk_list[] = {
+ 	{ USB_VENDOR_AND_INTERFACE_INFO(0x046d, USB_CLASS_VIDEO, 1, 0),
+ 	  .driver_info = USB_QUIRK_RESET_RESUME },
+ 
++	/* ASUS Base Station(T100) */
++	{ USB_DEVICE(0x0b05, 0x17e0), .driver_info =
++			USB_QUIRK_IGNORE_REMOTE_WAKEUP },
++
+ 	{ }  /* terminating entry must be last */
+ };
+ 
+diff --git a/drivers/usb/musb/musb_dsps.c b/drivers/usb/musb/musb_dsps.c
+index 09529f94e72d..6983e805147b 100644
+--- a/drivers/usb/musb/musb_dsps.c
++++ b/drivers/usb/musb/musb_dsps.c
+@@ -780,6 +780,7 @@ static int dsps_suspend(struct device *dev)
+ 	struct musb *musb = platform_get_drvdata(glue->musb);
+ 	void __iomem *mbase = musb->ctrl_base;
+ 
++	del_timer_sync(&glue->timer);
+ 	glue->context.control = dsps_readl(mbase, wrp->control);
+ 	glue->context.epintr = dsps_readl(mbase, wrp->epintr_set);
+ 	glue->context.coreintr = dsps_readl(mbase, wrp->coreintr_set);
+@@ -805,6 +806,7 @@ static int dsps_resume(struct device *dev)
+ 	dsps_writel(mbase, wrp->mode, glue->context.mode);
+ 	dsps_writel(mbase, wrp->tx_mode, glue->context.tx_mode);
+ 	dsps_writel(mbase, wrp->rx_mode, glue->context.rx_mode);
++	setup_timer(&glue->timer, otg_timer, (unsigned long) musb);
+ 
+ 	return 0;
+ }
+diff --git a/drivers/usb/serial/cp210x.c b/drivers/usb/serial/cp210x.c
+index 330df5ce435b..63b2af2a87c0 100644
+--- a/drivers/usb/serial/cp210x.c
++++ b/drivers/usb/serial/cp210x.c
+@@ -122,6 +122,7 @@ static const struct usb_device_id id_table[] = {
+ 	{ USB_DEVICE(0x10C4, 0x8665) }, /* AC-Services OBD-IF */
+ 	{ USB_DEVICE(0x10C4, 0x88A4) }, /* MMB Networks ZigBee USB Device */
+ 	{ USB_DEVICE(0x10C4, 0x88A5) }, /* Planet Innovation Ingeni ZigBee USB Device */
++	{ USB_DEVICE(0x10C4, 0x8946) }, /* Ketra N1 Wireless Interface */
+ 	{ USB_DEVICE(0x10C4, 0xEA60) }, /* Silicon Labs factory default */
+ 	{ USB_DEVICE(0x10C4, 0xEA61) }, /* Silicon Labs factory default */
+ 	{ USB_DEVICE(0x10C4, 0xEA70) }, /* Silicon Labs factory default */
+@@ -155,6 +156,7 @@ static const struct usb_device_id id_table[] = {
+ 	{ USB_DEVICE(0x1ADB, 0x0001) }, /* Schweitzer Engineering C662 Cable */
+ 	{ USB_DEVICE(0x1B1C, 0x1C00) }, /* Corsair USB Dongle */
+ 	{ USB_DEVICE(0x1BE3, 0x07A6) }, /* WAGO 750-923 USB Service Cable */
++	{ USB_DEVICE(0x1D6F, 0x0010) }, /* Seluxit ApS RF Dongle */
+ 	{ USB_DEVICE(0x1E29, 0x0102) }, /* Festo CPX-USB */
+ 	{ USB_DEVICE(0x1E29, 0x0501) }, /* Festo CMSP */
+ 	{ USB_DEVICE(0x1FB9, 0x0100) }, /* Lake Shore Model 121 Current Source */
+diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
+index 3f42785f653c..27136935fec3 100644
+--- a/drivers/usb/storage/uas.c
++++ b/drivers/usb/storage/uas.c
+@@ -28,6 +28,7 @@
+ #include <scsi/scsi_tcq.h>
+ 
+ #include "uas-detect.h"
++#include "scsiglue.h"
+ 
+ /*
+  * The r00-r01c specs define this version of the SENSE IU data structure.
+@@ -49,6 +50,7 @@ struct uas_dev_info {
+ 	struct usb_anchor cmd_urbs;
+ 	struct usb_anchor sense_urbs;
+ 	struct usb_anchor data_urbs;
++	unsigned long flags;
+ 	int qdepth, resetting;
+ 	struct response_iu response;
+ 	unsigned cmd_pipe, status_pipe, data_in_pipe, data_out_pipe;
+@@ -714,6 +716,15 @@ static int uas_queuecommand_lck(struct scsi_cmnd *cmnd,
+ 
+ 	BUILD_BUG_ON(sizeof(struct uas_cmd_info) > sizeof(struct scsi_pointer));
+ 
++	if ((devinfo->flags & US_FL_NO_ATA_1X) &&
++			(cmnd->cmnd[0] == ATA_12 || cmnd->cmnd[0] == ATA_16)) {
++		memcpy(cmnd->sense_buffer, usb_stor_sense_invalidCDB,
++		       sizeof(usb_stor_sense_invalidCDB));
++		cmnd->result = SAM_STAT_CHECK_CONDITION;
++		cmnd->scsi_done(cmnd);
++		return 0;
++	}
++
+ 	spin_lock_irqsave(&devinfo->lock, flags);
+ 
+ 	if (devinfo->resetting) {
+@@ -950,6 +961,10 @@ static int uas_slave_alloc(struct scsi_device *sdev)
+ static int uas_slave_configure(struct scsi_device *sdev)
+ {
+ 	struct uas_dev_info *devinfo = sdev->hostdata;
++
++	if (devinfo->flags & US_FL_NO_REPORT_OPCODES)
++		sdev->no_report_opcodes = 1;
++
+ 	scsi_set_tag_type(sdev, MSG_ORDERED_TAG);
+ 	scsi_activate_tcq(sdev, devinfo->qdepth - 2);
+ 	return 0;
+@@ -1080,6 +1095,8 @@ static int uas_probe(struct usb_interface *intf, const struct usb_device_id *id)
+ 	devinfo->resetting = 0;
+ 	devinfo->running_task = 0;
+ 	devinfo->shutdown = 0;
++	devinfo->flags = id->driver_info;
++	usb_stor_adjust_quirks(udev, &devinfo->flags);
+ 	init_usb_anchor(&devinfo->cmd_urbs);
+ 	init_usb_anchor(&devinfo->sense_urbs);
+ 	init_usb_anchor(&devinfo->data_urbs);
+diff --git a/drivers/usb/storage/unusual_uas.h b/drivers/usb/storage/unusual_uas.h
+index 7244444df8ee..8511b54a65d9 100644
+--- a/drivers/usb/storage/unusual_uas.h
++++ b/drivers/usb/storage/unusual_uas.h
+@@ -40,13 +40,38 @@
+  * and don't forget to CC: the USB development list <linux-usb@vger.kernel.org>
+  */
+ 
+-/*
+- * This is an example entry for the US_FL_IGNORE_UAS flag. Once we have an
+- * actual entry using US_FL_IGNORE_UAS this entry should be removed.
+- *
+- * UNUSUAL_DEV(  0xabcd, 0x1234, 0x0100, 0x0100,
+- *		"Example",
+- *		"Storage with broken UAS",
+- *		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
+- *		US_FL_IGNORE_UAS),
+- */
++/* https://bugzilla.kernel.org/show_bug.cgi?id=79511 */
++UNUSUAL_DEV(0x0bc2, 0x2312, 0x0000, 0x9999,
++		"Seagate",
++		"Expansion Desk",
++		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++		US_FL_NO_ATA_1X),
++
++/* https://bbs.archlinux.org/viewtopic.php?id=183190 */
++UNUSUAL_DEV(0x0bc2, 0x3312, 0x0000, 0x9999,
++		"Seagate",
++		"Expansion Desk",
++		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++		US_FL_NO_ATA_1X),
++
++/* https://bbs.archlinux.org/viewtopic.php?id=183190 */
++UNUSUAL_DEV(0x0bc2, 0xab20, 0x0000, 0x9999,
++		"Seagate",
++		"Backup+ BK",
++		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++		US_FL_NO_ATA_1X),
++
++/* Reported-by: Claudio Bizzarri <claudio.bizzarri@gmail.com> */
++UNUSUAL_DEV(0x152d, 0x0567, 0x0000, 0x9999,
++		"JMicron",
++		"JMS567",
++		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++		US_FL_NO_REPORT_OPCODES),
++
++/* Most ASM1051 based devices have issues with uas, blacklist them all */
++/* Reported-by: Hans de Goede <hdegoede@redhat.com> */
++UNUSUAL_DEV(0x174c, 0x5106, 0x0000, 0x9999,
++		"ASMedia",
++		"ASM1051",
++		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
++		US_FL_IGNORE_UAS),
+diff --git a/drivers/usb/storage/usb.c b/drivers/usb/storage/usb.c
+index f1c96261a501..20c5bcc6d3df 100644
+--- a/drivers/usb/storage/usb.c
++++ b/drivers/usb/storage/usb.c
+@@ -476,7 +476,8 @@ void usb_stor_adjust_quirks(struct usb_device *udev, unsigned long *fflags)
+ 			US_FL_CAPACITY_OK | US_FL_IGNORE_RESIDUE |
+ 			US_FL_SINGLE_LUN | US_FL_NO_WP_DETECT |
+ 			US_FL_NO_READ_DISC_INFO | US_FL_NO_READ_CAPACITY_16 |
+-			US_FL_INITIAL_READ10 | US_FL_WRITE_CACHE);
++			US_FL_INITIAL_READ10 | US_FL_WRITE_CACHE |
++			US_FL_NO_ATA_1X | US_FL_NO_REPORT_OPCODES);
+ 
+ 	p = quirks;
+ 	while (*p) {
+@@ -514,6 +515,9 @@ void usb_stor_adjust_quirks(struct usb_device *udev, unsigned long *fflags)
+ 		case 'e':
+ 			f |= US_FL_NO_READ_CAPACITY_16;
+ 			break;
++		case 'f':
++			f |= US_FL_NO_REPORT_OPCODES;
++			break;
+ 		case 'h':
+ 			f |= US_FL_CAPACITY_HEURISTICS;
+ 			break;
+@@ -541,6 +545,9 @@ void usb_stor_adjust_quirks(struct usb_device *udev, unsigned long *fflags)
+ 		case 's':
+ 			f |= US_FL_SINGLE_LUN;
+ 			break;
++		case 't':
++			f |= US_FL_NO_ATA_1X;
++			break;
+ 		case 'u':
+ 			f |= US_FL_IGNORE_UAS;
+ 			break;
+diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
+index 4967916fe4ac..d69f0577a319 100644
+--- a/include/linux/if_vlan.h
++++ b/include/linux/if_vlan.h
+@@ -187,7 +187,6 @@ vlan_dev_get_egress_qos_mask(struct net_device *dev, u32 skprio)
+ }
+ 
+ extern bool vlan_do_receive(struct sk_buff **skb);
+-extern struct sk_buff *vlan_untag(struct sk_buff *skb);
+ 
+ extern int vlan_vid_add(struct net_device *dev, __be16 proto, u16 vid);
+ extern void vlan_vid_del(struct net_device *dev, __be16 proto, u16 vid);
+@@ -241,11 +240,6 @@ static inline bool vlan_do_receive(struct sk_buff **skb)
+ 	return false;
+ }
+ 
+-static inline struct sk_buff *vlan_untag(struct sk_buff *skb)
+-{
+-	return skb;
+-}
+-
+ static inline int vlan_vid_add(struct net_device *dev, __be16 proto, u16 vid)
+ {
+ 	return 0;
+diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
+index ec89301ada41..6bb6bd86b0dc 100644
+--- a/include/linux/skbuff.h
++++ b/include/linux/skbuff.h
+@@ -2549,6 +2549,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen);
+ void skb_scrub_packet(struct sk_buff *skb, bool xnet);
+ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
+ struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features);
++struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
+ 
+ struct skb_checksum_ops {
+ 	__wsum (*update)(const void *mem, int len, __wsum wsum);
+diff --git a/include/linux/usb/quirks.h b/include/linux/usb/quirks.h
+index 52f944dfe2fd..49587dc22f5d 100644
+--- a/include/linux/usb/quirks.h
++++ b/include/linux/usb/quirks.h
+@@ -30,4 +30,7 @@
+    descriptor */
+ #define USB_QUIRK_DELAY_INIT		0x00000040
+ 
++/* device generates spurious wakeup, ignore remote wakeup capability */
++#define USB_QUIRK_IGNORE_REMOTE_WAKEUP	0x00000200
++
+ #endif /* __LINUX_USB_QUIRKS_H */
+diff --git a/include/linux/usb_usual.h b/include/linux/usb_usual.h
+index 9b7de1b46437..a7f2604c5f25 100644
+--- a/include/linux/usb_usual.h
++++ b/include/linux/usb_usual.h
+@@ -73,6 +73,10 @@
+ 		/* Device advertises UAS but it is broken */	\
+ 	US_FLAG(BROKEN_FUA,	0x01000000)			\
+ 		/* Cannot handle FUA in WRITE or READ CDBs */	\
++	US_FLAG(NO_ATA_1X,	0x02000000)			\
++		/* Cannot handle ATA_12 or ATA_16 CDBs */	\
++	US_FLAG(NO_REPORT_OPCODES,	0x04000000)		\
++		/* Cannot handle MI_REPORT_SUPPORTED_OPERATION_CODES */	\
+ 
+ #define US_FLAG(name, value)	US_FL_##name = value ,
+ enum { US_DO_ALL_FLAGS };
+diff --git a/include/net/dst.h b/include/net/dst.h
+index 71c60f42be48..a8ae4e760778 100644
+--- a/include/net/dst.h
++++ b/include/net/dst.h
+@@ -480,6 +480,7 @@ void dst_init(void);
+ /* Flags for xfrm_lookup flags argument. */
+ enum {
+ 	XFRM_LOOKUP_ICMP = 1 << 0,
++	XFRM_LOOKUP_QUEUE = 1 << 1,
+ };
+ 
+ struct flowi;
+@@ -490,7 +491,16 @@ static inline struct dst_entry *xfrm_lookup(struct net *net,
+ 					    int flags)
+ {
+ 	return dst_orig;
+-} 
++}
++
++static inline struct dst_entry *xfrm_lookup_route(struct net *net,
++						  struct dst_entry *dst_orig,
++						  const struct flowi *fl,
++						  struct sock *sk,
++						  int flags)
++{
++	return dst_orig;
++}
+ 
+ static inline struct xfrm_state *dst_xfrm(const struct dst_entry *dst)
+ {
+@@ -502,6 +512,10 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
+ 			      const struct flowi *fl, struct sock *sk,
+ 			      int flags);
+ 
++struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig,
++				    const struct flowi *fl, struct sock *sk,
++				    int flags);
++
+ /* skb attached with this dst needs transformation if dst->xfrm is valid */
+ static inline struct xfrm_state *dst_xfrm(const struct dst_entry *dst)
+ {
+diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
+index 7a4313887568..5fbe6568c3cf 100644
+--- a/include/net/inet_connection_sock.h
++++ b/include/net/inet_connection_sock.h
+@@ -62,6 +62,7 @@ struct inet_connection_sock_af_ops {
+ 	void	    (*addr2sockaddr)(struct sock *sk, struct sockaddr *);
+ 	int	    (*bind_conflict)(const struct sock *sk,
+ 				     const struct inet_bind_bucket *tb, bool relax);
++	void	    (*mtu_reduced)(struct sock *sk);
+ };
+ 
+ /** inet_connection_sock - INET connection oriented sock
+diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
+index 9bcb220bd4ad..cf485f9aa563 100644
+--- a/include/net/ip6_fib.h
++++ b/include/net/ip6_fib.h
+@@ -114,16 +114,13 @@ struct rt6_info {
+ 	u32				rt6i_flags;
+ 	struct rt6key			rt6i_src;
+ 	struct rt6key			rt6i_prefsrc;
+-	u32				rt6i_metric;
+ 
+ 	struct inet6_dev		*rt6i_idev;
+ 	unsigned long			_rt6i_peer;
+ 
+-	u32				rt6i_genid;
+-
++	u32				rt6i_metric;
+ 	/* more non-fragment space at head required */
+ 	unsigned short			rt6i_nfheader_len;
+-
+ 	u8				rt6i_protocol;
+ };
+ 
+diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
+index 361d26077196..e0d64667a4b3 100644
+--- a/include/net/net_namespace.h
++++ b/include/net/net_namespace.h
+@@ -352,26 +352,12 @@ static inline void rt_genid_bump_ipv4(struct net *net)
+ 	atomic_inc(&net->ipv4.rt_genid);
+ }
+ 
+-#if IS_ENABLED(CONFIG_IPV6)
+-static inline int rt_genid_ipv6(struct net *net)
+-{
+-	return atomic_read(&net->ipv6.rt_genid);
+-}
+-
+-static inline void rt_genid_bump_ipv6(struct net *net)
+-{
+-	atomic_inc(&net->ipv6.rt_genid);
+-}
+-#else
+-static inline int rt_genid_ipv6(struct net *net)
+-{
+-	return 0;
+-}
+-
++extern void (*__fib6_flush_trees)(struct net *net);
+ static inline void rt_genid_bump_ipv6(struct net *net)
+ {
++	if (__fib6_flush_trees)
++		__fib6_flush_trees(net);
+ }
+-#endif
+ 
+ #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
+ static inline struct netns_ieee802154_lowpan *
+diff --git a/include/net/sctp/command.h b/include/net/sctp/command.h
+index 4b7cd695e431..cfcbc3f627bd 100644
+--- a/include/net/sctp/command.h
++++ b/include/net/sctp/command.h
+@@ -115,7 +115,7 @@ typedef enum {
+  * analysis of the state functions, but in reality just taken from
+  * thin air in the hopes othat we don't trigger a kernel panic.
+  */
+-#define SCTP_MAX_NUM_COMMANDS 14
++#define SCTP_MAX_NUM_COMMANDS 20
+ 
+ typedef union {
+ 	__s32 i32;
+diff --git a/include/net/sock.h b/include/net/sock.h
+index 156350745700..6cc7944d65bf 100644
+--- a/include/net/sock.h
++++ b/include/net/sock.h
+@@ -971,7 +971,6 @@ struct proto {
+ 						struct sk_buff *skb);
+ 
+ 	void		(*release_cb)(struct sock *sk);
+-	void		(*mtu_reduced)(struct sock *sk);
+ 
+ 	/* Keeping track of sk's, looking them up, and port selection methods. */
+ 	void			(*hash)(struct sock *sk);
+diff --git a/include/net/tcp.h b/include/net/tcp.h
+index 7286db80e8b8..d587ff0f8828 100644
+--- a/include/net/tcp.h
++++ b/include/net/tcp.h
+@@ -448,6 +448,7 @@ const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
+  */
+ 
+ void tcp_v4_send_check(struct sock *sk, struct sk_buff *skb);
++void tcp_v4_mtu_reduced(struct sock *sk);
+ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb);
+ struct sock *tcp_create_openreq_child(struct sock *sk,
+ 				      struct request_sock *req,
+@@ -718,8 +719,10 @@ struct tcp_skb_cb {
+ #define TCPCB_SACKED_RETRANS	0x02	/* SKB retransmitted		*/
+ #define TCPCB_LOST		0x04	/* SKB is lost			*/
+ #define TCPCB_TAGBITS		0x07	/* All tag bits			*/
++#define TCPCB_REPAIRED		0x10	/* SKB repaired (no skb_mstamp)	*/
+ #define TCPCB_EVER_RETRANS	0x80	/* Ever retransmitted frame	*/
+-#define TCPCB_RETRANS		(TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS)
++#define TCPCB_RETRANS		(TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS| \
++				TCPCB_REPAIRED)
+ 
+ 	__u8		ip_dsfield;	/* IPv4 tos or IPv6 dsfield	*/
+ 	/* 1 byte hole */
+diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
+index 75d427763992..90cc2bdd4064 100644
+--- a/net/8021q/vlan_core.c
++++ b/net/8021q/vlan_core.c
+@@ -112,59 +112,6 @@ __be16 vlan_dev_vlan_proto(const struct net_device *dev)
+ }
+ EXPORT_SYMBOL(vlan_dev_vlan_proto);
+ 
+-static struct sk_buff *vlan_reorder_header(struct sk_buff *skb)
+-{
+-	if (skb_cow(skb, skb_headroom(skb)) < 0) {
+-		kfree_skb(skb);
+-		return NULL;
+-	}
+-
+-	memmove(skb->data - ETH_HLEN, skb->data - VLAN_ETH_HLEN, 2 * ETH_ALEN);
+-	skb->mac_header += VLAN_HLEN;
+-	return skb;
+-}
+-
+-struct sk_buff *vlan_untag(struct sk_buff *skb)
+-{
+-	struct vlan_hdr *vhdr;
+-	u16 vlan_tci;
+-
+-	if (unlikely(vlan_tx_tag_present(skb))) {
+-		/* vlan_tci is already set-up so leave this for another time */
+-		return skb;
+-	}
+-
+-	skb = skb_share_check(skb, GFP_ATOMIC);
+-	if (unlikely(!skb))
+-		goto err_free;
+-
+-	if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
+-		goto err_free;
+-
+-	vhdr = (struct vlan_hdr *) skb->data;
+-	vlan_tci = ntohs(vhdr->h_vlan_TCI);
+-	__vlan_hwaccel_put_tag(skb, skb->protocol, vlan_tci);
+-
+-	skb_pull_rcsum(skb, VLAN_HLEN);
+-	vlan_set_encap_proto(skb, vhdr);
+-
+-	skb = vlan_reorder_header(skb);
+-	if (unlikely(!skb))
+-		goto err_free;
+-
+-	skb_reset_network_header(skb);
+-	skb_reset_transport_header(skb);
+-	skb_reset_mac_len(skb);
+-
+-	return skb;
+-
+-err_free:
+-	kfree_skb(skb);
+-	return NULL;
+-}
+-EXPORT_SYMBOL(vlan_untag);
+-
+-
+ /*
+  * vlan info and vid list
+  */
+diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
+index 23caf5b0309e..4fd47a1a0e9a 100644
+--- a/net/bridge/br_private.h
++++ b/net/bridge/br_private.h
+@@ -309,6 +309,9 @@ struct br_input_skb_cb {
+ 	int igmp;
+ 	int mrouters_only;
+ #endif
++#ifdef CONFIG_BRIDGE_VLAN_FILTERING
++	bool vlan_filtered;
++#endif
+ };
+ 
+ #define BR_INPUT_SKB_CB(__skb)	((struct br_input_skb_cb *)(__skb)->cb)
+diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
+index 2b2774fe0703..b03e884fba3e 100644
+--- a/net/bridge/br_vlan.c
++++ b/net/bridge/br_vlan.c
+@@ -127,7 +127,8 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
+ {
+ 	u16 vid;
+ 
+-	if (!br->vlan_enabled)
++	/* If this packet was not filtered at input, let it pass */
++	if (!BR_INPUT_SKB_CB(skb)->vlan_filtered)
+ 		goto out;
+ 
+ 	/* Vlan filter table must be configured at this point.  The
+@@ -166,8 +167,10 @@ bool br_allowed_ingress(struct net_bridge *br, struct net_port_vlans *v,
+ 	/* If VLAN filtering is disabled on the bridge, all packets are
+ 	 * permitted.
+ 	 */
+-	if (!br->vlan_enabled)
++	if (!br->vlan_enabled) {
++		BR_INPUT_SKB_CB(skb)->vlan_filtered = false;
+ 		return true;
++	}
+ 
+ 	/* If there are no vlan in the permitted list, all packets are
+ 	 * rejected.
+@@ -175,6 +178,7 @@ bool br_allowed_ingress(struct net_bridge *br, struct net_port_vlans *v,
+ 	if (!v)
+ 		goto drop;
+ 
++	BR_INPUT_SKB_CB(skb)->vlan_filtered = true;
+ 	proto = br->vlan_proto;
+ 
+ 	/* If vlan tx offload is disabled on bridge device and frame was
+@@ -183,7 +187,7 @@ bool br_allowed_ingress(struct net_bridge *br, struct net_port_vlans *v,
+ 	 */
+ 	if (unlikely(!vlan_tx_tag_present(skb) &&
+ 		     skb->protocol == proto)) {
+-		skb = vlan_untag(skb);
++		skb = skb_vlan_untag(skb);
+ 		if (unlikely(!skb))
+ 			return false;
+ 	}
+@@ -253,7 +257,8 @@ bool br_allowed_egress(struct net_bridge *br,
+ {
+ 	u16 vid;
+ 
+-	if (!br->vlan_enabled)
++	/* If this packet was not filtered at input, let it pass */
++	if (!BR_INPUT_SKB_CB(skb)->vlan_filtered)
+ 		return true;
+ 
+ 	if (!v)
+@@ -272,6 +277,7 @@ bool br_should_learn(struct net_bridge_port *p, struct sk_buff *skb, u16 *vid)
+ 	struct net_bridge *br = p->br;
+ 	struct net_port_vlans *v;
+ 
++	/* If filtering was disabled at input, let it pass. */
+ 	if (!br->vlan_enabled)
+ 		return true;
+ 
+diff --git a/net/core/dev.c b/net/core/dev.c
+index 367a586d0c8a..2647b508eb4d 100644
+--- a/net/core/dev.c
++++ b/net/core/dev.c
+@@ -2576,13 +2576,19 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
+ 		return harmonize_features(skb, features);
+ 	}
+ 
+-	features &= (skb->dev->vlan_features | NETIF_F_HW_VLAN_CTAG_TX |
+-					       NETIF_F_HW_VLAN_STAG_TX);
++	features = netdev_intersect_features(features,
++					     skb->dev->vlan_features |
++					     NETIF_F_HW_VLAN_CTAG_TX |
++					     NETIF_F_HW_VLAN_STAG_TX);
+ 
+ 	if (protocol == htons(ETH_P_8021Q) || protocol == htons(ETH_P_8021AD))
+-		features &= NETIF_F_SG | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST |
+-				NETIF_F_GEN_CSUM | NETIF_F_HW_VLAN_CTAG_TX |
+-				NETIF_F_HW_VLAN_STAG_TX;
++		features = netdev_intersect_features(features,
++						     NETIF_F_SG |
++						     NETIF_F_HIGHDMA |
++						     NETIF_F_FRAGLIST |
++						     NETIF_F_GEN_CSUM |
++						     NETIF_F_HW_VLAN_CTAG_TX |
++						     NETIF_F_HW_VLAN_STAG_TX);
+ 
+ 	return harmonize_features(skb, features);
+ }
+@@ -3588,7 +3594,7 @@ another_round:
+ 
+ 	if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
+ 	    skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
+-		skb = vlan_untag(skb);
++		skb = skb_vlan_untag(skb);
+ 		if (unlikely(!skb))
+ 			goto unlock;
+ 	}
+diff --git a/net/core/filter.c b/net/core/filter.c
+index 1dbf6462f766..3139f966a178 100644
+--- a/net/core/filter.c
++++ b/net/core/filter.c
+@@ -1318,6 +1318,7 @@ static int sk_store_orig_filter(struct sk_filter *fp,
+ 	fkprog->filter = kmemdup(fp->insns, fsize, GFP_KERNEL);
+ 	if (!fkprog->filter) {
+ 		kfree(fp->orig_prog);
++		fp->orig_prog = NULL;
+ 		return -ENOMEM;
+ 	}
+ 
+diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
+index 1063996f8317..e0b5ca349049 100644
+--- a/net/core/rtnetlink.c
++++ b/net/core/rtnetlink.c
+@@ -799,7 +799,8 @@ static inline int rtnl_vfinfo_size(const struct net_device *dev,
+ 			(nla_total_size(sizeof(struct ifla_vf_mac)) +
+ 			 nla_total_size(sizeof(struct ifla_vf_vlan)) +
+ 			 nla_total_size(sizeof(struct ifla_vf_spoofchk)) +
+-			 nla_total_size(sizeof(struct ifla_vf_rate)));
++			 nla_total_size(sizeof(struct ifla_vf_rate)) +
++			 nla_total_size(sizeof(struct ifla_vf_link_state)));
+ 		return size;
+ 	} else
+ 		return 0;
+diff --git a/net/core/skbuff.c b/net/core/skbuff.c
+index 58ff88edbefd..f5f14d54d6a2 100644
+--- a/net/core/skbuff.c
++++ b/net/core/skbuff.c
+@@ -62,6 +62,7 @@
+ #include <linux/scatterlist.h>
+ #include <linux/errqueue.h>
+ #include <linux/prefetch.h>
++#include <linux/if_vlan.h>
+ 
+ #include <net/protocol.h>
+ #include <net/dst.h>
+@@ -3151,6 +3152,9 @@ int skb_gro_receive(struct sk_buff **head, struct sk_buff *skb)
+ 		NAPI_GRO_CB(skb)->free = NAPI_GRO_FREE_STOLEN_HEAD;
+ 		goto done;
+ 	}
++	/* switch back to head shinfo */
++	pinfo = skb_shinfo(p);
++
+ 	if (pinfo->frag_list)
+ 		goto merge;
+ 	if (skb_gro_len(p) != pinfo->gso_size)
+@@ -3959,3 +3963,55 @@ unsigned int skb_gso_transport_seglen(const struct sk_buff *skb)
+ 	return shinfo->gso_size;
+ }
+ EXPORT_SYMBOL_GPL(skb_gso_transport_seglen);
++
++static struct sk_buff *skb_reorder_vlan_header(struct sk_buff *skb)
++{
++	if (skb_cow(skb, skb_headroom(skb)) < 0) {
++		kfree_skb(skb);
++		return NULL;
++	}
++
++	memmove(skb->data - ETH_HLEN, skb->data - VLAN_ETH_HLEN, 2 * ETH_ALEN);
++	skb->mac_header += VLAN_HLEN;
++	return skb;
++}
++
++struct sk_buff *skb_vlan_untag(struct sk_buff *skb)
++{
++	struct vlan_hdr *vhdr;
++	u16 vlan_tci;
++
++	if (unlikely(vlan_tx_tag_present(skb))) {
++		/* vlan_tci is already set-up so leave this for another time */
++		return skb;
++	}
++
++	skb = skb_share_check(skb, GFP_ATOMIC);
++	if (unlikely(!skb))
++		goto err_free;
++
++	if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
++		goto err_free;
++
++	vhdr = (struct vlan_hdr *)skb->data;
++	vlan_tci = ntohs(vhdr->h_vlan_TCI);
++	__vlan_hwaccel_put_tag(skb, skb->protocol, vlan_tci);
++
++	skb_pull_rcsum(skb, VLAN_HLEN);
++	vlan_set_encap_proto(skb, vhdr);
++
++	skb = skb_reorder_vlan_header(skb);
++	if (unlikely(!skb))
++		goto err_free;
++
++	skb_reset_network_header(skb);
++	skb_reset_transport_header(skb);
++	skb_reset_mac_len(skb);
++
++	return skb;
++
++err_free:
++	kfree_skb(skb);
++	return NULL;
++}
++EXPORT_SYMBOL(skb_vlan_untag);
+diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
+index 45920d928341..6c2719373bc5 100644
+--- a/net/ipv4/ip_tunnel.c
++++ b/net/ipv4/ip_tunnel.c
+@@ -764,9 +764,14 @@ int ip_tunnel_ioctl(struct net_device *dev, struct ip_tunnel_parm *p, int cmd)
+ 
+ 		t = ip_tunnel_find(itn, p, itn->fb_tunnel_dev->type);
+ 
+-		if (!t && (cmd == SIOCADDTUNNEL)) {
+-			t = ip_tunnel_create(net, itn, p);
+-			err = PTR_ERR_OR_ZERO(t);
++		if (cmd == SIOCADDTUNNEL) {
++			if (!t) {
++				t = ip_tunnel_create(net, itn, p);
++				err = PTR_ERR_OR_ZERO(t);
++				break;
++			}
++
++			err = -EEXIST;
+ 			break;
+ 		}
+ 		if (dev != itn->fb_tunnel_dev && cmd == SIOCCHGTUNNEL) {
+diff --git a/net/ipv4/route.c b/net/ipv4/route.c
+index 190199851c9a..4b340c30a037 100644
+--- a/net/ipv4/route.c
++++ b/net/ipv4/route.c
+@@ -2267,9 +2267,9 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
+ 		return rt;
+ 
+ 	if (flp4->flowi4_proto)
+-		rt = (struct rtable *) xfrm_lookup(net, &rt->dst,
+-						   flowi4_to_flowi(flp4),
+-						   sk, 0);
++		rt = (struct rtable *)xfrm_lookup_route(net, &rt->dst,
++							flowi4_to_flowi(flp4),
++							sk, 0);
+ 
+ 	return rt;
+ }
+diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
+index 9d2118e5fbc7..0717f45b5171 100644
+--- a/net/ipv4/tcp.c
++++ b/net/ipv4/tcp.c
+@@ -1175,13 +1175,6 @@ new_segment:
+ 					goto wait_for_memory;
+ 
+ 				/*
+-				 * All packets are restored as if they have
+-				 * already been sent.
+-				 */
+-				if (tp->repair)
+-					TCP_SKB_CB(skb)->when = tcp_time_stamp;
+-
+-				/*
+ 				 * Check whether we can use HW checksum.
+ 				 */
+ 				if (sk->sk_route_caps & NETIF_F_ALL_CSUM)
+@@ -1190,6 +1183,13 @@ new_segment:
+ 				skb_entail(sk, skb);
+ 				copy = size_goal;
+ 				max = size_goal;
++
++				/* All packets are restored as if they have
++				 * already been sent. skb_mstamp isn't set to
++				 * avoid wrong rtt estimation.
++				 */
++				if (tp->repair)
++					TCP_SKB_CB(skb)->sacked |= TCPCB_REPAIRED;
+ 			}
+ 
+ 			/* Try to append data to the end of skb. */
+diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
+index 40639c288dc2..a1bbebb03490 100644
+--- a/net/ipv4/tcp_input.c
++++ b/net/ipv4/tcp_input.c
+@@ -2680,7 +2680,6 @@ static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
+  */
+ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
+ {
+-	struct inet_connection_sock *icsk = inet_csk(sk);
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 	bool recovered = !before(tp->snd_una, tp->high_seq);
+ 
+@@ -2706,12 +2705,9 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
+ 
+ 	if (recovered) {
+ 		/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
+-		icsk->icsk_retransmits = 0;
+ 		tcp_try_undo_recovery(sk);
+ 		return;
+ 	}
+-	if (flag & FLAG_DATA_ACKED)
+-		icsk->icsk_retransmits = 0;
+ 	if (tcp_is_reno(tp)) {
+ 		/* A Reno DUPACK means new data in F-RTO step 2.b above are
+ 		 * delivered. Lower inflight to clock out (re)tranmissions.
+@@ -3393,8 +3389,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+ 	    icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
+ 		tcp_rearm_rto(sk);
+ 
+-	if (after(ack, prior_snd_una))
++	if (after(ack, prior_snd_una)) {
+ 		flag |= FLAG_SND_UNA_ADVANCED;
++		icsk->icsk_retransmits = 0;
++	}
+ 
+ 	prior_fackets = tp->fackets_out;
+ 
+diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
+index 77cccda1ad0c..f63c524de5d9 100644
+--- a/net/ipv4/tcp_ipv4.c
++++ b/net/ipv4/tcp_ipv4.c
+@@ -269,7 +269,7 @@ EXPORT_SYMBOL(tcp_v4_connect);
+  * It can be called through tcp_release_cb() if socket was owned by user
+  * at the time tcp_v4_err() was called to handle ICMP message.
+  */
+-static void tcp_v4_mtu_reduced(struct sock *sk)
++void tcp_v4_mtu_reduced(struct sock *sk)
+ {
+ 	struct dst_entry *dst;
+ 	struct inet_sock *inet = inet_sk(sk);
+@@ -300,6 +300,7 @@ static void tcp_v4_mtu_reduced(struct sock *sk)
+ 		tcp_simple_retransmit(sk);
+ 	} /* else let the usual retransmit timer handle it */
+ }
++EXPORT_SYMBOL(tcp_v4_mtu_reduced);
+ 
+ static void do_redirect(struct sk_buff *skb, struct sock *sk)
+ {
+@@ -1880,6 +1881,7 @@ const struct inet_connection_sock_af_ops ipv4_specific = {
+ 	.compat_setsockopt = compat_ip_setsockopt,
+ 	.compat_getsockopt = compat_ip_getsockopt,
+ #endif
++	.mtu_reduced	   = tcp_v4_mtu_reduced,
+ };
+ EXPORT_SYMBOL(ipv4_specific);
+ 
+@@ -2499,7 +2501,6 @@ struct proto tcp_prot = {
+ 	.sendpage		= tcp_sendpage,
+ 	.backlog_rcv		= tcp_v4_do_rcv,
+ 	.release_cb		= tcp_release_cb,
+-	.mtu_reduced		= tcp_v4_mtu_reduced,
+ 	.hash			= inet_hash,
+ 	.unhash			= inet_unhash,
+ 	.get_port		= inet_csk_get_port,
+diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
+index 179b51e6bda3..4e4932b5079b 100644
+--- a/net/ipv4/tcp_output.c
++++ b/net/ipv4/tcp_output.c
+@@ -800,7 +800,7 @@ void tcp_release_cb(struct sock *sk)
+ 		__sock_put(sk);
+ 	}
+ 	if (flags & (1UL << TCP_MTU_REDUCED_DEFERRED)) {
+-		sk->sk_prot->mtu_reduced(sk);
++		inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
+ 		__sock_put(sk);
+ 	}
+ }
+@@ -1916,8 +1916,11 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
+ 		tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
+ 		BUG_ON(!tso_segs);
+ 
+-		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE)
++		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
++			/* "when" is used as a start point for the retransmit timer */
++			TCP_SKB_CB(skb)->when = tcp_time_stamp;
+ 			goto repair; /* Skip network transmission */
++		}
+ 
+ 		cwnd_quota = tcp_cwnd_test(tp, skb);
+ 		if (!cwnd_quota) {
+diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
+index 5667b3003af9..4a9a34954923 100644
+--- a/net/ipv6/addrconf.c
++++ b/net/ipv6/addrconf.c
+@@ -1679,14 +1679,12 @@ void addrconf_dad_failure(struct inet6_ifaddr *ifp)
+ 	addrconf_mod_dad_work(ifp, 0);
+ }
+ 
+-/* Join to solicited addr multicast group. */
+-
++/* Join to solicited addr multicast group.
++ * caller must hold RTNL */
+ void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr)
+ {
+ 	struct in6_addr maddr;
+ 
+-	ASSERT_RTNL();
+-
+ 	if (dev->flags&(IFF_LOOPBACK|IFF_NOARP))
+ 		return;
+ 
+@@ -1694,12 +1692,11 @@ void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr)
+ 	ipv6_dev_mc_inc(dev, &maddr);
+ }
+ 
++/* caller must hold RTNL */
+ void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr)
+ {
+ 	struct in6_addr maddr;
+ 
+-	ASSERT_RTNL();
+-
+ 	if (idev->dev->flags&(IFF_LOOPBACK|IFF_NOARP))
+ 		return;
+ 
+@@ -1707,12 +1704,11 @@ void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr)
+ 	__ipv6_dev_mc_dec(idev, &maddr);
+ }
+ 
++/* caller must hold RTNL */
+ static void addrconf_join_anycast(struct inet6_ifaddr *ifp)
+ {
+ 	struct in6_addr addr;
+ 
+-	ASSERT_RTNL();
+-
+ 	if (ifp->prefix_len >= 127) /* RFC 6164 */
+ 		return;
+ 	ipv6_addr_prefix(&addr, &ifp->addr, ifp->prefix_len);
+@@ -1721,12 +1717,11 @@ static void addrconf_join_anycast(struct inet6_ifaddr *ifp)
+ 	ipv6_dev_ac_inc(ifp->idev->dev, &addr);
+ }
+ 
++/* caller must hold RTNL */
+ static void addrconf_leave_anycast(struct inet6_ifaddr *ifp)
+ {
+ 	struct in6_addr addr;
+ 
+-	ASSERT_RTNL();
+-
+ 	if (ifp->prefix_len >= 127) /* RFC 6164 */
+ 		return;
+ 	ipv6_addr_prefix(&addr, &ifp->addr, ifp->prefix_len);
+@@ -4751,10 +4746,11 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
+ 
+ 		if (ip6_del_rt(ifp->rt))
+ 			dst_free(&ifp->rt->dst);
++
++		rt_genid_bump_ipv6(net);
+ 		break;
+ 	}
+ 	atomic_inc(&net->ipv6.dev_addr_genid);
+-	rt_genid_bump_ipv6(net);
+ }
+ 
+ static void ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
+diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
+index e6960457f625..98cc4cd570e2 100644
+--- a/net/ipv6/addrconf_core.c
++++ b/net/ipv6/addrconf_core.c
+@@ -8,6 +8,13 @@
+ #include <net/addrconf.h>
+ #include <net/ip.h>
+ 
++/* if ipv6 module registers this function is used by xfrm to force all
++ * sockets to relookup their nodes - this is fairly expensive, be
++ * careful
++ */
++void (*__fib6_flush_trees)(struct net *);
++EXPORT_SYMBOL(__fib6_flush_trees);
++
+ #define IPV6_ADDR_SCOPE_TYPE(scope)	((scope) << 16)
+ 
+ static inline unsigned int ipv6_addr_scope2type(unsigned int scope)
+diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
+index 210183244689..ff2de7d9d8e6 100644
+--- a/net/ipv6/anycast.c
++++ b/net/ipv6/anycast.c
+@@ -77,6 +77,7 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 	pac->acl_next = NULL;
+ 	pac->acl_addr = *addr;
+ 
++	rtnl_lock();
+ 	rcu_read_lock();
+ 	if (ifindex == 0) {
+ 		struct rt6_info *rt;
+@@ -137,6 +138,7 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 
+ error:
+ 	rcu_read_unlock();
++	rtnl_unlock();
+ 	if (pac)
+ 		sock_kfree_s(sk, pac, sizeof(*pac));
+ 	return err;
+@@ -171,11 +173,13 @@ int ipv6_sock_ac_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 
+ 	spin_unlock_bh(&ipv6_sk_ac_lock);
+ 
++	rtnl_lock();
+ 	rcu_read_lock();
+ 	dev = dev_get_by_index_rcu(net, pac->acl_ifindex);
+ 	if (dev)
+ 		ipv6_dev_ac_dec(dev, &pac->acl_addr);
+ 	rcu_read_unlock();
++	rtnl_unlock();
+ 
+ 	sock_kfree_s(sk, pac, sizeof(*pac));
+ 	return 0;
+@@ -198,6 +202,7 @@ void ipv6_sock_ac_close(struct sock *sk)
+ 	spin_unlock_bh(&ipv6_sk_ac_lock);
+ 
+ 	prev_index = 0;
++	rtnl_lock();
+ 	rcu_read_lock();
+ 	while (pac) {
+ 		struct ipv6_ac_socklist *next = pac->acl_next;
+@@ -212,6 +217,7 @@ void ipv6_sock_ac_close(struct sock *sk)
+ 		pac = next;
+ 	}
+ 	rcu_read_unlock();
++	rtnl_unlock();
+ }
+ 
+ static void aca_put(struct ifacaddr6 *ac)
+@@ -233,6 +239,8 @@ int ipv6_dev_ac_inc(struct net_device *dev, const struct in6_addr *addr)
+ 	struct rt6_info *rt;
+ 	int err;
+ 
++	ASSERT_RTNL();
++
+ 	idev = in6_dev_get(dev);
+ 
+ 	if (idev == NULL)
+@@ -302,6 +310,8 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr)
+ {
+ 	struct ifacaddr6 *aca, *prev_aca;
+ 
++	ASSERT_RTNL();
++
+ 	write_lock_bh(&idev->lock);
+ 	prev_aca = NULL;
+ 	for (aca = idev->ac_list; aca; aca = aca->aca_next) {
+diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
+index cb4459bd1d29..97b9fa8de377 100644
+--- a/net/ipv6/ip6_fib.c
++++ b/net/ipv6/ip6_fib.c
+@@ -643,7 +643,7 @@ static int fib6_commit_metrics(struct dst_entry *dst,
+ 	if (dst->flags & DST_HOST) {
+ 		mp = dst_metrics_write_ptr(dst);
+ 	} else {
+-		mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL);
++		mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
+ 		if (!mp)
+ 			return -ENOMEM;
+ 		dst_init_metrics(dst, mp, 0);
+@@ -1605,6 +1605,24 @@ static void fib6_prune_clones(struct net *net, struct fib6_node *fn)
+ 	fib6_clean_tree(net, fn, fib6_prune_clone, 1, NULL);
+ }
+ 
++static int fib6_update_sernum(struct rt6_info *rt, void *arg)
++{
++	__u32 sernum = *(__u32 *)arg;
++
++	if (rt->rt6i_node &&
++	    rt->rt6i_node->fn_sernum != sernum)
++		rt->rt6i_node->fn_sernum = sernum;
++
++	return 0;
++}
++
++static void fib6_flush_trees(struct net *net)
++{
++	__u32 new_sernum = fib6_new_sernum();
++
++	fib6_clean_all(net, fib6_update_sernum, &new_sernum);
++}
++
+ /*
+  *	Garbage collection
+  */
+@@ -1788,6 +1806,8 @@ int __init fib6_init(void)
+ 			      NULL);
+ 	if (ret)
+ 		goto out_unregister_subsys;
++
++	__fib6_flush_trees = fib6_flush_trees;
+ out:
+ 	return ret;
+ 
+diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
+index 3873181ed856..43bc1fc24621 100644
+--- a/net/ipv6/ip6_gre.c
++++ b/net/ipv6/ip6_gre.c
+@@ -778,7 +778,7 @@ static inline int ip6gre_xmit_ipv4(struct sk_buff *skb, struct net_device *dev)
+ 		encap_limit = t->parms.encap_limit;
+ 
+ 	memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6));
+-	fl6.flowi6_proto = IPPROTO_IPIP;
++	fl6.flowi6_proto = IPPROTO_GRE;
+ 
+ 	dsfield = ipv4_get_dsfield(iph);
+ 
+@@ -828,7 +828,7 @@ static inline int ip6gre_xmit_ipv6(struct sk_buff *skb, struct net_device *dev)
+ 		encap_limit = t->parms.encap_limit;
+ 
+ 	memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6));
+-	fl6.flowi6_proto = IPPROTO_IPV6;
++	fl6.flowi6_proto = IPPROTO_GRE;
+ 
+ 	dsfield = ipv6_get_dsfield(ipv6h);
+ 	if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS)
+diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
+index 45702b8cd141..59345af6d3a7 100644
+--- a/net/ipv6/ip6_output.c
++++ b/net/ipv6/ip6_output.c
+@@ -1008,7 +1008,7 @@ struct dst_entry *ip6_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
+ 	if (final_dst)
+ 		fl6->daddr = *final_dst;
+ 
+-	return xfrm_lookup(sock_net(sk), dst, flowi6_to_flowi(fl6), sk, 0);
++	return xfrm_lookup_route(sock_net(sk), dst, flowi6_to_flowi(fl6), sk, 0);
+ }
+ EXPORT_SYMBOL_GPL(ip6_dst_lookup_flow);
+ 
+@@ -1040,7 +1040,7 @@ struct dst_entry *ip6_sk_dst_lookup_flow(struct sock *sk, struct flowi6 *fl6,
+ 	if (final_dst)
+ 		fl6->daddr = *final_dst;
+ 
+-	return xfrm_lookup(sock_net(sk), dst, flowi6_to_flowi(fl6), sk, 0);
++	return xfrm_lookup_route(sock_net(sk), dst, flowi6_to_flowi(fl6), sk, 0);
+ }
+ EXPORT_SYMBOL_GPL(ip6_sk_dst_lookup_flow);
+ 
+diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
+index 617f0958e164..a23b655a7627 100644
+--- a/net/ipv6/mcast.c
++++ b/net/ipv6/mcast.c
+@@ -172,6 +172,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 	mc_lst->next = NULL;
+ 	mc_lst->addr = *addr;
+ 
++	rtnl_lock();
+ 	rcu_read_lock();
+ 	if (ifindex == 0) {
+ 		struct rt6_info *rt;
+@@ -185,6 +186,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 
+ 	if (dev == NULL) {
+ 		rcu_read_unlock();
++		rtnl_unlock();
+ 		sock_kfree_s(sk, mc_lst, sizeof(*mc_lst));
+ 		return -ENODEV;
+ 	}
+@@ -202,6 +204,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 
+ 	if (err) {
+ 		rcu_read_unlock();
++		rtnl_unlock();
+ 		sock_kfree_s(sk, mc_lst, sizeof(*mc_lst));
+ 		return err;
+ 	}
+@@ -212,6 +215,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 	spin_unlock(&ipv6_sk_mc_lock);
+ 
+ 	rcu_read_unlock();
++	rtnl_unlock();
+ 
+ 	return 0;
+ }
+@@ -229,6 +233,7 @@ int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 	if (!ipv6_addr_is_multicast(addr))
+ 		return -EINVAL;
+ 
++	rtnl_lock();
+ 	spin_lock(&ipv6_sk_mc_lock);
+ 	for (lnk = &np->ipv6_mc_list;
+ 	     (mc_lst = rcu_dereference_protected(*lnk,
+@@ -252,12 +257,15 @@ int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
+ 			} else
+ 				(void) ip6_mc_leave_src(sk, mc_lst, NULL);
+ 			rcu_read_unlock();
++			rtnl_unlock();
++
+ 			atomic_sub(sizeof(*mc_lst), &sk->sk_omem_alloc);
+ 			kfree_rcu(mc_lst, rcu);
+ 			return 0;
+ 		}
+ 	}
+ 	spin_unlock(&ipv6_sk_mc_lock);
++	rtnl_unlock();
+ 
+ 	return -EADDRNOTAVAIL;
+ }
+@@ -302,6 +310,7 @@ void ipv6_sock_mc_close(struct sock *sk)
+ 	if (!rcu_access_pointer(np->ipv6_mc_list))
+ 		return;
+ 
++	rtnl_lock();
+ 	spin_lock(&ipv6_sk_mc_lock);
+ 	while ((mc_lst = rcu_dereference_protected(np->ipv6_mc_list,
+ 				lockdep_is_held(&ipv6_sk_mc_lock))) != NULL) {
+@@ -328,6 +337,7 @@ void ipv6_sock_mc_close(struct sock *sk)
+ 		spin_lock(&ipv6_sk_mc_lock);
+ 	}
+ 	spin_unlock(&ipv6_sk_mc_lock);
++	rtnl_unlock();
+ }
+ 
+ int ip6_mc_source(int add, int omode, struct sock *sk,
+@@ -845,6 +855,8 @@ int ipv6_dev_mc_inc(struct net_device *dev, const struct in6_addr *addr)
+ 	struct ifmcaddr6 *mc;
+ 	struct inet6_dev *idev;
+ 
++	ASSERT_RTNL();
++
+ 	/* we need to take a reference on idev */
+ 	idev = in6_dev_get(dev);
+ 
+@@ -916,6 +928,8 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr)
+ {
+ 	struct ifmcaddr6 *ma, **map;
+ 
++	ASSERT_RTNL();
++
+ 	write_lock_bh(&idev->lock);
+ 	for (map = &idev->mc_list; (ma=*map) != NULL; map = &ma->next) {
+ 		if (ipv6_addr_equal(&ma->mca_addr, addr)) {
+diff --git a/net/ipv6/route.c b/net/ipv6/route.c
+index f23fbd28a501..bafde82324c5 100644
+--- a/net/ipv6/route.c
++++ b/net/ipv6/route.c
+@@ -314,7 +314,6 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
+ 
+ 		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
+ 		rt6_init_peer(rt, table ? &table->tb6_peers : net->ipv6.peers);
+-		rt->rt6i_genid = rt_genid_ipv6(net);
+ 		INIT_LIST_HEAD(&rt->rt6i_siblings);
+ 	}
+ 	return rt;
+@@ -1098,9 +1097,6 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
+ 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
+ 	 * into this function always.
+ 	 */
+-	if (rt->rt6i_genid != rt_genid_ipv6(dev_net(rt->dst.dev)))
+-		return NULL;
+-
+ 	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
+ 		return NULL;
+ 
+diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
+index 4f408176dc64..9906535ce9de 100644
+--- a/net/ipv6/sit.c
++++ b/net/ipv6/sit.c
+@@ -101,19 +101,19 @@ static struct ip_tunnel *ipip6_tunnel_lookup(struct net *net,
+ 	for_each_ip_tunnel_rcu(t, sitn->tunnels_r_l[h0 ^ h1]) {
+ 		if (local == t->parms.iph.saddr &&
+ 		    remote == t->parms.iph.daddr &&
+-		    (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
++		    (!dev || !t->parms.link || dev->ifindex == t->parms.link) &&
+ 		    (t->dev->flags & IFF_UP))
+ 			return t;
+ 	}
+ 	for_each_ip_tunnel_rcu(t, sitn->tunnels_r[h0]) {
+ 		if (remote == t->parms.iph.daddr &&
+-		    (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
++		    (!dev || !t->parms.link || dev->ifindex == t->parms.link) &&
+ 		    (t->dev->flags & IFF_UP))
+ 			return t;
+ 	}
+ 	for_each_ip_tunnel_rcu(t, sitn->tunnels_l[h1]) {
+ 		if (local == t->parms.iph.saddr &&
+-		    (!dev || !t->parms.link || dev->iflink == t->parms.link) &&
++		    (!dev || !t->parms.link || dev->ifindex == t->parms.link) &&
+ 		    (t->dev->flags & IFF_UP))
+ 			return t;
+ 	}
+diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
+index 229239ad96b1..cb5125c5328d 100644
+--- a/net/ipv6/tcp_ipv6.c
++++ b/net/ipv6/tcp_ipv6.c
+@@ -1681,6 +1681,7 @@ static const struct inet_connection_sock_af_ops ipv6_specific = {
+ 	.compat_setsockopt = compat_ipv6_setsockopt,
+ 	.compat_getsockopt = compat_ipv6_getsockopt,
+ #endif
++	.mtu_reduced	   = tcp_v6_mtu_reduced,
+ };
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1711,6 +1712,7 @@ static const struct inet_connection_sock_af_ops ipv6_mapped = {
+ 	.compat_setsockopt = compat_ipv6_setsockopt,
+ 	.compat_getsockopt = compat_ipv6_getsockopt,
+ #endif
++	.mtu_reduced	   = tcp_v4_mtu_reduced,
+ };
+ 
+ #ifdef CONFIG_TCP_MD5SIG
+@@ -1950,7 +1952,6 @@ struct proto tcpv6_prot = {
+ 	.sendpage		= tcp_sendpage,
+ 	.backlog_rcv		= tcp_v6_do_rcv,
+ 	.release_cb		= tcp_release_cb,
+-	.mtu_reduced		= tcp_v6_mtu_reduced,
+ 	.hash			= tcp_v6_hash,
+ 	.unhash			= inet_unhash,
+ 	.get_port		= inet_csk_get_port,
+diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
+index 13752d96275e..b704a9356208 100644
+--- a/net/l2tp/l2tp_ppp.c
++++ b/net/l2tp/l2tp_ppp.c
+@@ -755,7 +755,8 @@ static int pppol2tp_connect(struct socket *sock, struct sockaddr *uservaddr,
+ 	/* If PMTU discovery was enabled, use the MTU that was discovered */
+ 	dst = sk_dst_get(tunnel->sock);
+ 	if (dst != NULL) {
+-		u32 pmtu = dst_mtu(__sk_dst_get(tunnel->sock));
++		u32 pmtu = dst_mtu(dst);
++
+ 		if (pmtu != 0)
+ 			session->mtu = session->mru = pmtu -
+ 				PPPOL2TP_HEADER_OVERHEAD;
+diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
+index e6fac7e3db52..48fc607a211e 100644
+--- a/net/netlink/af_netlink.c
++++ b/net/netlink/af_netlink.c
+@@ -205,7 +205,7 @@ static int __netlink_deliver_tap_skb(struct sk_buff *skb,
+ 		nskb->protocol = htons((u16) sk->sk_protocol);
+ 		nskb->pkt_type = netlink_is_kernel(sk) ?
+ 				 PACKET_KERNEL : PACKET_USER;
+-
++		skb_reset_network_header(nskb);
+ 		ret = dev_queue_xmit(nskb);
+ 		if (unlikely(ret > 0))
+ 			ret = net_xmit_errno(ret);
+diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
+index e70d8b18e962..10736e6b192b 100644
+--- a/net/openvswitch/actions.c
++++ b/net/openvswitch/actions.c
+@@ -42,6 +42,9 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
+ 
+ static int make_writable(struct sk_buff *skb, int write_len)
+ {
++	if (!pskb_may_pull(skb, write_len))
++		return -ENOMEM;
++
+ 	if (!skb_cloned(skb) || skb_clone_writable(skb, write_len))
+ 		return 0;
+ 
+@@ -70,6 +73,8 @@ static int __pop_vlan_tci(struct sk_buff *skb, __be16 *current_tci)
+ 
+ 	vlan_set_encap_proto(skb, vhdr);
+ 	skb->mac_header += VLAN_HLEN;
++	if (skb_network_offset(skb) < ETH_HLEN)
++		skb_set_network_header(skb, ETH_HLEN);
+ 	skb_reset_mac_len(skb);
+ 
+ 	return 0;
+diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
+index b85c67ccb797..3eb786fd3f22 100644
+--- a/net/packet/af_packet.c
++++ b/net/packet/af_packet.c
+@@ -636,6 +636,7 @@ static void init_prb_bdqc(struct packet_sock *po,
+ 	p1->tov_in_jiffies = msecs_to_jiffies(p1->retire_blk_tov);
+ 	p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv;
+ 
++	p1->max_frame_len = p1->kblk_size - BLK_PLUS_PRIV(p1->blk_sizeof_priv);
+ 	prb_init_ft_ops(p1, req_u);
+ 	prb_setup_retire_blk_timer(po, tx_ring);
+ 	prb_open_block(p1, pbd);
+@@ -1946,6 +1947,18 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
+ 			if ((int)snaplen < 0)
+ 				snaplen = 0;
+ 		}
++	} else if (unlikely(macoff + snaplen >
++			    GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len)) {
++		u32 nval;
++
++		nval = GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len - macoff;
++		pr_err_once("tpacket_rcv: packet too big, clamped from %u to %u. macoff=%u\n",
++			    snaplen, nval, macoff);
++		snaplen = nval;
++		if (unlikely((int)snaplen < 0)) {
++			snaplen = 0;
++			macoff = GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len;
++		}
+ 	}
+ 	spin_lock(&sk->sk_receive_queue.lock);
+ 	h.raw = packet_current_rx_frame(po, skb,
+@@ -3789,6 +3802,10 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
+ 			goto out;
+ 		if (unlikely(req->tp_block_size & (PAGE_SIZE - 1)))
+ 			goto out;
++		if (po->tp_version >= TPACKET_V3 &&
++		    (int)(req->tp_block_size -
++			  BLK_PLUS_PRIV(req_u->req3.tp_sizeof_priv)) <= 0)
++			goto out;
+ 		if (unlikely(req->tp_frame_size < po->tp_hdrlen +
+ 					po->tp_reserve))
+ 			goto out;
+diff --git a/net/packet/internal.h b/net/packet/internal.h
+index eb9580a6b25f..cdddf6a30399 100644
+--- a/net/packet/internal.h
++++ b/net/packet/internal.h
+@@ -29,6 +29,7 @@ struct tpacket_kbdq_core {
+ 	char		*pkblk_start;
+ 	char		*pkblk_end;
+ 	int		kblk_size;
++	unsigned int	max_frame_len;
+ 	unsigned int	knum_blocks;
+ 	uint64_t	knxt_seq_num;
+ 	char		*prev;
+diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
+index 45527e6b52db..3b2617aa6bcd 100644
+--- a/net/sched/cls_api.c
++++ b/net/sched/cls_api.c
+@@ -549,6 +549,7 @@ void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
+ 	tcf_tree_lock(tp);
+ 	list_splice_init(&dst->actions, &tmp);
+ 	list_splice(&src->actions, &dst->actions);
++	dst->type = src->type;
+ 	tcf_tree_unlock(tp);
+ 	tcf_action_destroy(&tmp, TCA_ACT_UNBIND);
+ #endif
+diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
+index 5170a1ff95a1..7194fe8589b0 100644
+--- a/net/sctp/sm_statefuns.c
++++ b/net/sctp/sm_statefuns.c
+@@ -1775,9 +1775,22 @@ static sctp_disposition_t sctp_sf_do_dupcook_a(struct net *net,
+ 	/* Update the content of current association. */
+ 	sctp_add_cmd_sf(commands, SCTP_CMD_UPDATE_ASSOC, SCTP_ASOC(new_asoc));
+ 	sctp_add_cmd_sf(commands, SCTP_CMD_EVENT_ULP, SCTP_ULPEVENT(ev));
+-	sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
+-			SCTP_STATE(SCTP_STATE_ESTABLISHED));
+-	sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(repl));
++	if (sctp_state(asoc, SHUTDOWN_PENDING) &&
++	    (sctp_sstate(asoc->base.sk, CLOSING) ||
++	     sock_flag(asoc->base.sk, SOCK_DEAD))) {
++		/* if were currently in SHUTDOWN_PENDING, but the socket
++		 * has been closed by user, don't transition to ESTABLISHED.
++		 * Instead trigger SHUTDOWN bundled with COOKIE_ACK.
++		 */
++		sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(repl));
++		return sctp_sf_do_9_2_start_shutdown(net, ep, asoc,
++						     SCTP_ST_CHUNK(0), NULL,
++						     commands);
++	} else {
++		sctp_add_cmd_sf(commands, SCTP_CMD_NEW_STATE,
++				SCTP_STATE(SCTP_STATE_ESTABLISHED));
++		sctp_add_cmd_sf(commands, SCTP_CMD_REPLY, SCTP_CHUNK(repl));
++	}
+ 	return SCTP_DISPOSITION_CONSUME;
+ 
+ nomem_ev:
+diff --git a/net/tipc/port.h b/net/tipc/port.h
+index cf4ca5b1d9a4..3f34cac07a2c 100644
+--- a/net/tipc/port.h
++++ b/net/tipc/port.h
+@@ -229,9 +229,12 @@ static inline int tipc_port_importance(struct tipc_port *port)
+ 	return msg_importance(&port->phdr);
+ }
+ 
+-static inline void tipc_port_set_importance(struct tipc_port *port, int imp)
++static inline int tipc_port_set_importance(struct tipc_port *port, int imp)
+ {
++	if (imp > TIPC_CRITICAL_IMPORTANCE)
++		return -EINVAL;
+ 	msg_set_importance(&port->phdr, (u32)imp);
++	return 0;
+ }
+ 
+ #endif
+diff --git a/net/tipc/socket.c b/net/tipc/socket.c
+index ef0475568f9e..4093fd81edd5 100644
+--- a/net/tipc/socket.c
++++ b/net/tipc/socket.c
+@@ -1841,7 +1841,7 @@ static int tipc_setsockopt(struct socket *sock, int lvl, int opt,
+ 
+ 	switch (opt) {
+ 	case TIPC_IMPORTANCE:
+-		tipc_port_set_importance(port, value);
++		res = tipc_port_set_importance(port, value);
+ 		break;
+ 	case TIPC_SRC_DROPPABLE:
+ 		if (sock->type != SOCK_STREAM)
+diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
+index 0525d78ba328..93e755b97486 100644
+--- a/net/xfrm/xfrm_policy.c
++++ b/net/xfrm/xfrm_policy.c
+@@ -39,6 +39,11 @@
+ #define XFRM_QUEUE_TMO_MAX ((unsigned)(60*HZ))
+ #define XFRM_MAX_QUEUE_LEN	100
+ 
++struct xfrm_flo {
++	struct dst_entry *dst_orig;
++	u8 flags;
++};
++
+ static DEFINE_SPINLOCK(xfrm_policy_afinfo_lock);
+ static struct xfrm_policy_afinfo __rcu *xfrm_policy_afinfo[NPROTO]
+ 						__read_mostly;
+@@ -1877,13 +1882,14 @@ static int xdst_queue_output(struct sock *sk, struct sk_buff *skb)
+ }
+ 
+ static struct xfrm_dst *xfrm_create_dummy_bundle(struct net *net,
+-						 struct dst_entry *dst,
++						 struct xfrm_flo *xflo,
+ 						 const struct flowi *fl,
+ 						 int num_xfrms,
+ 						 u16 family)
+ {
+ 	int err;
+ 	struct net_device *dev;
++	struct dst_entry *dst;
+ 	struct dst_entry *dst1;
+ 	struct xfrm_dst *xdst;
+ 
+@@ -1891,9 +1897,12 @@ static struct xfrm_dst *xfrm_create_dummy_bundle(struct net *net,
+ 	if (IS_ERR(xdst))
+ 		return xdst;
+ 
+-	if (net->xfrm.sysctl_larval_drop || num_xfrms <= 0)
++	if (!(xflo->flags & XFRM_LOOKUP_QUEUE) ||
++	    net->xfrm.sysctl_larval_drop ||
++	    num_xfrms <= 0)
+ 		return xdst;
+ 
++	dst = xflo->dst_orig;
+ 	dst1 = &xdst->u.dst;
+ 	dst_hold(dst);
+ 	xdst->route = dst;
+@@ -1935,7 +1944,7 @@ static struct flow_cache_object *
+ xfrm_bundle_lookup(struct net *net, const struct flowi *fl, u16 family, u8 dir,
+ 		   struct flow_cache_object *oldflo, void *ctx)
+ {
+-	struct dst_entry *dst_orig = (struct dst_entry *)ctx;
++	struct xfrm_flo *xflo = (struct xfrm_flo *)ctx;
+ 	struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX];
+ 	struct xfrm_dst *xdst, *new_xdst;
+ 	int num_pols = 0, num_xfrms = 0, i, err, pol_dead;
+@@ -1976,7 +1985,8 @@ xfrm_bundle_lookup(struct net *net, const struct flowi *fl, u16 family, u8 dir,
+ 			goto make_dummy_bundle;
+ 	}
+ 
+-	new_xdst = xfrm_resolve_and_create_bundle(pols, num_pols, fl, family, dst_orig);
++	new_xdst = xfrm_resolve_and_create_bundle(pols, num_pols, fl, family,
++						  xflo->dst_orig);
+ 	if (IS_ERR(new_xdst)) {
+ 		err = PTR_ERR(new_xdst);
+ 		if (err != -EAGAIN)
+@@ -2010,7 +2020,7 @@ make_dummy_bundle:
+ 	/* We found policies, but there's no bundles to instantiate:
+ 	 * either because the policy blocks, has no transformations or
+ 	 * we could not build template (no xfrm_states).*/
+-	xdst = xfrm_create_dummy_bundle(net, dst_orig, fl, num_xfrms, family);
++	xdst = xfrm_create_dummy_bundle(net, xflo, fl, num_xfrms, family);
+ 	if (IS_ERR(xdst)) {
+ 		xfrm_pols_put(pols, num_pols);
+ 		return ERR_CAST(xdst);
+@@ -2104,13 +2114,18 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
+ 	}
+ 
+ 	if (xdst == NULL) {
++		struct xfrm_flo xflo;
++
++		xflo.dst_orig = dst_orig;
++		xflo.flags = flags;
++
+ 		/* To accelerate a bit...  */
+ 		if ((dst_orig->flags & DST_NOXFRM) ||
+ 		    !net->xfrm.policy_count[XFRM_POLICY_OUT])
+ 			goto nopol;
+ 
+ 		flo = flow_cache_lookup(net, fl, family, dir,
+-					xfrm_bundle_lookup, dst_orig);
++					xfrm_bundle_lookup, &xflo);
+ 		if (flo == NULL)
+ 			goto nopol;
+ 		if (IS_ERR(flo)) {
+@@ -2138,7 +2153,7 @@ struct dst_entry *xfrm_lookup(struct net *net, struct dst_entry *dst_orig,
+ 			xfrm_pols_put(pols, drop_pols);
+ 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTNOSTATES);
+ 
+-			return make_blackhole(net, family, dst_orig);
++			return ERR_PTR(-EREMOTE);
+ 		}
+ 
+ 		err = -EAGAIN;
+@@ -2195,6 +2210,23 @@ dropdst:
+ }
+ EXPORT_SYMBOL(xfrm_lookup);
+ 
++/* Callers of xfrm_lookup_route() must ensure a call to dst_output().
++ * Otherwise we may send out blackholed packets.
++ */
++struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig,
++				    const struct flowi *fl,
++				    struct sock *sk, int flags)
++{
++	struct dst_entry *dst = xfrm_lookup(net, dst_orig, fl, sk,
++					    flags | XFRM_LOOKUP_QUEUE);
++
++	if (IS_ERR(dst) && PTR_ERR(dst) == -EREMOTE)
++		return make_blackhole(net, dst_orig->ops->family, dst_orig);
++
++	return dst;
++}
++EXPORT_SYMBOL(xfrm_lookup_route);
++
+ static inline int
+ xfrm_secpath_reject(int idx, struct sk_buff *skb, const struct flowi *fl)
+ {
+@@ -2460,7 +2492,7 @@ int __xfrm_route_forward(struct sk_buff *skb, unsigned short family)
+ 
+ 	skb_dst_force(skb);
+ 
+-	dst = xfrm_lookup(net, skb_dst(skb), &fl, NULL, 0);
++	dst = xfrm_lookup(net, skb_dst(skb), &fl, NULL, XFRM_LOOKUP_QUEUE);
+ 	if (IS_ERR(dst)) {
+ 		res = 0;
+ 		dst = NULL;


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-10-30 19:29 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-10-30 19:29 UTC (permalink / raw
  To: gentoo-commits

commit:     5ca4fd40116dd22e8caab91c470be1860fe0141d
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Thu Oct 30 19:29:00 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Thu Oct 30 19:29:00 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=5ca4fd40

Linux patch 3.16.7

---
 0000_README             |    4 +
 1006_linux-3.16.7.patch | 6873 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 6877 insertions(+)

diff --git a/0000_README b/0000_README
index a7526a7..9bf3b17 100644
--- a/0000_README
+++ b/0000_README
@@ -66,6 +66,10 @@ Patch:  1005_linux-3.16.6.patch
 From:   http://www.kernel.org
 Desc:   Linux 3.16.6
 
+Patch:  1006_linux-3.16.7.patch
+From:   http://www.kernel.org
+Desc:   Linux 3.16.7
+
 Patch:  1500_XATTR_USER_PREFIX.patch
 From:   https://bugs.gentoo.org/show_bug.cgi?id=470644
 Desc:   Support for namespace user.pax.* on tmpfs.

diff --git a/1006_linux-3.16.7.patch b/1006_linux-3.16.7.patch
new file mode 100644
index 0000000..9776e1b
--- /dev/null
+++ b/1006_linux-3.16.7.patch
@@ -0,0 +1,6873 @@
+diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
+new file mode 100644
+index 000000000000..ea45dd3901e3
+--- /dev/null
++++ b/Documentation/lzo.txt
+@@ -0,0 +1,164 @@
++
++LZO stream format as understood by Linux's LZO decompressor
++===========================================================
++
++Introduction
++
++  This is not a specification. No specification seems to be publicly available
++  for the LZO stream format. This document describes what input format the LZO
++  decompressor as implemented in the Linux kernel understands. The file subject
++  of this analysis is lib/lzo/lzo1x_decompress_safe.c. No analysis was made on
++  the compressor nor on any other implementations though it seems likely that
++  the format matches the standard one. The purpose of this document is to
++  better understand what the code does in order to propose more efficient fixes
++  for future bug reports.
++
++Description
++
++  The stream is composed of a series of instructions, operands, and data. The
++  instructions consist in a few bits representing an opcode, and bits forming
++  the operands for the instruction, whose size and position depend on the
++  opcode and on the number of literals copied by previous instruction. The
++  operands are used to indicate :
++
++    - a distance when copying data from the dictionary (past output buffer)
++    - a length (number of bytes to copy from dictionary)
++    - the number of literals to copy, which is retained in variable "state"
++      as a piece of information for next instructions.
++
++  Optionally depending on the opcode and operands, extra data may follow. These
++  extra data can be a complement for the operand (eg: a length or a distance
++  encoded on larger values), or a literal to be copied to the output buffer.
++
++  The first byte of the block follows a different encoding from other bytes, it
++  seems to be optimized for literal use only, since there is no dictionary yet
++  prior to that byte.
++
++  Lengths are always encoded on a variable size starting with a small number
++  of bits in the operand. If the number of bits isn't enough to represent the
++  length, up to 255 may be added in increments by consuming more bytes with a
++  rate of at most 255 per extra byte (thus the compression ratio cannot exceed
++  around 255:1). The variable length encoding using #bits is always the same :
++
++       length = byte & ((1 << #bits) - 1)
++       if (!length) {
++               length = ((1 << #bits) - 1)
++               length += 255*(number of zero bytes)
++               length += first-non-zero-byte
++       }
++       length += constant (generally 2 or 3)
++
++  For references to the dictionary, distances are relative to the output
++  pointer. Distances are encoded using very few bits belonging to certain
++  ranges, resulting in multiple copy instructions using different encodings.
++  Certain encodings involve one extra byte, others involve two extra bytes
++  forming a little-endian 16-bit quantity (marked LE16 below).
++
++  After any instruction except the large literal copy, 0, 1, 2 or 3 literals
++  are copied before starting the next instruction. The number of literals that
++  were copied may change the meaning and behaviour of the next instruction. In
++  practice, only one instruction needs to know whether 0, less than 4, or more
++  literals were copied. This is the information stored in the <state> variable
++  in this implementation. This number of immediate literals to be copied is
++  generally encoded in the last two bits of the instruction but may also be
++  taken from the last two bits of an extra operand (eg: distance).
++
++  End of stream is declared when a block copy of distance 0 is seen. Only one
++  instruction may encode this distance (0001HLLL), it takes one LE16 operand
++  for the distance, thus requiring 3 bytes.
++
++  IMPORTANT NOTE : in the code some length checks are missing because certain
++  instructions are called under the assumption that a certain number of bytes
++  follow because it has already been garanteed before parsing the instructions.
++  They just have to "refill" this credit if they consume extra bytes. This is
++  an implementation design choice independant on the algorithm or encoding.
++
++Byte sequences
++
++  First byte encoding :
++
++      0..17   : follow regular instruction encoding, see below. It is worth
++                noting that codes 16 and 17 will represent a block copy from
++                the dictionary which is empty, and that they will always be
++                invalid at this place.
++
++      18..21  : copy 0..3 literals
++                state = (byte - 17) = 0..3  [ copy <state> literals ]
++                skip byte
++
++      22..255 : copy literal string
++                length = (byte - 17) = 4..238
++                state = 4 [ don't copy extra literals ]
++                skip byte
++
++  Instruction encoding :
++
++      0 0 0 0 X X X X  (0..15)
++        Depends on the number of literals copied by the last instruction.
++        If last instruction did not copy any literal (state == 0), this
++        encoding will be a copy of 4 or more literal, and must be interpreted
++        like this :
++
++           0 0 0 0 L L L L  (0..15)  : copy long literal string
++           length = 3 + (L ?: 15 + (zero_bytes * 255) + non_zero_byte)
++           state = 4  (no extra literals are copied)
++
++        If last instruction used to copy between 1 to 3 literals (encoded in
++        the instruction's opcode or distance), the instruction is a copy of a
++        2-byte block from the dictionary within a 1kB distance. It is worth
++        noting that this instruction provides little savings since it uses 2
++        bytes to encode a copy of 2 other bytes but it encodes the number of
++        following literals for free. It must be interpreted like this :
++
++           0 0 0 0 D D S S  (0..15)  : copy 2 bytes from <= 1kB distance
++           length = 2
++           state = S (copy S literals after this block)
++         Always followed by exactly one byte : H H H H H H H H
++           distance = (H << 2) + D + 1
++
++        If last instruction used to copy 4 or more literals (as detected by
++        state == 4), the instruction becomes a copy of a 3-byte block from the
++        dictionary from a 2..3kB distance, and must be interpreted like this :
++
++           0 0 0 0 D D S S  (0..15)  : copy 3 bytes from 2..3 kB distance
++           length = 3
++           state = S (copy S literals after this block)
++         Always followed by exactly one byte : H H H H H H H H
++           distance = (H << 2) + D + 2049
++
++      0 0 0 1 H L L L  (16..31)
++           Copy of a block within 16..48kB distance (preferably less than 10B)
++           length = 2 + (L ?: 7 + (zero_bytes * 255) + non_zero_byte)
++        Always followed by exactly one LE16 :  D D D D D D D D : D D D D D D S S
++           distance = 16384 + (H << 14) + D
++           state = S (copy S literals after this block)
++           End of stream is reached if distance == 16384
++
++      0 0 1 L L L L L  (32..63)
++           Copy of small block within 16kB distance (preferably less than 34B)
++           length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
++        Always followed by exactly one LE16 :  D D D D D D D D : D D D D D D S S
++           distance = D + 1
++           state = S (copy S literals after this block)
++
++      0 1 L D D D S S  (64..127)
++           Copy 3-4 bytes from block within 2kB distance
++           state = S (copy S literals after this block)
++           length = 3 + L
++         Always followed by exactly one byte : H H H H H H H H
++           distance = (H << 3) + D + 1
++
++      1 L L D D D S S  (128..255)
++           Copy 5-8 bytes from block within 2kB distance
++           state = S (copy S literals after this block)
++           length = 5 + L
++         Always followed by exactly one byte : H H H H H H H H
++           distance = (H << 3) + D + 1
++
++Authors
++
++  This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
++  analysis of the decompression code available in Linux 3.16-rc5. The code is
++  tricky, it is possible that this document contains mistakes or that a few
++  corner cases were overlooked. In any case, please report any doubt, fix, or
++  proposed updates to the author(s) so that the document can be updated.
+diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
+index 290894176142..53838d9c6295 100644
+--- a/Documentation/virtual/kvm/mmu.txt
++++ b/Documentation/virtual/kvm/mmu.txt
+@@ -425,6 +425,20 @@ fault through the slow path.
+ Since only 19 bits are used to store generation-number on mmio spte, all
+ pages are zapped when there is an overflow.
+ 
++Unfortunately, a single memory access might access kvm_memslots(kvm) multiple
++times, the last one happening when the generation number is retrieved and
++stored into the MMIO spte.  Thus, the MMIO spte might be created based on
++out-of-date information, but with an up-to-date generation number.
++
++To avoid this, the generation number is incremented again after synchronize_srcu
++returns; thus, the low bit of kvm_memslots(kvm)->generation is only 1 during a
++memslot update, while some SRCU readers might be using the old copy.  We do not
++want to use an MMIO sptes created with an odd generation number, and we can do
++this without losing a bit in the MMIO spte.  The low bit of the generation
++is not stored in MMIO spte, and presumed zero when it is extracted out of the
++spte.  If KVM is unlucky and creates an MMIO spte while the low bit is 1,
++the next access to the spte will always be a cache miss.
++
+ 
+ Further reading
+ ===============
+diff --git a/Makefile b/Makefile
+index 5c4bc3fc18c0..29ba21cde7c0 100644
+--- a/Makefile
++++ b/Makefile
+@@ -1,6 +1,6 @@
+ VERSION = 3
+ PATCHLEVEL = 16
+-SUBLEVEL = 6
++SUBLEVEL = 7
+ EXTRAVERSION =
+ NAME = Museum of Fishiegoodies
+ 
+diff --git a/arch/arm/boot/dts/Makefile b/arch/arm/boot/dts/Makefile
+index adb5ed9e269e..c04db0ae0895 100644
+--- a/arch/arm/boot/dts/Makefile
++++ b/arch/arm/boot/dts/Makefile
+@@ -137,8 +137,8 @@ kirkwood := \
+ 	kirkwood-openrd-client.dtb \
+ 	kirkwood-openrd-ultimate.dtb \
+ 	kirkwood-rd88f6192.dtb \
+-	kirkwood-rd88f6281-a0.dtb \
+-	kirkwood-rd88f6281-a1.dtb \
++	kirkwood-rd88f6281-z0.dtb \
++	kirkwood-rd88f6281-a.dtb \
+ 	kirkwood-rs212.dtb \
+ 	kirkwood-rs409.dtb \
+ 	kirkwood-rs411.dtb \
+diff --git a/arch/arm/boot/dts/armada-370-netgear-rn102.dts b/arch/arm/boot/dts/armada-370-netgear-rn102.dts
+index d6d572e5af32..285524fb915e 100644
+--- a/arch/arm/boot/dts/armada-370-netgear-rn102.dts
++++ b/arch/arm/boot/dts/armada-370-netgear-rn102.dts
+@@ -143,6 +143,10 @@
+ 				marvell,nand-enable-arbiter;
+ 				nand-on-flash-bbt;
+ 
++				/* Use Hardware BCH ECC */
++				nand-ecc-strength = <4>;
++				nand-ecc-step-size = <512>;
++
+ 				partition@0 {
+ 					label = "u-boot";
+ 					reg = <0x0000000 0x180000>;  /* 1.5MB */
+diff --git a/arch/arm/boot/dts/armada-370-netgear-rn104.dts b/arch/arm/boot/dts/armada-370-netgear-rn104.dts
+index c5fe8b5dcdc7..4ec1ce561d34 100644
+--- a/arch/arm/boot/dts/armada-370-netgear-rn104.dts
++++ b/arch/arm/boot/dts/armada-370-netgear-rn104.dts
+@@ -145,6 +145,10 @@
+ 				marvell,nand-enable-arbiter;
+ 				nand-on-flash-bbt;
+ 
++				/* Use Hardware BCH ECC */
++				nand-ecc-strength = <4>;
++				nand-ecc-step-size = <512>;
++
+ 				partition@0 {
+ 					label = "u-boot";
+ 					reg = <0x0000000 0x180000>;  /* 1.5MB */
+diff --git a/arch/arm/boot/dts/armada-xp-netgear-rn2120.dts b/arch/arm/boot/dts/armada-xp-netgear-rn2120.dts
+index 0cf999abc4ed..c5ed85a70ed9 100644
+--- a/arch/arm/boot/dts/armada-xp-netgear-rn2120.dts
++++ b/arch/arm/boot/dts/armada-xp-netgear-rn2120.dts
+@@ -223,6 +223,10 @@
+ 				marvell,nand-enable-arbiter;
+ 				nand-on-flash-bbt;
+ 
++				/* Use Hardware BCH ECC */
++				nand-ecc-strength = <4>;
++				nand-ecc-step-size = <512>;
++
+ 				partition@0 {
+ 					label = "u-boot";
+ 					reg = <0x0000000 0x180000>;  /* 1.5MB */
+diff --git a/arch/arm/boot/dts/at91sam9263.dtsi b/arch/arm/boot/dts/at91sam9263.dtsi
+index fece8665fb63..b8f234bf7de8 100644
+--- a/arch/arm/boot/dts/at91sam9263.dtsi
++++ b/arch/arm/boot/dts/at91sam9263.dtsi
+@@ -535,6 +535,7 @@
+ 				compatible = "atmel,hsmci";
+ 				reg = <0xfff80000 0x600>;
+ 				interrupts = <10 IRQ_TYPE_LEVEL_HIGH 0>;
++				pinctrl-names = "default";
+ 				#address-cells = <1>;
+ 				#size-cells = <0>;
+ 				status = "disabled";
+@@ -544,6 +545,7 @@
+ 				compatible = "atmel,hsmci";
+ 				reg = <0xfff84000 0x600>;
+ 				interrupts = <11 IRQ_TYPE_LEVEL_HIGH 0>;
++				pinctrl-names = "default";
+ 				#address-cells = <1>;
+ 				#size-cells = <0>;
+ 				status = "disabled";
+diff --git a/arch/arm/boot/dts/imx28-evk.dts b/arch/arm/boot/dts/imx28-evk.dts
+index e4cc44c98585..41a983405e7d 100644
+--- a/arch/arm/boot/dts/imx28-evk.dts
++++ b/arch/arm/boot/dts/imx28-evk.dts
+@@ -193,7 +193,6 @@
+ 			i2c0: i2c@80058000 {
+ 				pinctrl-names = "default";
+ 				pinctrl-0 = <&i2c0_pins_a>;
+-				clock-frequency = <400000>;
+ 				status = "okay";
+ 
+ 				sgtl5000: codec@0a {
+diff --git a/arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts b/arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts
+index 8f76d28759a3..f82827d6fcff 100644
+--- a/arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts
++++ b/arch/arm/boot/dts/kirkwood-mv88f6281gtw-ge.dts
+@@ -123,11 +123,11 @@
+ 
+ 	dsa@0 {
+ 		compatible = "marvell,dsa";
+-		#address-cells = <2>;
++		#address-cells = <1>;
+ 		#size-cells = <0>;
+ 
+-		dsa,ethernet = <&eth0>;
+-		dsa,mii-bus = <&ethphy0>;
++		dsa,ethernet = <&eth0port>;
++		dsa,mii-bus = <&mdio>;
+ 
+ 		switch@0 {
+ 			#address-cells = <1>;
+@@ -169,17 +169,13 @@
+ 
+ &mdio {
+ 	status = "okay";
+-
+-	ethphy0: ethernet-phy@ff {
+-		reg = <0xff>; 	/* No phy attached */
+-		speed = <1000>;
+-		duplex = <1>;
+-	};
+ };
+ 
+ &eth0 {
+ 	status = "okay";
++
+ 	ethernet0-port@0 {
+-		phy-handle = <&ethphy0>;
++		speed = <1000>;
++		duplex = <1>;
+ 	};
+ };
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281-a.dts b/arch/arm/boot/dts/kirkwood-rd88f6281-a.dts
+new file mode 100644
+index 000000000000..f2e08b3b33ea
+--- /dev/null
++++ b/arch/arm/boot/dts/kirkwood-rd88f6281-a.dts
+@@ -0,0 +1,43 @@
++/*
++ * Marvell RD88F6181 A Board descrition
++ *
++ * Andrew Lunn <andrew@lunn.ch>
++ *
++ * This file is licensed under the terms of the GNU General Public
++ * License version 2.  This program is licensed "as is" without any
++ * warranty of any kind, whether express or implied.
++ *
++ * This file contains the definitions for the board with the A0 or
++ * higher stepping of the SoC. The ethernet switch does not have a
++ * "wan" port.
++ */
++
++/dts-v1/;
++#include "kirkwood-rd88f6281.dtsi"
++
++/ {
++	model = "Marvell RD88f6281 Reference design, with A0 or higher SoC";
++	compatible = "marvell,rd88f6281-a", "marvell,rd88f6281","marvell,kirkwood-88f6281", "marvell,kirkwood";
++
++	dsa@0 {
++		switch@0 {
++			reg = <10 0>;	 /* MDIO address 10, switch 0 in tree */
++		};
++	};
++};
++
++&mdio {
++	status = "okay";
++
++	ethphy1: ethernet-phy@11 {
++		 reg = <11>;
++	};
++};
++
++&eth1 {
++	status = "okay";
++
++	ethernet1-port@0 {
++		 phy-handle = <&ethphy1>;
++	};
++};
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281-a0.dts b/arch/arm/boot/dts/kirkwood-rd88f6281-a0.dts
+deleted file mode 100644
+index a803bbb70bc8..000000000000
+--- a/arch/arm/boot/dts/kirkwood-rd88f6281-a0.dts
++++ /dev/null
+@@ -1,26 +0,0 @@
+-/*
+- * Marvell RD88F6181 A0 Board descrition
+- *
+- * Andrew Lunn <andrew@lunn.ch>
+- *
+- * This file is licensed under the terms of the GNU General Public
+- * License version 2.  This program is licensed "as is" without any
+- * warranty of any kind, whether express or implied.
+- *
+- * This file contains the definitions for the board with the A0 variant of
+- * the SoC. The ethernet switch does not have a "wan" port.
+- */
+-
+-/dts-v1/;
+-#include "kirkwood-rd88f6281.dtsi"
+-
+-/ {
+-	model = "Marvell RD88f6281 Reference design, with A0 SoC";
+-	compatible = "marvell,rd88f6281-a0", "marvell,rd88f6281","marvell,kirkwood-88f6281", "marvell,kirkwood";
+-
+-	dsa@0 {
+-		switch@0 {
+-			reg = <10 0>;    /* MDIO address 10, switch 0 in tree */
+-		};
+-	};
+-};
+\ No newline at end of file
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281-a1.dts b/arch/arm/boot/dts/kirkwood-rd88f6281-a1.dts
+deleted file mode 100644
+index baeebbf1d8c7..000000000000
+--- a/arch/arm/boot/dts/kirkwood-rd88f6281-a1.dts
++++ /dev/null
+@@ -1,31 +0,0 @@
+-/*
+- * Marvell RD88F6181 A1 Board descrition
+- *
+- * Andrew Lunn <andrew@lunn.ch>
+- *
+- * This file is licensed under the terms of the GNU General Public
+- * License version 2.  This program is licensed "as is" without any
+- * warranty of any kind, whether express or implied.
+- *
+- * This file contains the definitions for the board with the A1 variant of
+- * the SoC. The ethernet switch has a "wan" port.
+- */
+-
+-/dts-v1/;
+-
+-#include "kirkwood-rd88f6281.dtsi"
+-
+-/ {
+-	model = "Marvell RD88f6281 Reference design, with A1 SoC";
+-	compatible = "marvell,rd88f6281-a1", "marvell,rd88f6281","marvell,kirkwood-88f6281", "marvell,kirkwood";
+-
+-	dsa@0 {
+-		switch@0 {
+-			reg = <0 0>;    /* MDIO address 0, switch 0 in tree */
+-			port@4 {
+-				reg = <4>;
+-				label = "wan";
+-			};
+-		};
+-	};
+-};
+\ No newline at end of file
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281-z0.dts b/arch/arm/boot/dts/kirkwood-rd88f6281-z0.dts
+new file mode 100644
+index 000000000000..f4272b64ed7f
+--- /dev/null
++++ b/arch/arm/boot/dts/kirkwood-rd88f6281-z0.dts
+@@ -0,0 +1,35 @@
++/*
++ * Marvell RD88F6181 Z0 stepping descrition
++ *
++ * Andrew Lunn <andrew@lunn.ch>
++ *
++ * This file is licensed under the terms of the GNU General Public
++ * License version 2.  This program is licensed "as is" without any
++ * warranty of any kind, whether express or implied.
++ *
++ * This file contains the definitions for the board using the Z0
++ * stepping of the SoC. The ethernet switch has a "wan" port.
++*/
++
++/dts-v1/;
++
++#include "kirkwood-rd88f6281.dtsi"
++
++/ {
++	model = "Marvell RD88f6281 Reference design, with Z0 SoC";
++	compatible = "marvell,rd88f6281-z0", "marvell,rd88f6281","marvell,kirkwood-88f6281", "marvell,kirkwood";
++
++	dsa@0 {
++		switch@0 {
++			reg = <0 0>;    /* MDIO address 0, switch 0 in tree */
++			port@4 {
++				reg = <4>;
++				label = "wan";
++			};
++		};
++	};
++};
++
++&eth1 {
++      status = "disabled";
++};
+diff --git a/arch/arm/boot/dts/kirkwood-rd88f6281.dtsi b/arch/arm/boot/dts/kirkwood-rd88f6281.dtsi
+index 26cf0e0ccefd..d195e884b3b5 100644
+--- a/arch/arm/boot/dts/kirkwood-rd88f6281.dtsi
++++ b/arch/arm/boot/dts/kirkwood-rd88f6281.dtsi
+@@ -37,7 +37,6 @@
+ 
+ 	ocp@f1000000 {
+ 		pinctrl: pin-controller@10000 {
+-			pinctrl-0 = <&pmx_sdio_cd>;
+ 			pinctrl-names = "default";
+ 
+ 			pmx_sdio_cd: pmx-sdio-cd {
+@@ -69,8 +68,8 @@
+ 		#address-cells = <2>;
+ 		#size-cells = <0>;
+ 
+-		dsa,ethernet = <&eth0>;
+-		dsa,mii-bus = <&ethphy1>;
++		dsa,ethernet = <&eth0port>;
++		dsa,mii-bus = <&mdio>;
+ 
+ 		switch@0 {
+ 			#address-cells = <1>;
+@@ -119,35 +118,19 @@
+ 	};
+ 
+ 	partition@300000 {
+-		label = "data";
++		label = "rootfs";
+ 		reg = <0x0300000 0x500000>;
+ 	};
+ };
+ 
+ &mdio {
+ 	status = "okay";
+-
+-	ethphy0: ethernet-phy@0 {
+-		reg = <0>;
+-	};
+-
+-	ethphy1: ethernet-phy@ff {
+-		reg = <0xff>; /* No PHY attached */
+-		speed = <1000>;
+-		duple = <1>;
+-	};
+ };
+ 
+ &eth0 {
+ 	status = "okay";
+ 	ethernet0-port@0 {
+-		phy-handle = <&ethphy0>;
+-	};
+-};
+-
+-&eth1 {
+-	status = "okay";
+-	ethernet1-port@0 {
+-		phy-handle = <&ethphy1>;
++		speed = <1000>;
++		duplex = <1>;
+ 	};
+ };
+diff --git a/arch/arm/boot/dts/kirkwood.dtsi b/arch/arm/boot/dts/kirkwood.dtsi
+index afc640cd80c5..464f09a1a4a5 100644
+--- a/arch/arm/boot/dts/kirkwood.dtsi
++++ b/arch/arm/boot/dts/kirkwood.dtsi
+@@ -309,7 +309,7 @@
+ 			marvell,tx-checksum-limit = <1600>;
+ 			status = "disabled";
+ 
+-			ethernet0-port@0 {
++			eth0port: ethernet0-port@0 {
+ 				compatible = "marvell,kirkwood-eth-port";
+ 				reg = <0>;
+ 				interrupts = <11>;
+@@ -342,7 +342,7 @@
+ 			pinctrl-names = "default";
+ 			status = "disabled";
+ 
+-			ethernet1-port@0 {
++			eth1port: ethernet1-port@0 {
+ 				compatible = "marvell,kirkwood-eth-port";
+ 				reg = <0>;
+ 				interrupts = <15>;
+diff --git a/arch/arm/boot/dts/sama5d3_can.dtsi b/arch/arm/boot/dts/sama5d3_can.dtsi
+index a0775851cce5..eaf41451ad0c 100644
+--- a/arch/arm/boot/dts/sama5d3_can.dtsi
++++ b/arch/arm/boot/dts/sama5d3_can.dtsi
+@@ -40,7 +40,7 @@
+ 						atmel,clk-output-range = <0 66000000>;
+ 					};
+ 
+-					can1_clk: can0_clk {
++					can1_clk: can1_clk {
+ 						#clock-cells = <0>;
+ 						reg = <41>;
+ 						atmel,clk-output-range = <0 66000000>;
+diff --git a/arch/arm/mach-at91/clock.c b/arch/arm/mach-at91/clock.c
+index 034529d801b2..d66f102c352a 100644
+--- a/arch/arm/mach-at91/clock.c
++++ b/arch/arm/mach-at91/clock.c
+@@ -962,6 +962,7 @@ static int __init at91_clock_reset(void)
+ 	}
+ 
+ 	at91_pmc_write(AT91_PMC_SCDR, scdr);
++	at91_pmc_write(AT91_PMC_PCDR, pcdr);
+ 	if (cpu_is_sama5d3())
+ 		at91_pmc_write(AT91_PMC_PCDR1, pcdr1);
+ 
+diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
+index 253e33bc94fb..56de5aadede2 100644
+--- a/arch/arm64/include/asm/compat.h
++++ b/arch/arm64/include/asm/compat.h
+@@ -37,8 +37,8 @@ typedef s32		compat_ssize_t;
+ typedef s32		compat_time_t;
+ typedef s32		compat_clock_t;
+ typedef s32		compat_pid_t;
+-typedef u32		__compat_uid_t;
+-typedef u32		__compat_gid_t;
++typedef u16		__compat_uid_t;
++typedef u16		__compat_gid_t;
+ typedef u16		__compat_uid16_t;
+ typedef u16		__compat_gid16_t;
+ typedef u32		__compat_uid32_t;
+diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
+index 9ce04ba6bcb0..8993a69099c7 100644
+--- a/arch/arm64/kernel/entry.S
++++ b/arch/arm64/kernel/entry.S
+@@ -298,7 +298,6 @@ el1_dbg:
+ 	mrs	x0, far_el1
+ 	mov	x2, sp				// struct pt_regs
+ 	bl	do_debug_exception
+-	enable_dbg
+ 	kernel_exit 1
+ el1_inv:
+ 	// TODO: add support for undefined instructions in kernel mode
+diff --git a/arch/m68k/mm/hwtest.c b/arch/m68k/mm/hwtest.c
+index 2c7dde3c6430..2a5259fd23eb 100644
+--- a/arch/m68k/mm/hwtest.c
++++ b/arch/m68k/mm/hwtest.c
+@@ -28,9 +28,11 @@
+ int hwreg_present( volatile void *regp )
+ {
+     int	ret = 0;
++    unsigned long flags;
+     long	save_sp, save_vbr;
+     long	tmp_vectors[3];
+ 
++    local_irq_save(flags);
+     __asm__ __volatile__
+ 	(	"movec	%/vbr,%2\n\t"
+ 		"movel	#Lberr1,%4@(8)\n\t"
+@@ -46,6 +48,7 @@ int hwreg_present( volatile void *regp )
+ 		: "=&d" (ret), "=&r" (save_sp), "=&r" (save_vbr)
+ 		: "a" (regp), "a" (tmp_vectors)
+                 );
++    local_irq_restore(flags);
+ 
+     return( ret );
+ }
+@@ -58,9 +61,11 @@ EXPORT_SYMBOL(hwreg_present);
+ int hwreg_write( volatile void *regp, unsigned short val )
+ {
+ 	int		ret;
++	unsigned long flags;
+ 	long	save_sp, save_vbr;
+ 	long	tmp_vectors[3];
+ 
++	local_irq_save(flags);
+ 	__asm__ __volatile__
+ 	(	"movec	%/vbr,%2\n\t"
+ 		"movel	#Lberr2,%4@(8)\n\t"
+@@ -78,6 +83,7 @@ int hwreg_write( volatile void *regp, unsigned short val )
+ 		: "=&d" (ret), "=&r" (save_sp), "=&r" (save_vbr)
+ 		: "a" (regp), "a" (tmp_vectors), "g" (val)
+ 	);
++	local_irq_restore(flags);
+ 
+ 	return( ret );
+ }
+diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
+index 94802d267022..b20f9d63a664 100644
+--- a/arch/powerpc/kernel/eeh_pe.c
++++ b/arch/powerpc/kernel/eeh_pe.c
+@@ -570,6 +570,8 @@ static void *__eeh_pe_state_clear(void *data, void *flag)
+ {
+ 	struct eeh_pe *pe = (struct eeh_pe *)data;
+ 	int state = *((int *)flag);
++	struct eeh_dev *edev, *tmp;
++	struct pci_dev *pdev;
+ 
+ 	/* Keep the state of permanently removed PE intact */
+ 	if ((pe->freeze_count > EEH_MAX_ALLOWED_FREEZES) &&
+@@ -578,9 +580,22 @@ static void *__eeh_pe_state_clear(void *data, void *flag)
+ 
+ 	pe->state &= ~state;
+ 
+-	/* Clear check count since last isolation */
+-	if (state & EEH_PE_ISOLATED)
+-		pe->check_count = 0;
++	/*
++	 * Special treatment on clearing isolated state. Clear
++	 * check count since last isolation and put all affected
++	 * devices to normal state.
++	 */
++	if (!(state & EEH_PE_ISOLATED))
++		return NULL;
++
++	pe->check_count = 0;
++	eeh_pe_for_each_dev(pe, edev, tmp) {
++		pdev = eeh_dev_to_pci_dev(edev);
++		if (!pdev)
++			continue;
++
++		pdev->error_state = pci_channel_io_normal;
++	}
+ 
+ 	return NULL;
+ }
+diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
+index 4642d6a4d356..de1ec54a2a57 100644
+--- a/arch/powerpc/platforms/pseries/iommu.c
++++ b/arch/powerpc/platforms/pseries/iommu.c
+@@ -329,16 +329,16 @@ struct direct_window {
+ 
+ /* Dynamic DMA Window support */
+ struct ddw_query_response {
+-	__be32 windows_available;
+-	__be32 largest_available_block;
+-	__be32 page_size;
+-	__be32 migration_capable;
++	u32 windows_available;
++	u32 largest_available_block;
++	u32 page_size;
++	u32 migration_capable;
+ };
+ 
+ struct ddw_create_response {
+-	__be32 liobn;
+-	__be32 addr_hi;
+-	__be32 addr_lo;
++	u32 liobn;
++	u32 addr_hi;
++	u32 addr_lo;
+ };
+ 
+ static LIST_HEAD(direct_window_list);
+@@ -725,16 +725,18 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
+ {
+ 	struct dynamic_dma_window_prop *dwp;
+ 	struct property *win64;
+-	const u32 *ddw_avail;
++	u32 ddw_avail[3];
+ 	u64 liobn;
+-	int len, ret = 0;
++	int ret = 0;
++
++	ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
++					 &ddw_avail[0], 3);
+ 
+-	ddw_avail = of_get_property(np, "ibm,ddw-applicable", &len);
+ 	win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
+ 	if (!win64)
+ 		return;
+ 
+-	if (!ddw_avail || len < 3 * sizeof(u32) || win64->length < sizeof(*dwp))
++	if (ret || win64->length < sizeof(*dwp))
+ 		goto delprop;
+ 
+ 	dwp = win64->value;
+@@ -872,8 +874,9 @@ static int create_ddw(struct pci_dev *dev, const u32 *ddw_avail,
+ 
+ 	do {
+ 		/* extra outputs are LIOBN and dma-addr (hi, lo) */
+-		ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create, cfg_addr,
+-				BUID_HI(buid), BUID_LO(buid), page_shift, window_shift);
++		ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create,
++				cfg_addr, BUID_HI(buid), BUID_LO(buid),
++				page_shift, window_shift);
+ 	} while (rtas_busy_delay(ret));
+ 	dev_info(&dev->dev,
+ 		"ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d "
+@@ -910,7 +913,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ 	int page_shift;
+ 	u64 dma_addr, max_addr;
+ 	struct device_node *dn;
+-	const u32 *uninitialized_var(ddw_avail);
++	u32 ddw_avail[3];
+ 	struct direct_window *window;
+ 	struct property *win64;
+ 	struct dynamic_dma_window_prop *ddwprop;
+@@ -942,8 +945,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ 	 * for the given node in that order.
+ 	 * the property is actually in the parent, not the PE
+ 	 */
+-	ddw_avail = of_get_property(pdn, "ibm,ddw-applicable", &len);
+-	if (!ddw_avail || len < 3 * sizeof(u32))
++	ret = of_property_read_u32_array(pdn, "ibm,ddw-applicable",
++					 &ddw_avail[0], 3);
++	if (ret)
+ 		goto out_failed;
+ 
+        /*
+@@ -966,11 +970,11 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ 		dev_dbg(&dev->dev, "no free dynamic windows");
+ 		goto out_failed;
+ 	}
+-	if (be32_to_cpu(query.page_size) & 4) {
++	if (query.page_size & 4) {
+ 		page_shift = 24; /* 16MB */
+-	} else if (be32_to_cpu(query.page_size) & 2) {
++	} else if (query.page_size & 2) {
+ 		page_shift = 16; /* 64kB */
+-	} else if (be32_to_cpu(query.page_size) & 1) {
++	} else if (query.page_size & 1) {
+ 		page_shift = 12; /* 4kB */
+ 	} else {
+ 		dev_dbg(&dev->dev, "no supported direct page size in mask %x",
+@@ -980,7 +984,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ 	/* verify the window * number of ptes will map the partition */
+ 	/* check largest block * page size > max memory hotplug addr */
+ 	max_addr = memory_hotplug_max();
+-	if (be32_to_cpu(query.largest_available_block) < (max_addr >> page_shift)) {
++	if (query.largest_available_block < (max_addr >> page_shift)) {
+ 		dev_dbg(&dev->dev, "can't map partiton max 0x%llx with %u "
+ 			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
+ 			  1ULL << page_shift);
+@@ -1006,8 +1010,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ 	if (ret != 0)
+ 		goto out_free_prop;
+ 
+-	ddwprop->liobn = create.liobn;
+-	ddwprop->dma_base = cpu_to_be64(of_read_number(&create.addr_hi, 2));
++	ddwprop->liobn = cpu_to_be32(create.liobn);
++	ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
++			create.addr_lo);
+ 	ddwprop->tce_shift = cpu_to_be32(page_shift);
+ 	ddwprop->window_shift = cpu_to_be32(len);
+ 
+@@ -1039,7 +1044,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+ 	list_add(&window->list, &direct_window_list);
+ 	spin_unlock(&direct_window_list_lock);
+ 
+-	dma_addr = of_read_number(&create.addr_hi, 2);
++	dma_addr = be64_to_cpu(ddwprop->dma_base);
+ 	goto out_unlock;
+ 
+ out_free_window:
+diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
+index 90c8de22a2a0..5d5ebd400162 100644
+--- a/arch/s390/kvm/interrupt.c
++++ b/arch/s390/kvm/interrupt.c
+@@ -85,6 +85,7 @@ static int __interrupt_is_deliverable(struct kvm_vcpu *vcpu,
+ 			return 0;
+ 		if (vcpu->arch.sie_block->gcr[0] & 0x2000ul)
+ 			return 1;
++		return 0;
+ 	case KVM_S390_INT_EMERGENCY:
+ 		if (psw_extint_disabled(vcpu))
+ 			return 0;
+diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
+index 407c87d9879a..db7d3bf4357e 100644
+--- a/arch/sparc/Kconfig
++++ b/arch/sparc/Kconfig
+@@ -67,6 +67,7 @@ config SPARC64
+ 	select HAVE_SYSCALL_TRACEPOINTS
+ 	select HAVE_CONTEXT_TRACKING
+ 	select HAVE_DEBUG_KMEMLEAK
++	select SPARSE_IRQ
+ 	select RTC_DRV_CMOS
+ 	select RTC_DRV_BQ4802
+ 	select RTC_DRV_SUN4V
+diff --git a/arch/sparc/include/asm/hypervisor.h b/arch/sparc/include/asm/hypervisor.h
+index 94b39caea3eb..4f6725ff4c33 100644
+--- a/arch/sparc/include/asm/hypervisor.h
++++ b/arch/sparc/include/asm/hypervisor.h
+@@ -2947,6 +2947,16 @@ unsigned long sun4v_vt_set_perfreg(unsigned long reg_num,
+ 				   unsigned long reg_val);
+ #endif
+ 
++#define	HV_FAST_T5_GET_PERFREG		0x1a8
++#define	HV_FAST_T5_SET_PERFREG		0x1a9
++
++#ifndef	__ASSEMBLY__
++unsigned long sun4v_t5_get_perfreg(unsigned long reg_num,
++				   unsigned long *reg_val);
++unsigned long sun4v_t5_set_perfreg(unsigned long reg_num,
++				   unsigned long reg_val);
++#endif
++
+ /* Function numbers for HV_CORE_TRAP.  */
+ #define HV_CORE_SET_VER			0x00
+ #define HV_CORE_PUTCHAR			0x01
+@@ -2978,6 +2988,7 @@ unsigned long sun4v_vt_set_perfreg(unsigned long reg_num,
+ #define HV_GRP_VF_CPU			0x0205
+ #define HV_GRP_KT_CPU			0x0209
+ #define HV_GRP_VT_CPU			0x020c
++#define HV_GRP_T5_CPU			0x0211
+ #define HV_GRP_DIAG			0x0300
+ 
+ #ifndef __ASSEMBLY__
+diff --git a/arch/sparc/include/asm/irq_64.h b/arch/sparc/include/asm/irq_64.h
+index 91d219381306..3f70f900e834 100644
+--- a/arch/sparc/include/asm/irq_64.h
++++ b/arch/sparc/include/asm/irq_64.h
+@@ -37,7 +37,7 @@
+  *
+  * ino_bucket->irq allocation is made during {sun4v_,}build_irq().
+  */
+-#define NR_IRQS    255
++#define NR_IRQS		(2048)
+ 
+ void irq_install_pre_handler(int irq,
+ 			     void (*func)(unsigned int, void *, void *),
+@@ -57,11 +57,8 @@ unsigned int sun4u_build_msi(u32 portid, unsigned int *irq_p,
+ 			     unsigned long iclr_base);
+ void sun4u_destroy_msi(unsigned int irq);
+ 
+-unsigned char irq_alloc(unsigned int dev_handle,
+-			unsigned int dev_ino);
+-#ifdef CONFIG_PCI_MSI
++unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino);
+ void irq_free(unsigned int irq);
+-#endif
+ 
+ void __init init_IRQ(void);
+ void fixup_irqs(void);
+diff --git a/arch/sparc/include/asm/ldc.h b/arch/sparc/include/asm/ldc.h
+index c8c67f621f4f..58ab64de25d2 100644
+--- a/arch/sparc/include/asm/ldc.h
++++ b/arch/sparc/include/asm/ldc.h
+@@ -53,13 +53,14 @@ struct ldc_channel;
+ /* Allocate state for a channel.  */
+ struct ldc_channel *ldc_alloc(unsigned long id,
+ 			      const struct ldc_channel_config *cfgp,
+-			      void *event_arg);
++			      void *event_arg,
++			      const char *name);
+ 
+ /* Shut down and free state for a channel.  */
+ void ldc_free(struct ldc_channel *lp);
+ 
+ /* Register TX and RX queues of the link with the hypervisor.  */
+-int ldc_bind(struct ldc_channel *lp, const char *name);
++int ldc_bind(struct ldc_channel *lp);
+ 
+ /* For non-RAW protocols we need to complete a handshake before
+  * communication can proceed.  ldc_connect() does that, if the
+diff --git a/arch/sparc/include/asm/oplib_64.h b/arch/sparc/include/asm/oplib_64.h
+index f34682430fcf..2e3a4add8591 100644
+--- a/arch/sparc/include/asm/oplib_64.h
++++ b/arch/sparc/include/asm/oplib_64.h
+@@ -62,7 +62,8 @@ struct linux_mem_p1275 {
+ /* You must call prom_init() before using any of the library services,
+  * preferably as early as possible.  Pass it the romvec pointer.
+  */
+-void prom_init(void *cif_handler, void *cif_stack);
++void prom_init(void *cif_handler);
++void prom_init_report(void);
+ 
+ /* Boot argument acquisition, returns the boot command line string. */
+ char *prom_getbootargs(void);
+diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
+index bf109984a032..8c2a8c937540 100644
+--- a/arch/sparc/include/asm/page_64.h
++++ b/arch/sparc/include/asm/page_64.h
+@@ -57,18 +57,21 @@ void copy_user_page(void *to, void *from, unsigned long vaddr, struct page *topa
+ typedef struct { unsigned long pte; } pte_t;
+ typedef struct { unsigned long iopte; } iopte_t;
+ typedef struct { unsigned long pmd; } pmd_t;
++typedef struct { unsigned long pud; } pud_t;
+ typedef struct { unsigned long pgd; } pgd_t;
+ typedef struct { unsigned long pgprot; } pgprot_t;
+ 
+ #define pte_val(x)	((x).pte)
+ #define iopte_val(x)	((x).iopte)
+ #define pmd_val(x)      ((x).pmd)
++#define pud_val(x)      ((x).pud)
+ #define pgd_val(x)	((x).pgd)
+ #define pgprot_val(x)	((x).pgprot)
+ 
+ #define __pte(x)	((pte_t) { (x) } )
+ #define __iopte(x)	((iopte_t) { (x) } )
+ #define __pmd(x)        ((pmd_t) { (x) } )
++#define __pud(x)        ((pud_t) { (x) } )
+ #define __pgd(x)	((pgd_t) { (x) } )
+ #define __pgprot(x)	((pgprot_t) { (x) } )
+ 
+@@ -77,18 +80,21 @@ typedef struct { unsigned long pgprot; } pgprot_t;
+ typedef unsigned long pte_t;
+ typedef unsigned long iopte_t;
+ typedef unsigned long pmd_t;
++typedef unsigned long pud_t;
+ typedef unsigned long pgd_t;
+ typedef unsigned long pgprot_t;
+ 
+ #define pte_val(x)	(x)
+ #define iopte_val(x)	(x)
+ #define pmd_val(x)      (x)
++#define pud_val(x)      (x)
+ #define pgd_val(x)	(x)
+ #define pgprot_val(x)	(x)
+ 
+ #define __pte(x)	(x)
+ #define __iopte(x)	(x)
+ #define __pmd(x)        (x)
++#define __pud(x)        (x)
+ #define __pgd(x)	(x)
+ #define __pgprot(x)	(x)
+ 
+@@ -96,21 +102,14 @@ typedef unsigned long pgprot_t;
+ 
+ typedef pte_t *pgtable_t;
+ 
+-/* These two values define the virtual address space range in which we
+- * must forbid 64-bit user processes from making mappings.  It used to
+- * represent precisely the virtual address space hole present in most
+- * early sparc64 chips including UltraSPARC-I.  But now it also is
+- * further constrained by the limits of our page tables, which is
+- * 43-bits of virtual address.
+- */
+-#define SPARC64_VA_HOLE_TOP	_AC(0xfffffc0000000000,UL)
+-#define SPARC64_VA_HOLE_BOTTOM	_AC(0x0000040000000000,UL)
++extern unsigned long sparc64_va_hole_top;
++extern unsigned long sparc64_va_hole_bottom;
+ 
+ /* The next two defines specify the actual exclusion region we
+  * enforce, wherein we use a 4GB red zone on each side of the VA hole.
+  */
+-#define VA_EXCLUDE_START (SPARC64_VA_HOLE_BOTTOM - (1UL << 32UL))
+-#define VA_EXCLUDE_END   (SPARC64_VA_HOLE_TOP + (1UL << 32UL))
++#define VA_EXCLUDE_START (sparc64_va_hole_bottom - (1UL << 32UL))
++#define VA_EXCLUDE_END   (sparc64_va_hole_top + (1UL << 32UL))
+ 
+ #define TASK_UNMAPPED_BASE	(test_thread_flag(TIF_32BIT) ? \
+ 				 _AC(0x0000000070000000,UL) : \
+@@ -118,20 +117,16 @@ typedef pte_t *pgtable_t;
+ 
+ #include <asm-generic/memory_model.h>
+ 
+-#define PAGE_OFFSET_BY_BITS(X)	(-(_AC(1,UL) << (X)))
+ extern unsigned long PAGE_OFFSET;
+ 
+ #endif /* !(__ASSEMBLY__) */
+ 
+-/* The maximum number of physical memory address bits we support, this
+- * is used to size various tables used to manage kernel TLB misses and
+- * also the sparsemem code.
++/* The maximum number of physical memory address bits we support.  The
++ * largest value we can support is whatever "KPGD_SHIFT + KPTE_BITS"
++ * evaluates to.
+  */
+-#define MAX_PHYS_ADDRESS_BITS	47
++#define MAX_PHYS_ADDRESS_BITS	53
+ 
+-/* These two shift counts are used when indexing sparc64_valid_addr_bitmap
+- * and kpte_linear_bitmap.
+- */
+ #define ILOG2_4MB		22
+ #define ILOG2_256MB		28
+ 
+diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
+index 39a7ac49b00c..5e3187185b4a 100644
+--- a/arch/sparc/include/asm/pgalloc_64.h
++++ b/arch/sparc/include/asm/pgalloc_64.h
+@@ -15,6 +15,13 @@
+ 
+ extern struct kmem_cache *pgtable_cache;
+ 
++static inline void __pgd_populate(pgd_t *pgd, pud_t *pud)
++{
++	pgd_set(pgd, pud);
++}
++
++#define pgd_populate(MM, PGD, PUD)	__pgd_populate(PGD, PUD)
++
+ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
+ {
+ 	return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+@@ -25,7 +32,23 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
+ 	kmem_cache_free(pgtable_cache, pgd);
+ }
+ 
+-#define pud_populate(MM, PUD, PMD)	pud_set(PUD, PMD)
++static inline void __pud_populate(pud_t *pud, pmd_t *pmd)
++{
++	pud_set(pud, pmd);
++}
++
++#define pud_populate(MM, PUD, PMD)	__pud_populate(PUD, PMD)
++
++static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
++{
++	return kmem_cache_alloc(pgtable_cache,
++				GFP_KERNEL|__GFP_REPEAT);
++}
++
++static inline void pud_free(struct mm_struct *mm, pud_t *pud)
++{
++	kmem_cache_free(pgtable_cache, pud);
++}
+ 
+ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+ {
+@@ -91,4 +114,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pte_t *pte,
+ #define __pmd_free_tlb(tlb, pmd, addr)		      \
+ 	pgtable_free_tlb(tlb, pmd, false)
+ 
++#define __pud_free_tlb(tlb, pud, addr)		      \
++	pgtable_free_tlb(tlb, pud, false)
++
+ #endif /* _SPARC64_PGALLOC_H */
+diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
+index 3770bf5c6e1b..bfeb626085ac 100644
+--- a/arch/sparc/include/asm/pgtable_64.h
++++ b/arch/sparc/include/asm/pgtable_64.h
+@@ -20,8 +20,6 @@
+ #include <asm/page.h>
+ #include <asm/processor.h>
+ 
+-#include <asm-generic/pgtable-nopud.h>
+-
+ /* The kernel image occupies 0x4000000 to 0x6000000 (4MB --> 96MB).
+  * The page copy blockops can use 0x6000000 to 0x8000000.
+  * The 8K TSB is mapped in the 0x8000000 to 0x8400000 range.
+@@ -42,10 +40,7 @@
+ #define LOW_OBP_ADDRESS		_AC(0x00000000f0000000,UL)
+ #define HI_OBP_ADDRESS		_AC(0x0000000100000000,UL)
+ #define VMALLOC_START		_AC(0x0000000100000000,UL)
+-#define VMALLOC_END		_AC(0x0000010000000000,UL)
+-#define VMEMMAP_BASE		_AC(0x0000010000000000,UL)
+-
+-#define vmemmap			((struct page *)VMEMMAP_BASE)
++#define VMEMMAP_BASE		VMALLOC_END
+ 
+ /* PMD_SHIFT determines the size of the area a second-level page
+  * table can map
+@@ -55,13 +50,25 @@
+ #define PMD_MASK	(~(PMD_SIZE-1))
+ #define PMD_BITS	(PAGE_SHIFT - 3)
+ 
+-/* PGDIR_SHIFT determines what a third-level page table entry can map */
+-#define PGDIR_SHIFT	(PAGE_SHIFT + (PAGE_SHIFT-3) + PMD_BITS)
++/* PUD_SHIFT determines the size of the area a third-level page
++ * table can map
++ */
++#define PUD_SHIFT	(PMD_SHIFT + PMD_BITS)
++#define PUD_SIZE	(_AC(1,UL) << PUD_SHIFT)
++#define PUD_MASK	(~(PUD_SIZE-1))
++#define PUD_BITS	(PAGE_SHIFT - 3)
++
++/* PGDIR_SHIFT determines what a fourth-level page table entry can map */
++#define PGDIR_SHIFT	(PUD_SHIFT + PUD_BITS)
+ #define PGDIR_SIZE	(_AC(1,UL) << PGDIR_SHIFT)
+ #define PGDIR_MASK	(~(PGDIR_SIZE-1))
+ #define PGDIR_BITS	(PAGE_SHIFT - 3)
+ 
+-#if (PGDIR_SHIFT + PGDIR_BITS) != 43
++#if (MAX_PHYS_ADDRESS_BITS > PGDIR_SHIFT + PGDIR_BITS)
++#error MAX_PHYS_ADDRESS_BITS exceeds what kernel page tables can support
++#endif
++
++#if (PGDIR_SHIFT + PGDIR_BITS) != 53
+ #error Page table parameters do not cover virtual address space properly.
+ #endif
+ 
+@@ -71,28 +78,18 @@
+ 
+ #ifndef __ASSEMBLY__
+ 
+-#include <linux/sched.h>
+-
+-extern unsigned long sparc64_valid_addr_bitmap[];
++extern unsigned long VMALLOC_END;
+ 
+-/* Needs to be defined here and not in linux/mm.h, as it is arch dependent */
+-static inline bool __kern_addr_valid(unsigned long paddr)
+-{
+-	if ((paddr >> MAX_PHYS_ADDRESS_BITS) != 0UL)
+-		return false;
+-	return test_bit(paddr >> ILOG2_4MB, sparc64_valid_addr_bitmap);
+-}
++#define vmemmap			((struct page *)VMEMMAP_BASE)
+ 
+-static inline bool kern_addr_valid(unsigned long addr)
+-{
+-	unsigned long paddr = __pa(addr);
++#include <linux/sched.h>
+ 
+-	return __kern_addr_valid(paddr);
+-}
++bool kern_addr_valid(unsigned long addr);
+ 
+ /* Entries per page directory level. */
+ #define PTRS_PER_PTE	(1UL << (PAGE_SHIFT-3))
+ #define PTRS_PER_PMD	(1UL << PMD_BITS)
++#define PTRS_PER_PUD	(1UL << PUD_BITS)
+ #define PTRS_PER_PGD	(1UL << PGDIR_BITS)
+ 
+ /* Kernel has a separate 44bit address space. */
+@@ -101,6 +98,9 @@ static inline bool kern_addr_valid(unsigned long addr)
+ #define pmd_ERROR(e)							\
+ 	pr_err("%s:%d: bad pmd %p(%016lx) seen at (%pS)\n",		\
+ 	       __FILE__, __LINE__, &(e), pmd_val(e), __builtin_return_address(0))
++#define pud_ERROR(e)							\
++	pr_err("%s:%d: bad pud %p(%016lx) seen at (%pS)\n",		\
++	       __FILE__, __LINE__, &(e), pud_val(e), __builtin_return_address(0))
+ #define pgd_ERROR(e)							\
+ 	pr_err("%s:%d: bad pgd %p(%016lx) seen at (%pS)\n",		\
+ 	       __FILE__, __LINE__, &(e), pgd_val(e), __builtin_return_address(0))
+@@ -112,6 +112,7 @@ static inline bool kern_addr_valid(unsigned long addr)
+ #define _PAGE_R	  	  _AC(0x8000000000000000,UL) /* Keep ref bit uptodate*/
+ #define _PAGE_SPECIAL     _AC(0x0200000000000000,UL) /* Special page         */
+ #define _PAGE_PMD_HUGE    _AC(0x0100000000000000,UL) /* Huge page            */
++#define _PAGE_PUD_HUGE    _PAGE_PMD_HUGE
+ 
+ /* Advertise support for _PAGE_SPECIAL */
+ #define __HAVE_ARCH_PTE_SPECIAL
+@@ -658,26 +659,26 @@ static inline unsigned long pmd_large(pmd_t pmd)
+ 	return pte_val(pte) & _PAGE_PMD_HUGE;
+ }
+ 
+-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+-static inline unsigned long pmd_young(pmd_t pmd)
++static inline unsigned long pmd_pfn(pmd_t pmd)
+ {
+ 	pte_t pte = __pte(pmd_val(pmd));
+ 
+-	return pte_young(pte);
++	return pte_pfn(pte);
+ }
+ 
+-static inline unsigned long pmd_write(pmd_t pmd)
++#ifdef CONFIG_TRANSPARENT_HUGEPAGE
++static inline unsigned long pmd_young(pmd_t pmd)
+ {
+ 	pte_t pte = __pte(pmd_val(pmd));
+ 
+-	return pte_write(pte);
++	return pte_young(pte);
+ }
+ 
+-static inline unsigned long pmd_pfn(pmd_t pmd)
++static inline unsigned long pmd_write(pmd_t pmd)
+ {
+ 	pte_t pte = __pte(pmd_val(pmd));
+ 
+-	return pte_pfn(pte);
++	return pte_write(pte);
+ }
+ 
+ static inline unsigned long pmd_trans_huge(pmd_t pmd)
+@@ -771,13 +772,15 @@ static inline int pmd_present(pmd_t pmd)
+  * the top bits outside of the range of any physical address size we
+  * support are clear as well.  We also validate the physical itself.
+  */
+-#define pmd_bad(pmd)			((pmd_val(pmd) & ~PAGE_MASK) || \
+-					 !__kern_addr_valid(pmd_val(pmd)))
++#define pmd_bad(pmd)			(pmd_val(pmd) & ~PAGE_MASK)
+ 
+ #define pud_none(pud)			(!pud_val(pud))
+ 
+-#define pud_bad(pud)			((pud_val(pud) & ~PAGE_MASK) || \
+-					 !__kern_addr_valid(pud_val(pud)))
++#define pud_bad(pud)			(pud_val(pud) & ~PAGE_MASK)
++
++#define pgd_none(pgd)			(!pgd_val(pgd))
++
++#define pgd_bad(pgd)			(pgd_val(pgd) & ~PAGE_MASK)
+ 
+ #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+@@ -815,10 +818,31 @@ static inline unsigned long __pmd_page(pmd_t pmd)
+ #define pmd_clear(pmdp)			(pmd_val(*(pmdp)) = 0UL)
+ #define pud_present(pud)		(pud_val(pud) != 0U)
+ #define pud_clear(pudp)			(pud_val(*(pudp)) = 0UL)
++#define pgd_page_vaddr(pgd)		\
++	((unsigned long) __va(pgd_val(pgd)))
++#define pgd_present(pgd)		(pgd_val(pgd) != 0U)
++#define pgd_clear(pgdp)			(pgd_val(*(pgd)) = 0UL)
++
++static inline unsigned long pud_large(pud_t pud)
++{
++	pte_t pte = __pte(pud_val(pud));
++
++	return pte_val(pte) & _PAGE_PMD_HUGE;
++}
++
++static inline unsigned long pud_pfn(pud_t pud)
++{
++	pte_t pte = __pte(pud_val(pud));
++
++	return pte_pfn(pte);
++}
+ 
+ /* Same in both SUN4V and SUN4U.  */
+ #define pte_none(pte) 			(!pte_val(pte))
+ 
++#define pgd_set(pgdp, pudp)	\
++	(pgd_val(*(pgdp)) = (__pa((unsigned long) (pudp))))
++
+ /* to find an entry in a page-table-directory. */
+ #define pgd_index(address)	(((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
+ #define pgd_offset(mm, address)	((mm)->pgd + pgd_index(address))
+@@ -826,6 +850,11 @@ static inline unsigned long __pmd_page(pmd_t pmd)
+ /* to find an entry in a kernel page-table-directory */
+ #define pgd_offset_k(address) pgd_offset(&init_mm, address)
+ 
++/* Find an entry in the third-level page table.. */
++#define pud_index(address)	(((address) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
++#define pud_offset(pgdp, address)	\
++	((pud_t *) pgd_page_vaddr(*(pgdp)) + pud_index(address))
++
+ /* Find an entry in the second-level page table.. */
+ #define pmd_offset(pudp, address)	\
+ 	((pmd_t *) pud_page_vaddr(*(pudp)) + \
+@@ -898,7 +927,6 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
+ #endif
+ 
+ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
+-extern pmd_t swapper_low_pmd_dir[PTRS_PER_PMD];
+ 
+ void paging_init(void);
+ unsigned long find_ecache_flush_span(unsigned long size);
+diff --git a/arch/sparc/include/asm/setup.h b/arch/sparc/include/asm/setup.h
+index f5fffd84d0dd..29d64b1758ed 100644
+--- a/arch/sparc/include/asm/setup.h
++++ b/arch/sparc/include/asm/setup.h
+@@ -48,6 +48,8 @@ unsigned long safe_compute_effective_address(struct pt_regs *, unsigned int);
+ #endif
+ 
+ #ifdef CONFIG_SPARC64
++void __init start_early_boot(void);
++
+ /* unaligned_64.c */
+ int handle_ldf_stq(u32 insn, struct pt_regs *regs);
+ void handle_ld_nf(u32 insn, struct pt_regs *regs);
+diff --git a/arch/sparc/include/asm/spitfire.h b/arch/sparc/include/asm/spitfire.h
+index 3fc58691dbd0..56f933816144 100644
+--- a/arch/sparc/include/asm/spitfire.h
++++ b/arch/sparc/include/asm/spitfire.h
+@@ -45,6 +45,8 @@
+ #define SUN4V_CHIP_NIAGARA3	0x03
+ #define SUN4V_CHIP_NIAGARA4	0x04
+ #define SUN4V_CHIP_NIAGARA5	0x05
++#define SUN4V_CHIP_SPARC_M6	0x06
++#define SUN4V_CHIP_SPARC_M7	0x07
+ #define SUN4V_CHIP_SPARC64X	0x8a
+ #define SUN4V_CHIP_UNKNOWN	0xff
+ 
+diff --git a/arch/sparc/include/asm/thread_info_64.h b/arch/sparc/include/asm/thread_info_64.h
+index a5f01ac6d0f1..cc6275c931a5 100644
+--- a/arch/sparc/include/asm/thread_info_64.h
++++ b/arch/sparc/include/asm/thread_info_64.h
+@@ -63,7 +63,8 @@ struct thread_info {
+ 	struct pt_regs		*kern_una_regs;
+ 	unsigned int		kern_una_insn;
+ 
+-	unsigned long		fpregs[0] __attribute__ ((aligned(64)));
++	unsigned long		fpregs[(7 * 256) / sizeof(unsigned long)]
++		__attribute__ ((aligned(64)));
+ };
+ 
+ #endif /* !(__ASSEMBLY__) */
+@@ -102,6 +103,7 @@ struct thread_info {
+ #define FAULT_CODE_ITLB		0x04	/* Miss happened in I-TLB	   */
+ #define FAULT_CODE_WINFIXUP	0x08	/* Miss happened during spill/fill */
+ #define FAULT_CODE_BLKCOMMIT	0x10	/* Use blk-commit ASI in copy_page */
++#define	FAULT_CODE_BAD_RA	0x20	/* Bad RA for sun4v		   */
+ 
+ #if PAGE_SHIFT == 13
+ #define THREAD_SIZE (2*PAGE_SIZE)
+diff --git a/arch/sparc/include/asm/tsb.h b/arch/sparc/include/asm/tsb.h
+index 90916f955cac..ecb49cfa3be9 100644
+--- a/arch/sparc/include/asm/tsb.h
++++ b/arch/sparc/include/asm/tsb.h
+@@ -133,9 +133,24 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ 	sub	TSB, 0x8, TSB;   \
+ 	TSB_STORE(TSB, TAG);
+ 
+-	/* Do a kernel page table walk.  Leaves physical PTE pointer in
+-	 * REG1.  Jumps to FAIL_LABEL on early page table walk termination.
+-	 * VADDR will not be clobbered, but REG2 will.
++	/* Do a kernel page table walk.  Leaves valid PTE value in
++	 * REG1.  Jumps to FAIL_LABEL on early page table walk
++	 * termination.  VADDR will not be clobbered, but REG2 will.
++	 *
++	 * There are two masks we must apply to propagate bits from
++	 * the virtual address into the PTE physical address field
++	 * when dealing with huge pages.  This is because the page
++	 * table boundaries do not match the huge page size(s) the
++	 * hardware supports.
++	 *
++	 * In these cases we propagate the bits that are below the
++	 * page table level where we saw the huge page mapping, but
++	 * are still within the relevant physical bits for the huge
++	 * page size in question.  So for PMD mappings (which fall on
++	 * bit 23, for 8MB per PMD) we must propagate bit 22 for a
++	 * 4MB huge page.  For huge PUDs (which fall on bit 33, for
++	 * 8GB per PUD), we have to accomodate 256MB and 2GB huge
++	 * pages.  So for those we propagate bits 32 to 28.
+ 	 */
+ #define KERN_PGTABLE_WALK(VADDR, REG1, REG2, FAIL_LABEL)	\
+ 	sethi		%hi(swapper_pg_dir), REG1; \
+@@ -145,15 +160,40 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ 	andn		REG2, 0x7, REG2; \
+ 	ldx		[REG1 + REG2], REG1; \
+ 	brz,pn		REG1, FAIL_LABEL; \
+-	 sllx		VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
++	 sllx		VADDR, 64 - (PUD_SHIFT + PUD_BITS), REG2; \
+ 	srlx		REG2, 64 - PAGE_SHIFT, REG2; \
+ 	andn		REG2, 0x7, REG2; \
+ 	ldxa		[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
+ 	brz,pn		REG1, FAIL_LABEL; \
+-	 sllx		VADDR, 64 - PMD_SHIFT, REG2; \
++	sethi		%uhi(_PAGE_PUD_HUGE), REG2; \
++	brz,pn		REG1, FAIL_LABEL; \
++	 sllx		REG2, 32, REG2; \
++	andcc		REG1, REG2, %g0; \
++	sethi		%hi(0xf8000000), REG2; \
++	bne,pt		%xcc, 697f; \
++	 sllx		REG2, 1, REG2; \
++	sllx		VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
+ 	srlx		REG2, 64 - PAGE_SHIFT, REG2; \
+ 	andn		REG2, 0x7, REG2; \
+-	add		REG1, REG2, REG1;
++	ldxa		[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
++	sethi		%uhi(_PAGE_PMD_HUGE), REG2; \
++	brz,pn		REG1, FAIL_LABEL; \
++	 sllx		REG2, 32, REG2; \
++	andcc		REG1, REG2, %g0; \
++	be,pn		%xcc, 698f; \
++	 sethi		%hi(0x400000), REG2; \
++697:	brgez,pn	REG1, FAIL_LABEL; \
++	 andn		REG1, REG2, REG1; \
++	and		VADDR, REG2, REG2; \
++	ba,pt		%xcc, 699f; \
++	 or		REG1, REG2, REG1; \
++698:	sllx		VADDR, 64 - PMD_SHIFT, REG2; \
++	srlx		REG2, 64 - PAGE_SHIFT, REG2; \
++	andn		REG2, 0x7, REG2; \
++	ldxa		[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
++	brgez,pn	REG1, FAIL_LABEL; \
++	 nop; \
++699:
+ 
+ 	/* PMD has been loaded into REG1, interpret the value, seeing
+ 	 * if it is a HUGE PMD or a normal one.  If it is not valid
+@@ -198,6 +238,11 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ 	andn		REG2, 0x7, REG2; \
+ 	ldxa		[PHYS_PGD + REG2] ASI_PHYS_USE_EC, REG1; \
+ 	brz,pn		REG1, FAIL_LABEL; \
++	 sllx		VADDR, 64 - (PUD_SHIFT + PUD_BITS), REG2; \
++	srlx		REG2, 64 - PAGE_SHIFT, REG2; \
++	andn		REG2, 0x7, REG2; \
++	ldxa		[REG1 + REG2] ASI_PHYS_USE_EC, REG1; \
++	brz,pn		REG1, FAIL_LABEL; \
+ 	 sllx		VADDR, 64 - (PMD_SHIFT + PMD_BITS), REG2; \
+ 	srlx		REG2, 64 - PAGE_SHIFT, REG2; \
+ 	andn		REG2, 0x7, REG2; \
+@@ -246,8 +291,6 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ 	(KERNEL_TSB_SIZE_BYTES / 16)
+ #define KERNEL_TSB4M_NENTRIES	4096
+ 
+-#define KTSB_PHYS_SHIFT		15
+-
+ 	/* Do a kernel TSB lookup at tl>0 on VADDR+TAG, branch to OK_LABEL
+ 	 * on TSB hit.  REG1, REG2, REG3, and REG4 are used as temporaries
+ 	 * and the found TTE will be left in REG1.  REG3 and REG4 must
+@@ -256,17 +299,15 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ 	 * VADDR and TAG will be preserved and not clobbered by this macro.
+ 	 */
+ #define KERN_TSB_LOOKUP_TL1(VADDR, TAG, REG1, REG2, REG3, REG4, OK_LABEL) \
+-661:	sethi		%hi(swapper_tsb), REG1;			\
+-	or		REG1, %lo(swapper_tsb), REG1; \
++661:	sethi		%uhi(swapper_tsb), REG1; \
++	sethi		%hi(swapper_tsb), REG2; \
++	or		REG1, %ulo(swapper_tsb), REG1; \
++	or		REG2, %lo(swapper_tsb), REG2; \
+ 	.section	.swapper_tsb_phys_patch, "ax"; \
+ 	.word		661b; \
+ 	.previous; \
+-661:	nop; \
+-	.section	.tsb_ldquad_phys_patch, "ax"; \
+-	.word		661b; \
+-	sllx		REG1, KTSB_PHYS_SHIFT, REG1; \
+-	sllx		REG1, KTSB_PHYS_SHIFT, REG1; \
+-	.previous; \
++	sllx		REG1, 32, REG1; \
++	or		REG1, REG2, REG1; \
+ 	srlx		VADDR, PAGE_SHIFT, REG2; \
+ 	and		REG2, (KERNEL_TSB_NENTRIES - 1), REG2; \
+ 	sllx		REG2, 4, REG2; \
+@@ -281,17 +322,15 @@ extern struct tsb_phys_patch_entry __tsb_phys_patch, __tsb_phys_patch_end;
+ 	 * we can make use of that for the index computation.
+ 	 */
+ #define KERN_TSB4M_LOOKUP_TL1(TAG, REG1, REG2, REG3, REG4, OK_LABEL) \
+-661:	sethi		%hi(swapper_4m_tsb), REG1;	     \
+-	or		REG1, %lo(swapper_4m_tsb), REG1; \
++661:	sethi		%uhi(swapper_4m_tsb), REG1; \
++	sethi		%hi(swapper_4m_tsb), REG2; \
++	or		REG1, %ulo(swapper_4m_tsb), REG1; \
++	or		REG2, %lo(swapper_4m_tsb), REG2; \
+ 	.section	.swapper_4m_tsb_phys_patch, "ax"; \
+ 	.word		661b; \
+ 	.previous; \
+-661:	nop; \
+-	.section	.tsb_ldquad_phys_patch, "ax"; \
+-	.word		661b; \
+-	sllx		REG1, KTSB_PHYS_SHIFT, REG1; \
+-	sllx		REG1, KTSB_PHYS_SHIFT, REG1; \
+-	.previous; \
++	sllx		REG1, 32, REG1; \
++	or		REG1, REG2, REG1; \
+ 	and		TAG, (KERNEL_TSB4M_NENTRIES - 1), REG2; \
+ 	sllx		REG2, 4, REG2; \
+ 	add		REG1, REG2, REG2; \
+diff --git a/arch/sparc/include/asm/visasm.h b/arch/sparc/include/asm/visasm.h
+index b26673759283..1f0aa2024e94 100644
+--- a/arch/sparc/include/asm/visasm.h
++++ b/arch/sparc/include/asm/visasm.h
+@@ -39,6 +39,14 @@
+ 297:	wr		%o5, FPRS_FEF, %fprs;		\
+ 298:
+ 
++#define VISEntryHalfFast(fail_label)			\
++	rd		%fprs, %o5;			\
++	andcc		%o5, FPRS_FEF, %g0;		\
++	be,pt		%icc, 297f;			\
++	 nop;						\
++	ba,a,pt		%xcc, fail_label;		\
++297:	wr		%o5, FPRS_FEF, %fprs;
++
+ #define VISExitHalf					\
+ 	wr		%o5, 0, %fprs;
+ 
+diff --git a/arch/sparc/kernel/cpu.c b/arch/sparc/kernel/cpu.c
+index 82a3a71c451e..dfad8b1aea9f 100644
+--- a/arch/sparc/kernel/cpu.c
++++ b/arch/sparc/kernel/cpu.c
+@@ -494,6 +494,18 @@ static void __init sun4v_cpu_probe(void)
+ 		sparc_pmu_type = "niagara5";
+ 		break;
+ 
++	case SUN4V_CHIP_SPARC_M6:
++		sparc_cpu_type = "SPARC-M6";
++		sparc_fpu_type = "SPARC-M6 integrated FPU";
++		sparc_pmu_type = "sparc-m6";
++		break;
++
++	case SUN4V_CHIP_SPARC_M7:
++		sparc_cpu_type = "SPARC-M7";
++		sparc_fpu_type = "SPARC-M7 integrated FPU";
++		sparc_pmu_type = "sparc-m7";
++		break;
++
+ 	case SUN4V_CHIP_SPARC64X:
+ 		sparc_cpu_type = "SPARC64-X";
+ 		sparc_fpu_type = "SPARC64-X integrated FPU";
+diff --git a/arch/sparc/kernel/cpumap.c b/arch/sparc/kernel/cpumap.c
+index de1c844dfabc..e69ec0e3f155 100644
+--- a/arch/sparc/kernel/cpumap.c
++++ b/arch/sparc/kernel/cpumap.c
+@@ -326,6 +326,8 @@ static int iterate_cpu(struct cpuinfo_tree *t, unsigned int root_index)
+ 	case SUN4V_CHIP_NIAGARA3:
+ 	case SUN4V_CHIP_NIAGARA4:
+ 	case SUN4V_CHIP_NIAGARA5:
++	case SUN4V_CHIP_SPARC_M6:
++	case SUN4V_CHIP_SPARC_M7:
+ 	case SUN4V_CHIP_SPARC64X:
+ 		rover_inc_table = niagara_iterate_method;
+ 		break;
+diff --git a/arch/sparc/kernel/ds.c b/arch/sparc/kernel/ds.c
+index dff60abbea01..f87a55d77094 100644
+--- a/arch/sparc/kernel/ds.c
++++ b/arch/sparc/kernel/ds.c
+@@ -1200,14 +1200,14 @@ static int ds_probe(struct vio_dev *vdev, const struct vio_device_id *id)
+ 	ds_cfg.tx_irq = vdev->tx_irq;
+ 	ds_cfg.rx_irq = vdev->rx_irq;
+ 
+-	lp = ldc_alloc(vdev->channel_id, &ds_cfg, dp);
++	lp = ldc_alloc(vdev->channel_id, &ds_cfg, dp, "DS");
+ 	if (IS_ERR(lp)) {
+ 		err = PTR_ERR(lp);
+ 		goto out_free_ds_states;
+ 	}
+ 	dp->lp = lp;
+ 
+-	err = ldc_bind(lp, "DS");
++	err = ldc_bind(lp);
+ 	if (err)
+ 		goto out_free_ldc;
+ 
+diff --git a/arch/sparc/kernel/dtlb_prot.S b/arch/sparc/kernel/dtlb_prot.S
+index b2c2c5be281c..d668ca149e64 100644
+--- a/arch/sparc/kernel/dtlb_prot.S
++++ b/arch/sparc/kernel/dtlb_prot.S
+@@ -24,11 +24,11 @@
+ 	mov		TLB_TAG_ACCESS, %g4		! For reload of vaddr
+ 
+ /* PROT ** ICACHE line 2: More real fault processing */
++	ldxa		[%g4] ASI_DMMU, %g5		! Put tagaccess in %g5
+ 	bgu,pn		%xcc, winfix_trampoline		! Yes, perform winfixup
+-	 ldxa		[%g4] ASI_DMMU, %g5		! Put tagaccess in %g5
+-	ba,pt		%xcc, sparc64_realfault_common	! Nope, normal fault
+ 	 mov		FAULT_CODE_DTLB | FAULT_CODE_WRITE, %g4
+-	nop
++	ba,pt		%xcc, sparc64_realfault_common	! Nope, normal fault
++	 nop
+ 	nop
+ 	nop
+ 	nop
+diff --git a/arch/sparc/kernel/entry.h b/arch/sparc/kernel/entry.h
+index ebaba6167dd4..88d322b67fac 100644
+--- a/arch/sparc/kernel/entry.h
++++ b/arch/sparc/kernel/entry.h
+@@ -65,13 +65,10 @@ struct pause_patch_entry {
+ extern struct pause_patch_entry __pause_3insn_patch,
+ 	__pause_3insn_patch_end;
+ 
+-void __init per_cpu_patch(void);
+ void sun4v_patch_1insn_range(struct sun4v_1insn_patch_entry *,
+ 			     struct sun4v_1insn_patch_entry *);
+ void sun4v_patch_2insn_range(struct sun4v_2insn_patch_entry *,
+ 			     struct sun4v_2insn_patch_entry *);
+-void __init sun4v_patch(void);
+-void __init boot_cpu_id_too_large(int cpu);
+ extern unsigned int dcache_parity_tl1_occurred;
+ extern unsigned int icache_parity_tl1_occurred;
+ 
+diff --git a/arch/sparc/kernel/head_64.S b/arch/sparc/kernel/head_64.S
+index 452f04fe8da6..3d61fcae7ee3 100644
+--- a/arch/sparc/kernel/head_64.S
++++ b/arch/sparc/kernel/head_64.S
+@@ -427,6 +427,12 @@ sun4v_chip_type:
+ 	cmp	%g2, '5'
+ 	be,pt	%xcc, 5f
+ 	 mov	SUN4V_CHIP_NIAGARA5, %g4
++	cmp	%g2, '6'
++	be,pt	%xcc, 5f
++	 mov	SUN4V_CHIP_SPARC_M6, %g4
++	cmp	%g2, '7'
++	be,pt	%xcc, 5f
++	 mov	SUN4V_CHIP_SPARC_M7, %g4
+ 	ba,pt	%xcc, 49f
+ 	 nop
+ 
+@@ -585,6 +591,12 @@ niagara_tlb_fixup:
+ 	cmp	%g1, SUN4V_CHIP_NIAGARA5
+ 	be,pt	%xcc, niagara4_patch
+ 	 nop
++	cmp	%g1, SUN4V_CHIP_SPARC_M6
++	be,pt	%xcc, niagara4_patch
++	 nop
++	cmp	%g1, SUN4V_CHIP_SPARC_M7
++	be,pt	%xcc, niagara4_patch
++	 nop
+ 
+ 	call	generic_patch_copyops
+ 	 nop
+@@ -660,14 +672,12 @@ tlb_fixup_done:
+ 	sethi	%hi(init_thread_union), %g6
+ 	or	%g6, %lo(init_thread_union), %g6
+ 	ldx	[%g6 + TI_TASK], %g4
+-	mov	%sp, %l6
+ 
+ 	wr	%g0, ASI_P, %asi
+ 	mov	1, %g1
+ 	sllx	%g1, THREAD_SHIFT, %g1
+ 	sub	%g1, (STACKFRAME_SZ + STACK_BIAS), %g1
+ 	add	%g6, %g1, %sp
+-	mov	0, %fp
+ 
+ 	/* Set per-cpu pointer initially to zero, this makes
+ 	 * the boot-cpu use the in-kernel-image per-cpu areas
+@@ -694,44 +704,14 @@ tlb_fixup_done:
+ 	 nop
+ #endif
+ 
+-	mov	%l6, %o1			! OpenPROM stack
+ 	call	prom_init
+ 	 mov	%l7, %o0			! OpenPROM cif handler
+ 
+-	/* Initialize current_thread_info()->cpu as early as possible.
+-	 * In order to do that accurately we have to patch up the get_cpuid()
+-	 * assembler sequences.  And that, in turn, requires that we know
+-	 * if we are on a Starfire box or not.  While we're here, patch up
+-	 * the sun4v sequences as well.
++	/* To create a one-register-window buffer between the kernel's
++	 * initial stack and the last stack frame we use from the firmware,
++	 * do the rest of the boot from a C helper function.
+ 	 */
+-	call	check_if_starfire
+-	 nop
+-	call	per_cpu_patch
+-	 nop
+-	call	sun4v_patch
+-	 nop
+-
+-#ifdef CONFIG_SMP
+-	call	hard_smp_processor_id
+-	 nop
+-	cmp	%o0, NR_CPUS
+-	blu,pt	%xcc, 1f
+-	 nop
+-	call	boot_cpu_id_too_large
+-	 nop
+-	/* Not reached... */
+-
+-1:
+-#else
+-	mov	0, %o0
+-#endif
+-	sth	%o0, [%g6 + TI_CPU]
+-
+-	call	prom_init_report
+-	 nop
+-
+-	/* Off we go.... */
+-	call	start_kernel
++	call	start_early_boot
+ 	 nop
+ 	/* Not reached... */
+ 
+diff --git a/arch/sparc/kernel/hvapi.c b/arch/sparc/kernel/hvapi.c
+index c0a2de0fd624..5c55145bfbf0 100644
+--- a/arch/sparc/kernel/hvapi.c
++++ b/arch/sparc/kernel/hvapi.c
+@@ -46,6 +46,7 @@ static struct api_info api_table[] = {
+ 	{ .group = HV_GRP_VF_CPU,				},
+ 	{ .group = HV_GRP_KT_CPU,				},
+ 	{ .group = HV_GRP_VT_CPU,				},
++	{ .group = HV_GRP_T5_CPU,				},
+ 	{ .group = HV_GRP_DIAG,		.flags = FLAG_PRE_API	},
+ };
+ 
+diff --git a/arch/sparc/kernel/hvcalls.S b/arch/sparc/kernel/hvcalls.S
+index f3ab509b76a8..caedf8320416 100644
+--- a/arch/sparc/kernel/hvcalls.S
++++ b/arch/sparc/kernel/hvcalls.S
+@@ -821,3 +821,19 @@ ENTRY(sun4v_vt_set_perfreg)
+ 	retl
+ 	 nop
+ ENDPROC(sun4v_vt_set_perfreg)
++
++ENTRY(sun4v_t5_get_perfreg)
++	mov	%o1, %o4
++	mov	HV_FAST_T5_GET_PERFREG, %o5
++	ta	HV_FAST_TRAP
++	stx	%o1, [%o4]
++	retl
++	 nop
++ENDPROC(sun4v_t5_get_perfreg)
++
++ENTRY(sun4v_t5_set_perfreg)
++	mov	HV_FAST_T5_SET_PERFREG, %o5
++	ta	HV_FAST_TRAP
++	retl
++	 nop
++ENDPROC(sun4v_t5_set_perfreg)
+diff --git a/arch/sparc/kernel/hvtramp.S b/arch/sparc/kernel/hvtramp.S
+index b7ddcdd1dea9..cdbfec299f2f 100644
+--- a/arch/sparc/kernel/hvtramp.S
++++ b/arch/sparc/kernel/hvtramp.S
+@@ -109,7 +109,6 @@ hv_cpu_startup:
+ 	sllx		%g5, THREAD_SHIFT, %g5
+ 	sub		%g5, (STACKFRAME_SZ + STACK_BIAS), %g5
+ 	add		%g6, %g5, %sp
+-	mov		0, %fp
+ 
+ 	call		init_irqwork_curcpu
+ 	 nop
+diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
+index 7f08ec8a7c68..28fed53b13a0 100644
+--- a/arch/sparc/kernel/ioport.c
++++ b/arch/sparc/kernel/ioport.c
+@@ -278,7 +278,8 @@ static void *sbus_alloc_coherent(struct device *dev, size_t len,
+ 	}
+ 
+ 	order = get_order(len_total);
+-	if ((va = __get_free_pages(GFP_KERNEL|__GFP_COMP, order)) == 0)
++	va = __get_free_pages(gfp, order);
++	if (va == 0)
+ 		goto err_nopages;
+ 
+ 	if ((res = kzalloc(sizeof(struct resource), GFP_KERNEL)) == NULL)
+@@ -443,7 +444,7 @@ static void *pci32_alloc_coherent(struct device *dev, size_t len,
+ 	}
+ 
+ 	order = get_order(len_total);
+-	va = (void *) __get_free_pages(GFP_KERNEL, order);
++	va = (void *) __get_free_pages(gfp, order);
+ 	if (va == NULL) {
+ 		printk("pci_alloc_consistent: no %ld pages\n", len_total>>PAGE_SHIFT);
+ 		goto err_nopages;
+diff --git a/arch/sparc/kernel/irq_64.c b/arch/sparc/kernel/irq_64.c
+index 666193f4e8bb..4033c23bdfa6 100644
+--- a/arch/sparc/kernel/irq_64.c
++++ b/arch/sparc/kernel/irq_64.c
+@@ -47,8 +47,6 @@
+ #include "cpumap.h"
+ #include "kstack.h"
+ 
+-#define NUM_IVECS	(IMAP_INR + 1)
+-
+ struct ino_bucket *ivector_table;
+ unsigned long ivector_table_pa;
+ 
+@@ -107,55 +105,196 @@ static void bucket_set_irq(unsigned long bucket_pa, unsigned int irq)
+ 
+ #define irq_work_pa(__cpu)	&(trap_block[(__cpu)].irq_worklist_pa)
+ 
+-static struct {
+-	unsigned int dev_handle;
+-	unsigned int dev_ino;
+-	unsigned int in_use;
+-} irq_table[NR_IRQS];
+-static DEFINE_SPINLOCK(irq_alloc_lock);
++static unsigned long hvirq_major __initdata;
++static int __init early_hvirq_major(char *p)
++{
++	int rc = kstrtoul(p, 10, &hvirq_major);
++
++	return rc;
++}
++early_param("hvirq", early_hvirq_major);
++
++static int hv_irq_version;
++
++/* Major version 2.0 of HV_GRP_INTR added support for the VIRQ cookie
++ * based interfaces, but:
++ *
++ * 1) Several OSs, Solaris and Linux included, use them even when only
++ *    negotiating version 1.0 (or failing to negotiate at all).  So the
++ *    hypervisor has a workaround that provides the VIRQ interfaces even
++ *    when only verion 1.0 of the API is in use.
++ *
++ * 2) Second, and more importantly, with major version 2.0 these VIRQ
++ *    interfaces only were actually hooked up for LDC interrupts, even
++ *    though the Hypervisor specification clearly stated:
++ *
++ *	The new interrupt API functions will be available to a guest
++ *	when it negotiates version 2.0 in the interrupt API group 0x2. When
++ *	a guest negotiates version 2.0, all interrupt sources will only
++ *	support using the cookie interface, and any attempt to use the
++ *	version 1.0 interrupt APIs numbered 0xa0 to 0xa6 will result in the
++ *	ENOTSUPPORTED error being returned.
++ *
++ *   with an emphasis on "all interrupt sources".
++ *
++ * To correct this, major version 3.0 was created which does actually
++ * support VIRQs for all interrupt sources (not just LDC devices).  So
++ * if we want to move completely over the cookie based VIRQs we must
++ * negotiate major version 3.0 or later of HV_GRP_INTR.
++ */
++static bool sun4v_cookie_only_virqs(void)
++{
++	if (hv_irq_version >= 3)
++		return true;
++	return false;
++}
+ 
+-unsigned char irq_alloc(unsigned int dev_handle, unsigned int dev_ino)
++static void __init irq_init_hv(void)
+ {
+-	unsigned long flags;
+-	unsigned char ent;
++	unsigned long hv_error, major, minor = 0;
++
++	if (tlb_type != hypervisor)
++		return;
+ 
+-	BUILD_BUG_ON(NR_IRQS >= 256);
++	if (hvirq_major)
++		major = hvirq_major;
++	else
++		major = 3;
+ 
+-	spin_lock_irqsave(&irq_alloc_lock, flags);
++	hv_error = sun4v_hvapi_register(HV_GRP_INTR, major, &minor);
++	if (!hv_error)
++		hv_irq_version = major;
++	else
++		hv_irq_version = 1;
+ 
+-	for (ent = 1; ent < NR_IRQS; ent++) {
+-		if (!irq_table[ent].in_use)
++	pr_info("SUN4V: Using IRQ API major %d, cookie only virqs %s\n",
++		hv_irq_version,
++		sun4v_cookie_only_virqs() ? "enabled" : "disabled");
++}
++
++/* This function is for the timer interrupt.*/
++int __init arch_probe_nr_irqs(void)
++{
++	return 1;
++}
++
++#define DEFAULT_NUM_IVECS	(0xfffU)
++static unsigned int nr_ivec = DEFAULT_NUM_IVECS;
++#define NUM_IVECS (nr_ivec)
++
++static unsigned int __init size_nr_ivec(void)
++{
++	if (tlb_type == hypervisor) {
++		switch (sun4v_chip_type) {
++		/* Athena's devhandle|devino is large.*/
++		case SUN4V_CHIP_SPARC64X:
++			nr_ivec = 0xffff;
+ 			break;
++		}
+ 	}
+-	if (ent >= NR_IRQS) {
+-		printk(KERN_ERR "IRQ: Out of virtual IRQs.\n");
+-		ent = 0;
+-	} else {
+-		irq_table[ent].dev_handle = dev_handle;
+-		irq_table[ent].dev_ino = dev_ino;
+-		irq_table[ent].in_use = 1;
+-	}
++	return nr_ivec;
++}
++
++struct irq_handler_data {
++	union {
++		struct {
++			unsigned int dev_handle;
++			unsigned int dev_ino;
++		};
++		unsigned long sysino;
++	};
++	struct ino_bucket bucket;
++	unsigned long	iclr;
++	unsigned long	imap;
++};
++
++static inline unsigned int irq_data_to_handle(struct irq_data *data)
++{
++	struct irq_handler_data *ihd = data->handler_data;
++
++	return ihd->dev_handle;
++}
++
++static inline unsigned int irq_data_to_ino(struct irq_data *data)
++{
++	struct irq_handler_data *ihd = data->handler_data;
+ 
+-	spin_unlock_irqrestore(&irq_alloc_lock, flags);
++	return ihd->dev_ino;
++}
++
++static inline unsigned long irq_data_to_sysino(struct irq_data *data)
++{
++	struct irq_handler_data *ihd = data->handler_data;
+ 
+-	return ent;
++	return ihd->sysino;
+ }
+ 
+-#ifdef CONFIG_PCI_MSI
+ void irq_free(unsigned int irq)
+ {
+-	unsigned long flags;
++	void *data = irq_get_handler_data(irq);
+ 
+-	if (irq >= NR_IRQS)
+-		return;
++	kfree(data);
++	irq_set_handler_data(irq, NULL);
++	irq_free_descs(irq, 1);
++}
+ 
+-	spin_lock_irqsave(&irq_alloc_lock, flags);
++unsigned int irq_alloc(unsigned int dev_handle, unsigned int dev_ino)
++{
++	int irq;
+ 
+-	irq_table[irq].in_use = 0;
++	irq = __irq_alloc_descs(-1, 1, 1, numa_node_id(), NULL);
++	if (irq <= 0)
++		goto out;
+ 
+-	spin_unlock_irqrestore(&irq_alloc_lock, flags);
++	return irq;
++out:
++	return 0;
++}
++
++static unsigned int cookie_exists(u32 devhandle, unsigned int devino)
++{
++	unsigned long hv_err, cookie;
++	struct ino_bucket *bucket;
++	unsigned int irq = 0U;
++
++	hv_err = sun4v_vintr_get_cookie(devhandle, devino, &cookie);
++	if (hv_err) {
++		pr_err("HV get cookie failed hv_err = %ld\n", hv_err);
++		goto out;
++	}
++
++	if (cookie & ((1UL << 63UL))) {
++		cookie = ~cookie;
++		bucket = (struct ino_bucket *) __va(cookie);
++		irq = bucket->__irq;
++	}
++out:
++	return irq;
++}
++
++static unsigned int sysino_exists(u32 devhandle, unsigned int devino)
++{
++	unsigned long sysino = sun4v_devino_to_sysino(devhandle, devino);
++	struct ino_bucket *bucket;
++	unsigned int irq;
++
++	bucket = &ivector_table[sysino];
++	irq = bucket_get_irq(__pa(bucket));
++
++	return irq;
++}
++
++void ack_bad_irq(unsigned int irq)
++{
++	pr_crit("BAD IRQ ack %d\n", irq);
++}
++
++void irq_install_pre_handler(int irq,
++			     void (*func)(unsigned int, void *, void *),
++			     void *arg1, void *arg2)
++{
++	pr_warn("IRQ pre handler NOT supported.\n");
+ }
+-#endif
+ 
+ /*
+  * /proc/interrupts printing:
+@@ -206,15 +345,6 @@ static unsigned int sun4u_compute_tid(unsigned long imap, unsigned long cpuid)
+ 	return tid;
+ }
+ 
+-struct irq_handler_data {
+-	unsigned long	iclr;
+-	unsigned long	imap;
+-
+-	void		(*pre_handler)(unsigned int, void *, void *);
+-	void		*arg1;
+-	void		*arg2;
+-};
+-
+ #ifdef CONFIG_SMP
+ static int irq_choose_cpu(unsigned int irq, const struct cpumask *affinity)
+ {
+@@ -316,8 +446,8 @@ static void sun4u_irq_eoi(struct irq_data *data)
+ 
+ static void sun4v_irq_enable(struct irq_data *data)
+ {
+-	unsigned int ino = irq_table[data->irq].dev_ino;
+ 	unsigned long cpuid = irq_choose_cpu(data->irq, data->affinity);
++	unsigned int ino = irq_data_to_sysino(data);
+ 	int err;
+ 
+ 	err = sun4v_intr_settarget(ino, cpuid);
+@@ -337,8 +467,8 @@ static void sun4v_irq_enable(struct irq_data *data)
+ static int sun4v_set_affinity(struct irq_data *data,
+ 			       const struct cpumask *mask, bool force)
+ {
+-	unsigned int ino = irq_table[data->irq].dev_ino;
+ 	unsigned long cpuid = irq_choose_cpu(data->irq, mask);
++	unsigned int ino = irq_data_to_sysino(data);
+ 	int err;
+ 
+ 	err = sun4v_intr_settarget(ino, cpuid);
+@@ -351,7 +481,7 @@ static int sun4v_set_affinity(struct irq_data *data,
+ 
+ static void sun4v_irq_disable(struct irq_data *data)
+ {
+-	unsigned int ino = irq_table[data->irq].dev_ino;
++	unsigned int ino = irq_data_to_sysino(data);
+ 	int err;
+ 
+ 	err = sun4v_intr_setenabled(ino, HV_INTR_DISABLED);
+@@ -362,7 +492,7 @@ static void sun4v_irq_disable(struct irq_data *data)
+ 
+ static void sun4v_irq_eoi(struct irq_data *data)
+ {
+-	unsigned int ino = irq_table[data->irq].dev_ino;
++	unsigned int ino = irq_data_to_sysino(data);
+ 	int err;
+ 
+ 	err = sun4v_intr_setstate(ino, HV_INTR_STATE_IDLE);
+@@ -373,14 +503,13 @@ static void sun4v_irq_eoi(struct irq_data *data)
+ 
+ static void sun4v_virq_enable(struct irq_data *data)
+ {
+-	unsigned long cpuid, dev_handle, dev_ino;
++	unsigned long dev_handle = irq_data_to_handle(data);
++	unsigned long dev_ino = irq_data_to_ino(data);
++	unsigned long cpuid;
+ 	int err;
+ 
+ 	cpuid = irq_choose_cpu(data->irq, data->affinity);
+ 
+-	dev_handle = irq_table[data->irq].dev_handle;
+-	dev_ino = irq_table[data->irq].dev_ino;
+-
+ 	err = sun4v_vintr_set_target(dev_handle, dev_ino, cpuid);
+ 	if (err != HV_EOK)
+ 		printk(KERN_ERR "sun4v_vintr_set_target(%lx,%lx,%lu): "
+@@ -403,14 +532,13 @@ static void sun4v_virq_enable(struct irq_data *data)
+ static int sun4v_virt_set_affinity(struct irq_data *data,
+ 				    const struct cpumask *mask, bool force)
+ {
+-	unsigned long cpuid, dev_handle, dev_ino;
++	unsigned long dev_handle = irq_data_to_handle(data);
++	unsigned long dev_ino = irq_data_to_ino(data);
++	unsigned long cpuid;
+ 	int err;
+ 
+ 	cpuid = irq_choose_cpu(data->irq, mask);
+ 
+-	dev_handle = irq_table[data->irq].dev_handle;
+-	dev_ino = irq_table[data->irq].dev_ino;
+-
+ 	err = sun4v_vintr_set_target(dev_handle, dev_ino, cpuid);
+ 	if (err != HV_EOK)
+ 		printk(KERN_ERR "sun4v_vintr_set_target(%lx,%lx,%lu): "
+@@ -422,11 +550,10 @@ static int sun4v_virt_set_affinity(struct irq_data *data,
+ 
+ static void sun4v_virq_disable(struct irq_data *data)
+ {
+-	unsigned long dev_handle, dev_ino;
++	unsigned long dev_handle = irq_data_to_handle(data);
++	unsigned long dev_ino = irq_data_to_ino(data);
+ 	int err;
+ 
+-	dev_handle = irq_table[data->irq].dev_handle;
+-	dev_ino = irq_table[data->irq].dev_ino;
+ 
+ 	err = sun4v_vintr_set_valid(dev_handle, dev_ino,
+ 				    HV_INTR_DISABLED);
+@@ -438,12 +565,10 @@ static void sun4v_virq_disable(struct irq_data *data)
+ 
+ static void sun4v_virq_eoi(struct irq_data *data)
+ {
+-	unsigned long dev_handle, dev_ino;
++	unsigned long dev_handle = irq_data_to_handle(data);
++	unsigned long dev_ino = irq_data_to_ino(data);
+ 	int err;
+ 
+-	dev_handle = irq_table[data->irq].dev_handle;
+-	dev_ino = irq_table[data->irq].dev_ino;
+-
+ 	err = sun4v_vintr_set_state(dev_handle, dev_ino,
+ 				    HV_INTR_STATE_IDLE);
+ 	if (err != HV_EOK)
+@@ -479,31 +604,10 @@ static struct irq_chip sun4v_virq = {
+ 	.flags			= IRQCHIP_EOI_IF_HANDLED,
+ };
+ 
+-static void pre_flow_handler(struct irq_data *d)
+-{
+-	struct irq_handler_data *handler_data = irq_data_get_irq_handler_data(d);
+-	unsigned int ino = irq_table[d->irq].dev_ino;
+-
+-	handler_data->pre_handler(ino, handler_data->arg1, handler_data->arg2);
+-}
+-
+-void irq_install_pre_handler(int irq,
+-			     void (*func)(unsigned int, void *, void *),
+-			     void *arg1, void *arg2)
+-{
+-	struct irq_handler_data *handler_data = irq_get_handler_data(irq);
+-
+-	handler_data->pre_handler = func;
+-	handler_data->arg1 = arg1;
+-	handler_data->arg2 = arg2;
+-
+-	__irq_set_preflow_handler(irq, pre_flow_handler);
+-}
+-
+ unsigned int build_irq(int inofixup, unsigned long iclr, unsigned long imap)
+ {
+-	struct ino_bucket *bucket;
+ 	struct irq_handler_data *handler_data;
++	struct ino_bucket *bucket;
+ 	unsigned int irq;
+ 	int ino;
+ 
+@@ -537,119 +641,166 @@ out:
+ 	return irq;
+ }
+ 
+-static unsigned int sun4v_build_common(unsigned long sysino,
+-				       struct irq_chip *chip)
++static unsigned int sun4v_build_common(u32 devhandle, unsigned int devino,
++		void (*handler_data_init)(struct irq_handler_data *data,
++		u32 devhandle, unsigned int devino),
++		struct irq_chip *chip)
+ {
+-	struct ino_bucket *bucket;
+-	struct irq_handler_data *handler_data;
++	struct irq_handler_data *data;
+ 	unsigned int irq;
+ 
+-	BUG_ON(tlb_type != hypervisor);
++	irq = irq_alloc(devhandle, devino);
++	if (!irq)
++		goto out;
+ 
+-	bucket = &ivector_table[sysino];
+-	irq = bucket_get_irq(__pa(bucket));
+-	if (!irq) {
+-		irq = irq_alloc(0, sysino);
+-		bucket_set_irq(__pa(bucket), irq);
+-		irq_set_chip_and_handler_name(irq, chip, handle_fasteoi_irq,
+-					      "IVEC");
++	data = kzalloc(sizeof(struct irq_handler_data), GFP_ATOMIC);
++	if (unlikely(!data)) {
++		pr_err("IRQ handler data allocation failed.\n");
++		irq_free(irq);
++		irq = 0;
++		goto out;
+ 	}
+ 
+-	handler_data = irq_get_handler_data(irq);
+-	if (unlikely(handler_data))
+-		goto out;
++	irq_set_handler_data(irq, data);
++	handler_data_init(data, devhandle, devino);
++	irq_set_chip_and_handler_name(irq, chip, handle_fasteoi_irq, "IVEC");
++	data->imap = ~0UL;
++	data->iclr = ~0UL;
++out:
++	return irq;
++}
+ 
+-	handler_data = kzalloc(sizeof(struct irq_handler_data), GFP_ATOMIC);
+-	if (unlikely(!handler_data)) {
+-		prom_printf("IRQ: kzalloc(irq_handler_data) failed.\n");
+-		prom_halt();
+-	}
+-	irq_set_handler_data(irq, handler_data);
++static unsigned long cookie_assign(unsigned int irq, u32 devhandle,
++		unsigned int devino)
++{
++	struct irq_handler_data *ihd = irq_get_handler_data(irq);
++	unsigned long hv_error, cookie;
+ 
+-	/* Catch accidental accesses to these things.  IMAP/ICLR handling
+-	 * is done by hypervisor calls on sun4v platforms, not by direct
+-	 * register accesses.
++	/* handler_irq needs to find the irq. cookie is seen signed in
++	 * sun4v_dev_mondo and treated as a non ivector_table delivery.
+ 	 */
+-	handler_data->imap = ~0UL;
+-	handler_data->iclr = ~0UL;
++	ihd->bucket.__irq = irq;
++	cookie = ~__pa(&ihd->bucket);
+ 
+-out:
+-	return irq;
++	hv_error = sun4v_vintr_set_cookie(devhandle, devino, cookie);
++	if (hv_error)
++		pr_err("HV vintr set cookie failed = %ld\n", hv_error);
++
++	return hv_error;
+ }
+ 
+-unsigned int sun4v_build_irq(u32 devhandle, unsigned int devino)
++static void cookie_handler_data(struct irq_handler_data *data,
++				u32 devhandle, unsigned int devino)
+ {
+-	unsigned long sysino = sun4v_devino_to_sysino(devhandle, devino);
++	data->dev_handle = devhandle;
++	data->dev_ino = devino;
++}
+ 
+-	return sun4v_build_common(sysino, &sun4v_irq);
++static unsigned int cookie_build_irq(u32 devhandle, unsigned int devino,
++				     struct irq_chip *chip)
++{
++	unsigned long hv_error;
++	unsigned int irq;
++
++	irq = sun4v_build_common(devhandle, devino, cookie_handler_data, chip);
++
++	hv_error = cookie_assign(irq, devhandle, devino);
++	if (hv_error) {
++		irq_free(irq);
++		irq = 0;
++	}
++
++	return irq;
+ }
+ 
+-unsigned int sun4v_build_virq(u32 devhandle, unsigned int devino)
++static unsigned int sun4v_build_cookie(u32 devhandle, unsigned int devino)
+ {
+-	struct irq_handler_data *handler_data;
+-	unsigned long hv_err, cookie;
+-	struct ino_bucket *bucket;
+ 	unsigned int irq;
+ 
+-	bucket = kzalloc(sizeof(struct ino_bucket), GFP_ATOMIC);
+-	if (unlikely(!bucket))
+-		return 0;
++	irq = cookie_exists(devhandle, devino);
++	if (irq)
++		goto out;
+ 
+-	/* The only reference we store to the IRQ bucket is
+-	 * by physical address which kmemleak can't see, tell
+-	 * it that this object explicitly is not a leak and
+-	 * should be scanned.
+-	 */
+-	kmemleak_not_leak(bucket);
++	irq = cookie_build_irq(devhandle, devino, &sun4v_virq);
+ 
+-	__flush_dcache_range((unsigned long) bucket,
+-			     ((unsigned long) bucket +
+-			      sizeof(struct ino_bucket)));
++out:
++	return irq;
++}
+ 
+-	irq = irq_alloc(devhandle, devino);
++static void sysino_set_bucket(unsigned int irq)
++{
++	struct irq_handler_data *ihd = irq_get_handler_data(irq);
++	struct ino_bucket *bucket;
++	unsigned long sysino;
++
++	sysino = sun4v_devino_to_sysino(ihd->dev_handle, ihd->dev_ino);
++	BUG_ON(sysino >= nr_ivec);
++	bucket = &ivector_table[sysino];
+ 	bucket_set_irq(__pa(bucket), irq);
++}
+ 
+-	irq_set_chip_and_handler_name(irq, &sun4v_virq, handle_fasteoi_irq,
+-				      "IVEC");
++static void sysino_handler_data(struct irq_handler_data *data,
++				u32 devhandle, unsigned int devino)
++{
++	unsigned long sysino;
+ 
+-	handler_data = kzalloc(sizeof(struct irq_handler_data), GFP_ATOMIC);
+-	if (unlikely(!handler_data))
+-		return 0;
++	sysino = sun4v_devino_to_sysino(devhandle, devino);
++	data->sysino = sysino;
++}
+ 
+-	/* In order to make the LDC channel startup sequence easier,
+-	 * especially wrt. locking, we do not let request_irq() enable
+-	 * the interrupt.
+-	 */
+-	irq_set_status_flags(irq, IRQ_NOAUTOEN);
+-	irq_set_handler_data(irq, handler_data);
++static unsigned int sysino_build_irq(u32 devhandle, unsigned int devino,
++				     struct irq_chip *chip)
++{
++	unsigned int irq;
+ 
+-	/* Catch accidental accesses to these things.  IMAP/ICLR handling
+-	 * is done by hypervisor calls on sun4v platforms, not by direct
+-	 * register accesses.
+-	 */
+-	handler_data->imap = ~0UL;
+-	handler_data->iclr = ~0UL;
++	irq = sun4v_build_common(devhandle, devino, sysino_handler_data, chip);
++	if (!irq)
++		goto out;
+ 
+-	cookie = ~__pa(bucket);
+-	hv_err = sun4v_vintr_set_cookie(devhandle, devino, cookie);
+-	if (hv_err) {
+-		prom_printf("IRQ: Fatal, cannot set cookie for [%x:%x] "
+-			    "err=%lu\n", devhandle, devino, hv_err);
+-		prom_halt();
+-	}
++	sysino_set_bucket(irq);
++out:
++	return irq;
++}
+ 
++static int sun4v_build_sysino(u32 devhandle, unsigned int devino)
++{
++	int irq;
++
++	irq = sysino_exists(devhandle, devino);
++	if (irq)
++		goto out;
++
++	irq = sysino_build_irq(devhandle, devino, &sun4v_irq);
++out:
+ 	return irq;
+ }
+ 
+-void ack_bad_irq(unsigned int irq)
++unsigned int sun4v_build_irq(u32 devhandle, unsigned int devino)
+ {
+-	unsigned int ino = irq_table[irq].dev_ino;
++	unsigned int irq;
+ 
+-	if (!ino)
+-		ino = 0xdeadbeef;
++	if (sun4v_cookie_only_virqs())
++		irq = sun4v_build_cookie(devhandle, devino);
++	else
++		irq = sun4v_build_sysino(devhandle, devino);
+ 
+-	printk(KERN_CRIT "Unexpected IRQ from ino[%x] irq[%u]\n",
+-	       ino, irq);
++	return irq;
++}
++
++unsigned int sun4v_build_virq(u32 devhandle, unsigned int devino)
++{
++	int irq;
++
++	irq = cookie_build_irq(devhandle, devino, &sun4v_virq);
++	if (!irq)
++		goto out;
++
++	/* This is borrowed from the original function.
++	 */
++	irq_set_status_flags(irq, IRQ_NOAUTOEN);
++
++out:
++	return irq;
+ }
+ 
+ void *hardirq_stack[NR_CPUS];
+@@ -720,9 +871,12 @@ void fixup_irqs(void)
+ 
+ 	for (irq = 0; irq < NR_IRQS; irq++) {
+ 		struct irq_desc *desc = irq_to_desc(irq);
+-		struct irq_data *data = irq_desc_get_irq_data(desc);
++		struct irq_data *data;
+ 		unsigned long flags;
+ 
++		if (!desc)
++			continue;
++		data = irq_desc_get_irq_data(desc);
+ 		raw_spin_lock_irqsave(&desc->lock, flags);
+ 		if (desc->action && !irqd_is_per_cpu(data)) {
+ 			if (data->chip->irq_set_affinity)
+@@ -922,16 +1076,22 @@ static struct irqaction timer_irq_action = {
+ 	.name = "timer",
+ };
+ 
+-/* Only invoked on boot processor. */
+-void __init init_IRQ(void)
++static void __init irq_ivector_init(void)
+ {
+-	unsigned long size;
++	unsigned long size, order;
++	unsigned int ivecs;
+ 
+-	map_prom_timers();
+-	kill_prom_timer();
++	/* If we are doing cookie only VIRQs then we do not need the ivector
++	 * table to process interrupts.
++	 */
++	if (sun4v_cookie_only_virqs())
++		return;
+ 
+-	size = sizeof(struct ino_bucket) * NUM_IVECS;
+-	ivector_table = kzalloc(size, GFP_KERNEL);
++	ivecs = size_nr_ivec();
++	size = sizeof(struct ino_bucket) * ivecs;
++	order = get_order(size);
++	ivector_table = (struct ino_bucket *)
++		__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+ 	if (!ivector_table) {
+ 		prom_printf("Fatal error, cannot allocate ivector_table\n");
+ 		prom_halt();
+@@ -940,6 +1100,15 @@ void __init init_IRQ(void)
+ 			     ((unsigned long) ivector_table) + size);
+ 
+ 	ivector_table_pa = __pa(ivector_table);
++}
++
++/* Only invoked on boot processor.*/
++void __init init_IRQ(void)
++{
++	irq_init_hv();
++	irq_ivector_init();
++	map_prom_timers();
++	kill_prom_timer();
+ 
+ 	if (tlb_type == hypervisor)
+ 		sun4v_init_mondo_queues();
+diff --git a/arch/sparc/kernel/ktlb.S b/arch/sparc/kernel/ktlb.S
+index 605d49204580..ef0d8e9e1210 100644
+--- a/arch/sparc/kernel/ktlb.S
++++ b/arch/sparc/kernel/ktlb.S
+@@ -47,14 +47,6 @@ kvmap_itlb_vmalloc_addr:
+ 	KERN_PGTABLE_WALK(%g4, %g5, %g2, kvmap_itlb_longpath)
+ 
+ 	TSB_LOCK_TAG(%g1, %g2, %g7)
+-
+-	/* Load and check PTE.  */
+-	ldxa		[%g5] ASI_PHYS_USE_EC, %g5
+-	mov		1, %g7
+-	sllx		%g7, TSB_TAG_INVALID_BIT, %g7
+-	brgez,a,pn	%g5, kvmap_itlb_longpath
+-	 TSB_STORE(%g1, %g7)
+-
+ 	TSB_WRITE(%g1, %g5, %g6)
+ 
+ 	/* fallthrough to TLB load */
+@@ -118,6 +110,12 @@ kvmap_dtlb_obp:
+ 	ba,pt		%xcc, kvmap_dtlb_load
+ 	 nop
+ 
++kvmap_linear_early:
++	sethi		%hi(kern_linear_pte_xor), %g7
++	ldx		[%g7 + %lo(kern_linear_pte_xor)], %g2
++	ba,pt		%xcc, kvmap_dtlb_tsb4m_load
++	 xor		%g2, %g4, %g5
++
+ 	.align		32
+ kvmap_dtlb_tsb4m_load:
+ 	TSB_LOCK_TAG(%g1, %g2, %g7)
+@@ -146,105 +144,17 @@ kvmap_dtlb_4v:
+ 	/* Correct TAG_TARGET is already in %g6, check 4mb TSB.  */
+ 	KERN_TSB4M_LOOKUP_TL1(%g6, %g5, %g1, %g2, %g3, kvmap_dtlb_load)
+ #endif
+-	/* TSB entry address left in %g1, lookup linear PTE.
+-	 * Must preserve %g1 and %g6 (TAG).
+-	 */
+-kvmap_dtlb_tsb4m_miss:
+-	/* Clear the PAGE_OFFSET top virtual bits, shift
+-	 * down to get PFN, and make sure PFN is in range.
+-	 */
+-661:	sllx		%g4, 0, %g5
+-	.section	.page_offset_shift_patch, "ax"
+-	.word		661b
+-	.previous
+-
+-	/* Check to see if we know about valid memory at the 4MB
+-	 * chunk this physical address will reside within.
++	/* Linear mapping TSB lookup failed.  Fallthrough to kernel
++	 * page table based lookup.
+ 	 */
+-661:	srlx		%g5, MAX_PHYS_ADDRESS_BITS, %g2
+-	.section	.page_offset_shift_patch, "ax"
+-	.word		661b
+-	.previous
+-
+-	brnz,pn		%g2, kvmap_dtlb_longpath
+-	 nop
+-
+-	/* This unconditional branch and delay-slot nop gets patched
+-	 * by the sethi sequence once the bitmap is properly setup.
+-	 */
+-	.globl		valid_addr_bitmap_insn
+-valid_addr_bitmap_insn:
+-	ba,pt		%xcc, 2f
+-	 nop
+-	.subsection	2
+-	.globl		valid_addr_bitmap_patch
+-valid_addr_bitmap_patch:
+-	sethi		%hi(sparc64_valid_addr_bitmap), %g7
+-	or		%g7, %lo(sparc64_valid_addr_bitmap), %g7
+-	.previous
+-
+-661:	srlx		%g5, ILOG2_4MB, %g2
+-	.section	.page_offset_shift_patch, "ax"
+-	.word		661b
+-	.previous
+-
+-	srlx		%g2, 6, %g5
+-	and		%g2, 63, %g2
+-	sllx		%g5, 3, %g5
+-	ldx		[%g7 + %g5], %g5
+-	mov		1, %g7
+-	sllx		%g7, %g2, %g7
+-	andcc		%g5, %g7, %g0
+-	be,pn		%xcc, kvmap_dtlb_longpath
+-
+-2:	 sethi		%hi(kpte_linear_bitmap), %g2
+-
+-	/* Get the 256MB physical address index. */
+-661:	sllx		%g4, 0, %g5
+-	.section	.page_offset_shift_patch, "ax"
+-	.word		661b
+-	.previous
+-
+-	or		%g2, %lo(kpte_linear_bitmap), %g2
+-
+-661:	srlx		%g5, ILOG2_256MB, %g5
+-	.section	.page_offset_shift_patch, "ax"
+-	.word		661b
+-	.previous
+-
+-	and		%g5, (32 - 1), %g7
+-
+-	/* Divide by 32 to get the offset into the bitmask.  */
+-	srlx		%g5, 5, %g5
+-	add		%g7, %g7, %g7
+-	sllx		%g5, 3, %g5
+-
+-	/* kern_linear_pte_xor[(mask >> shift) & 3)] */
+-	ldx		[%g2 + %g5], %g2
+-	srlx		%g2, %g7, %g7
+-	sethi		%hi(kern_linear_pte_xor), %g5
+-	and		%g7, 3, %g7
+-	or		%g5, %lo(kern_linear_pte_xor), %g5
+-	sllx		%g7, 3, %g7
+-	ldx		[%g5 + %g7], %g2
+-
+ 	.globl		kvmap_linear_patch
+ kvmap_linear_patch:
+-	ba,pt		%xcc, kvmap_dtlb_tsb4m_load
+-	 xor		%g2, %g4, %g5
++	ba,a,pt		%xcc, kvmap_linear_early
+ 
+ kvmap_dtlb_vmalloc_addr:
+ 	KERN_PGTABLE_WALK(%g4, %g5, %g2, kvmap_dtlb_longpath)
+ 
+ 	TSB_LOCK_TAG(%g1, %g2, %g7)
+-
+-	/* Load and check PTE.  */
+-	ldxa		[%g5] ASI_PHYS_USE_EC, %g5
+-	mov		1, %g7
+-	sllx		%g7, TSB_TAG_INVALID_BIT, %g7
+-	brgez,a,pn	%g5, kvmap_dtlb_longpath
+-	 TSB_STORE(%g1, %g7)
+-
+ 	TSB_WRITE(%g1, %g5, %g6)
+ 
+ 	/* fallthrough to TLB load */
+@@ -276,13 +186,8 @@ kvmap_dtlb_load:
+ 
+ #ifdef CONFIG_SPARSEMEM_VMEMMAP
+ kvmap_vmemmap:
+-	sub		%g4, %g5, %g5
+-	srlx		%g5, ILOG2_4MB, %g5
+-	sethi		%hi(vmemmap_table), %g1
+-	sllx		%g5, 3, %g5
+-	or		%g1, %lo(vmemmap_table), %g1
+-	ba,pt		%xcc, kvmap_dtlb_load
+-	 ldx		[%g1 + %g5], %g5
++	KERN_PGTABLE_WALK(%g4, %g5, %g2, kvmap_dtlb_longpath)
++	ba,a,pt		%xcc, kvmap_dtlb_load
+ #endif
+ 
+ kvmap_dtlb_nonlinear:
+@@ -294,8 +199,8 @@ kvmap_dtlb_nonlinear:
+ 
+ #ifdef CONFIG_SPARSEMEM_VMEMMAP
+ 	/* Do not use the TSB for vmemmap.  */
+-	mov		(VMEMMAP_BASE >> 40), %g5
+-	sllx		%g5, 40, %g5
++	sethi		%hi(VMEMMAP_BASE), %g5
++	ldx		[%g5 + %lo(VMEMMAP_BASE)], %g5
+ 	cmp		%g4,%g5
+ 	bgeu,pn		%xcc, kvmap_vmemmap
+ 	 nop
+@@ -307,8 +212,8 @@ kvmap_dtlb_tsbmiss:
+ 	sethi		%hi(MODULES_VADDR), %g5
+ 	cmp		%g4, %g5
+ 	blu,pn		%xcc, kvmap_dtlb_longpath
+-	 mov		(VMALLOC_END >> 40), %g5
+-	sllx		%g5, 40, %g5
++	 sethi		%hi(VMALLOC_END), %g5
++	ldx		[%g5 + %lo(VMALLOC_END)], %g5
+ 	cmp		%g4, %g5
+ 	bgeu,pn		%xcc, kvmap_dtlb_longpath
+ 	 nop
+diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
+index 66dacd56bb10..27bb55485472 100644
+--- a/arch/sparc/kernel/ldc.c
++++ b/arch/sparc/kernel/ldc.c
+@@ -1078,7 +1078,8 @@ static void ldc_iommu_release(struct ldc_channel *lp)
+ 
+ struct ldc_channel *ldc_alloc(unsigned long id,
+ 			      const struct ldc_channel_config *cfgp,
+-			      void *event_arg)
++			      void *event_arg,
++			      const char *name)
+ {
+ 	struct ldc_channel *lp;
+ 	const struct ldc_mode_ops *mops;
+@@ -1093,6 +1094,8 @@ struct ldc_channel *ldc_alloc(unsigned long id,
+ 	err = -EINVAL;
+ 	if (!cfgp)
+ 		goto out_err;
++	if (!name)
++		goto out_err;
+ 
+ 	switch (cfgp->mode) {
+ 	case LDC_MODE_RAW:
+@@ -1185,6 +1188,21 @@ struct ldc_channel *ldc_alloc(unsigned long id,
+ 
+ 	INIT_HLIST_HEAD(&lp->mh_list);
+ 
++	snprintf(lp->rx_irq_name, LDC_IRQ_NAME_MAX, "%s RX", name);
++	snprintf(lp->tx_irq_name, LDC_IRQ_NAME_MAX, "%s TX", name);
++
++	err = request_irq(lp->cfg.rx_irq, ldc_rx, 0,
++			  lp->rx_irq_name, lp);
++	if (err)
++		goto out_free_txq;
++
++	err = request_irq(lp->cfg.tx_irq, ldc_tx, 0,
++			  lp->tx_irq_name, lp);
++	if (err) {
++		free_irq(lp->cfg.rx_irq, lp);
++		goto out_free_txq;
++	}
++
+ 	return lp;
+ 
+ out_free_txq:
+@@ -1237,31 +1255,14 @@ EXPORT_SYMBOL(ldc_free);
+  * state.  This does not initiate a handshake, ldc_connect() does
+  * that.
+  */
+-int ldc_bind(struct ldc_channel *lp, const char *name)
++int ldc_bind(struct ldc_channel *lp)
+ {
+ 	unsigned long hv_err, flags;
+ 	int err = -EINVAL;
+ 
+-	if (!name ||
+-	    (lp->state != LDC_STATE_INIT))
++	if (lp->state != LDC_STATE_INIT)
+ 		return -EINVAL;
+ 
+-	snprintf(lp->rx_irq_name, LDC_IRQ_NAME_MAX, "%s RX", name);
+-	snprintf(lp->tx_irq_name, LDC_IRQ_NAME_MAX, "%s TX", name);
+-
+-	err = request_irq(lp->cfg.rx_irq, ldc_rx, 0,
+-			  lp->rx_irq_name, lp);
+-	if (err)
+-		return err;
+-
+-	err = request_irq(lp->cfg.tx_irq, ldc_tx, 0,
+-			  lp->tx_irq_name, lp);
+-	if (err) {
+-		free_irq(lp->cfg.rx_irq, lp);
+-		return err;
+-	}
+-
+-
+ 	spin_lock_irqsave(&lp->lock, flags);
+ 
+ 	enable_irq(lp->cfg.rx_irq);
+diff --git a/arch/sparc/kernel/nmi.c b/arch/sparc/kernel/nmi.c
+index 337094556916..5b1151dcba13 100644
+--- a/arch/sparc/kernel/nmi.c
++++ b/arch/sparc/kernel/nmi.c
+@@ -130,7 +130,6 @@ static inline unsigned int get_nmi_count(int cpu)
+ 
+ static __init void nmi_cpu_busy(void *data)
+ {
+-	local_irq_enable_in_hardirq();
+ 	while (endflag == 0)
+ 		mb();
+ }
+diff --git a/arch/sparc/kernel/pcr.c b/arch/sparc/kernel/pcr.c
+index 269af58497aa..7e967c8018c8 100644
+--- a/arch/sparc/kernel/pcr.c
++++ b/arch/sparc/kernel/pcr.c
+@@ -191,12 +191,41 @@ static const struct pcr_ops n4_pcr_ops = {
+ 	.pcr_nmi_disable	= PCR_N4_PICNPT,
+ };
+ 
++static u64 n5_pcr_read(unsigned long reg_num)
++{
++	unsigned long val;
++
++	(void) sun4v_t5_get_perfreg(reg_num, &val);
++
++	return val;
++}
++
++static void n5_pcr_write(unsigned long reg_num, u64 val)
++{
++	(void) sun4v_t5_set_perfreg(reg_num, val);
++}
++
++static const struct pcr_ops n5_pcr_ops = {
++	.read_pcr		= n5_pcr_read,
++	.write_pcr		= n5_pcr_write,
++	.read_pic		= n4_pic_read,
++	.write_pic		= n4_pic_write,
++	.nmi_picl_value		= n4_picl_value,
++	.pcr_nmi_enable		= (PCR_N4_PICNPT | PCR_N4_STRACE |
++				   PCR_N4_UTRACE | PCR_N4_TOE |
++				   (26 << PCR_N4_SL_SHIFT)),
++	.pcr_nmi_disable	= PCR_N4_PICNPT,
++};
++
++
+ static unsigned long perf_hsvc_group;
+ static unsigned long perf_hsvc_major;
+ static unsigned long perf_hsvc_minor;
+ 
+ static int __init register_perf_hsvc(void)
+ {
++	unsigned long hverror;
++
+ 	if (tlb_type == hypervisor) {
+ 		switch (sun4v_chip_type) {
+ 		case SUN4V_CHIP_NIAGARA1:
+@@ -215,6 +244,10 @@ static int __init register_perf_hsvc(void)
+ 			perf_hsvc_group = HV_GRP_VT_CPU;
+ 			break;
+ 
++		case SUN4V_CHIP_NIAGARA5:
++			perf_hsvc_group = HV_GRP_T5_CPU;
++			break;
++
+ 		default:
+ 			return -ENODEV;
+ 		}
+@@ -222,10 +255,12 @@ static int __init register_perf_hsvc(void)
+ 
+ 		perf_hsvc_major = 1;
+ 		perf_hsvc_minor = 0;
+-		if (sun4v_hvapi_register(perf_hsvc_group,
+-					 perf_hsvc_major,
+-					 &perf_hsvc_minor)) {
+-			printk("perfmon: Could not register hvapi.\n");
++		hverror = sun4v_hvapi_register(perf_hsvc_group,
++					       perf_hsvc_major,
++					       &perf_hsvc_minor);
++		if (hverror) {
++			pr_err("perfmon: Could not register hvapi(0x%lx).\n",
++			       hverror);
+ 			return -ENODEV;
+ 		}
+ 	}
+@@ -254,6 +289,10 @@ static int __init setup_sun4v_pcr_ops(void)
+ 		pcr_ops = &n4_pcr_ops;
+ 		break;
+ 
++	case SUN4V_CHIP_NIAGARA5:
++		pcr_ops = &n5_pcr_ops;
++		break;
++
+ 	default:
+ 		ret = -ENODEV;
+ 		break;
+diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
+index 8efd33753ad3..c9759ad3f34a 100644
+--- a/arch/sparc/kernel/perf_event.c
++++ b/arch/sparc/kernel/perf_event.c
+@@ -1662,7 +1662,8 @@ static bool __init supported_pmu(void)
+ 		sparc_pmu = &niagara2_pmu;
+ 		return true;
+ 	}
+-	if (!strcmp(sparc_pmu_type, "niagara4")) {
++	if (!strcmp(sparc_pmu_type, "niagara4") ||
++	    !strcmp(sparc_pmu_type, "niagara5")) {
+ 		sparc_pmu = &niagara4_pmu;
+ 		return true;
+ 	}
+@@ -1671,9 +1672,12 @@ static bool __init supported_pmu(void)
+ 
+ static int __init init_hw_perf_events(void)
+ {
++	int err;
++
+ 	pr_info("Performance events: ");
+ 
+-	if (!supported_pmu()) {
++	err = pcr_arch_init();
++	if (err || !supported_pmu()) {
+ 		pr_cont("No support for PMU type '%s'\n", sparc_pmu_type);
+ 		return 0;
+ 	}
+@@ -1685,7 +1689,7 @@ static int __init init_hw_perf_events(void)
+ 
+ 	return 0;
+ }
+-early_initcall(init_hw_perf_events);
++pure_initcall(init_hw_perf_events);
+ 
+ void perf_callchain_kernel(struct perf_callchain_entry *entry,
+ 			   struct pt_regs *regs)
+diff --git a/arch/sparc/kernel/process_64.c b/arch/sparc/kernel/process_64.c
+index 027e09986194..0be7bf978cb1 100644
+--- a/arch/sparc/kernel/process_64.c
++++ b/arch/sparc/kernel/process_64.c
+@@ -312,6 +312,9 @@ static void __global_pmu_self(int this_cpu)
+ 	struct global_pmu_snapshot *pp;
+ 	int i, num;
+ 
++	if (!pcr_ops)
++		return;
++
+ 	pp = &global_cpu_snapshot[this_cpu].pmu;
+ 
+ 	num = 1;
+diff --git a/arch/sparc/kernel/setup_64.c b/arch/sparc/kernel/setup_64.c
+index 3fdb455e3318..61a519808cb7 100644
+--- a/arch/sparc/kernel/setup_64.c
++++ b/arch/sparc/kernel/setup_64.c
+@@ -30,6 +30,7 @@
+ #include <linux/cpu.h>
+ #include <linux/initrd.h>
+ #include <linux/module.h>
++#include <linux/start_kernel.h>
+ 
+ #include <asm/io.h>
+ #include <asm/processor.h>
+@@ -174,7 +175,7 @@ char reboot_command[COMMAND_LINE_SIZE];
+ 
+ static struct pt_regs fake_swapper_regs = { { 0, }, 0, 0, 0, 0 };
+ 
+-void __init per_cpu_patch(void)
++static void __init per_cpu_patch(void)
+ {
+ 	struct cpuid_patch_entry *p;
+ 	unsigned long ver;
+@@ -266,7 +267,7 @@ void sun4v_patch_2insn_range(struct sun4v_2insn_patch_entry *start,
+ 	}
+ }
+ 
+-void __init sun4v_patch(void)
++static void __init sun4v_patch(void)
+ {
+ 	extern void sun4v_hvapi_init(void);
+ 
+@@ -335,14 +336,25 @@ static void __init pause_patch(void)
+ 	}
+ }
+ 
+-#ifdef CONFIG_SMP
+-void __init boot_cpu_id_too_large(int cpu)
++void __init start_early_boot(void)
+ {
+-	prom_printf("Serious problem, boot cpu id (%d) >= NR_CPUS (%d)\n",
+-		    cpu, NR_CPUS);
+-	prom_halt();
++	int cpu;
++
++	check_if_starfire();
++	per_cpu_patch();
++	sun4v_patch();
++
++	cpu = hard_smp_processor_id();
++	if (cpu >= NR_CPUS) {
++		prom_printf("Serious problem, boot cpu id (%d) >= NR_CPUS (%d)\n",
++			    cpu, NR_CPUS);
++		prom_halt();
++	}
++	current_thread_info()->cpu = cpu;
++
++	prom_init_report();
++	start_kernel();
+ }
+-#endif
+ 
+ /* On Ultra, we support all of the v8 capabilities. */
+ unsigned long sparc64_elf_hwcap = (HWCAP_SPARC_FLUSH | HWCAP_SPARC_STBAR |
+@@ -500,12 +512,16 @@ static void __init init_sparc64_elf_hwcap(void)
+ 		    sun4v_chip_type == SUN4V_CHIP_NIAGARA3 ||
+ 		    sun4v_chip_type == SUN4V_CHIP_NIAGARA4 ||
+ 		    sun4v_chip_type == SUN4V_CHIP_NIAGARA5 ||
++		    sun4v_chip_type == SUN4V_CHIP_SPARC_M6 ||
++		    sun4v_chip_type == SUN4V_CHIP_SPARC_M7 ||
+ 		    sun4v_chip_type == SUN4V_CHIP_SPARC64X)
+ 			cap |= HWCAP_SPARC_BLKINIT;
+ 		if (sun4v_chip_type == SUN4V_CHIP_NIAGARA2 ||
+ 		    sun4v_chip_type == SUN4V_CHIP_NIAGARA3 ||
+ 		    sun4v_chip_type == SUN4V_CHIP_NIAGARA4 ||
+ 		    sun4v_chip_type == SUN4V_CHIP_NIAGARA5 ||
++		    sun4v_chip_type == SUN4V_CHIP_SPARC_M6 ||
++		    sun4v_chip_type == SUN4V_CHIP_SPARC_M7 ||
+ 		    sun4v_chip_type == SUN4V_CHIP_SPARC64X)
+ 			cap |= HWCAP_SPARC_N2;
+ 	}
+@@ -533,6 +549,8 @@ static void __init init_sparc64_elf_hwcap(void)
+ 			    sun4v_chip_type == SUN4V_CHIP_NIAGARA3 ||
+ 			    sun4v_chip_type == SUN4V_CHIP_NIAGARA4 ||
+ 			    sun4v_chip_type == SUN4V_CHIP_NIAGARA5 ||
++			    sun4v_chip_type == SUN4V_CHIP_SPARC_M6 ||
++			    sun4v_chip_type == SUN4V_CHIP_SPARC_M7 ||
+ 			    sun4v_chip_type == SUN4V_CHIP_SPARC64X)
+ 				cap |= (AV_SPARC_VIS | AV_SPARC_VIS2 |
+ 					AV_SPARC_ASI_BLK_INIT |
+@@ -540,6 +558,8 @@ static void __init init_sparc64_elf_hwcap(void)
+ 			if (sun4v_chip_type == SUN4V_CHIP_NIAGARA3 ||
+ 			    sun4v_chip_type == SUN4V_CHIP_NIAGARA4 ||
+ 			    sun4v_chip_type == SUN4V_CHIP_NIAGARA5 ||
++			    sun4v_chip_type == SUN4V_CHIP_SPARC_M6 ||
++			    sun4v_chip_type == SUN4V_CHIP_SPARC_M7 ||
+ 			    sun4v_chip_type == SUN4V_CHIP_SPARC64X)
+ 				cap |= (AV_SPARC_VIS3 | AV_SPARC_HPC |
+ 					AV_SPARC_FMAF);
+diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
+index 41aa2478f3ca..c9300bfaee5a 100644
+--- a/arch/sparc/kernel/smp_64.c
++++ b/arch/sparc/kernel/smp_64.c
+@@ -1383,7 +1383,6 @@ void __cpu_die(unsigned int cpu)
+ 
+ void __init smp_cpus_done(unsigned int max_cpus)
+ {
+-	pcr_arch_init();
+ }
+ 
+ void smp_send_reschedule(int cpu)
+@@ -1468,6 +1467,13 @@ static void __init pcpu_populate_pte(unsigned long addr)
+ 	pud_t *pud;
+ 	pmd_t *pmd;
+ 
++	if (pgd_none(*pgd)) {
++		pud_t *new;
++
++		new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
++		pgd_populate(&init_mm, pgd, new);
++	}
++
+ 	pud = pud_offset(pgd, addr);
+ 	if (pud_none(*pud)) {
+ 		pmd_t *new;
+diff --git a/arch/sparc/kernel/sun4v_tlb_miss.S b/arch/sparc/kernel/sun4v_tlb_miss.S
+index e0c09bf85610..6179e19bc9b9 100644
+--- a/arch/sparc/kernel/sun4v_tlb_miss.S
++++ b/arch/sparc/kernel/sun4v_tlb_miss.S
+@@ -195,6 +195,11 @@ sun4v_tsb_miss_common:
+ 	 ldx	[%g2 + TRAP_PER_CPU_PGD_PADDR], %g7
+ 
+ sun4v_itlb_error:
++	rdpr	%tl, %g1
++	cmp	%g1, 1
++	ble,pt	%icc, sun4v_bad_ra
++	 or	%g0, FAULT_CODE_BAD_RA | FAULT_CODE_ITLB, %g1
++
+ 	sethi	%hi(sun4v_err_itlb_vaddr), %g1
+ 	stx	%g4, [%g1 + %lo(sun4v_err_itlb_vaddr)]
+ 	sethi	%hi(sun4v_err_itlb_ctx), %g1
+@@ -206,15 +211,10 @@ sun4v_itlb_error:
+ 	sethi	%hi(sun4v_err_itlb_error), %g1
+ 	stx	%o0, [%g1 + %lo(sun4v_err_itlb_error)]
+ 
++	sethi	%hi(1f), %g7
+ 	rdpr	%tl, %g4
+-	cmp	%g4, 1
+-	ble,pt	%icc, 1f
+-	 sethi	%hi(2f), %g7
+ 	ba,pt	%xcc, etraptl1
+-	 or	%g7, %lo(2f), %g7
+-
+-1:	ba,pt	%xcc, etrap
+-2:	 or	%g7, %lo(2b), %g7
++1:	 or	%g7, %lo(1f), %g7
+ 	mov	%l4, %o1
+ 	call	sun4v_itlb_error_report
+ 	 add	%sp, PTREGS_OFF, %o0
+@@ -222,6 +222,11 @@ sun4v_itlb_error:
+ 	/* NOTREACHED */
+ 
+ sun4v_dtlb_error:
++	rdpr	%tl, %g1
++	cmp	%g1, 1
++	ble,pt	%icc, sun4v_bad_ra
++	 or	%g0, FAULT_CODE_BAD_RA | FAULT_CODE_DTLB, %g1
++
+ 	sethi	%hi(sun4v_err_dtlb_vaddr), %g1
+ 	stx	%g4, [%g1 + %lo(sun4v_err_dtlb_vaddr)]
+ 	sethi	%hi(sun4v_err_dtlb_ctx), %g1
+@@ -233,21 +238,23 @@ sun4v_dtlb_error:
+ 	sethi	%hi(sun4v_err_dtlb_error), %g1
+ 	stx	%o0, [%g1 + %lo(sun4v_err_dtlb_error)]
+ 
++	sethi	%hi(1f), %g7
+ 	rdpr	%tl, %g4
+-	cmp	%g4, 1
+-	ble,pt	%icc, 1f
+-	 sethi	%hi(2f), %g7
+ 	ba,pt	%xcc, etraptl1
+-	 or	%g7, %lo(2f), %g7
+-
+-1:	ba,pt	%xcc, etrap
+-2:	 or	%g7, %lo(2b), %g7
++1:	 or	%g7, %lo(1f), %g7
+ 	mov	%l4, %o1
+ 	call	sun4v_dtlb_error_report
+ 	 add	%sp, PTREGS_OFF, %o0
+ 
+ 	/* NOTREACHED */
+ 
++sun4v_bad_ra:
++	or	%g0, %g4, %g5
++	ba,pt	%xcc, sparc64_realfault_common
++	 or	%g1, %g0, %g4
++
++	/* NOTREACHED */
++
+ 	/* Instruction Access Exception, tl0. */
+ sun4v_iacc:
+ 	ldxa	[%g0] ASI_SCRATCHPAD, %g2
+diff --git a/arch/sparc/kernel/trampoline_64.S b/arch/sparc/kernel/trampoline_64.S
+index 737f8cbc7d56..88ede1d53b4c 100644
+--- a/arch/sparc/kernel/trampoline_64.S
++++ b/arch/sparc/kernel/trampoline_64.S
+@@ -109,10 +109,13 @@ startup_continue:
+ 	brnz,pn		%g1, 1b
+ 	 nop
+ 
+-	sethi		%hi(p1275buf), %g2
+-	or		%g2, %lo(p1275buf), %g2
+-	ldx		[%g2 + 0x10], %l2
+-	add		%l2, -(192 + 128), %sp
++	/* Get onto temporary stack which will be in the locked
++	 * kernel image.
++	 */
++	sethi		%hi(tramp_stack), %g1
++	or		%g1, %lo(tramp_stack), %g1
++	add		%g1, TRAMP_STACK_SIZE, %g1
++	sub		%g1, STACKFRAME_SZ + STACK_BIAS + 256, %sp
+ 	flushw
+ 
+ 	/* Setup the loop variables:
+@@ -394,7 +397,6 @@ after_lock_tlb:
+ 	sllx		%g5, THREAD_SHIFT, %g5
+ 	sub		%g5, (STACKFRAME_SZ + STACK_BIAS), %g5
+ 	add		%g6, %g5, %sp
+-	mov		0, %fp
+ 
+ 	rdpr		%pstate, %o1
+ 	or		%o1, PSTATE_IE, %o1
+diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
+index fb6640ec8557..981a769b9558 100644
+--- a/arch/sparc/kernel/traps_64.c
++++ b/arch/sparc/kernel/traps_64.c
+@@ -2104,6 +2104,11 @@ void sun4v_nonresum_overflow(struct pt_regs *regs)
+ 	atomic_inc(&sun4v_nonresum_oflow_cnt);
+ }
+ 
++static void sun4v_tlb_error(struct pt_regs *regs)
++{
++	die_if_kernel("TLB/TSB error", regs);
++}
++
+ unsigned long sun4v_err_itlb_vaddr;
+ unsigned long sun4v_err_itlb_ctx;
+ unsigned long sun4v_err_itlb_pte;
+@@ -2111,8 +2116,7 @@ unsigned long sun4v_err_itlb_error;
+ 
+ void sun4v_itlb_error_report(struct pt_regs *regs, int tl)
+ {
+-	if (tl > 1)
+-		dump_tl1_traplog((struct tl1_traplog *)(regs + 1));
++	dump_tl1_traplog((struct tl1_traplog *)(regs + 1));
+ 
+ 	printk(KERN_EMERG "SUN4V-ITLB: Error at TPC[%lx], tl %d\n",
+ 	       regs->tpc, tl);
+@@ -2125,7 +2129,7 @@ void sun4v_itlb_error_report(struct pt_regs *regs, int tl)
+ 	       sun4v_err_itlb_vaddr, sun4v_err_itlb_ctx,
+ 	       sun4v_err_itlb_pte, sun4v_err_itlb_error);
+ 
+-	prom_halt();
++	sun4v_tlb_error(regs);
+ }
+ 
+ unsigned long sun4v_err_dtlb_vaddr;
+@@ -2135,8 +2139,7 @@ unsigned long sun4v_err_dtlb_error;
+ 
+ void sun4v_dtlb_error_report(struct pt_regs *regs, int tl)
+ {
+-	if (tl > 1)
+-		dump_tl1_traplog((struct tl1_traplog *)(regs + 1));
++	dump_tl1_traplog((struct tl1_traplog *)(regs + 1));
+ 
+ 	printk(KERN_EMERG "SUN4V-DTLB: Error at TPC[%lx], tl %d\n",
+ 	       regs->tpc, tl);
+@@ -2149,7 +2152,7 @@ void sun4v_dtlb_error_report(struct pt_regs *regs, int tl)
+ 	       sun4v_err_dtlb_vaddr, sun4v_err_dtlb_ctx,
+ 	       sun4v_err_dtlb_pte, sun4v_err_dtlb_error);
+ 
+-	prom_halt();
++	sun4v_tlb_error(regs);
+ }
+ 
+ void hypervisor_tlbop_error(unsigned long err, unsigned long op)
+diff --git a/arch/sparc/kernel/tsb.S b/arch/sparc/kernel/tsb.S
+index 14158d40ba76..be98685c14c6 100644
+--- a/arch/sparc/kernel/tsb.S
++++ b/arch/sparc/kernel/tsb.S
+@@ -162,10 +162,10 @@ tsb_miss_page_table_walk_sun4v_fastpath:
+ 	nop
+ 	.previous
+ 
+-	rdpr	%tl, %g3
+-	cmp	%g3, 1
++	rdpr	%tl, %g7
++	cmp	%g7, 1
+ 	bne,pn	%xcc, winfix_trampoline
+-	 nop
++	 mov	%g3, %g4
+ 	ba,pt	%xcc, etrap
+ 	 rd	%pc, %g7
+ 	call	hugetlb_setup
+diff --git a/arch/sparc/kernel/viohs.c b/arch/sparc/kernel/viohs.c
+index f8e7dd53e1c7..9c5fbd0b8a04 100644
+--- a/arch/sparc/kernel/viohs.c
++++ b/arch/sparc/kernel/viohs.c
+@@ -714,7 +714,7 @@ int vio_ldc_alloc(struct vio_driver_state *vio,
+ 	cfg.tx_irq = vio->vdev->tx_irq;
+ 	cfg.rx_irq = vio->vdev->rx_irq;
+ 
+-	lp = ldc_alloc(vio->vdev->channel_id, &cfg, event_arg);
++	lp = ldc_alloc(vio->vdev->channel_id, &cfg, event_arg, vio->name);
+ 	if (IS_ERR(lp))
+ 		return PTR_ERR(lp);
+ 
+@@ -746,7 +746,7 @@ void vio_port_up(struct vio_driver_state *vio)
+ 
+ 	err = 0;
+ 	if (state == LDC_STATE_INIT) {
+-		err = ldc_bind(vio->lp, vio->name);
++		err = ldc_bind(vio->lp);
+ 		if (err)
+ 			printk(KERN_WARNING "%s: Port %lu bind failed, "
+ 			       "err=%d\n",
+diff --git a/arch/sparc/kernel/vmlinux.lds.S b/arch/sparc/kernel/vmlinux.lds.S
+index 932ff90fd760..09243057cb0b 100644
+--- a/arch/sparc/kernel/vmlinux.lds.S
++++ b/arch/sparc/kernel/vmlinux.lds.S
+@@ -35,8 +35,9 @@ jiffies = jiffies_64;
+ 
+ SECTIONS
+ {
+-	/* swapper_low_pmd_dir is sparc64 only */
+-	swapper_low_pmd_dir = 0x0000000000402000;
++#ifdef CONFIG_SPARC64
++	swapper_pg_dir = 0x0000000000402000;
++#endif
+ 	. = INITIAL_ADDRESS;
+ 	.text TEXTSTART :
+ 	{
+@@ -122,11 +123,6 @@ SECTIONS
+ 		*(.swapper_4m_tsb_phys_patch)
+ 		__swapper_4m_tsb_phys_patch_end = .;
+ 	}
+-	.page_offset_shift_patch : {
+-		__page_offset_shift_patch = .;
+-		*(.page_offset_shift_patch)
+-		__page_offset_shift_patch_end = .;
+-	}
+ 	.popc_3insn_patch : {
+ 		__popc_3insn_patch = .;
+ 		*(.popc_3insn_patch)
+diff --git a/arch/sparc/lib/NG4memcpy.S b/arch/sparc/lib/NG4memcpy.S
+index 9cf2ee01cee3..140527a20e7d 100644
+--- a/arch/sparc/lib/NG4memcpy.S
++++ b/arch/sparc/lib/NG4memcpy.S
+@@ -41,6 +41,10 @@
+ #endif
+ #endif
+ 
++#if !defined(EX_LD) && !defined(EX_ST)
++#define NON_USER_COPY
++#endif
++
+ #ifndef EX_LD
+ #define EX_LD(x)	x
+ #endif
+@@ -197,9 +201,13 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
+ 	 mov		EX_RETVAL(%o3), %o0
+ 
+ .Llarge_src_unaligned:
++#ifdef NON_USER_COPY
++	VISEntryHalfFast(.Lmedium_vis_entry_fail)
++#else
++	VISEntryHalf
++#endif
+ 	andn		%o2, 0x3f, %o4
+ 	sub		%o2, %o4, %o2
+-	VISEntryHalf
+ 	alignaddr	%o1, %g0, %g1
+ 	add		%o1, %o4, %o1
+ 	EX_LD(LOAD(ldd, %g1 + 0x00, %f0))
+@@ -240,6 +248,10 @@ FUNC_NAME:	/* %o0=dst, %o1=src, %o2=len */
+ 	 nop
+ 	ba,a,pt		%icc, .Lmedium_unaligned
+ 
++#ifdef NON_USER_COPY
++.Lmedium_vis_entry_fail:
++	 or		%o0, %o1, %g2
++#endif
+ .Lmedium:
+ 	LOAD(prefetch, %o1 + 0x40, #n_reads_strong)
+ 	andcc		%g2, 0x7, %g0
+diff --git a/arch/sparc/lib/memset.S b/arch/sparc/lib/memset.S
+index 99c017be8719..f75e6906df14 100644
+--- a/arch/sparc/lib/memset.S
++++ b/arch/sparc/lib/memset.S
+@@ -3,8 +3,9 @@
+  * Copyright (C) 1996,1997 Jakub Jelinek (jj@sunsite.mff.cuni.cz)
+  * Copyright (C) 1996 David S. Miller (davem@caip.rutgers.edu)
+  *
+- * Returns 0, if ok, and number of bytes not yet set if exception
+- * occurs and we were called as clear_user.
++ * Calls to memset returns initial %o0. Calls to bzero returns 0, if ok, and
++ * number of bytes not yet set if exception occurs and we were called as
++ * clear_user.
+  */
+ 
+ #include <asm/ptrace.h>
+@@ -65,6 +66,8 @@ __bzero_begin:
+ 	.globl	__memset_start, __memset_end
+ __memset_start:
+ memset:
++	mov	%o0, %g1
++	mov	1, %g4
+ 	and	%o1, 0xff, %g3
+ 	sll	%g3, 8, %g2
+ 	or	%g3, %g2, %g3
+@@ -89,6 +92,7 @@ memset:
+ 	 sub	%o0, %o2, %o0
+ 
+ __bzero:
++	clr	%g4
+ 	mov	%g0, %g3
+ 1:
+ 	cmp	%o1, 7
+@@ -151,8 +155,8 @@ __bzero:
+ 	bne,a	8f
+ 	 EX(stb	%g3, [%o0], and %o1, 1)
+ 8:
+-	retl
+-	 clr	%o0
++	b	0f
++	 nop
+ 7:
+ 	be	13b
+ 	 orcc	%o1, 0, %g0
+@@ -164,6 +168,12 @@ __bzero:
+ 	bne	8b
+ 	 EX(stb	%g3, [%o0 - 1], add %o1, 1)
+ 0:
++	andcc	%g4, 1, %g0
++	be	5f
++	 nop
++	retl
++	 mov	%g1, %o0
++5:
+ 	retl
+ 	 clr	%o0
+ __memset_end:
+diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
+index 587cd0565128..18fcd7167095 100644
+--- a/arch/sparc/mm/fault_64.c
++++ b/arch/sparc/mm/fault_64.c
+@@ -346,6 +346,9 @@ retry:
+ 		down_read(&mm->mmap_sem);
+ 	}
+ 
++	if (fault_code & FAULT_CODE_BAD_RA)
++		goto do_sigbus;
++
+ 	vma = find_vma(mm, address);
+ 	if (!vma)
+ 		goto bad_area;
+diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
+index 1aed0432c64b..ae6ce383d4df 100644
+--- a/arch/sparc/mm/gup.c
++++ b/arch/sparc/mm/gup.c
+@@ -160,6 +160,36 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
+ 	return 1;
+ }
+ 
++int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
++			  struct page **pages)
++{
++	struct mm_struct *mm = current->mm;
++	unsigned long addr, len, end;
++	unsigned long next, flags;
++	pgd_t *pgdp;
++	int nr = 0;
++
++	start &= PAGE_MASK;
++	addr = start;
++	len = (unsigned long) nr_pages << PAGE_SHIFT;
++	end = start + len;
++
++	local_irq_save(flags);
++	pgdp = pgd_offset(mm, addr);
++	do {
++		pgd_t pgd = *pgdp;
++
++		next = pgd_addr_end(addr, end);
++		if (pgd_none(pgd))
++			break;
++		if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
++			break;
++	} while (pgdp++, addr = next, addr != end);
++	local_irq_restore(flags);
++
++	return nr;
++}
++
+ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
+ 			struct page **pages)
+ {
+diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
+index 2cfb0f25e0ed..bbb9371f519b 100644
+--- a/arch/sparc/mm/init_64.c
++++ b/arch/sparc/mm/init_64.c
+@@ -74,7 +74,6 @@ unsigned long kern_linear_pte_xor[4] __read_mostly;
+  * 'cpu' properties, but we need to have this table setup before the
+  * MDESC is initialized.
+  */
+-unsigned long kpte_linear_bitmap[KPTE_BITMAP_BYTES / sizeof(unsigned long)];
+ 
+ #ifndef CONFIG_DEBUG_PAGEALLOC
+ /* A special kernel TSB for 4MB, 256MB, 2GB and 16GB linear mappings.
+@@ -83,10 +82,11 @@ unsigned long kpte_linear_bitmap[KPTE_BITMAP_BYTES / sizeof(unsigned long)];
+  */
+ extern struct tsb swapper_4m_tsb[KERNEL_TSB4M_NENTRIES];
+ #endif
++extern struct tsb swapper_tsb[KERNEL_TSB_NENTRIES];
+ 
+ static unsigned long cpu_pgsz_mask;
+ 
+-#define MAX_BANKS	32
++#define MAX_BANKS	1024
+ 
+ static struct linux_prom64_registers pavail[MAX_BANKS];
+ static int pavail_ents;
+@@ -164,10 +164,6 @@ static void __init read_obp_memory(const char *property,
+ 	     cmp_p64, NULL);
+ }
+ 
+-unsigned long sparc64_valid_addr_bitmap[VALID_ADDR_BITMAP_BYTES /
+-					sizeof(unsigned long)];
+-EXPORT_SYMBOL(sparc64_valid_addr_bitmap);
+-
+ /* Kernel physical address base and size in bytes.  */
+ unsigned long kern_base __read_mostly;
+ unsigned long kern_size __read_mostly;
+@@ -839,7 +835,10 @@ static int find_node(unsigned long addr)
+ 		if ((addr & p->mask) == p->val)
+ 			return i;
+ 	}
+-	return -1;
++	/* The following condition has been observed on LDOM guests.*/
++	WARN_ONCE(1, "find_node: A physical address doesn't match a NUMA node"
++		" rule. Some physical memory will be owned by node 0.");
++	return 0;
+ }
+ 
+ static u64 memblock_nid_range(u64 start, u64 end, int *nid)
+@@ -1365,9 +1364,144 @@ static unsigned long __init bootmem_init(unsigned long phys_base)
+ static struct linux_prom64_registers pall[MAX_BANKS] __initdata;
+ static int pall_ents __initdata;
+ 
+-#ifdef CONFIG_DEBUG_PAGEALLOC
++static unsigned long max_phys_bits = 40;
++
++bool kern_addr_valid(unsigned long addr)
++{
++	pgd_t *pgd;
++	pud_t *pud;
++	pmd_t *pmd;
++	pte_t *pte;
++
++	if ((long)addr < 0L) {
++		unsigned long pa = __pa(addr);
++
++		if ((addr >> max_phys_bits) != 0UL)
++			return false;
++
++		return pfn_valid(pa >> PAGE_SHIFT);
++	}
++
++	if (addr >= (unsigned long) KERNBASE &&
++	    addr < (unsigned long)&_end)
++		return true;
++
++	pgd = pgd_offset_k(addr);
++	if (pgd_none(*pgd))
++		return 0;
++
++	pud = pud_offset(pgd, addr);
++	if (pud_none(*pud))
++		return 0;
++
++	if (pud_large(*pud))
++		return pfn_valid(pud_pfn(*pud));
++
++	pmd = pmd_offset(pud, addr);
++	if (pmd_none(*pmd))
++		return 0;
++
++	if (pmd_large(*pmd))
++		return pfn_valid(pmd_pfn(*pmd));
++
++	pte = pte_offset_kernel(pmd, addr);
++	if (pte_none(*pte))
++		return 0;
++
++	return pfn_valid(pte_pfn(*pte));
++}
++EXPORT_SYMBOL(kern_addr_valid);
++
++static unsigned long __ref kernel_map_hugepud(unsigned long vstart,
++					      unsigned long vend,
++					      pud_t *pud)
++{
++	const unsigned long mask16gb = (1UL << 34) - 1UL;
++	u64 pte_val = vstart;
++
++	/* Each PUD is 8GB */
++	if ((vstart & mask16gb) ||
++	    (vend - vstart <= mask16gb)) {
++		pte_val ^= kern_linear_pte_xor[2];
++		pud_val(*pud) = pte_val | _PAGE_PUD_HUGE;
++
++		return vstart + PUD_SIZE;
++	}
++
++	pte_val ^= kern_linear_pte_xor[3];
++	pte_val |= _PAGE_PUD_HUGE;
++
++	vend = vstart + mask16gb + 1UL;
++	while (vstart < vend) {
++		pud_val(*pud) = pte_val;
++
++		pte_val += PUD_SIZE;
++		vstart += PUD_SIZE;
++		pud++;
++	}
++	return vstart;
++}
++
++static bool kernel_can_map_hugepud(unsigned long vstart, unsigned long vend,
++				   bool guard)
++{
++	if (guard && !(vstart & ~PUD_MASK) && (vend - vstart) >= PUD_SIZE)
++		return true;
++
++	return false;
++}
++
++static unsigned long __ref kernel_map_hugepmd(unsigned long vstart,
++					      unsigned long vend,
++					      pmd_t *pmd)
++{
++	const unsigned long mask256mb = (1UL << 28) - 1UL;
++	const unsigned long mask2gb = (1UL << 31) - 1UL;
++	u64 pte_val = vstart;
++
++	/* Each PMD is 8MB */
++	if ((vstart & mask256mb) ||
++	    (vend - vstart <= mask256mb)) {
++		pte_val ^= kern_linear_pte_xor[0];
++		pmd_val(*pmd) = pte_val | _PAGE_PMD_HUGE;
++
++		return vstart + PMD_SIZE;
++	}
++
++	if ((vstart & mask2gb) ||
++	    (vend - vstart <= mask2gb)) {
++		pte_val ^= kern_linear_pte_xor[1];
++		pte_val |= _PAGE_PMD_HUGE;
++		vend = vstart + mask256mb + 1UL;
++	} else {
++		pte_val ^= kern_linear_pte_xor[2];
++		pte_val |= _PAGE_PMD_HUGE;
++		vend = vstart + mask2gb + 1UL;
++	}
++
++	while (vstart < vend) {
++		pmd_val(*pmd) = pte_val;
++
++		pte_val += PMD_SIZE;
++		vstart += PMD_SIZE;
++		pmd++;
++	}
++
++	return vstart;
++}
++
++static bool kernel_can_map_hugepmd(unsigned long vstart, unsigned long vend,
++				   bool guard)
++{
++	if (guard && !(vstart & ~PMD_MASK) && (vend - vstart) >= PMD_SIZE)
++		return true;
++
++	return false;
++}
++
+ static unsigned long __ref kernel_map_range(unsigned long pstart,
+-					    unsigned long pend, pgprot_t prot)
++					    unsigned long pend, pgprot_t prot,
++					    bool use_huge)
+ {
+ 	unsigned long vstart = PAGE_OFFSET + pstart;
+ 	unsigned long vend = PAGE_OFFSET + pend;
+@@ -1386,19 +1520,34 @@ static unsigned long __ref kernel_map_range(unsigned long pstart,
+ 		pmd_t *pmd;
+ 		pte_t *pte;
+ 
++		if (pgd_none(*pgd)) {
++			pud_t *new;
++
++			new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
++			alloc_bytes += PAGE_SIZE;
++			pgd_populate(&init_mm, pgd, new);
++		}
+ 		pud = pud_offset(pgd, vstart);
+ 		if (pud_none(*pud)) {
+ 			pmd_t *new;
+ 
++			if (kernel_can_map_hugepud(vstart, vend, use_huge)) {
++				vstart = kernel_map_hugepud(vstart, vend, pud);
++				continue;
++			}
+ 			new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
+ 			alloc_bytes += PAGE_SIZE;
+ 			pud_populate(&init_mm, pud, new);
+ 		}
+ 
+ 		pmd = pmd_offset(pud, vstart);
+-		if (!pmd_present(*pmd)) {
++		if (pmd_none(*pmd)) {
+ 			pte_t *new;
+ 
++			if (kernel_can_map_hugepmd(vstart, vend, use_huge)) {
++				vstart = kernel_map_hugepmd(vstart, vend, pmd);
++				continue;
++			}
+ 			new = __alloc_bootmem(PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
+ 			alloc_bytes += PAGE_SIZE;
+ 			pmd_populate_kernel(&init_mm, pmd, new);
+@@ -1421,100 +1570,34 @@ static unsigned long __ref kernel_map_range(unsigned long pstart,
+ 	return alloc_bytes;
+ }
+ 
+-extern unsigned int kvmap_linear_patch[1];
+-#endif /* CONFIG_DEBUG_PAGEALLOC */
+-
+-static void __init kpte_set_val(unsigned long index, unsigned long val)
++static void __init flush_all_kernel_tsbs(void)
+ {
+-	unsigned long *ptr = kpte_linear_bitmap;
+-
+-	val <<= ((index % (BITS_PER_LONG / 2)) * 2);
+-	ptr += (index / (BITS_PER_LONG / 2));
+-
+-	*ptr |= val;
+-}
+-
+-static const unsigned long kpte_shift_min = 28; /* 256MB */
+-static const unsigned long kpte_shift_max = 34; /* 16GB */
+-static const unsigned long kpte_shift_incr = 3;
+-
+-static unsigned long kpte_mark_using_shift(unsigned long start, unsigned long end,
+-					   unsigned long shift)
+-{
+-	unsigned long size = (1UL << shift);
+-	unsigned long mask = (size - 1UL);
+-	unsigned long remains = end - start;
+-	unsigned long val;
+-
+-	if (remains < size || (start & mask))
+-		return start;
+-
+-	/* VAL maps:
+-	 *
+-	 *	shift 28 --> kern_linear_pte_xor index 1
+-	 *	shift 31 --> kern_linear_pte_xor index 2
+-	 *	shift 34 --> kern_linear_pte_xor index 3
+-	 */
+-	val = ((shift - kpte_shift_min) / kpte_shift_incr) + 1;
+-
+-	remains &= ~mask;
+-	if (shift != kpte_shift_max)
+-		remains = size;
+-
+-	while (remains) {
+-		unsigned long index = start >> kpte_shift_min;
++	int i;
+ 
+-		kpte_set_val(index, val);
++	for (i = 0; i < KERNEL_TSB_NENTRIES; i++) {
++		struct tsb *ent = &swapper_tsb[i];
+ 
+-		start += 1UL << kpte_shift_min;
+-		remains -= 1UL << kpte_shift_min;
++		ent->tag = (1UL << TSB_TAG_INVALID_BIT);
+ 	}
++#ifndef CONFIG_DEBUG_PAGEALLOC
++	for (i = 0; i < KERNEL_TSB4M_NENTRIES; i++) {
++		struct tsb *ent = &swapper_4m_tsb[i];
+ 
+-	return start;
+-}
+-
+-static void __init mark_kpte_bitmap(unsigned long start, unsigned long end)
+-{
+-	unsigned long smallest_size, smallest_mask;
+-	unsigned long s;
+-
+-	smallest_size = (1UL << kpte_shift_min);
+-	smallest_mask = (smallest_size - 1UL);
+-
+-	while (start < end) {
+-		unsigned long orig_start = start;
+-
+-		for (s = kpte_shift_max; s >= kpte_shift_min; s -= kpte_shift_incr) {
+-			start = kpte_mark_using_shift(start, end, s);
+-
+-			if (start != orig_start)
+-				break;
+-		}
+-
+-		if (start == orig_start)
+-			start = (start + smallest_size) & ~smallest_mask;
++		ent->tag = (1UL << TSB_TAG_INVALID_BIT);
+ 	}
++#endif
+ }
+ 
+-static void __init init_kpte_bitmap(void)
+-{
+-	unsigned long i;
+-
+-	for (i = 0; i < pall_ents; i++) {
+-		unsigned long phys_start, phys_end;
+-
+-		phys_start = pall[i].phys_addr;
+-		phys_end = phys_start + pall[i].reg_size;
+-
+-		mark_kpte_bitmap(phys_start, phys_end);
+-	}
+-}
++extern unsigned int kvmap_linear_patch[1];
+ 
+ static void __init kernel_physical_mapping_init(void)
+ {
+-#ifdef CONFIG_DEBUG_PAGEALLOC
+ 	unsigned long i, mem_alloced = 0UL;
++	bool use_huge = true;
+ 
++#ifdef CONFIG_DEBUG_PAGEALLOC
++	use_huge = false;
++#endif
+ 	for (i = 0; i < pall_ents; i++) {
+ 		unsigned long phys_start, phys_end;
+ 
+@@ -1522,7 +1605,7 @@ static void __init kernel_physical_mapping_init(void)
+ 		phys_end = phys_start + pall[i].reg_size;
+ 
+ 		mem_alloced += kernel_map_range(phys_start, phys_end,
+-						PAGE_KERNEL);
++						PAGE_KERNEL, use_huge);
+ 	}
+ 
+ 	printk("Allocated %ld bytes for kernel page tables.\n",
+@@ -1531,8 +1614,9 @@ static void __init kernel_physical_mapping_init(void)
+ 	kvmap_linear_patch[0] = 0x01000000; /* nop */
+ 	flushi(&kvmap_linear_patch[0]);
+ 
++	flush_all_kernel_tsbs();
++
+ 	__flush_tlb_all();
+-#endif
+ }
+ 
+ #ifdef CONFIG_DEBUG_PAGEALLOC
+@@ -1542,7 +1626,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
+ 	unsigned long phys_end = phys_start + (numpages * PAGE_SIZE);
+ 
+ 	kernel_map_range(phys_start, phys_end,
+-			 (enable ? PAGE_KERNEL : __pgprot(0)));
++			 (enable ? PAGE_KERNEL : __pgprot(0)), false);
+ 
+ 	flush_tsb_kernel_range(PAGE_OFFSET + phys_start,
+ 			       PAGE_OFFSET + phys_end);
+@@ -1570,76 +1654,56 @@ unsigned long __init find_ecache_flush_span(unsigned long size)
+ unsigned long PAGE_OFFSET;
+ EXPORT_SYMBOL(PAGE_OFFSET);
+ 
+-static void __init page_offset_shift_patch_one(unsigned int *insn, unsigned long phys_bits)
+-{
+-	unsigned long final_shift;
+-	unsigned int val = *insn;
+-	unsigned int cnt;
+-
+-	/* We are patching in ilog2(max_supported_phys_address), and
+-	 * we are doing so in a manner similar to a relocation addend.
+-	 * That is, we are adding the shift value to whatever value
+-	 * is in the shift instruction count field already.
+-	 */
+-	cnt = (val & 0x3f);
+-	val &= ~0x3f;
+-
+-	/* If we are trying to shift >= 64 bits, clear the destination
+-	 * register.  This can happen when phys_bits ends up being equal
+-	 * to MAX_PHYS_ADDRESS_BITS.
+-	 */
+-	final_shift = (cnt + (64 - phys_bits));
+-	if (final_shift >= 64) {
+-		unsigned int rd = (val >> 25) & 0x1f;
+-
+-		val = 0x80100000 | (rd << 25);
+-	} else {
+-		val |= final_shift;
+-	}
+-	*insn = val;
+-
+-	__asm__ __volatile__("flush	%0"
+-			     : /* no outputs */
+-			     : "r" (insn));
+-}
+-
+-static void __init page_offset_shift_patch(unsigned long phys_bits)
+-{
+-	extern unsigned int __page_offset_shift_patch;
+-	extern unsigned int __page_offset_shift_patch_end;
+-	unsigned int *p;
+-
+-	p = &__page_offset_shift_patch;
+-	while (p < &__page_offset_shift_patch_end) {
+-		unsigned int *insn = (unsigned int *)(unsigned long)*p;
++unsigned long VMALLOC_END   = 0x0000010000000000UL;
++EXPORT_SYMBOL(VMALLOC_END);
+ 
+-		page_offset_shift_patch_one(insn, phys_bits);
+-
+-		p++;
+-	}
+-}
++unsigned long sparc64_va_hole_top =    0xfffff80000000000UL;
++unsigned long sparc64_va_hole_bottom = 0x0000080000000000UL;
+ 
+ static void __init setup_page_offset(void)
+ {
+-	unsigned long max_phys_bits = 40;
+-
+ 	if (tlb_type == cheetah || tlb_type == cheetah_plus) {
++		/* Cheetah/Panther support a full 64-bit virtual
++		 * address, so we can use all that our page tables
++		 * support.
++		 */
++		sparc64_va_hole_top =    0xfff0000000000000UL;
++		sparc64_va_hole_bottom = 0x0010000000000000UL;
++
+ 		max_phys_bits = 42;
+ 	} else if (tlb_type == hypervisor) {
+ 		switch (sun4v_chip_type) {
+ 		case SUN4V_CHIP_NIAGARA1:
+ 		case SUN4V_CHIP_NIAGARA2:
++			/* T1 and T2 support 48-bit virtual addresses.  */
++			sparc64_va_hole_top =    0xffff800000000000UL;
++			sparc64_va_hole_bottom = 0x0000800000000000UL;
++
+ 			max_phys_bits = 39;
+ 			break;
+ 		case SUN4V_CHIP_NIAGARA3:
++			/* T3 supports 48-bit virtual addresses.  */
++			sparc64_va_hole_top =    0xffff800000000000UL;
++			sparc64_va_hole_bottom = 0x0000800000000000UL;
++
+ 			max_phys_bits = 43;
+ 			break;
+ 		case SUN4V_CHIP_NIAGARA4:
+ 		case SUN4V_CHIP_NIAGARA5:
+ 		case SUN4V_CHIP_SPARC64X:
+-		default:
++		case SUN4V_CHIP_SPARC_M6:
++			/* T4 and later support 52-bit virtual addresses.  */
++			sparc64_va_hole_top =    0xfff8000000000000UL;
++			sparc64_va_hole_bottom = 0x0008000000000000UL;
+ 			max_phys_bits = 47;
+ 			break;
++		case SUN4V_CHIP_SPARC_M7:
++		default:
++			/* M7 and later support 52-bit virtual addresses.  */
++			sparc64_va_hole_top =    0xfff8000000000000UL;
++			sparc64_va_hole_bottom = 0x0008000000000000UL;
++			max_phys_bits = 49;
++			break;
+ 		}
+ 	}
+ 
+@@ -1649,12 +1713,16 @@ static void __init setup_page_offset(void)
+ 		prom_halt();
+ 	}
+ 
+-	PAGE_OFFSET = PAGE_OFFSET_BY_BITS(max_phys_bits);
++	PAGE_OFFSET = sparc64_va_hole_top;
++	VMALLOC_END = ((sparc64_va_hole_bottom >> 1) +
++		       (sparc64_va_hole_bottom >> 2));
+ 
+-	pr_info("PAGE_OFFSET is 0x%016lx (max_phys_bits == %lu)\n",
++	pr_info("MM: PAGE_OFFSET is 0x%016lx (max_phys_bits == %lu)\n",
+ 		PAGE_OFFSET, max_phys_bits);
+-
+-	page_offset_shift_patch(max_phys_bits);
++	pr_info("MM: VMALLOC [0x%016lx --> 0x%016lx]\n",
++		VMALLOC_START, VMALLOC_END);
++	pr_info("MM: VMEMMAP [0x%016lx --> 0x%016lx]\n",
++		VMEMMAP_BASE, VMEMMAP_BASE << 1);
+ }
+ 
+ static void __init tsb_phys_patch(void)
+@@ -1699,21 +1767,42 @@ static void __init tsb_phys_patch(void)
+ #define NUM_KTSB_DESCR	1
+ #endif
+ static struct hv_tsb_descr ktsb_descr[NUM_KTSB_DESCR];
+-extern struct tsb swapper_tsb[KERNEL_TSB_NENTRIES];
++
++/* The swapper TSBs are loaded with a base sequence of:
++ *
++ *	sethi	%uhi(SYMBOL), REG1
++ *	sethi	%hi(SYMBOL), REG2
++ *	or	REG1, %ulo(SYMBOL), REG1
++ *	or	REG2, %lo(SYMBOL), REG2
++ *	sllx	REG1, 32, REG1
++ *	or	REG1, REG2, REG1
++ *
++ * When we use physical addressing for the TSB accesses, we patch the
++ * first four instructions in the above sequence.
++ */
+ 
+ static void patch_one_ktsb_phys(unsigned int *start, unsigned int *end, unsigned long pa)
+ {
+-	pa >>= KTSB_PHYS_SHIFT;
++	unsigned long high_bits, low_bits;
++
++	high_bits = (pa >> 32) & 0xffffffff;
++	low_bits = (pa >> 0) & 0xffffffff;
+ 
+ 	while (start < end) {
+ 		unsigned int *ia = (unsigned int *)(unsigned long)*start;
+ 
+-		ia[0] = (ia[0] & ~0x3fffff) | (pa >> 10);
++		ia[0] = (ia[0] & ~0x3fffff) | (high_bits >> 10);
+ 		__asm__ __volatile__("flush	%0" : : "r" (ia));
+ 
+-		ia[1] = (ia[1] & ~0x3ff) | (pa & 0x3ff);
++		ia[1] = (ia[1] & ~0x3fffff) | (low_bits >> 10);
+ 		__asm__ __volatile__("flush	%0" : : "r" (ia + 1));
+ 
++		ia[2] = (ia[2] & ~0x1fff) | (high_bits & 0x3ff);
++		__asm__ __volatile__("flush	%0" : : "r" (ia + 2));
++
++		ia[3] = (ia[3] & ~0x1fff) | (low_bits & 0x3ff);
++		__asm__ __volatile__("flush	%0" : : "r" (ia + 3));
++
+ 		start++;
+ 	}
+ }
+@@ -1852,7 +1941,6 @@ static void __init sun4v_linear_pte_xor_finalize(void)
+ /* paging_init() sets up the page tables */
+ 
+ static unsigned long last_valid_pfn;
+-pgd_t swapper_pg_dir[PTRS_PER_PGD];
+ 
+ static void sun4u_pgprot_init(void);
+ static void sun4v_pgprot_init(void);
+@@ -1955,16 +2043,10 @@ void __init paging_init(void)
+ 	 */
+ 	init_mm.pgd += ((shift) / (sizeof(pgd_t)));
+ 	
+-	memset(swapper_low_pmd_dir, 0, sizeof(swapper_low_pmd_dir));
++	memset(swapper_pg_dir, 0, sizeof(swapper_pg_dir));
+ 
+-	/* Now can init the kernel/bad page tables. */
+-	pud_set(pud_offset(&swapper_pg_dir[0], 0),
+-		swapper_low_pmd_dir + (shift / sizeof(pgd_t)));
+-	
+ 	inherit_prom_mappings();
+ 	
+-	init_kpte_bitmap();
+-
+ 	/* Ok, we can use our TLB miss and window trap handlers safely.  */
+ 	setup_tba();
+ 
+@@ -2071,70 +2153,6 @@ int page_in_phys_avail(unsigned long paddr)
+ 	return 0;
+ }
+ 
+-static struct linux_prom64_registers pavail_rescan[MAX_BANKS] __initdata;
+-static int pavail_rescan_ents __initdata;
+-
+-/* Certain OBP calls, such as fetching "available" properties, can
+- * claim physical memory.  So, along with initializing the valid
+- * address bitmap, what we do here is refetch the physical available
+- * memory list again, and make sure it provides at least as much
+- * memory as 'pavail' does.
+- */
+-static void __init setup_valid_addr_bitmap_from_pavail(unsigned long *bitmap)
+-{
+-	int i;
+-
+-	read_obp_memory("available", &pavail_rescan[0], &pavail_rescan_ents);
+-
+-	for (i = 0; i < pavail_ents; i++) {
+-		unsigned long old_start, old_end;
+-
+-		old_start = pavail[i].phys_addr;
+-		old_end = old_start + pavail[i].reg_size;
+-		while (old_start < old_end) {
+-			int n;
+-
+-			for (n = 0; n < pavail_rescan_ents; n++) {
+-				unsigned long new_start, new_end;
+-
+-				new_start = pavail_rescan[n].phys_addr;
+-				new_end = new_start +
+-					pavail_rescan[n].reg_size;
+-
+-				if (new_start <= old_start &&
+-				    new_end >= (old_start + PAGE_SIZE)) {
+-					set_bit(old_start >> ILOG2_4MB, bitmap);
+-					goto do_next_page;
+-				}
+-			}
+-
+-			prom_printf("mem_init: Lost memory in pavail\n");
+-			prom_printf("mem_init: OLD start[%lx] size[%lx]\n",
+-				    pavail[i].phys_addr,
+-				    pavail[i].reg_size);
+-			prom_printf("mem_init: NEW start[%lx] size[%lx]\n",
+-				    pavail_rescan[i].phys_addr,
+-				    pavail_rescan[i].reg_size);
+-			prom_printf("mem_init: Cannot continue, aborting.\n");
+-			prom_halt();
+-
+-		do_next_page:
+-			old_start += PAGE_SIZE;
+-		}
+-	}
+-}
+-
+-static void __init patch_tlb_miss_handler_bitmap(void)
+-{
+-	extern unsigned int valid_addr_bitmap_insn[];
+-	extern unsigned int valid_addr_bitmap_patch[];
+-
+-	valid_addr_bitmap_insn[1] = valid_addr_bitmap_patch[1];
+-	mb();
+-	valid_addr_bitmap_insn[0] = valid_addr_bitmap_patch[0];
+-	flushi(&valid_addr_bitmap_insn[0]);
+-}
+-
+ static void __init register_page_bootmem_info(void)
+ {
+ #ifdef CONFIG_NEED_MULTIPLE_NODES
+@@ -2147,18 +2165,6 @@ static void __init register_page_bootmem_info(void)
+ }
+ void __init mem_init(void)
+ {
+-	unsigned long addr, last;
+-
+-	addr = PAGE_OFFSET + kern_base;
+-	last = PAGE_ALIGN(kern_size) + addr;
+-	while (addr < last) {
+-		set_bit(__pa(addr) >> ILOG2_4MB, sparc64_valid_addr_bitmap);
+-		addr += PAGE_SIZE;
+-	}
+-
+-	setup_valid_addr_bitmap_from_pavail(sparc64_valid_addr_bitmap);
+-	patch_tlb_miss_handler_bitmap();
+-
+ 	high_memory = __va(last_valid_pfn << PAGE_SHIFT);
+ 
+ 	register_page_bootmem_info();
+@@ -2248,18 +2254,9 @@ unsigned long _PAGE_CACHE __read_mostly;
+ EXPORT_SYMBOL(_PAGE_CACHE);
+ 
+ #ifdef CONFIG_SPARSEMEM_VMEMMAP
+-unsigned long vmemmap_table[VMEMMAP_SIZE];
+-
+-static long __meminitdata addr_start, addr_end;
+-static int __meminitdata node_start;
+-
+ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
+ 			       int node)
+ {
+-	unsigned long phys_start = (vstart - VMEMMAP_BASE);
+-	unsigned long phys_end = (vend - VMEMMAP_BASE);
+-	unsigned long addr = phys_start & VMEMMAP_CHUNK_MASK;
+-	unsigned long end = VMEMMAP_ALIGN(phys_end);
+ 	unsigned long pte_base;
+ 
+ 	pte_base = (_PAGE_VALID | _PAGE_SZ4MB_4U |
+@@ -2270,47 +2267,52 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend,
+ 			    _PAGE_CP_4V | _PAGE_CV_4V |
+ 			    _PAGE_P_4V | _PAGE_W_4V);
+ 
+-	for (; addr < end; addr += VMEMMAP_CHUNK) {
+-		unsigned long *vmem_pp =
+-			vmemmap_table + (addr >> VMEMMAP_CHUNK_SHIFT);
+-		void *block;
++	pte_base |= _PAGE_PMD_HUGE;
+ 
+-		if (!(*vmem_pp & _PAGE_VALID)) {
+-			block = vmemmap_alloc_block(1UL << ILOG2_4MB, node);
+-			if (!block)
++	vstart = vstart & PMD_MASK;
++	vend = ALIGN(vend, PMD_SIZE);
++	for (; vstart < vend; vstart += PMD_SIZE) {
++		pgd_t *pgd = pgd_offset_k(vstart);
++		unsigned long pte;
++		pud_t *pud;
++		pmd_t *pmd;
++
++		if (pgd_none(*pgd)) {
++			pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
++
++			if (!new)
+ 				return -ENOMEM;
++			pgd_populate(&init_mm, pgd, new);
++		}
+ 
+-			*vmem_pp = pte_base | __pa(block);
++		pud = pud_offset(pgd, vstart);
++		if (pud_none(*pud)) {
++			pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+ 
+-			/* check to see if we have contiguous blocks */
+-			if (addr_end != addr || node_start != node) {
+-				if (addr_start)
+-					printk(KERN_DEBUG " [%lx-%lx] on node %d\n",
+-					       addr_start, addr_end-1, node_start);
+-				addr_start = addr;
+-				node_start = node;
+-			}
+-			addr_end = addr + VMEMMAP_CHUNK;
++			if (!new)
++				return -ENOMEM;
++			pud_populate(&init_mm, pud, new);
+ 		}
+-	}
+-	return 0;
+-}
+ 
+-void __meminit vmemmap_populate_print_last(void)
+-{
+-	if (addr_start) {
+-		printk(KERN_DEBUG " [%lx-%lx] on node %d\n",
+-		       addr_start, addr_end-1, node_start);
+-		addr_start = 0;
+-		addr_end = 0;
+-		node_start = 0;
++		pmd = pmd_offset(pud, vstart);
++
++		pte = pmd_val(*pmd);
++		if (!(pte & _PAGE_VALID)) {
++			void *block = vmemmap_alloc_block(PMD_SIZE, node);
++
++			if (!block)
++				return -ENOMEM;
++
++			pmd_val(*pmd) = pte_base | __pa(block);
++		}
+ 	}
++
++	return 0;
+ }
+ 
+ void vmemmap_free(unsigned long start, unsigned long end)
+ {
+ }
+-
+ #endif /* CONFIG_SPARSEMEM_VMEMMAP */
+ 
+ static void prot_init_common(unsigned long page_none,
+@@ -2722,8 +2724,8 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+ 			do_flush_tlb_kernel_range(start, LOW_OBP_ADDRESS);
+ 		}
+ 		if (end > HI_OBP_ADDRESS) {
+-			flush_tsb_kernel_range(end, HI_OBP_ADDRESS);
+-			do_flush_tlb_kernel_range(end, HI_OBP_ADDRESS);
++			flush_tsb_kernel_range(HI_OBP_ADDRESS, end);
++			do_flush_tlb_kernel_range(HI_OBP_ADDRESS, end);
+ 		}
+ 	} else {
+ 		flush_tsb_kernel_range(start, end);
+diff --git a/arch/sparc/mm/init_64.h b/arch/sparc/mm/init_64.h
+index 0668b364f44d..a4c09603b05c 100644
+--- a/arch/sparc/mm/init_64.h
++++ b/arch/sparc/mm/init_64.h
+@@ -8,15 +8,8 @@
+  */
+ 
+ #define MAX_PHYS_ADDRESS	(1UL << MAX_PHYS_ADDRESS_BITS)
+-#define KPTE_BITMAP_CHUNK_SZ		(256UL * 1024UL * 1024UL)
+-#define KPTE_BITMAP_BYTES	\
+-	((MAX_PHYS_ADDRESS / KPTE_BITMAP_CHUNK_SZ) / 4)
+-#define VALID_ADDR_BITMAP_CHUNK_SZ	(4UL * 1024UL * 1024UL)
+-#define VALID_ADDR_BITMAP_BYTES	\
+-	((MAX_PHYS_ADDRESS / VALID_ADDR_BITMAP_CHUNK_SZ) / 8)
+ 
+ extern unsigned long kern_linear_pte_xor[4];
+-extern unsigned long kpte_linear_bitmap[KPTE_BITMAP_BYTES / sizeof(unsigned long)];
+ extern unsigned int sparc64_highest_unlocked_tlb_ent;
+ extern unsigned long sparc64_kern_pri_context;
+ extern unsigned long sparc64_kern_pri_nuc_bits;
+@@ -38,15 +31,4 @@ extern unsigned long kern_locked_tte_data;
+ 
+ void prom_world(int enter);
+ 
+-#ifdef CONFIG_SPARSEMEM_VMEMMAP
+-#define VMEMMAP_CHUNK_SHIFT	22
+-#define VMEMMAP_CHUNK		(1UL << VMEMMAP_CHUNK_SHIFT)
+-#define VMEMMAP_CHUNK_MASK	~(VMEMMAP_CHUNK - 1UL)
+-#define VMEMMAP_ALIGN(x)	(((x)+VMEMMAP_CHUNK-1UL)&VMEMMAP_CHUNK_MASK)
+-
+-#define VMEMMAP_SIZE	((((1UL << MAX_PHYSADDR_BITS) >> PAGE_SHIFT) * \
+-			  sizeof(struct page)) >> VMEMMAP_CHUNK_SHIFT)
+-extern unsigned long vmemmap_table[VMEMMAP_SIZE];
+-#endif
+-
+ #endif /* _SPARC64_MM_INIT_H */
+diff --git a/arch/sparc/net/bpf_jit_asm.S b/arch/sparc/net/bpf_jit_asm.S
+index 9d016c7017f7..8c83f4b8eb15 100644
+--- a/arch/sparc/net/bpf_jit_asm.S
++++ b/arch/sparc/net/bpf_jit_asm.S
+@@ -6,10 +6,12 @@
+ #define SAVE_SZ		176
+ #define SCRATCH_OFF	STACK_BIAS + 128
+ #define BE_PTR(label)	be,pn %xcc, label
++#define SIGN_EXTEND(reg)	sra reg, 0, reg
+ #else
+ #define SAVE_SZ		96
+ #define SCRATCH_OFF	72
+ #define BE_PTR(label)	be label
++#define SIGN_EXTEND(reg)
+ #endif
+ 
+ #define SKF_MAX_NEG_OFF	(-0x200000) /* SKF_LL_OFF from filter.h */
+@@ -135,6 +137,7 @@ bpf_slow_path_byte_msh:
+ 	save	%sp, -SAVE_SZ, %sp;			\
+ 	mov	%i0, %o0;				\
+ 	mov	r_OFF, %o1;				\
++	SIGN_EXTEND(%o1);				\
+ 	call	bpf_internal_load_pointer_neg_helper;	\
+ 	 mov	(LEN), %o2;				\
+ 	mov	%o0, r_TMP;				\
+diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c
+index 892a102671ad..8d4152f94c5a 100644
+--- a/arch/sparc/net/bpf_jit_comp.c
++++ b/arch/sparc/net/bpf_jit_comp.c
+@@ -184,7 +184,7 @@ do {								\
+ 	 */
+ #define emit_alu_K(OPCODE, K)					\
+ do {								\
+-	if (K) {						\
++	if (K || OPCODE == AND || OPCODE == MUL) {		\
+ 		unsigned int _insn = OPCODE;			\
+ 		_insn |= RS1(r_A) | RD(r_A);			\
+ 		if (is_simm13(K)) {				\
+@@ -234,12 +234,18 @@ do {	BUILD_BUG_ON(FIELD_SIZEOF(STRUCT, FIELD) != sizeof(u8));	\
+ 	__emit_load8(BASE, STRUCT, FIELD, DEST);			\
+ } while (0)
+ 
+-#define emit_ldmem(OFF, DEST)					\
+-do {	*prog++ = LD32I | RS1(FP) | S13(-(OFF)) | RD(DEST);	\
++#ifdef CONFIG_SPARC64
++#define BIAS (STACK_BIAS - 4)
++#else
++#define BIAS (-4)
++#endif
++
++#define emit_ldmem(OFF, DEST)						\
++do {	*prog++ = LD32I | RS1(SP) | S13(BIAS - (OFF)) | RD(DEST);	\
+ } while (0)
+ 
+-#define emit_stmem(OFF, SRC)					\
+-do {	*prog++ = LD32I | RS1(FP) | S13(-(OFF)) | RD(SRC);	\
++#define emit_stmem(OFF, SRC)						\
++do {	*prog++ = ST32I | RS1(SP) | S13(BIAS - (OFF)) | RD(SRC);	\
+ } while (0)
+ 
+ #ifdef CONFIG_SMP
+@@ -615,10 +621,11 @@ void bpf_jit_compile(struct sk_filter *fp)
+ 			case BPF_ANC | SKF_AD_VLAN_TAG:
+ 			case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
+ 				emit_skb_load16(vlan_tci, r_A);
+-				if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) {
+-					emit_andi(r_A, VLAN_VID_MASK, r_A);
++				if (code != (BPF_ANC | SKF_AD_VLAN_TAG)) {
++					emit_alu_K(SRL, 12);
++					emit_andi(r_A, 1, r_A);
+ 				} else {
+-					emit_loadimm(VLAN_TAG_PRESENT, r_TMP);
++					emit_loadimm(~VLAN_TAG_PRESENT, r_TMP);
+ 					emit_and(r_A, r_TMP, r_A);
+ 				}
+ 				break;
+@@ -630,15 +637,19 @@ void bpf_jit_compile(struct sk_filter *fp)
+ 				emit_loadimm(K, r_X);
+ 				break;
+ 			case BPF_LD | BPF_MEM:
++				seen |= SEEN_MEM;
+ 				emit_ldmem(K * 4, r_A);
+ 				break;
+ 			case BPF_LDX | BPF_MEM:
++				seen |= SEEN_MEM | SEEN_XREG;
+ 				emit_ldmem(K * 4, r_X);
+ 				break;
+ 			case BPF_ST:
++				seen |= SEEN_MEM;
+ 				emit_stmem(K * 4, r_A);
+ 				break;
+ 			case BPF_STX:
++				seen |= SEEN_MEM | SEEN_XREG;
+ 				emit_stmem(K * 4, r_X);
+ 				break;
+ 
+diff --git a/arch/sparc/power/hibernate_asm.S b/arch/sparc/power/hibernate_asm.S
+index 79942166df84..d7d9017dcb15 100644
+--- a/arch/sparc/power/hibernate_asm.S
++++ b/arch/sparc/power/hibernate_asm.S
+@@ -54,8 +54,8 @@ ENTRY(swsusp_arch_resume)
+ 	 nop
+ 
+ 	/* Write PAGE_OFFSET to %g7 */
+-	sethi	%uhi(PAGE_OFFSET), %g7
+-	sllx	%g7, 32, %g7
++	sethi	%hi(PAGE_OFFSET), %g7
++	ldx	[%g7 + %lo(PAGE_OFFSET)], %g7
+ 
+ 	setuw	(PAGE_SIZE-8), %g3
+ 
+diff --git a/arch/sparc/prom/bootstr_64.c b/arch/sparc/prom/bootstr_64.c
+index ab9ccc63b388..7149e77714a4 100644
+--- a/arch/sparc/prom/bootstr_64.c
++++ b/arch/sparc/prom/bootstr_64.c
+@@ -14,7 +14,10 @@
+  *          the .bss section or it will break things.
+  */
+ 
+-#define BARG_LEN  256
++/* We limit BARG_LEN to 1024 because this is the size of the
++ * 'barg_out' command line buffer in the SILO bootloader.
++ */
++#define BARG_LEN 1024
+ struct {
+ 	int bootstr_len;
+ 	int bootstr_valid;
+diff --git a/arch/sparc/prom/cif.S b/arch/sparc/prom/cif.S
+index 9c86b4b7d429..8050f381f518 100644
+--- a/arch/sparc/prom/cif.S
++++ b/arch/sparc/prom/cif.S
+@@ -11,11 +11,10 @@
+ 	.text
+ 	.globl	prom_cif_direct
+ prom_cif_direct:
++	save	%sp, -192, %sp
+ 	sethi	%hi(p1275buf), %o1
+ 	or	%o1, %lo(p1275buf), %o1
+-	ldx	[%o1 + 0x0010], %o2	! prom_cif_stack
+-	save	%o2, -192, %sp
+-	ldx	[%i1 + 0x0008], %l2	! prom_cif_handler
++	ldx	[%o1 + 0x0008], %l2	! prom_cif_handler
+ 	mov	%g4, %l0
+ 	mov	%g5, %l1
+ 	mov	%g6, %l3
+diff --git a/arch/sparc/prom/init_64.c b/arch/sparc/prom/init_64.c
+index d95db755828f..110b0d78b864 100644
+--- a/arch/sparc/prom/init_64.c
++++ b/arch/sparc/prom/init_64.c
+@@ -26,13 +26,13 @@ phandle prom_chosen_node;
+  * It gets passed the pointer to the PROM vector.
+  */
+ 
+-extern void prom_cif_init(void *, void *);
++extern void prom_cif_init(void *);
+ 
+-void __init prom_init(void *cif_handler, void *cif_stack)
++void __init prom_init(void *cif_handler)
+ {
+ 	phandle node;
+ 
+-	prom_cif_init(cif_handler, cif_stack);
++	prom_cif_init(cif_handler);
+ 
+ 	prom_chosen_node = prom_finddevice(prom_chosen_path);
+ 	if (!prom_chosen_node || (s32)prom_chosen_node == -1)
+diff --git a/arch/sparc/prom/p1275.c b/arch/sparc/prom/p1275.c
+index e58b81726319..545d8bb79b65 100644
+--- a/arch/sparc/prom/p1275.c
++++ b/arch/sparc/prom/p1275.c
+@@ -9,6 +9,7 @@
+ #include <linux/smp.h>
+ #include <linux/string.h>
+ #include <linux/spinlock.h>
++#include <linux/irqflags.h>
+ 
+ #include <asm/openprom.h>
+ #include <asm/oplib.h>
+@@ -19,7 +20,6 @@
+ struct {
+ 	long prom_callback;			/* 0x00 */
+ 	void (*prom_cif_handler)(long *);	/* 0x08 */
+-	unsigned long prom_cif_stack;		/* 0x10 */
+ } p1275buf;
+ 
+ extern void prom_world(int);
+@@ -36,8 +36,8 @@ void p1275_cmd_direct(unsigned long *args)
+ {
+ 	unsigned long flags;
+ 
+-	raw_local_save_flags(flags);
+-	raw_local_irq_restore((unsigned long)PIL_NMI);
++	local_save_flags(flags);
++	local_irq_restore((unsigned long)PIL_NMI);
+ 	raw_spin_lock(&prom_entry_lock);
+ 
+ 	prom_world(1);
+@@ -45,11 +45,10 @@ void p1275_cmd_direct(unsigned long *args)
+ 	prom_world(0);
+ 
+ 	raw_spin_unlock(&prom_entry_lock);
+-	raw_local_irq_restore(flags);
++	local_irq_restore(flags);
+ }
+ 
+ void prom_cif_init(void *cif_handler, void *cif_stack)
+ {
+ 	p1275buf.prom_cif_handler = (void (*)(long *))cif_handler;
+-	p1275buf.prom_cif_stack = (unsigned long)cif_stack;
+ }
+diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
+index 9f83c171ac18..db1ce1e90a5b 100644
+--- a/arch/x86/include/asm/kvm_host.h
++++ b/arch/x86/include/asm/kvm_host.h
+@@ -479,6 +479,7 @@ struct kvm_vcpu_arch {
+ 	u64 mmio_gva;
+ 	unsigned access;
+ 	gfn_t mmio_gfn;
++	u64 mmio_gen;
+ 
+ 	struct kvm_pmu pmu;
+ 
+diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
+index f9e4fdd3b877..21337cd58b6b 100644
+--- a/arch/x86/kernel/cpu/intel.c
++++ b/arch/x86/kernel/cpu/intel.c
+@@ -144,6 +144,21 @@ static void early_init_intel(struct cpuinfo_x86 *c)
+ 			setup_clear_cpu_cap(X86_FEATURE_ERMS);
+ 		}
+ 	}
++
++	/*
++	 * Intel Quark Core DevMan_001.pdf section 6.4.11
++	 * "The operating system also is required to invalidate (i.e., flush)
++	 *  the TLB when any changes are made to any of the page table entries.
++	 *  The operating system must reload CR3 to cause the TLB to be flushed"
++	 *
++	 * As a result cpu_has_pge() in arch/x86/include/asm/tlbflush.h should
++	 * be false so that __flush_tlb_all() causes CR3 insted of CR4.PGE
++	 * to be modified
++	 */
++	if (c->x86 == 5 && c->x86_model == 9) {
++		pr_info("Disabling PGE capability bit\n");
++		setup_clear_cpu_cap(X86_FEATURE_PGE);
++	}
+ }
+ 
+ #ifdef CONFIG_X86_32
+diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
+index 931467881da7..1cd2a5fbde07 100644
+--- a/arch/x86/kvm/mmu.c
++++ b/arch/x86/kvm/mmu.c
+@@ -199,16 +199,20 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask)
+ EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
+ 
+ /*
+- * spte bits of bit 3 ~ bit 11 are used as low 9 bits of generation number,
+- * the bits of bits 52 ~ bit 61 are used as high 10 bits of generation
+- * number.
++ * the low bit of the generation number is always presumed to be zero.
++ * This disables mmio caching during memslot updates.  The concept is
++ * similar to a seqcount but instead of retrying the access we just punt
++ * and ignore the cache.
++ *
++ * spte bits 3-11 are used as bits 1-9 of the generation number,
++ * the bits 52-61 are used as bits 10-19 of the generation number.
+  */
+-#define MMIO_SPTE_GEN_LOW_SHIFT		3
++#define MMIO_SPTE_GEN_LOW_SHIFT		2
+ #define MMIO_SPTE_GEN_HIGH_SHIFT	52
+ 
+-#define MMIO_GEN_SHIFT			19
+-#define MMIO_GEN_LOW_SHIFT		9
+-#define MMIO_GEN_LOW_MASK		((1 << MMIO_GEN_LOW_SHIFT) - 1)
++#define MMIO_GEN_SHIFT			20
++#define MMIO_GEN_LOW_SHIFT		10
++#define MMIO_GEN_LOW_MASK		((1 << MMIO_GEN_LOW_SHIFT) - 2)
+ #define MMIO_GEN_MASK			((1 << MMIO_GEN_SHIFT) - 1)
+ #define MMIO_MAX_GEN			((1 << MMIO_GEN_SHIFT) - 1)
+ 
+@@ -236,12 +240,7 @@ static unsigned int get_mmio_spte_generation(u64 spte)
+ 
+ static unsigned int kvm_current_mmio_generation(struct kvm *kvm)
+ {
+-	/*
+-	 * Init kvm generation close to MMIO_MAX_GEN to easily test the
+-	 * code of handling generation number wrap-around.
+-	 */
+-	return (kvm_memslots(kvm)->generation +
+-		      MMIO_MAX_GEN - 150) & MMIO_GEN_MASK;
++	return kvm_memslots(kvm)->generation & MMIO_GEN_MASK;
+ }
+ 
+ static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
+@@ -3163,7 +3162,7 @@ static void mmu_sync_roots(struct kvm_vcpu *vcpu)
+ 	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
+ 		return;
+ 
+-	vcpu_clear_mmio_info(vcpu, ~0ul);
++	vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
+ 	kvm_mmu_audit(vcpu, AUDIT_PRE_SYNC);
+ 	if (vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL) {
+ 		hpa_t root = vcpu->arch.mmu.root_hpa;
+@@ -4433,7 +4432,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm)
+ 	 * The very rare case: if the generation-number is round,
+ 	 * zap all shadow pages.
+ 	 */
+-	if (unlikely(kvm_current_mmio_generation(kvm) >= MMIO_MAX_GEN)) {
++	if (unlikely(kvm_current_mmio_generation(kvm) == 0)) {
+ 		printk_ratelimited(KERN_INFO "kvm: zapping shadow pages for mmio generation wraparound\n");
+ 		kvm_mmu_invalidate_zap_all_pages(kvm);
+ 	}
+diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
+index 801332edefc3..6c437ed00dcf 100644
+--- a/arch/x86/kvm/vmx.c
++++ b/arch/x86/kvm/vmx.c
+@@ -450,6 +450,7 @@ struct vcpu_vmx {
+ 		int           gs_ldt_reload_needed;
+ 		int           fs_reload_needed;
+ 		u64           msr_host_bndcfgs;
++		unsigned long vmcs_host_cr4;	/* May not match real cr4 */
+ 	} host_state;
+ 	struct {
+ 		int vm86_active;
+@@ -4218,11 +4219,16 @@ static void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
+ 	u32 low32, high32;
+ 	unsigned long tmpl;
+ 	struct desc_ptr dt;
++	unsigned long cr4;
+ 
+ 	vmcs_writel(HOST_CR0, read_cr0() & ~X86_CR0_TS);  /* 22.2.3 */
+-	vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
+ 	vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
+ 
++	/* Save the most likely value for this task's CR4 in the VMCS. */
++	cr4 = read_cr4();
++	vmcs_writel(HOST_CR4, cr4);			/* 22.2.3, 22.2.5 */
++	vmx->host_state.vmcs_host_cr4 = cr4;
++
+ 	vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);  /* 22.2.4 */
+ #ifdef CONFIG_X86_64
+ 	/*
+@@ -7336,7 +7342,7 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
+ {
+ 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+-	unsigned long debugctlmsr;
++	unsigned long debugctlmsr, cr4;
+ 
+ 	/* Record the guest's net vcpu time for enforced NMI injections. */
+ 	if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
+@@ -7357,6 +7363,12 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
+ 	if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))
+ 		vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);
+ 
++	cr4 = read_cr4();
++	if (unlikely(cr4 != vmx->host_state.vmcs_host_cr4)) {
++		vmcs_writel(HOST_CR4, cr4);
++		vmx->host_state.vmcs_host_cr4 = cr4;
++	}
++
+ 	/* When single-stepping over STI and MOV SS, we must clear the
+ 	 * corresponding interruptibility bits in the guest state. Otherwise
+ 	 * vmentry fails as it then expects bit 14 (BS) in pending debug
+diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
+index 8c97bac9a895..b0b17e6f0431 100644
+--- a/arch/x86/kvm/x86.h
++++ b/arch/x86/kvm/x86.h
+@@ -78,15 +78,23 @@ static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
+ 	vcpu->arch.mmio_gva = gva & PAGE_MASK;
+ 	vcpu->arch.access = access;
+ 	vcpu->arch.mmio_gfn = gfn;
++	vcpu->arch.mmio_gen = kvm_memslots(vcpu->kvm)->generation;
++}
++
++static inline bool vcpu_match_mmio_gen(struct kvm_vcpu *vcpu)
++{
++	return vcpu->arch.mmio_gen == kvm_memslots(vcpu->kvm)->generation;
+ }
+ 
+ /*
+- * Clear the mmio cache info for the given gva,
+- * specially, if gva is ~0ul, we clear all mmio cache info.
++ * Clear the mmio cache info for the given gva. If gva is MMIO_GVA_ANY, we
++ * clear all mmio cache info.
+  */
++#define MMIO_GVA_ANY (~(gva_t)0)
++
+ static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
+ {
+-	if (gva != (~0ul) && vcpu->arch.mmio_gva != (gva & PAGE_MASK))
++	if (gva != MMIO_GVA_ANY && vcpu->arch.mmio_gva != (gva & PAGE_MASK))
+ 		return;
+ 
+ 	vcpu->arch.mmio_gva = 0;
+@@ -94,7 +102,8 @@ static inline void vcpu_clear_mmio_info(struct kvm_vcpu *vcpu, gva_t gva)
+ 
+ static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long gva)
+ {
+-	if (vcpu->arch.mmio_gva && vcpu->arch.mmio_gva == (gva & PAGE_MASK))
++	if (vcpu_match_mmio_gen(vcpu) && vcpu->arch.mmio_gva &&
++	      vcpu->arch.mmio_gva == (gva & PAGE_MASK))
+ 		return true;
+ 
+ 	return false;
+@@ -102,7 +111,8 @@ static inline bool vcpu_match_mmio_gva(struct kvm_vcpu *vcpu, unsigned long gva)
+ 
+ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa)
+ {
+-	if (vcpu->arch.mmio_gfn && vcpu->arch.mmio_gfn == gpa >> PAGE_SHIFT)
++	if (vcpu_match_mmio_gen(vcpu) && vcpu->arch.mmio_gfn &&
++	      vcpu->arch.mmio_gfn == gpa >> PAGE_SHIFT)
+ 		return true;
+ 
+ 	return false;
+diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
+index 3c562f5a60bb..e1bce26cd4f9 100644
+--- a/crypto/async_tx/async_xor.c
++++ b/crypto/async_tx/async_xor.c
+@@ -78,8 +78,6 @@ do_async_xor(struct dma_chan *chan, struct dmaengine_unmap_data *unmap,
+ 		tx = dma->device_prep_dma_xor(chan, dma_dest, src_list,
+ 					      xor_src_cnt, unmap->len,
+ 					      dma_flags);
+-		src_list[0] = tmp;
+-
+ 
+ 		if (unlikely(!tx))
+ 			async_tx_quiesce(&submit->depend_tx);
+@@ -92,6 +90,7 @@ do_async_xor(struct dma_chan *chan, struct dmaengine_unmap_data *unmap,
+ 						      xor_src_cnt, unmap->len,
+ 						      dma_flags);
+ 		}
++		src_list[0] = tmp;
+ 
+ 		dma_set_unmap(tx, unmap);
+ 		async_tx_submit(chan, tx, submit);
+diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
+index d276e33880be..2a1d1ae5c11d 100644
+--- a/drivers/base/firmware_class.c
++++ b/drivers/base/firmware_class.c
+@@ -1086,6 +1086,9 @@ _request_firmware(const struct firmware **firmware_p, const char *name,
+ 	if (!firmware_p)
+ 		return -EINVAL;
+ 
++	if (!name || name[0] == '\0')
++		return -EINVAL;
++
+ 	ret = _request_firmware_prepare(&fw, name, device);
+ 	if (ret <= 0) /* error or already assigned */
+ 		goto out;
+diff --git a/drivers/base/regmap/regmap-debugfs.c b/drivers/base/regmap/regmap-debugfs.c
+index 65ea7b256b3e..a3530dadb163 100644
+--- a/drivers/base/regmap/regmap-debugfs.c
++++ b/drivers/base/regmap/regmap-debugfs.c
+@@ -473,6 +473,7 @@ void regmap_debugfs_init(struct regmap *map, const char *name)
+ {
+ 	struct rb_node *next;
+ 	struct regmap_range_node *range_node;
++	const char *devname = "dummy";
+ 
+ 	/* If we don't have the debugfs root yet, postpone init */
+ 	if (!regmap_debugfs_root) {
+@@ -491,12 +492,15 @@ void regmap_debugfs_init(struct regmap *map, const char *name)
+ 	INIT_LIST_HEAD(&map->debugfs_off_cache);
+ 	mutex_init(&map->cache_lock);
+ 
++	if (map->dev)
++		devname = dev_name(map->dev);
++
+ 	if (name) {
+ 		map->debugfs_name = kasprintf(GFP_KERNEL, "%s-%s",
+-					      dev_name(map->dev), name);
++					      devname, name);
+ 		name = map->debugfs_name;
+ 	} else {
+-		name = dev_name(map->dev);
++		name = devname;
+ 	}
+ 
+ 	map->debugfs = debugfs_create_dir(name, regmap_debugfs_root);
+diff --git a/drivers/base/regmap/regmap.c b/drivers/base/regmap/regmap.c
+index 283644e5d31f..8cda01590ed2 100644
+--- a/drivers/base/regmap/regmap.c
++++ b/drivers/base/regmap/regmap.c
+@@ -1395,7 +1395,7 @@ int _regmap_write(struct regmap *map, unsigned int reg,
+ 	}
+ 
+ #ifdef LOG_DEVICE
+-	if (strcmp(dev_name(map->dev), LOG_DEVICE) == 0)
++	if (map->dev && strcmp(dev_name(map->dev), LOG_DEVICE) == 0)
+ 		dev_info(map->dev, "%x <= %x\n", reg, val);
+ #endif
+ 
+@@ -1646,6 +1646,9 @@ out:
+ 	} else {
+ 		void *wval;
+ 
++		if (!val_count)
++			return -EINVAL;
++
+ 		wval = kmemdup(val, val_count * val_bytes, GFP_KERNEL);
+ 		if (!wval) {
+ 			dev_err(map->dev, "Error in memory allocation\n");
+@@ -2045,7 +2048,7 @@ static int _regmap_read(struct regmap *map, unsigned int reg,
+ 	ret = map->reg_read(context, reg, val);
+ 	if (ret == 0) {
+ #ifdef LOG_DEVICE
+-		if (strcmp(dev_name(map->dev), LOG_DEVICE) == 0)
++		if (map->dev && strcmp(dev_name(map->dev), LOG_DEVICE) == 0)
+ 			dev_info(map->dev, "%x => %x\n", reg, *val);
+ #endif
+ 
+diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
+index 6250fc2fb93a..0489a946e68d 100644
+--- a/drivers/bluetooth/btusb.c
++++ b/drivers/bluetooth/btusb.c
+@@ -317,6 +317,9 @@ static void btusb_intr_complete(struct urb *urb)
+ 			BT_ERR("%s corrupted event packet", hdev->name);
+ 			hdev->stat.err_rx++;
+ 		}
++	} else if (urb->status == -ENOENT) {
++		/* Avoid suspend failed when usb_kill_urb */
++		return;
+ 	}
+ 
+ 	if (!test_bit(BTUSB_INTR_RUNNING, &data->flags))
+@@ -405,6 +408,9 @@ static void btusb_bulk_complete(struct urb *urb)
+ 			BT_ERR("%s corrupted ACL packet", hdev->name);
+ 			hdev->stat.err_rx++;
+ 		}
++	} else if (urb->status == -ENOENT) {
++		/* Avoid suspend failed when usb_kill_urb */
++		return;
+ 	}
+ 
+ 	if (!test_bit(BTUSB_BULK_RUNNING, &data->flags))
+@@ -499,6 +505,9 @@ static void btusb_isoc_complete(struct urb *urb)
+ 				hdev->stat.err_rx++;
+ 			}
+ 		}
++	} else if (urb->status == -ENOENT) {
++		/* Avoid suspend failed when usb_kill_urb */
++		return;
+ 	}
+ 
+ 	if (!test_bit(BTUSB_ISOC_RUNNING, &data->flags))
+diff --git a/drivers/bluetooth/hci_h5.c b/drivers/bluetooth/hci_h5.c
+index fede8ca7147c..5d9148f8a506 100644
+--- a/drivers/bluetooth/hci_h5.c
++++ b/drivers/bluetooth/hci_h5.c
+@@ -237,7 +237,7 @@ static void h5_pkt_cull(struct h5 *h5)
+ 			break;
+ 
+ 		to_remove--;
+-		seq = (seq - 1) % 8;
++		seq = (seq - 1) & 0x07;
+ 	}
+ 
+ 	if (seq != h5->rx_ack)
+diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
+index f4aec2e6ef56..7d3742edbaa2 100644
+--- a/drivers/edac/mpc85xx_edac.c
++++ b/drivers/edac/mpc85xx_edac.c
+@@ -633,7 +633,7 @@ static int mpc85xx_l2_err_probe(struct platform_device *op)
+ 	if (edac_op_state == EDAC_OPSTATE_INT) {
+ 		pdata->irq = irq_of_parse_and_map(op->dev.of_node, 0);
+ 		res = devm_request_irq(&op->dev, pdata->irq,
+-				       mpc85xx_l2_isr, 0,
++				       mpc85xx_l2_isr, IRQF_SHARED,
+ 				       "[EDAC] L2 err", edac_dev);
+ 		if (res < 0) {
+ 			printk(KERN_ERR
+diff --git a/drivers/hid/hid-rmi.c b/drivers/hid/hid-rmi.c
+index 578bbe65902b..54966ca9e503 100644
+--- a/drivers/hid/hid-rmi.c
++++ b/drivers/hid/hid-rmi.c
+@@ -320,10 +320,7 @@ static int rmi_f11_input_event(struct hid_device *hdev, u8 irq, u8 *data,
+ 	int offset;
+ 	int i;
+ 
+-	if (size < hdata->f11.report_size)
+-		return 0;
+-
+-	if (!(irq & hdata->f11.irq_mask))
++	if (!(irq & hdata->f11.irq_mask) || size <= 0)
+ 		return 0;
+ 
+ 	offset = (hdata->max_fingers >> 2) + 1;
+@@ -332,9 +329,19 @@ static int rmi_f11_input_event(struct hid_device *hdev, u8 irq, u8 *data,
+ 		int fs_bit_position = (i & 0x3) << 1;
+ 		int finger_state = (data[fs_byte_position] >> fs_bit_position) &
+ 					0x03;
++		int position = offset + 5 * i;
++
++		if (position + 5 > size) {
++			/* partial report, go on with what we received */
++			printk_once(KERN_WARNING
++				"%s %s: Detected incomplete finger report. Finger reports may occasionally get dropped on this platform.\n",
++				 dev_driver_string(&hdev->dev),
++				 dev_name(&hdev->dev));
++			hid_dbg(hdev, "Incomplete finger report\n");
++			break;
++		}
+ 
+-		rmi_f11_process_touch(hdata, i, finger_state,
+-				&data[offset + 5 * i]);
++		rmi_f11_process_touch(hdata, i, finger_state, &data[position]);
+ 	}
+ 	input_mt_sync_frame(hdata->input);
+ 	input_sync(hdata->input);
+@@ -352,6 +359,11 @@ static int rmi_f30_input_event(struct hid_device *hdev, u8 irq, u8 *data,
+ 	if (!(irq & hdata->f30.irq_mask))
+ 		return 0;
+ 
++	if (size < (int)hdata->f30.report_size) {
++		hid_warn(hdev, "Click Button pressed, but the click data is missing\n");
++		return 0;
++	}
++
+ 	for (i = 0; i < hdata->gpio_led_count; i++) {
+ 		if (test_bit(i, &hdata->button_mask)) {
+ 			value = (data[i / 8] >> (i & 0x07)) & BIT(0);
+@@ -412,9 +424,29 @@ static int rmi_read_data_event(struct hid_device *hdev, u8 *data, int size)
+ 	return 1;
+ }
+ 
++static int rmi_check_sanity(struct hid_device *hdev, u8 *data, int size)
++{
++	int valid_size = size;
++	/*
++	 * On the Dell XPS 13 9333, the bus sometimes get confused and fills
++	 * the report with a sentinel value "ff". Synaptics told us that such
++	 * behavior does not comes from the touchpad itself, so we filter out
++	 * such reports here.
++	 */
++
++	while ((data[valid_size - 1] == 0xff) && valid_size > 0)
++		valid_size--;
++
++	return valid_size;
++}
++
+ static int rmi_raw_event(struct hid_device *hdev,
+ 		struct hid_report *report, u8 *data, int size)
+ {
++	size = rmi_check_sanity(hdev, data, size);
++	if (size < 2)
++		return 0;
++
+ 	switch (data[0]) {
+ 	case RMI_READ_DATA_REPORT_ID:
+ 		return rmi_read_data_event(hdev, data, size);
+diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
+index 284cf66489f4..bec55ed2917a 100644
+--- a/drivers/hv/channel.c
++++ b/drivers/hv/channel.c
+@@ -165,8 +165,10 @@ int vmbus_open(struct vmbus_channel *newchannel, u32 send_ringbuffer_size,
+ 	ret = vmbus_post_msg(open_msg,
+ 			       sizeof(struct vmbus_channel_open_channel));
+ 
+-	if (ret != 0)
++	if (ret != 0) {
++		err = ret;
+ 		goto error1;
++	}
+ 
+ 	t = wait_for_completion_timeout(&open_info->waitevent, 5*HZ);
+ 	if (t == 0) {
+@@ -363,7 +365,6 @@ int vmbus_establish_gpadl(struct vmbus_channel *channel, void *kbuffer,
+ 	u32 next_gpadl_handle;
+ 	unsigned long flags;
+ 	int ret = 0;
+-	int t;
+ 
+ 	next_gpadl_handle = atomic_read(&vmbus_connection.next_gpadl_handle);
+ 	atomic_inc(&vmbus_connection.next_gpadl_handle);
+@@ -410,9 +411,7 @@ int vmbus_establish_gpadl(struct vmbus_channel *channel, void *kbuffer,
+ 
+ 		}
+ 	}
+-	t = wait_for_completion_timeout(&msginfo->waitevent, 5*HZ);
+-	BUG_ON(t == 0);
+-
++	wait_for_completion(&msginfo->waitevent);
+ 
+ 	/* At this point, we received the gpadl created msg */
+ 	*gpadl_handle = gpadlmsg->gpadl;
+@@ -435,7 +434,7 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
+ 	struct vmbus_channel_gpadl_teardown *msg;
+ 	struct vmbus_channel_msginfo *info;
+ 	unsigned long flags;
+-	int ret, t;
++	int ret;
+ 
+ 	info = kmalloc(sizeof(*info) +
+ 		       sizeof(struct vmbus_channel_gpadl_teardown), GFP_KERNEL);
+@@ -457,11 +456,12 @@ int vmbus_teardown_gpadl(struct vmbus_channel *channel, u32 gpadl_handle)
+ 	ret = vmbus_post_msg(msg,
+ 			       sizeof(struct vmbus_channel_gpadl_teardown));
+ 
+-	BUG_ON(ret != 0);
+-	t = wait_for_completion_timeout(&info->waitevent, 5*HZ);
+-	BUG_ON(t == 0);
++	if (ret)
++		goto post_msg_err;
++
++	wait_for_completion(&info->waitevent);
+ 
+-	/* Received a torndown response */
++post_msg_err:
+ 	spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
+ 	list_del(&info->msglistentry);
+ 	spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
+@@ -478,7 +478,7 @@ static void reset_channel_cb(void *arg)
+ 	channel->onchannel_callback = NULL;
+ }
+ 
+-static void vmbus_close_internal(struct vmbus_channel *channel)
++static int vmbus_close_internal(struct vmbus_channel *channel)
+ {
+ 	struct vmbus_channel_close_channel *msg;
+ 	int ret;
+@@ -501,11 +501,28 @@ static void vmbus_close_internal(struct vmbus_channel *channel)
+ 
+ 	ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_close_channel));
+ 
+-	BUG_ON(ret != 0);
++	if (ret) {
++		pr_err("Close failed: close post msg return is %d\n", ret);
++		/*
++		 * If we failed to post the close msg,
++		 * it is perhaps better to leak memory.
++		 */
++		return ret;
++	}
++
+ 	/* Tear down the gpadl for the channel's ring buffer */
+-	if (channel->ringbuffer_gpadlhandle)
+-		vmbus_teardown_gpadl(channel,
+-					  channel->ringbuffer_gpadlhandle);
++	if (channel->ringbuffer_gpadlhandle) {
++		ret = vmbus_teardown_gpadl(channel,
++					   channel->ringbuffer_gpadlhandle);
++		if (ret) {
++			pr_err("Close failed: teardown gpadl return %d\n", ret);
++			/*
++			 * If we failed to teardown gpadl,
++			 * it is perhaps better to leak memory.
++			 */
++			return ret;
++		}
++	}
+ 
+ 	/* Cleanup the ring buffers for this channel */
+ 	hv_ringbuffer_cleanup(&channel->outbound);
+@@ -514,7 +531,7 @@ static void vmbus_close_internal(struct vmbus_channel *channel)
+ 	free_pages((unsigned long)channel->ringbuffer_pages,
+ 		get_order(channel->ringbuffer_pagecount * PAGE_SIZE));
+ 
+-
++	return ret;
+ }
+ 
+ /*
+diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
+index ae22e3c1fc4c..e206619b946e 100644
+--- a/drivers/hv/connection.c
++++ b/drivers/hv/connection.c
+@@ -427,10 +427,21 @@ int vmbus_post_msg(void *buffer, size_t buflen)
+ 	 * insufficient resources. Retry the operation a couple of
+ 	 * times before giving up.
+ 	 */
+-	while (retries < 3) {
+-		ret =  hv_post_message(conn_id, 1, buffer, buflen);
+-		if (ret != HV_STATUS_INSUFFICIENT_BUFFERS)
++	while (retries < 10) {
++		ret = hv_post_message(conn_id, 1, buffer, buflen);
++
++		switch (ret) {
++		case HV_STATUS_INSUFFICIENT_BUFFERS:
++			ret = -ENOMEM;
++		case -ENOMEM:
++			break;
++		case HV_STATUS_SUCCESS:
+ 			return ret;
++		default:
++			pr_err("hv_post_msg() failed; error code:%d\n", ret);
++			return -EINVAL;
++		}
++
+ 		retries++;
+ 		msleep(100);
+ 	}
+diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
+index edfc8488cb03..3e4235c7a47f 100644
+--- a/drivers/hv/hv.c
++++ b/drivers/hv/hv.c
+@@ -138,6 +138,8 @@ int hv_init(void)
+ 	memset(hv_context.synic_event_page, 0, sizeof(void *) * NR_CPUS);
+ 	memset(hv_context.synic_message_page, 0,
+ 	       sizeof(void *) * NR_CPUS);
++	memset(hv_context.post_msg_page, 0,
++	       sizeof(void *) * NR_CPUS);
+ 	memset(hv_context.vp_index, 0,
+ 	       sizeof(int) * NR_CPUS);
+ 	memset(hv_context.event_dpc, 0,
+@@ -217,26 +219,18 @@ int hv_post_message(union hv_connection_id connection_id,
+ 		  enum hv_message_type message_type,
+ 		  void *payload, size_t payload_size)
+ {
+-	struct aligned_input {
+-		u64 alignment8;
+-		struct hv_input_post_message msg;
+-	};
+ 
+ 	struct hv_input_post_message *aligned_msg;
+ 	u16 status;
+-	unsigned long addr;
+ 
+ 	if (payload_size > HV_MESSAGE_PAYLOAD_BYTE_COUNT)
+ 		return -EMSGSIZE;
+ 
+-	addr = (unsigned long)kmalloc(sizeof(struct aligned_input), GFP_ATOMIC);
+-	if (!addr)
+-		return -ENOMEM;
+-
+ 	aligned_msg = (struct hv_input_post_message *)
+-			(ALIGN(addr, HV_HYPERCALL_PARAM_ALIGN));
++			hv_context.post_msg_page[get_cpu()];
+ 
+ 	aligned_msg->connectionid = connection_id;
++	aligned_msg->reserved = 0;
+ 	aligned_msg->message_type = message_type;
+ 	aligned_msg->payload_size = payload_size;
+ 	memcpy((void *)aligned_msg->payload, payload, payload_size);
+@@ -244,8 +238,7 @@ int hv_post_message(union hv_connection_id connection_id,
+ 	status = do_hypercall(HVCALL_POST_MESSAGE, aligned_msg, NULL)
+ 		& 0xFFFF;
+ 
+-	kfree((void *)addr);
+-
++	put_cpu();
+ 	return status;
+ }
+ 
+@@ -294,6 +287,14 @@ int hv_synic_alloc(void)
+ 			pr_err("Unable to allocate SYNIC event page\n");
+ 			goto err;
+ 		}
++
++		hv_context.post_msg_page[cpu] =
++			(void *)get_zeroed_page(GFP_ATOMIC);
++
++		if (hv_context.post_msg_page[cpu] == NULL) {
++			pr_err("Unable to allocate post msg page\n");
++			goto err;
++		}
+ 	}
+ 
+ 	return 0;
+@@ -308,6 +309,8 @@ static void hv_synic_free_cpu(int cpu)
+ 		free_page((unsigned long)hv_context.synic_event_page[cpu]);
+ 	if (hv_context.synic_message_page[cpu])
+ 		free_page((unsigned long)hv_context.synic_message_page[cpu]);
++	if (hv_context.post_msg_page[cpu])
++		free_page((unsigned long)hv_context.post_msg_page[cpu]);
+ }
+ 
+ void hv_synic_free(void)
+diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
+index 22b750749a39..c386d8dc7223 100644
+--- a/drivers/hv/hyperv_vmbus.h
++++ b/drivers/hv/hyperv_vmbus.h
+@@ -515,6 +515,10 @@ struct hv_context {
+ 	 * per-cpu list of the channels based on their CPU affinity.
+ 	 */
+ 	struct list_head percpu_list[NR_CPUS];
++	/*
++	 * buffer to post messages to the host.
++	 */
++	void *post_msg_page[NR_CPUS];
+ };
+ 
+ extern struct hv_context hv_context;
+diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
+index 49d11338294b..2fb90e2825c3 100644
+--- a/drivers/message/fusion/mptspi.c
++++ b/drivers/message/fusion/mptspi.c
+@@ -1420,6 +1420,11 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+ 		goto out_mptspi_probe;
+         }
+ 
++	/* VMWare emulation doesn't properly implement WRITE_SAME
++	 */
++	if (pdev->subsystem_vendor == 0x15AD)
++		sh->no_write_same = 1;
++
+ 	spin_lock_irqsave(&ioc->FreeQlock, flags);
+ 
+ 	/* Attach the SCSI Host to the IOC structure
+diff --git a/drivers/misc/mei/bus.c b/drivers/misc/mei/bus.c
+index 0e993ef28b94..8fd9466266b6 100644
+--- a/drivers/misc/mei/bus.c
++++ b/drivers/misc/mei/bus.c
+@@ -70,7 +70,7 @@ static int mei_cl_device_probe(struct device *dev)
+ 
+ 	dev_dbg(dev, "Device probe\n");
+ 
+-	strncpy(id.name, dev_name(dev), sizeof(id.name));
++	strlcpy(id.name, dev_name(dev), sizeof(id.name));
+ 
+ 	return driver->probe(device, &id);
+ }
+diff --git a/drivers/net/wireless/ath/ath9k/ar5008_phy.c b/drivers/net/wireless/ath/ath9k/ar5008_phy.c
+index 00fb8badbacc..3b3e91057a4c 100644
+--- a/drivers/net/wireless/ath/ath9k/ar5008_phy.c
++++ b/drivers/net/wireless/ath/ath9k/ar5008_phy.c
+@@ -1004,9 +1004,11 @@ static bool ar5008_hw_ani_control_new(struct ath_hw *ah,
+ 	case ATH9K_ANI_FIRSTEP_LEVEL:{
+ 		u32 level = param;
+ 
+-		value = level;
++		value = level * 2;
+ 		REG_RMW_FIELD(ah, AR_PHY_FIND_SIG,
+ 			      AR_PHY_FIND_SIG_FIRSTEP, value);
++		REG_RMW_FIELD(ah, AR_PHY_FIND_SIG_LOW,
++			      AR_PHY_FIND_SIG_FIRSTEP_LOW, value);
+ 
+ 		if (level != aniState->firstepLevel) {
+ 			ath_dbg(common, ANI,
+diff --git a/drivers/net/wireless/iwlwifi/mvm/constants.h b/drivers/net/wireless/iwlwifi/mvm/constants.h
+index 51685693af2e..cb4c06cead2d 100644
+--- a/drivers/net/wireless/iwlwifi/mvm/constants.h
++++ b/drivers/net/wireless/iwlwifi/mvm/constants.h
+@@ -80,7 +80,7 @@
+ #define IWL_MVM_WOWLAN_PS_SNOOZE_WINDOW		25
+ #define IWL_MVM_LOWLAT_QUOTA_MIN_PERCENT	64
+ #define IWL_MVM_BT_COEX_SYNC2SCO		1
+-#define IWL_MVM_BT_COEX_CORUNNING		1
++#define IWL_MVM_BT_COEX_CORUNNING		0
+ #define IWL_MVM_BT_COEX_MPLUT			1
+ 
+ #endif /* __MVM_CONSTANTS_H */
+diff --git a/drivers/net/wireless/iwlwifi/pcie/drv.c b/drivers/net/wireless/iwlwifi/pcie/drv.c
+index 98950e45c7b0..78eaa4875bd7 100644
+--- a/drivers/net/wireless/iwlwifi/pcie/drv.c
++++ b/drivers/net/wireless/iwlwifi/pcie/drv.c
+@@ -273,6 +273,8 @@ static DEFINE_PCI_DEVICE_TABLE(iwl_hw_card_ids) = {
+ 	{IWL_PCI_DEVICE(0x08B1, 0x4070, iwl7260_2ac_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B1, 0x4072, iwl7260_2ac_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B1, 0x4170, iwl7260_2ac_cfg)},
++	{IWL_PCI_DEVICE(0x08B1, 0x4C60, iwl7260_2ac_cfg)},
++	{IWL_PCI_DEVICE(0x08B1, 0x4C70, iwl7260_2ac_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B1, 0x4060, iwl7260_2n_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B1, 0x406A, iwl7260_2n_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B1, 0x4160, iwl7260_2n_cfg)},
+@@ -316,6 +318,8 @@ static DEFINE_PCI_DEVICE_TABLE(iwl_hw_card_ids) = {
+ 	{IWL_PCI_DEVICE(0x08B1, 0xC770, iwl7260_2ac_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B1, 0xC760, iwl7260_2n_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B2, 0xC270, iwl7260_2ac_cfg)},
++	{IWL_PCI_DEVICE(0x08B1, 0xCC70, iwl7260_2ac_cfg)},
++	{IWL_PCI_DEVICE(0x08B1, 0xCC60, iwl7260_2ac_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B2, 0xC272, iwl7260_2ac_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B2, 0xC260, iwl7260_2n_cfg)},
+ 	{IWL_PCI_DEVICE(0x08B2, 0xC26A, iwl7260_n_cfg)},
+diff --git a/drivers/net/wireless/rt2x00/rt2800.h b/drivers/net/wireless/rt2x00/rt2800.h
+index a394a9a95919..7cf6081a05a1 100644
+--- a/drivers/net/wireless/rt2x00/rt2800.h
++++ b/drivers/net/wireless/rt2x00/rt2800.h
+@@ -2039,7 +2039,7 @@ struct mac_iveiv_entry {
+  * 2 - drop tx power by 12dBm,
+  * 3 - increase tx power by 6dBm
+  */
+-#define BBP1_TX_POWER_CTRL		FIELD8(0x07)
++#define BBP1_TX_POWER_CTRL		FIELD8(0x03)
+ #define BBP1_TX_ANTENNA			FIELD8(0x18)
+ 
+ /*
+diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c
+index ce23e0f076b6..db5abef6cec0 100644
+--- a/drivers/pci/host/pci-mvebu.c
++++ b/drivers/pci/host/pci-mvebu.c
+@@ -873,7 +873,7 @@ static int mvebu_get_tgt_attr(struct device_node *np, int devfn,
+ 	rangesz = pna + na + ns;
+ 	nranges = rlen / sizeof(__be32) / rangesz;
+ 
+-	for (i = 0; i < nranges; i++) {
++	for (i = 0; i < nranges; i++, range += rangesz) {
+ 		u32 flags = of_read_number(range, 1);
+ 		u32 slot = of_read_number(range + 1, 1);
+ 		u64 cpuaddr = of_read_number(range + na, pna);
+@@ -883,14 +883,14 @@ static int mvebu_get_tgt_attr(struct device_node *np, int devfn,
+ 			rtype = IORESOURCE_IO;
+ 		else if (DT_FLAGS_TO_TYPE(flags) == DT_TYPE_MEM32)
+ 			rtype = IORESOURCE_MEM;
++		else
++			continue;
+ 
+ 		if (slot == PCI_SLOT(devfn) && type == rtype) {
+ 			*tgt = DT_CPUADDR_TO_TARGET(cpuaddr);
+ 			*attr = DT_CPUADDR_TO_ATTR(cpuaddr);
+ 			return 0;
+ 		}
+-
+-		range += rangesz;
+ 	}
+ 
+ 	return -ENOENT;
+diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
+index 9ff0a901ecf7..76ef7914c9aa 100644
+--- a/drivers/pci/pci-sysfs.c
++++ b/drivers/pci/pci-sysfs.c
+@@ -177,7 +177,7 @@ static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
+ {
+ 	struct pci_dev *pci_dev = to_pci_dev(dev);
+ 
+-	return sprintf(buf, "pci:v%08Xd%08Xsv%08Xsd%08Xbc%02Xsc%02Xi%02x\n",
++	return sprintf(buf, "pci:v%08Xd%08Xsv%08Xsd%08Xbc%02Xsc%02Xi%02X\n",
+ 		       pci_dev->vendor, pci_dev->device,
+ 		       pci_dev->subsystem_vendor, pci_dev->subsystem_device,
+ 		       (u8)(pci_dev->class >> 16), (u8)(pci_dev->class >> 8),
+diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
+index d0f69269eb6c..cc09b14b8ac1 100644
+--- a/drivers/pci/quirks.c
++++ b/drivers/pci/quirks.c
+@@ -24,6 +24,7 @@
+ #include <linux/ioport.h>
+ #include <linux/sched.h>
+ #include <linux/ktime.h>
++#include <linux/mm.h>
+ #include <asm/dma.h>	/* isa_dma_bridge_buggy */
+ #include "pci.h"
+ 
+@@ -287,6 +288,25 @@ static void quirk_citrine(struct pci_dev *dev)
+ }
+ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_IBM,	PCI_DEVICE_ID_IBM_CITRINE,	quirk_citrine);
+ 
++/*  On IBM Crocodile ipr SAS adapters, expand BAR to system page size */
++static void quirk_extend_bar_to_page(struct pci_dev *dev)
++{
++	int i;
++
++	for (i = 0; i < PCI_STD_RESOURCE_END; i++) {
++		struct resource *r = &dev->resource[i];
++
++		if (r->flags & IORESOURCE_MEM && resource_size(r) < PAGE_SIZE) {
++			r->end = PAGE_SIZE - 1;
++			r->start = 0;
++			r->flags |= IORESOURCE_UNSET;
++			dev_info(&dev->dev, "expanded BAR %d to page size: %pR\n",
++				 i, r);
++		}
++	}
++}
++DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_IBM, 0x034a, quirk_extend_bar_to_page);
++
+ /*
+  *  S3 868 and 968 chips report region size equal to 32M, but they decode 64M.
+  *  If it's needed, re-allocate the region.
+diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
+index a5a63ecfb628..a70b8715a315 100644
+--- a/drivers/pci/setup-bus.c
++++ b/drivers/pci/setup-bus.c
+@@ -1652,7 +1652,7 @@ void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
+ 	struct pci_dev_resource *fail_res;
+ 	int retval;
+ 	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
+-				  IORESOURCE_PREFETCH;
++				  IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
+ 
+ again:
+ 	__pci_bus_size_bridges(parent, &add_list);
+diff --git a/drivers/regulator/ltc3589.c b/drivers/regulator/ltc3589.c
+index c8105182b8b8..bef5842d0777 100644
+--- a/drivers/regulator/ltc3589.c
++++ b/drivers/regulator/ltc3589.c
+@@ -372,6 +372,7 @@ static bool ltc3589_volatile_reg(struct device *dev, unsigned int reg)
+ 	switch (reg) {
+ 	case LTC3589_IRQSTAT:
+ 	case LTC3589_PGSTAT:
++	case LTC3589_VCCR:
+ 		return true;
+ 	}
+ 	return false;
+diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
+index b0e4a3eb33c7..5b2e76159b41 100644
+--- a/drivers/rtc/rtc-cmos.c
++++ b/drivers/rtc/rtc-cmos.c
+@@ -856,7 +856,7 @@ static void __exit cmos_do_remove(struct device *dev)
+ 	cmos->dev = NULL;
+ }
+ 
+-#ifdef	CONFIG_PM_SLEEP
++#ifdef CONFIG_PM
+ 
+ static int cmos_suspend(struct device *dev)
+ {
+@@ -907,6 +907,8 @@ static inline int cmos_poweroff(struct device *dev)
+ 	return cmos_suspend(dev);
+ }
+ 
++#ifdef	CONFIG_PM_SLEEP
++
+ static int cmos_resume(struct device *dev)
+ {
+ 	struct cmos_rtc	*cmos = dev_get_drvdata(dev);
+@@ -954,6 +956,7 @@ static int cmos_resume(struct device *dev)
+ 	return 0;
+ }
+ 
++#endif
+ #else
+ 
+ static inline int cmos_poweroff(struct device *dev)
+diff --git a/drivers/scsi/be2iscsi/be_mgmt.c b/drivers/scsi/be2iscsi/be_mgmt.c
+index 07934b0b9ee1..accceb57ddbc 100644
+--- a/drivers/scsi/be2iscsi/be_mgmt.c
++++ b/drivers/scsi/be2iscsi/be_mgmt.c
+@@ -944,17 +944,20 @@ mgmt_static_ip_modify(struct beiscsi_hba *phba,
+ 
+ 	if (ip_action == IP_ACTION_ADD) {
+ 		memcpy(req->ip_params.ip_record.ip_addr.addr, ip_param->value,
+-		       ip_param->len);
++		       sizeof(req->ip_params.ip_record.ip_addr.addr));
+ 
+ 		if (subnet_param)
+ 			memcpy(req->ip_params.ip_record.ip_addr.subnet_mask,
+-			       subnet_param->value, subnet_param->len);
++			       subnet_param->value,
++			       sizeof(req->ip_params.ip_record.ip_addr.subnet_mask));
+ 	} else {
+ 		memcpy(req->ip_params.ip_record.ip_addr.addr,
+-		       if_info->ip_addr.addr, ip_param->len);
++		       if_info->ip_addr.addr,
++		       sizeof(req->ip_params.ip_record.ip_addr.addr));
+ 
+ 		memcpy(req->ip_params.ip_record.ip_addr.subnet_mask,
+-		       if_info->ip_addr.subnet_mask, ip_param->len);
++		       if_info->ip_addr.subnet_mask,
++		       sizeof(req->ip_params.ip_record.ip_addr.subnet_mask));
+ 	}
+ 
+ 	rc = mgmt_exec_nonemb_cmd(phba, &nonemb_cmd, NULL, 0);
+@@ -982,7 +985,7 @@ static int mgmt_modify_gateway(struct beiscsi_hba *phba, uint8_t *gt_addr,
+ 	req->action = gtway_action;
+ 	req->ip_addr.ip_type = BE2_IPV4;
+ 
+-	memcpy(req->ip_addr.addr, gt_addr, param_len);
++	memcpy(req->ip_addr.addr, gt_addr, sizeof(req->ip_addr.addr));
+ 
+ 	return mgmt_exec_nonemb_cmd(phba, &nonemb_cmd, NULL, 0);
+ }
+diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
+index d96bfb55e57b..5072251cdb8b 100644
+--- a/drivers/scsi/qla2xxx/qla_os.c
++++ b/drivers/scsi/qla2xxx/qla_os.c
+@@ -3111,10 +3111,8 @@ qla2x00_unmap_iobases(struct qla_hw_data *ha)
+ }
+ 
+ static void
+-qla2x00_clear_drv_active(scsi_qla_host_t *vha)
++qla2x00_clear_drv_active(struct qla_hw_data *ha)
+ {
+-	struct qla_hw_data *ha = vha->hw;
+-
+ 	if (IS_QLA8044(ha)) {
+ 		qla8044_idc_lock(ha);
+ 		qla8044_clear_drv_active(ha);
+@@ -3185,7 +3183,7 @@ qla2x00_remove_one(struct pci_dev *pdev)
+ 
+ 	scsi_host_put(base_vha->host);
+ 
+-	qla2x00_clear_drv_active(base_vha);
++	qla2x00_clear_drv_active(ha);
+ 
+ 	qla2x00_unmap_iobases(ha);
+ 
+diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c
+index e632e14180cf..bcc449a0c3a7 100644
+--- a/drivers/scsi/qla2xxx/qla_target.c
++++ b/drivers/scsi/qla2xxx/qla_target.c
+@@ -1431,12 +1431,10 @@ static inline void qlt_unmap_sg(struct scsi_qla_host *vha,
+ static int qlt_check_reserve_free_req(struct scsi_qla_host *vha,
+ 	uint32_t req_cnt)
+ {
+-	struct qla_hw_data *ha = vha->hw;
+-	device_reg_t __iomem *reg = ha->iobase;
+ 	uint32_t cnt;
+ 
+ 	if (vha->req->cnt < (req_cnt + 2)) {
+-		cnt = (uint16_t)RD_REG_DWORD(&reg->isp24.req_q_out);
++		cnt = (uint16_t)RD_REG_DWORD(vha->req->req_q_out);
+ 
+ 		ql_dbg(ql_dbg_tgt, vha, 0xe00a,
+ 		    "Request ring circled: cnt=%d, vha->->ring_index=%d, "
+@@ -3277,6 +3275,7 @@ static int qlt_handle_cmd_for_atio(struct scsi_qla_host *vha,
+ 			return -ENOMEM;
+ 
+ 		memcpy(&op->atio, atio, sizeof(*atio));
++		op->vha = vha;
+ 		INIT_WORK(&op->work, qlt_create_sess_from_atio);
+ 		queue_work(qla_tgt_wq, &op->work);
+ 		return 0;
+diff --git a/drivers/spi/spi-dw-mid.c b/drivers/spi/spi-dw-mid.c
+index 6d207afec8cb..a4c45ea8f688 100644
+--- a/drivers/spi/spi-dw-mid.c
++++ b/drivers/spi/spi-dw-mid.c
+@@ -89,7 +89,13 @@ err_exit:
+ 
+ static void mid_spi_dma_exit(struct dw_spi *dws)
+ {
++	if (!dws->dma_inited)
++		return;
++
++	dmaengine_terminate_all(dws->txchan);
+ 	dma_release_channel(dws->txchan);
++
++	dmaengine_terminate_all(dws->rxchan);
+ 	dma_release_channel(dws->rxchan);
+ }
+ 
+@@ -136,7 +142,7 @@ static int mid_spi_dma_transfer(struct dw_spi *dws, int cs_change)
+ 	txconf.dst_addr = dws->dma_addr;
+ 	txconf.dst_maxburst = LNW_DMA_MSIZE_16;
+ 	txconf.src_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+-	txconf.dst_addr_width = DMA_SLAVE_BUSWIDTH_2_BYTES;
++	txconf.dst_addr_width = dws->dma_width;
+ 	txconf.device_fc = false;
+ 
+ 	txchan->device->device_control(txchan, DMA_SLAVE_CONFIG,
+@@ -159,7 +165,7 @@ static int mid_spi_dma_transfer(struct dw_spi *dws, int cs_change)
+ 	rxconf.src_addr = dws->dma_addr;
+ 	rxconf.src_maxburst = LNW_DMA_MSIZE_16;
+ 	rxconf.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+-	rxconf.src_addr_width = DMA_SLAVE_BUSWIDTH_2_BYTES;
++	rxconf.src_addr_width = dws->dma_width;
+ 	rxconf.device_fc = false;
+ 
+ 	rxchan->device->device_control(rxchan, DMA_SLAVE_CONFIG,
+diff --git a/drivers/tty/serial/omap-serial.c b/drivers/tty/serial/omap-serial.c
+index d017cec8a34a..e454b7c2ecd9 100644
+--- a/drivers/tty/serial/omap-serial.c
++++ b/drivers/tty/serial/omap-serial.c
+@@ -254,8 +254,16 @@ serial_omap_baud_is_mode16(struct uart_port *port, unsigned int baud)
+ {
+ 	unsigned int n13 = port->uartclk / (13 * baud);
+ 	unsigned int n16 = port->uartclk / (16 * baud);
+-	int baudAbsDiff13 = baud - (port->uartclk / (13 * n13));
+-	int baudAbsDiff16 = baud - (port->uartclk / (16 * n16));
++	int baudAbsDiff13;
++	int baudAbsDiff16;
++
++	if (n13 == 0)
++		n13 = 1;
++	if (n16 == 0)
++		n16 = 1;
++
++	baudAbsDiff13 = baud - (port->uartclk / (13 * n13));
++	baudAbsDiff16 = baud - (port->uartclk / (16 * n16));
+ 	if (baudAbsDiff13 < 0)
+ 		baudAbsDiff13 = -baudAbsDiff13;
+ 	if (baudAbsDiff16 < 0)
+diff --git a/drivers/usb/gadget/Kconfig b/drivers/usb/gadget/Kconfig
+index ba18e9c110cc..77ad6a944129 100644
+--- a/drivers/usb/gadget/Kconfig
++++ b/drivers/usb/gadget/Kconfig
+@@ -438,7 +438,7 @@ config USB_GOKU
+ 	   gadget drivers to also be dynamically linked.
+ 
+ config USB_EG20T
+-	tristate "Intel EG20T PCH/LAPIS Semiconductor IOH(ML7213/ML7831) UDC"
++	tristate "Intel QUARK X1000/EG20T PCH/LAPIS Semiconductor IOH(ML7213/ML7831) UDC"
+ 	depends on PCI
+ 	help
+ 	  This is a USB device driver for EG20T PCH.
+@@ -459,6 +459,7 @@ config USB_EG20T
+ 	  ML7213/ML7831 is companion chip for Intel Atom E6xx series.
+ 	  ML7213/ML7831 is completely compatible for Intel EG20T PCH.
+ 
++	  This driver can be used with Intel's Quark X1000 SOC platform
+ #
+ # LAST -- dummy/emulated controller
+ #
+diff --git a/drivers/usb/gadget/pch_udc.c b/drivers/usb/gadget/pch_udc.c
+index eb8c3bedb57a..460d953c91b6 100644
+--- a/drivers/usb/gadget/pch_udc.c
++++ b/drivers/usb/gadget/pch_udc.c
+@@ -343,6 +343,7 @@ struct pch_vbus_gpio_data {
+  * @setup_data:		Received setup data
+  * @phys_addr:		of device memory
+  * @base_addr:		for mapped device memory
++ * @bar:		Indicates which PCI BAR for USB regs
+  * @irq:		IRQ line for the device
+  * @cfg_data:		current cfg, intf, and alt in use
+  * @vbus_gpio:		GPIO informaton for detecting VBUS
+@@ -370,14 +371,17 @@ struct pch_udc_dev {
+ 	struct usb_ctrlrequest		setup_data;
+ 	unsigned long			phys_addr;
+ 	void __iomem			*base_addr;
++	unsigned			bar;
+ 	unsigned			irq;
+ 	struct pch_udc_cfg_data		cfg_data;
+ 	struct pch_vbus_gpio_data	vbus_gpio;
+ };
+ #define to_pch_udc(g)	(container_of((g), struct pch_udc_dev, gadget))
+ 
++#define PCH_UDC_PCI_BAR_QUARK_X1000	0
+ #define PCH_UDC_PCI_BAR			1
+ #define PCI_DEVICE_ID_INTEL_EG20T_UDC	0x8808
++#define PCI_DEVICE_ID_INTEL_QUARK_X1000_UDC	0x0939
+ #define PCI_VENDOR_ID_ROHM		0x10DB
+ #define PCI_DEVICE_ID_ML7213_IOH_UDC	0x801D
+ #define PCI_DEVICE_ID_ML7831_IOH_UDC	0x8808
+@@ -3076,7 +3080,7 @@ static void pch_udc_remove(struct pci_dev *pdev)
+ 		iounmap(dev->base_addr);
+ 	if (dev->mem_region)
+ 		release_mem_region(dev->phys_addr,
+-				   pci_resource_len(pdev, PCH_UDC_PCI_BAR));
++				   pci_resource_len(pdev, dev->bar));
+ 	if (dev->active)
+ 		pci_disable_device(pdev);
+ 	kfree(dev);
+@@ -3144,9 +3148,15 @@ static int pch_udc_probe(struct pci_dev *pdev,
+ 	dev->active = 1;
+ 	pci_set_drvdata(pdev, dev);
+ 
++	/* Determine BAR based on PCI ID */
++	if (id->device == PCI_DEVICE_ID_INTEL_QUARK_X1000_UDC)
++		dev->bar = PCH_UDC_PCI_BAR_QUARK_X1000;
++	else
++		dev->bar = PCH_UDC_PCI_BAR;
++
+ 	/* PCI resource allocation */
+-	resource = pci_resource_start(pdev, 1);
+-	len = pci_resource_len(pdev, 1);
++	resource = pci_resource_start(pdev, dev->bar);
++	len = pci_resource_len(pdev, dev->bar);
+ 
+ 	if (!request_mem_region(resource, len, KBUILD_MODNAME)) {
+ 		dev_err(&pdev->dev, "%s: pci device used already\n", __func__);
+@@ -3212,6 +3222,12 @@ finished:
+ 
+ static const struct pci_device_id pch_udc_pcidev_id[] = {
+ 	{
++		PCI_DEVICE(PCI_VENDOR_ID_INTEL,
++			   PCI_DEVICE_ID_INTEL_QUARK_X1000_UDC),
++		.class = (PCI_CLASS_SERIAL_USB << 8) | 0xfe,
++		.class_mask = 0xffffffff,
++	},
++	{
+ 		PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_EG20T_UDC),
+ 		.class = (PCI_CLASS_SERIAL_USB << 8) | 0xfe,
+ 		.class_mask = 0xffffffff,
+diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
+index eea26e1b2fda..d738ff8ab81c 100644
+--- a/fs/btrfs/dev-replace.c
++++ b/fs/btrfs/dev-replace.c
+@@ -567,6 +567,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
+ 	btrfs_kobj_rm_device(fs_info, src_device);
+ 	btrfs_kobj_add_device(fs_info, tgt_device);
+ 
++	btrfs_dev_replace_unlock(dev_replace);
++
+ 	btrfs_rm_dev_replace_blocked(fs_info);
+ 
+ 	btrfs_rm_dev_replace_srcdev(fs_info, src_device);
+@@ -580,7 +582,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
+ 	 * superblock is scratched out so that it is no longer marked to
+ 	 * belong to this filesystem.
+ 	 */
+-	btrfs_dev_replace_unlock(dev_replace);
+ 	mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
+ 	mutex_unlock(&root->fs_info->chunk_mutex);
+ 
+diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
+index 8edb9fcc38d5..feff017a47d9 100644
+--- a/fs/btrfs/extent-tree.c
++++ b/fs/btrfs/extent-tree.c
+@@ -4508,7 +4508,13 @@ again:
+ 		space_info->flush = 1;
+ 	} else if (!ret && space_info->flags & BTRFS_BLOCK_GROUP_METADATA) {
+ 		used += orig_bytes;
+-		if (need_do_async_reclaim(space_info, root->fs_info, used) &&
++		/*
++		 * We will do the space reservation dance during log replay,
++		 * which means we won't have fs_info->fs_root set, so don't do
++		 * the async reclaim as we will panic.
++		 */
++		if (!root->fs_info->log_root_recovering &&
++		    need_do_async_reclaim(space_info, root->fs_info, used) &&
+ 		    !work_busy(&root->fs_info->async_reclaim_work))
+ 			queue_work(system_unbound_wq,
+ 				   &root->fs_info->async_reclaim_work);
+diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
+index ab1fd668020d..2a15294f1683 100644
+--- a/fs/btrfs/file.c
++++ b/fs/btrfs/file.c
+@@ -2622,23 +2622,28 @@ static int find_desired_extent(struct inode *inode, loff_t *offset, int whence)
+ 	struct btrfs_root *root = BTRFS_I(inode)->root;
+ 	struct extent_map *em = NULL;
+ 	struct extent_state *cached_state = NULL;
+-	u64 lockstart = *offset;
+-	u64 lockend = i_size_read(inode);
+-	u64 start = *offset;
+-	u64 len = i_size_read(inode);
++	u64 lockstart;
++	u64 lockend;
++	u64 start;
++	u64 len;
+ 	int ret = 0;
+ 
+-	lockend = max_t(u64, root->sectorsize, lockend);
++	if (inode->i_size == 0)
++		return -ENXIO;
++
++	/*
++	 * *offset can be negative, in this case we start finding DATA/HOLE from
++	 * the very start of the file.
++	 */
++	start = max_t(loff_t, 0, *offset);
++
++	lockstart = round_down(start, root->sectorsize);
++	lockend = round_up(i_size_read(inode), root->sectorsize);
+ 	if (lockend <= lockstart)
+ 		lockend = lockstart + root->sectorsize;
+-
+ 	lockend--;
+ 	len = lockend - lockstart + 1;
+ 
+-	len = max_t(u64, len, root->sectorsize);
+-	if (inode->i_size == 0)
+-		return -ENXIO;
+-
+ 	lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, 0,
+ 			 &cached_state);
+ 
+diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
+index c6cd34e699d0..a3a8dee8030f 100644
+--- a/fs/btrfs/inode.c
++++ b/fs/btrfs/inode.c
+@@ -3656,7 +3656,8 @@ noinline int btrfs_update_inode(struct btrfs_trans_handle *trans,
+ 	 * without delay
+ 	 */
+ 	if (!btrfs_is_free_space_inode(inode)
+-	    && root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID) {
++	    && root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
++	    && !root->fs_info->log_root_recovering) {
+ 		btrfs_update_root_times(trans, root);
+ 
+ 		ret = btrfs_delayed_update_inode(trans, root, inode);
+diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
+index 47aceb494d1d..4e395f3f251d 100644
+--- a/fs/btrfs/ioctl.c
++++ b/fs/btrfs/ioctl.c
+@@ -332,6 +332,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
+ 			goto out_drop;
+ 
+ 	} else {
++		ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0);
++		if (ret && ret != -ENODATA)
++			goto out_drop;
+ 		ip->flags &= ~(BTRFS_INODE_COMPRESS | BTRFS_INODE_NOCOMPRESS);
+ 	}
+ 
+@@ -5309,6 +5312,12 @@ long btrfs_ioctl(struct file *file, unsigned int
+ 		if (ret)
+ 			return ret;
+ 		ret = btrfs_sync_fs(file->f_dentry->d_sb, 1);
++		/*
++		 * The transaction thread may want to do more work,
++		 * namely it pokes the cleaner ktread that will start
++		 * processing uncleaned subvols.
++		 */
++		wake_up_process(root->fs_info->transaction_kthread);
+ 		return ret;
+ 	}
+ 	case BTRFS_IOC_START_SYNC:
+diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
+index 65245a07275b..56fe6ec409ac 100644
+--- a/fs/btrfs/relocation.c
++++ b/fs/btrfs/relocation.c
+@@ -736,7 +736,8 @@ again:
+ 		err = ret;
+ 		goto out;
+ 	}
+-	BUG_ON(!ret || !path1->slots[0]);
++	ASSERT(ret);
++	ASSERT(path1->slots[0]);
+ 
+ 	path1->slots[0]--;
+ 
+@@ -746,10 +747,10 @@ again:
+ 		 * the backref was added previously when processing
+ 		 * backref of type BTRFS_TREE_BLOCK_REF_KEY
+ 		 */
+-		BUG_ON(!list_is_singular(&cur->upper));
++		ASSERT(list_is_singular(&cur->upper));
+ 		edge = list_entry(cur->upper.next, struct backref_edge,
+ 				  list[LOWER]);
+-		BUG_ON(!list_empty(&edge->list[UPPER]));
++		ASSERT(list_empty(&edge->list[UPPER]));
+ 		exist = edge->node[UPPER];
+ 		/*
+ 		 * add the upper level block to pending list if we need
+@@ -831,7 +832,7 @@ again:
+ 					cur->cowonly = 1;
+ 			}
+ #else
+-		BUG_ON(key.type == BTRFS_EXTENT_REF_V0_KEY);
++		ASSERT(key.type != BTRFS_EXTENT_REF_V0_KEY);
+ 		if (key.type == BTRFS_SHARED_BLOCK_REF_KEY) {
+ #endif
+ 			if (key.objectid == key.offset) {
+@@ -840,7 +841,7 @@ again:
+ 				 * backref of this type.
+ 				 */
+ 				root = find_reloc_root(rc, cur->bytenr);
+-				BUG_ON(!root);
++				ASSERT(root);
+ 				cur->root = root;
+ 				break;
+ 			}
+@@ -868,7 +869,7 @@ again:
+ 			} else {
+ 				upper = rb_entry(rb_node, struct backref_node,
+ 						 rb_node);
+-				BUG_ON(!upper->checked);
++				ASSERT(upper->checked);
+ 				INIT_LIST_HEAD(&edge->list[UPPER]);
+ 			}
+ 			list_add_tail(&edge->list[LOWER], &cur->upper);
+@@ -892,7 +893,7 @@ again:
+ 
+ 		if (btrfs_root_level(&root->root_item) == cur->level) {
+ 			/* tree root */
+-			BUG_ON(btrfs_root_bytenr(&root->root_item) !=
++			ASSERT(btrfs_root_bytenr(&root->root_item) ==
+ 			       cur->bytenr);
+ 			if (should_ignore_root(root))
+ 				list_add(&cur->list, &useless);
+@@ -927,7 +928,7 @@ again:
+ 		need_check = true;
+ 		for (; level < BTRFS_MAX_LEVEL; level++) {
+ 			if (!path2->nodes[level]) {
+-				BUG_ON(btrfs_root_bytenr(&root->root_item) !=
++				ASSERT(btrfs_root_bytenr(&root->root_item) ==
+ 				       lower->bytenr);
+ 				if (should_ignore_root(root))
+ 					list_add(&lower->list, &useless);
+@@ -977,12 +978,15 @@ again:
+ 					need_check = false;
+ 					list_add_tail(&edge->list[UPPER],
+ 						      &list);
+-				} else
++				} else {
++					if (upper->checked)
++						need_check = true;
+ 					INIT_LIST_HEAD(&edge->list[UPPER]);
++				}
+ 			} else {
+ 				upper = rb_entry(rb_node, struct backref_node,
+ 						 rb_node);
+-				BUG_ON(!upper->checked);
++				ASSERT(upper->checked);
+ 				INIT_LIST_HEAD(&edge->list[UPPER]);
+ 				if (!upper->owner)
+ 					upper->owner = btrfs_header_owner(eb);
+@@ -1026,7 +1030,7 @@ next:
+ 	 * everything goes well, connect backref nodes and insert backref nodes
+ 	 * into the cache.
+ 	 */
+-	BUG_ON(!node->checked);
++	ASSERT(node->checked);
+ 	cowonly = node->cowonly;
+ 	if (!cowonly) {
+ 		rb_node = tree_insert(&cache->rb_root, node->bytenr,
+@@ -1062,8 +1066,21 @@ next:
+ 			continue;
+ 		}
+ 
+-		BUG_ON(!upper->checked);
+-		BUG_ON(cowonly != upper->cowonly);
++		if (!upper->checked) {
++			/*
++			 * Still want to blow up for developers since this is a
++			 * logic bug.
++			 */
++			ASSERT(0);
++			err = -EINVAL;
++			goto out;
++		}
++		if (cowonly != upper->cowonly) {
++			ASSERT(0);
++			err = -EINVAL;
++			goto out;
++		}
++
+ 		if (!cowonly) {
+ 			rb_node = tree_insert(&cache->rb_root, upper->bytenr,
+ 					      &upper->rb_node);
+@@ -1086,7 +1103,7 @@ next:
+ 	while (!list_empty(&useless)) {
+ 		upper = list_entry(useless.next, struct backref_node, list);
+ 		list_del_init(&upper->list);
+-		BUG_ON(!list_empty(&upper->upper));
++		ASSERT(list_empty(&upper->upper));
+ 		if (upper == node)
+ 			node = NULL;
+ 		if (upper->lowest) {
+@@ -1119,29 +1136,45 @@ out:
+ 	if (err) {
+ 		while (!list_empty(&useless)) {
+ 			lower = list_entry(useless.next,
+-					   struct backref_node, upper);
+-			list_del_init(&lower->upper);
++					   struct backref_node, list);
++			list_del_init(&lower->list);
+ 		}
+-		upper = node;
+-		INIT_LIST_HEAD(&list);
+-		while (upper) {
+-			if (RB_EMPTY_NODE(&upper->rb_node)) {
+-				list_splice_tail(&upper->upper, &list);
+-				free_backref_node(cache, upper);
+-			}
+-
+-			if (list_empty(&list))
+-				break;
+-
+-			edge = list_entry(list.next, struct backref_edge,
+-					  list[LOWER]);
++		while (!list_empty(&list)) {
++			edge = list_first_entry(&list, struct backref_edge,
++						list[UPPER]);
++			list_del(&edge->list[UPPER]);
+ 			list_del(&edge->list[LOWER]);
++			lower = edge->node[LOWER];
+ 			upper = edge->node[UPPER];
+ 			free_backref_edge(cache, edge);
++
++			/*
++			 * Lower is no longer linked to any upper backref nodes
++			 * and isn't in the cache, we can free it ourselves.
++			 */
++			if (list_empty(&lower->upper) &&
++			    RB_EMPTY_NODE(&lower->rb_node))
++				list_add(&lower->list, &useless);
++
++			if (!RB_EMPTY_NODE(&upper->rb_node))
++				continue;
++
++			/* Add this guy's upper edges to the list to proces */
++			list_for_each_entry(edge, &upper->upper, list[LOWER])
++				list_add_tail(&edge->list[UPPER], &list);
++			if (list_empty(&upper->upper))
++				list_add(&upper->list, &useless);
++		}
++
++		while (!list_empty(&useless)) {
++			lower = list_entry(useless.next,
++					   struct backref_node, list);
++			list_del_init(&lower->list);
++			free_backref_node(cache, lower);
+ 		}
+ 		return ERR_PTR(err);
+ 	}
+-	BUG_ON(node && node->detached);
++	ASSERT(!node || !node->detached);
+ 	return node;
+ }
+ 
+diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
+index d89c6d3542ca..98a25df1c430 100644
+--- a/fs/btrfs/transaction.c
++++ b/fs/btrfs/transaction.c
+@@ -609,7 +609,6 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
+ 		if (transid <= root->fs_info->last_trans_committed)
+ 			goto out;
+ 
+-		ret = -EINVAL;
+ 		/* find specified transaction */
+ 		spin_lock(&root->fs_info->trans_lock);
+ 		list_for_each_entry(t, &root->fs_info->trans_list, list) {
+@@ -625,9 +624,16 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
+ 			}
+ 		}
+ 		spin_unlock(&root->fs_info->trans_lock);
+-		/* The specified transaction doesn't exist */
+-		if (!cur_trans)
++
++		/*
++		 * The specified transaction doesn't exist, or we
++		 * raced with btrfs_commit_transaction
++		 */
++		if (!cur_trans) {
++			if (transid > root->fs_info->last_trans_committed)
++				ret = -EINVAL;
+ 			goto out;
++		}
+ 	} else {
+ 		/* find newest transaction that is committing | committed */
+ 		spin_lock(&root->fs_info->trans_lock);
+diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
+index d4a9431ec73c..57ee4c53b4f8 100644
+--- a/fs/ecryptfs/inode.c
++++ b/fs/ecryptfs/inode.c
+@@ -1039,7 +1039,7 @@ ecryptfs_setxattr(struct dentry *dentry, const char *name, const void *value,
+ 	}
+ 
+ 	rc = vfs_setxattr(lower_dentry, name, value, size, flags);
+-	if (!rc)
++	if (!rc && dentry->d_inode)
+ 		fsstack_copy_attr_all(dentry->d_inode, lower_dentry->d_inode);
+ out:
+ 	return rc;
+diff --git a/fs/namespace.c b/fs/namespace.c
+index 140d17705683..e544a0680a7c 100644
+--- a/fs/namespace.c
++++ b/fs/namespace.c
+@@ -1374,6 +1374,8 @@ static int do_umount(struct mount *mnt, int flags)
+ 		 * Special case for "unmounting" root ...
+ 		 * we just try to remount it readonly.
+ 		 */
++		if (!capable(CAP_SYS_ADMIN))
++			return -EPERM;
+ 		down_write(&sb->s_umount);
+ 		if (!(sb->s_flags & MS_RDONLY))
+ 			retval = do_remount_sb(sb, MS_RDONLY, NULL, 0);
+diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
+index 3275e94538e7..43fd8c557fe9 100644
+--- a/fs/nfs/nfs4proc.c
++++ b/fs/nfs/nfs4proc.c
+@@ -7242,7 +7242,7 @@ static int nfs41_proc_async_sequence(struct nfs_client *clp, struct rpc_cred *cr
+ 	int ret = 0;
+ 
+ 	if ((renew_flags & NFS4_RENEW_TIMEOUT) == 0)
+-		return 0;
++		return -EAGAIN;
+ 	task = _nfs41_proc_sequence(clp, cred, false);
+ 	if (IS_ERR(task))
+ 		ret = PTR_ERR(task);
+diff --git a/fs/nfs/nfs4renewd.c b/fs/nfs/nfs4renewd.c
+index 1720d32ffa54..e1ba58c3d1ad 100644
+--- a/fs/nfs/nfs4renewd.c
++++ b/fs/nfs/nfs4renewd.c
+@@ -88,10 +88,18 @@ nfs4_renew_state(struct work_struct *work)
+ 			}
+ 			nfs_expire_all_delegations(clp);
+ 		} else {
++			int ret;
++
+ 			/* Queue an asynchronous RENEW. */
+-			ops->sched_state_renewal(clp, cred, renew_flags);
++			ret = ops->sched_state_renewal(clp, cred, renew_flags);
+ 			put_rpccred(cred);
+-			goto out_exp;
++			switch (ret) {
++			default:
++				goto out_exp;
++			case -EAGAIN:
++			case -ENOMEM:
++				break;
++			}
+ 		}
+ 	} else {
+ 		dprintk("%s: failed to call renewd. Reason: lease not expired \n",
+diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
+index 848f6853c59e..db7792c30462 100644
+--- a/fs/nfs/nfs4state.c
++++ b/fs/nfs/nfs4state.c
+@@ -1732,7 +1732,8 @@ restart:
+ 			if (status < 0) {
+ 				set_bit(ops->owner_flag_bit, &sp->so_flags);
+ 				nfs4_put_state_owner(sp);
+-				return nfs4_recovery_handle_error(clp, status);
++				status = nfs4_recovery_handle_error(clp, status);
++				return (status != 0) ? status : -EAGAIN;
+ 			}
+ 
+ 			nfs4_put_state_owner(sp);
+@@ -1741,7 +1742,7 @@ restart:
+ 		spin_unlock(&clp->cl_lock);
+ 	}
+ 	rcu_read_unlock();
+-	return status;
++	return 0;
+ }
+ 
+ static int nfs4_check_lease(struct nfs_client *clp)
+@@ -1788,7 +1789,6 @@ static int nfs4_handle_reclaim_lease_error(struct nfs_client *clp, int status)
+ 		break;
+ 	case -NFS4ERR_STALE_CLIENTID:
+ 		clear_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);
+-		nfs4_state_clear_reclaim_reboot(clp);
+ 		nfs4_state_start_reclaim_reboot(clp);
+ 		break;
+ 	case -NFS4ERR_CLID_INUSE:
+@@ -2372,6 +2372,7 @@ static void nfs4_state_manager(struct nfs_client *clp)
+ 			status = nfs4_check_lease(clp);
+ 			if (status < 0)
+ 				goto out_error;
++			continue;
+ 		}
+ 
+ 		if (test_and_clear_bit(NFS4CLNT_MOVED, &clp->cl_state)) {
+@@ -2393,14 +2394,11 @@ static void nfs4_state_manager(struct nfs_client *clp)
+ 			section = "reclaim reboot";
+ 			status = nfs4_do_reclaim(clp,
+ 				clp->cl_mvops->reboot_recovery_ops);
+-			if (test_bit(NFS4CLNT_LEASE_EXPIRED, &clp->cl_state) ||
+-			    test_bit(NFS4CLNT_SESSION_RESET, &clp->cl_state))
+-				continue;
+-			nfs4_state_end_reclaim_reboot(clp);
+-			if (test_bit(NFS4CLNT_RECLAIM_NOGRACE, &clp->cl_state))
++			if (status == -EAGAIN)
+ 				continue;
+ 			if (status < 0)
+ 				goto out_error;
++			nfs4_state_end_reclaim_reboot(clp);
+ 		}
+ 
+ 		/* Now recover expired state... */
+@@ -2408,9 +2406,7 @@ static void nfs4_state_manager(struct nfs_client *clp)
+ 			section = "reclaim nograce";
+ 			status = nfs4_do_reclaim(clp,
+ 				clp->cl_mvops->nograce_recovery_ops);
+-			if (test_bit(NFS4CLNT_LEASE_EXPIRED, &clp->cl_state) ||
+-			    test_bit(NFS4CLNT_SESSION_RESET, &clp->cl_state) ||
+-			    test_bit(NFS4CLNT_RECLAIM_REBOOT, &clp->cl_state))
++			if (status == -EAGAIN)
+ 				continue;
+ 			if (status < 0)
+ 				goto out_error;
+diff --git a/fs/nfs/pagelist.c b/fs/nfs/pagelist.c
+index 34136ff5abf0..3a9c34a0f898 100644
+--- a/fs/nfs/pagelist.c
++++ b/fs/nfs/pagelist.c
+@@ -527,7 +527,8 @@ EXPORT_SYMBOL_GPL(nfs_pgio_header_free);
+  */
+ void nfs_pgio_data_destroy(struct nfs_pgio_header *hdr)
+ {
+-	put_nfs_open_context(hdr->args.context);
++	if (hdr->args.context)
++		put_nfs_open_context(hdr->args.context);
+ 	if (hdr->page_array.pagevec != hdr->page_array.page_array)
+ 		kfree(hdr->page_array.pagevec);
+ }
+@@ -753,12 +754,11 @@ int nfs_generic_pgio(struct nfs_pageio_descriptor *desc,
+ 		nfs_list_remove_request(req);
+ 		nfs_list_add_request(req, &hdr->pages);
+ 
+-		if (WARN_ON_ONCE(pageused >= pagecount))
+-			return nfs_pgio_error(desc, hdr);
+-
+ 		if (!last_page || last_page != req->wb_page) {
+-			*pages++ = last_page = req->wb_page;
+ 			pageused++;
++			if (pageused > pagecount)
++				break;
++			*pages++ = last_page = req->wb_page;
+ 		}
+ 	}
+ 	if (WARN_ON_ONCE(pageused != pagecount))
+diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
+index 1d5103dfc203..96338175a2fe 100644
+--- a/fs/nfsd/nfs4xdr.c
++++ b/fs/nfsd/nfs4xdr.c
+@@ -1675,6 +1675,14 @@ nfsd4_decode_compound(struct nfsd4_compoundargs *argp)
+ 			readbytes += nfsd4_max_reply(argp->rqstp, op);
+ 		} else
+ 			max_reply += nfsd4_max_reply(argp->rqstp, op);
++		/*
++		 * OP_LOCK may return a conflicting lock.  (Special case
++		 * because it will just skip encoding this if it runs
++		 * out of xdr buffer space, and it is the only operation
++		 * that behaves this way.)
++		 */
++		if (op->opnum == OP_LOCK)
++			max_reply += NFS4_OPAQUE_LIMIT;
+ 
+ 		if (op->status) {
+ 			argp->opcnt = i+1;
+diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
+index 2685bc9ea2c9..ec50a8385b13 100644
+--- a/fs/notify/fanotify/fanotify_user.c
++++ b/fs/notify/fanotify/fanotify_user.c
+@@ -78,7 +78,7 @@ static int create_fd(struct fsnotify_group *group,
+ 
+ 	pr_debug("%s: group=%p event=%p\n", __func__, group, event);
+ 
+-	client_fd = get_unused_fd();
++	client_fd = get_unused_fd_flags(group->fanotify_data.f_flags);
+ 	if (client_fd < 0)
+ 		return client_fd;
+ 
+diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
+index 02614349690d..4ff074bc2a7d 100644
+--- a/fs/xfs/xfs_aops.c
++++ b/fs/xfs/xfs_aops.c
+@@ -434,10 +434,22 @@ xfs_start_page_writeback(
+ {
+ 	ASSERT(PageLocked(page));
+ 	ASSERT(!PageWriteback(page));
+-	if (clear_dirty)
++
++	/*
++	 * if the page was not fully cleaned, we need to ensure that the higher
++	 * layers come back to it correctly. That means we need to keep the page
++	 * dirty, and for WB_SYNC_ALL writeback we need to ensure the
++	 * PAGECACHE_TAG_TOWRITE index mark is not removed so another attempt to
++	 * write this page in this writeback sweep will be made.
++	 */
++	if (clear_dirty) {
+ 		clear_page_dirty_for_io(page);
+-	set_page_writeback(page);
++		set_page_writeback(page);
++	} else
++		set_page_writeback_keepwrite(page);
++
+ 	unlock_page(page);
++
+ 	/* If no buffers on the page are to be written, finish it here */
+ 	if (!buffers)
+ 		end_page_writeback(page);
+diff --git a/include/linux/compiler-gcc5.h b/include/linux/compiler-gcc5.h
+new file mode 100644
+index 000000000000..cdd1cc202d51
+--- /dev/null
++++ b/include/linux/compiler-gcc5.h
+@@ -0,0 +1,66 @@
++#ifndef __LINUX_COMPILER_H
++#error "Please don't include <linux/compiler-gcc5.h> directly, include <linux/compiler.h> instead."
++#endif
++
++#define __used				__attribute__((__used__))
++#define __must_check			__attribute__((warn_unused_result))
++#define __compiler_offsetof(a, b)	__builtin_offsetof(a, b)
++
++/* Mark functions as cold. gcc will assume any path leading to a call
++   to them will be unlikely.  This means a lot of manual unlikely()s
++   are unnecessary now for any paths leading to the usual suspects
++   like BUG(), printk(), panic() etc. [but let's keep them for now for
++   older compilers]
++
++   Early snapshots of gcc 4.3 don't support this and we can't detect this
++   in the preprocessor, but we can live with this because they're unreleased.
++   Maketime probing would be overkill here.
++
++   gcc also has a __attribute__((__hot__)) to move hot functions into
++   a special section, but I don't see any sense in this right now in
++   the kernel context */
++#define __cold			__attribute__((__cold__))
++
++#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
++
++#ifndef __CHECKER__
++# define __compiletime_warning(message) __attribute__((warning(message)))
++# define __compiletime_error(message) __attribute__((error(message)))
++#endif /* __CHECKER__ */
++
++/*
++ * Mark a position in code as unreachable.  This can be used to
++ * suppress control flow warnings after asm blocks that transfer
++ * control elsewhere.
++ *
++ * Early snapshots of gcc 4.5 don't support this and we can't detect
++ * this in the preprocessor, but we can live with this because they're
++ * unreleased.  Really, we need to have autoconf for the kernel.
++ */
++#define unreachable() __builtin_unreachable()
++
++/* Mark a function definition as prohibited from being cloned. */
++#define __noclone	__attribute__((__noclone__))
++
++/*
++ * Tell the optimizer that something else uses this function or variable.
++ */
++#define __visible __attribute__((externally_visible))
++
++/*
++ * GCC 'asm goto' miscompiles certain code sequences:
++ *
++ *   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670
++ *
++ * Work it around via a compiler barrier quirk suggested by Jakub Jelinek.
++ * Fixed in GCC 4.8.2 and later versions.
++ *
++ * (asm goto is automatically volatile - the naming reflects this.)
++ */
++#define asm_volatile_goto(x...)	do { asm goto(x); asm (""); } while (0)
++
++#ifdef CONFIG_ARCH_USE_BUILTIN_BSWAP
++#define __HAVE_BUILTIN_BSWAP32__
++#define __HAVE_BUILTIN_BSWAP64__
++#define __HAVE_BUILTIN_BSWAP16__
++#endif /* CONFIG_ARCH_USE_BUILTIN_BSWAP */
+diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
+index 7fa31731c854..83a76633c03e 100644
+--- a/include/linux/pci_ids.h
++++ b/include/linux/pci_ids.h
+@@ -2555,6 +2555,7 @@
+ #define PCI_DEVICE_ID_INTEL_MFD_EMMC0	0x0823
+ #define PCI_DEVICE_ID_INTEL_MFD_EMMC1	0x0824
+ #define PCI_DEVICE_ID_INTEL_MRST_SD2	0x084F
++#define PCI_DEVICE_ID_INTEL_QUARK_X1000_ILB	0x095E
+ #define PCI_DEVICE_ID_INTEL_I960	0x0960
+ #define PCI_DEVICE_ID_INTEL_I960RM	0x0962
+ #define PCI_DEVICE_ID_INTEL_CENTERTON_ILB	0x0c60
+diff --git a/include/linux/sched.h b/include/linux/sched.h
+index 0376b054a0d0..c5cc872b351d 100644
+--- a/include/linux/sched.h
++++ b/include/linux/sched.h
+@@ -1947,11 +1947,13 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
+ #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
+ #define used_math() tsk_used_math(current)
+ 
+-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags */
++/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
++ * __GFP_FS is also cleared as it implies __GFP_IO.
++ */
+ static inline gfp_t memalloc_noio_flags(gfp_t flags)
+ {
+ 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
+-		flags &= ~__GFP_IO;
++		flags &= ~(__GFP_IO | __GFP_FS);
+ 	return flags;
+ }
+ 
+diff --git a/include/uapi/linux/hyperv.h b/include/uapi/linux/hyperv.h
+index 78e4a86030dd..0a8e6badb29b 100644
+--- a/include/uapi/linux/hyperv.h
++++ b/include/uapi/linux/hyperv.h
+@@ -137,7 +137,7 @@ struct hv_do_fcopy {
+ 	__u64	offset;
+ 	__u32	size;
+ 	__u8	data[DATA_FRAGMENT];
+-};
++} __attribute__((packed));
+ 
+ /*
+  * An implementation of HyperV key value pair (KVP) functionality for Linux.
+diff --git a/kernel/futex.c b/kernel/futex.c
+index c20fb395a672..c5909b46af98 100644
+--- a/kernel/futex.c
++++ b/kernel/futex.c
+@@ -343,6 +343,8 @@ static void get_futex_key_refs(union futex_key *key)
+ 	case FUT_OFF_MMSHARED:
+ 		futex_get_mm(key); /* implies MB (B) */
+ 		break;
++	default:
++		smp_mb(); /* explicit MB (B) */
+ 	}
+ }
+ 
+diff --git a/lib/lzo/lzo1x_decompress_safe.c b/lib/lzo/lzo1x_decompress_safe.c
+index 8563081e8da3..a1c387f6afba 100644
+--- a/lib/lzo/lzo1x_decompress_safe.c
++++ b/lib/lzo/lzo1x_decompress_safe.c
+@@ -19,31 +19,21 @@
+ #include <linux/lzo.h>
+ #include "lzodefs.h"
+ 
+-#define HAVE_IP(t, x)					\
+-	(((size_t)(ip_end - ip) >= (size_t)(t + x)) &&	\
+-	 (((t + x) >= t) && ((t + x) >= x)))
++#define HAVE_IP(x)      ((size_t)(ip_end - ip) >= (size_t)(x))
++#define HAVE_OP(x)      ((size_t)(op_end - op) >= (size_t)(x))
++#define NEED_IP(x)      if (!HAVE_IP(x)) goto input_overrun
++#define NEED_OP(x)      if (!HAVE_OP(x)) goto output_overrun
++#define TEST_LB(m_pos)  if ((m_pos) < out) goto lookbehind_overrun
+ 
+-#define HAVE_OP(t, x)					\
+-	(((size_t)(op_end - op) >= (size_t)(t + x)) &&	\
+-	 (((t + x) >= t) && ((t + x) >= x)))
+-
+-#define NEED_IP(t, x)					\
+-	do {						\
+-		if (!HAVE_IP(t, x))			\
+-			goto input_overrun;		\
+-	} while (0)
+-
+-#define NEED_OP(t, x)					\
+-	do {						\
+-		if (!HAVE_OP(t, x))			\
+-			goto output_overrun;		\
+-	} while (0)
+-
+-#define TEST_LB(m_pos)					\
+-	do {						\
+-		if ((m_pos) < out)			\
+-			goto lookbehind_overrun;	\
+-	} while (0)
++/* This MAX_255_COUNT is the maximum number of times we can add 255 to a base
++ * count without overflowing an integer. The multiply will overflow when
++ * multiplying 255 by more than MAXINT/255. The sum will overflow earlier
++ * depending on the base count. Since the base count is taken from a u8
++ * and a few bits, it is safe to assume that it will always be lower than
++ * or equal to 2*255, thus we can always prevent any overflow by accepting
++ * two less 255 steps. See Documentation/lzo.txt for more information.
++ */
++#define MAX_255_COUNT      ((((size_t)~0) / 255) - 2)
+ 
+ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
+ 			  unsigned char *out, size_t *out_len)
+@@ -75,17 +65,24 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
+ 		if (t < 16) {
+ 			if (likely(state == 0)) {
+ 				if (unlikely(t == 0)) {
++					size_t offset;
++					const unsigned char *ip_last = ip;
++
+ 					while (unlikely(*ip == 0)) {
+-						t += 255;
+ 						ip++;
+-						NEED_IP(1, 0);
++						NEED_IP(1);
+ 					}
+-					t += 15 + *ip++;
++					offset = ip - ip_last;
++					if (unlikely(offset > MAX_255_COUNT))
++						return LZO_E_ERROR;
++
++					offset = (offset << 8) - offset;
++					t += offset + 15 + *ip++;
+ 				}
+ 				t += 3;
+ copy_literal_run:
+ #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+-				if (likely(HAVE_IP(t, 15) && HAVE_OP(t, 15))) {
++				if (likely(HAVE_IP(t + 15) && HAVE_OP(t + 15))) {
+ 					const unsigned char *ie = ip + t;
+ 					unsigned char *oe = op + t;
+ 					do {
+@@ -101,8 +98,8 @@ copy_literal_run:
+ 				} else
+ #endif
+ 				{
+-					NEED_OP(t, 0);
+-					NEED_IP(t, 3);
++					NEED_OP(t);
++					NEED_IP(t + 3);
+ 					do {
+ 						*op++ = *ip++;
+ 					} while (--t > 0);
+@@ -115,7 +112,7 @@ copy_literal_run:
+ 				m_pos -= t >> 2;
+ 				m_pos -= *ip++ << 2;
+ 				TEST_LB(m_pos);
+-				NEED_OP(2, 0);
++				NEED_OP(2);
+ 				op[0] = m_pos[0];
+ 				op[1] = m_pos[1];
+ 				op += 2;
+@@ -136,13 +133,20 @@ copy_literal_run:
+ 		} else if (t >= 32) {
+ 			t = (t & 31) + (3 - 1);
+ 			if (unlikely(t == 2)) {
++				size_t offset;
++				const unsigned char *ip_last = ip;
++
+ 				while (unlikely(*ip == 0)) {
+-					t += 255;
+ 					ip++;
+-					NEED_IP(1, 0);
++					NEED_IP(1);
+ 				}
+-				t += 31 + *ip++;
+-				NEED_IP(2, 0);
++				offset = ip - ip_last;
++				if (unlikely(offset > MAX_255_COUNT))
++					return LZO_E_ERROR;
++
++				offset = (offset << 8) - offset;
++				t += offset + 31 + *ip++;
++				NEED_IP(2);
+ 			}
+ 			m_pos = op - 1;
+ 			next = get_unaligned_le16(ip);
+@@ -154,13 +158,20 @@ copy_literal_run:
+ 			m_pos -= (t & 8) << 11;
+ 			t = (t & 7) + (3 - 1);
+ 			if (unlikely(t == 2)) {
++				size_t offset;
++				const unsigned char *ip_last = ip;
++
+ 				while (unlikely(*ip == 0)) {
+-					t += 255;
+ 					ip++;
+-					NEED_IP(1, 0);
++					NEED_IP(1);
+ 				}
+-				t += 7 + *ip++;
+-				NEED_IP(2, 0);
++				offset = ip - ip_last;
++				if (unlikely(offset > MAX_255_COUNT))
++					return LZO_E_ERROR;
++
++				offset = (offset << 8) - offset;
++				t += offset + 7 + *ip++;
++				NEED_IP(2);
+ 			}
+ 			next = get_unaligned_le16(ip);
+ 			ip += 2;
+@@ -174,7 +185,7 @@ copy_literal_run:
+ #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+ 		if (op - m_pos >= 8) {
+ 			unsigned char *oe = op + t;
+-			if (likely(HAVE_OP(t, 15))) {
++			if (likely(HAVE_OP(t + 15))) {
+ 				do {
+ 					COPY8(op, m_pos);
+ 					op += 8;
+@@ -184,7 +195,7 @@ copy_literal_run:
+ 					m_pos += 8;
+ 				} while (op < oe);
+ 				op = oe;
+-				if (HAVE_IP(6, 0)) {
++				if (HAVE_IP(6)) {
+ 					state = next;
+ 					COPY4(op, ip);
+ 					op += next;
+@@ -192,7 +203,7 @@ copy_literal_run:
+ 					continue;
+ 				}
+ 			} else {
+-				NEED_OP(t, 0);
++				NEED_OP(t);
+ 				do {
+ 					*op++ = *m_pos++;
+ 				} while (op < oe);
+@@ -201,7 +212,7 @@ copy_literal_run:
+ #endif
+ 		{
+ 			unsigned char *oe = op + t;
+-			NEED_OP(t, 0);
++			NEED_OP(t);
+ 			op[0] = m_pos[0];
+ 			op[1] = m_pos[1];
+ 			op += 2;
+@@ -214,15 +225,15 @@ match_next:
+ 		state = next;
+ 		t = next;
+ #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
+-		if (likely(HAVE_IP(6, 0) && HAVE_OP(4, 0))) {
++		if (likely(HAVE_IP(6) && HAVE_OP(4))) {
+ 			COPY4(op, ip);
+ 			op += t;
+ 			ip += t;
+ 		} else
+ #endif
+ 		{
+-			NEED_IP(t, 3);
+-			NEED_OP(t, 0);
++			NEED_IP(t + 3);
++			NEED_OP(t);
+ 			while (t > 0) {
+ 				*op++ = *ip++;
+ 				t--;
+diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c
+index 323f23cd2c37..84c0a21c1cda 100644
+--- a/net/bluetooth/l2cap_core.c
++++ b/net/bluetooth/l2cap_core.c
+@@ -2400,12 +2400,8 @@ static int l2cap_segment_le_sdu(struct l2cap_chan *chan,
+ 
+ 	BT_DBG("chan %p, msg %p, len %zu", chan, msg, len);
+ 
+-	pdu_len = chan->conn->mtu - L2CAP_HDR_SIZE;
+-
+-	pdu_len = min_t(size_t, pdu_len, chan->remote_mps);
+-
+ 	sdu_len = len;
+-	pdu_len -= L2CAP_SDULEN_SIZE;
++	pdu_len = chan->remote_mps - L2CAP_SDULEN_SIZE;
+ 
+ 	while (len > 0) {
+ 		if (len <= pdu_len)
+diff --git a/net/bluetooth/smp.c b/net/bluetooth/smp.c
+index e33a982161c1..7b7f3de79db9 100644
+--- a/net/bluetooth/smp.c
++++ b/net/bluetooth/smp.c
+@@ -432,8 +432,11 @@ static int tk_request(struct l2cap_conn *conn, u8 remote_oob, u8 auth,
+ 	}
+ 
+ 	/* Not Just Works/Confirm results in MITM Authentication */
+-	if (method != JUST_CFM)
++	if (method != JUST_CFM) {
+ 		set_bit(SMP_FLAG_MITM_AUTH, &smp->flags);
++		if (hcon->pending_sec_level < BT_SECURITY_HIGH)
++			hcon->pending_sec_level = BT_SECURITY_HIGH;
++	}
+ 
+ 	/* If both devices have Keyoard-Display I/O, the master
+ 	 * Confirms and the slave Enters the passkey.
+diff --git a/security/integrity/ima/ima_appraise.c b/security/integrity/ima/ima_appraise.c
+index d3113d4aaa3c..bd8cef5b67e4 100644
+--- a/security/integrity/ima/ima_appraise.c
++++ b/security/integrity/ima/ima_appraise.c
+@@ -194,8 +194,11 @@ int ima_appraise_measurement(int func, struct integrity_iint_cache *iint,
+ 			goto out;
+ 
+ 		cause = "missing-hash";
+-		status =
+-		    (inode->i_size == 0) ? INTEGRITY_PASS : INTEGRITY_NOLABEL;
++		status = INTEGRITY_NOLABEL;
++		if (inode->i_size == 0) {
++			iint->flags |= IMA_NEW_FILE;
++			status = INTEGRITY_PASS;
++		}
+ 		goto out;
+ 	}
+ 
+diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c
+index ccd0ac8fa9a0..b126a78d5763 100644
+--- a/security/integrity/ima/ima_crypto.c
++++ b/security/integrity/ima/ima_crypto.c
+@@ -40,19 +40,19 @@ static int ima_kernel_read(struct file *file, loff_t offset,
+ {
+ 	mm_segment_t old_fs;
+ 	char __user *buf = addr;
+-	ssize_t ret;
++	ssize_t ret = -EINVAL;
+ 
+ 	if (!(file->f_mode & FMODE_READ))
+ 		return -EBADF;
+-	if (!file->f_op->read && !file->f_op->aio_read)
+-		return -EINVAL;
+ 
+ 	old_fs = get_fs();
+ 	set_fs(get_ds());
+ 	if (file->f_op->read)
+ 		ret = file->f_op->read(file, buf, count, &offset);
+-	else
++	else if (file->f_op->aio_read)
+ 		ret = do_sync_read(file, buf, count, &offset);
++	else if (file->f_op->read_iter)
++		ret = new_sync_read(file, buf, count, &offset);
+ 	set_fs(old_fs);
+ 	return ret;
+ }
+diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
+index 09baa335ebc7..e7745a07146d 100644
+--- a/security/integrity/ima/ima_main.c
++++ b/security/integrity/ima/ima_main.c
+@@ -128,11 +128,13 @@ static void ima_check_last_writer(struct integrity_iint_cache *iint,
+ 		return;
+ 
+ 	mutex_lock(&inode->i_mutex);
+-	if (atomic_read(&inode->i_writecount) == 1 &&
+-	    iint->version != inode->i_version) {
+-		iint->flags &= ~IMA_DONE_MASK;
+-		if (iint->flags & IMA_APPRAISE)
+-			ima_update_xattr(iint, file);
++	if (atomic_read(&inode->i_writecount) == 1) {
++		if ((iint->version != inode->i_version) ||
++		    (iint->flags & IMA_NEW_FILE)) {
++			iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
++			if (iint->flags & IMA_APPRAISE)
++				ima_update_xattr(iint, file);
++		}
+ 	}
+ 	mutex_unlock(&inode->i_mutex);
+ }
+diff --git a/security/integrity/integrity.h b/security/integrity/integrity.h
+index 33c0a70f6b15..2f8715d77a5a 100644
+--- a/security/integrity/integrity.h
++++ b/security/integrity/integrity.h
+@@ -31,6 +31,7 @@
+ #define IMA_DIGSIG		0x01000000
+ #define IMA_DIGSIG_REQUIRED	0x02000000
+ #define IMA_PERMIT_DIRECTIO	0x04000000
++#define IMA_NEW_FILE		0x08000000
+ 
+ #define IMA_DO_MASK		(IMA_MEASURE | IMA_APPRAISE | IMA_AUDIT | \
+ 				 IMA_APPRAISE_SUBMASK)
+diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
+index b653ab001fba..39c572806d0d 100644
+--- a/sound/core/pcm_native.c
++++ b/sound/core/pcm_native.c
+@@ -3190,7 +3190,7 @@ static const struct vm_operations_struct snd_pcm_vm_ops_data_fault = {
+ 
+ #ifndef ARCH_HAS_DMA_MMAP_COHERENT
+ /* This should be defined / handled globally! */
+-#ifdef CONFIG_ARM
++#if defined(CONFIG_ARM) || defined(CONFIG_ARM64)
+ #define ARCH_HAS_DMA_MMAP_COHERENT
+ #endif
+ #endif
+diff --git a/sound/firewire/bebob/bebob_terratec.c b/sound/firewire/bebob/bebob_terratec.c
+index eef8ea7d9b97..0e4c0bfc463b 100644
+--- a/sound/firewire/bebob/bebob_terratec.c
++++ b/sound/firewire/bebob/bebob_terratec.c
+@@ -17,10 +17,10 @@ phase88_rack_clk_src_get(struct snd_bebob *bebob, unsigned int *id)
+ 	unsigned int enable_ext, enable_word;
+ 	int err;
+ 
+-	err = avc_audio_get_selector(bebob->unit, 0, 0, &enable_ext);
++	err = avc_audio_get_selector(bebob->unit, 0, 9, &enable_ext);
+ 	if (err < 0)
+ 		goto end;
+-	err = avc_audio_get_selector(bebob->unit, 0, 0, &enable_word);
++	err = avc_audio_get_selector(bebob->unit, 0, 8, &enable_word);
+ 	if (err < 0)
+ 		goto end;
+ 
+diff --git a/sound/pci/emu10k1/emu10k1_callback.c b/sound/pci/emu10k1/emu10k1_callback.c
+index 3f3ef38d9b6e..874cd76c7b7f 100644
+--- a/sound/pci/emu10k1/emu10k1_callback.c
++++ b/sound/pci/emu10k1/emu10k1_callback.c
+@@ -85,6 +85,8 @@ snd_emu10k1_ops_setup(struct snd_emux *emux)
+  * get more voice for pcm
+  *
+  * terminate most inactive voice and give it as a pcm voice.
++ *
++ * voice_lock is already held.
+  */
+ int
+ snd_emu10k1_synth_get_voice(struct snd_emu10k1 *hw)
+@@ -92,12 +94,10 @@ snd_emu10k1_synth_get_voice(struct snd_emu10k1 *hw)
+ 	struct snd_emux *emu;
+ 	struct snd_emux_voice *vp;
+ 	struct best_voice best[V_END];
+-	unsigned long flags;
+ 	int i;
+ 
+ 	emu = hw->synth;
+ 
+-	spin_lock_irqsave(&emu->voice_lock, flags);
+ 	lookup_voices(emu, hw, best, 1); /* no OFF voices */
+ 	for (i = 0; i < V_END; i++) {
+ 		if (best[i].voice >= 0) {
+@@ -113,11 +113,9 @@ snd_emu10k1_synth_get_voice(struct snd_emu10k1 *hw)
+ 			vp->emu->num_voices--;
+ 			vp->ch = -1;
+ 			vp->state = SNDRV_EMUX_ST_OFF;
+-			spin_unlock_irqrestore(&emu->voice_lock, flags);
+ 			return ch;
+ 		}
+ 	}
+-	spin_unlock_irqrestore(&emu->voice_lock, flags);
+ 
+ 	/* not found */
+ 	return -ENOMEM;
+diff --git a/sound/pci/hda/hda_local.h b/sound/pci/hda/hda_local.h
+index 4e2d4863daa1..cb06a553b9d9 100644
+--- a/sound/pci/hda/hda_local.h
++++ b/sound/pci/hda/hda_local.h
+@@ -424,7 +424,7 @@ struct snd_hda_pin_quirk {
+ 	  .subvendor = _subvendor,\
+ 	  .name = _name,\
+ 	  .value = _value,\
+-	  .pins = (const struct hda_pintbl[]) { _pins } \
++	  .pins = (const struct hda_pintbl[]) { _pins, {0, 0}} \
+ 	}
+ #else
+ 
+@@ -432,7 +432,7 @@ struct snd_hda_pin_quirk {
+ 	{ .codec = _codec,\
+ 	  .subvendor = _subvendor,\
+ 	  .value = _value,\
+-	  .pins = (const struct hda_pintbl[]) { _pins } \
++	  .pins = (const struct hda_pintbl[]) { _pins, {0, 0}} \
+ 	}
+ 
+ #endif
+diff --git a/sound/pci/hda/patch_hdmi.c b/sound/pci/hda/patch_hdmi.c
+index ba4ca52072ff..ddd825bce575 100644
+--- a/sound/pci/hda/patch_hdmi.c
++++ b/sound/pci/hda/patch_hdmi.c
+@@ -1574,19 +1574,22 @@ static bool hdmi_present_sense(struct hdmi_spec_per_pin *per_pin, int repoll)
+ 		}
+ 	}
+ 
+-	if (pin_eld->eld_valid && !eld->eld_valid) {
+-		update_eld = true;
++	if (pin_eld->eld_valid != eld->eld_valid)
+ 		eld_changed = true;
+-	}
++
++	if (pin_eld->eld_valid && !eld->eld_valid)
++		update_eld = true;
++
+ 	if (update_eld) {
+ 		bool old_eld_valid = pin_eld->eld_valid;
+ 		pin_eld->eld_valid = eld->eld_valid;
+-		eld_changed = pin_eld->eld_size != eld->eld_size ||
++		if (pin_eld->eld_size != eld->eld_size ||
+ 			      memcmp(pin_eld->eld_buffer, eld->eld_buffer,
+-				     eld->eld_size) != 0;
+-		if (eld_changed)
++				     eld->eld_size) != 0) {
+ 			memcpy(pin_eld->eld_buffer, eld->eld_buffer,
+ 			       eld->eld_size);
++			eld_changed = true;
++		}
+ 		pin_eld->eld_size = eld->eld_size;
+ 		pin_eld->info = eld->info;
+ 
+diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c
+index 88e4623d4f97..c8bf72832731 100644
+--- a/sound/pci/hda/patch_realtek.c
++++ b/sound/pci/hda/patch_realtek.c
+@@ -3103,6 +3103,9 @@ static void alc283_shutup(struct hda_codec *codec)
+ 
+ 	alc_write_coef_idx(codec, 0x43, 0x9004);
+ 
++	/*depop hp during suspend*/
++	alc_write_coef_idx(codec, 0x06, 0x2100);
++
+ 	snd_hda_codec_write(codec, hp_pin, 0,
+ 			    AC_VERB_SET_AMP_GAIN_MUTE, AMP_OUT_MUTE);
+ 
+@@ -5575,9 +5578,9 @@ static void alc662_led_gpio1_mute_hook(void *private_data, int enabled)
+ 	unsigned int oldval = spec->gpio_led;
+ 
+ 	if (enabled)
+-		spec->gpio_led &= ~0x01;
+-	else
+ 		spec->gpio_led |= 0x01;
++	else
++		spec->gpio_led &= ~0x01;
+ 	if (spec->gpio_led != oldval)
+ 		snd_hda_codec_write(codec, 0x01, 0, AC_VERB_SET_GPIO_DATA,
+ 				    spec->gpio_led);
+diff --git a/sound/usb/quirks-table.h b/sound/usb/quirks-table.h
+index 223c47b33ba3..c657752a420c 100644
+--- a/sound/usb/quirks-table.h
++++ b/sound/usb/quirks-table.h
+@@ -385,6 +385,36 @@ YAMAHA_DEVICE(0x105d, NULL),
+ 	}
+ },
+ {
++	USB_DEVICE(0x0499, 0x1509),
++	.driver_info = (unsigned long) & (const struct snd_usb_audio_quirk) {
++		/* .vendor_name = "Yamaha", */
++		/* .product_name = "Steinberg UR22", */
++		.ifnum = QUIRK_ANY_INTERFACE,
++		.type = QUIRK_COMPOSITE,
++		.data = (const struct snd_usb_audio_quirk[]) {
++			{
++				.ifnum = 1,
++				.type = QUIRK_AUDIO_STANDARD_INTERFACE
++			},
++			{
++				.ifnum = 2,
++				.type = QUIRK_AUDIO_STANDARD_INTERFACE
++			},
++			{
++				.ifnum = 3,
++				.type = QUIRK_MIDI_YAMAHA
++			},
++			{
++				.ifnum = 4,
++				.type = QUIRK_IGNORE_INTERFACE
++			},
++			{
++				.ifnum = -1
++			}
++		}
++	}
++},
++{
+ 	USB_DEVICE(0x0499, 0x150a),
+ 	.driver_info = (unsigned long) & (const struct snd_usb_audio_quirk) {
+ 		/* .vendor_name = "Yamaha", */
+diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
+index 4b6c01b477f9..438851c2a797 100644
+--- a/virt/kvm/kvm_main.c
++++ b/virt/kvm/kvm_main.c
+@@ -52,6 +52,7 @@
+ 
+ #include <asm/processor.h>
+ #include <asm/io.h>
++#include <asm/ioctl.h>
+ #include <asm/uaccess.h>
+ #include <asm/pgtable.h>
+ 
+@@ -95,8 +96,6 @@ static int hardware_enable_all(void);
+ static void hardware_disable_all(void);
+ 
+ static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
+-static void update_memslots(struct kvm_memslots *slots,
+-			    struct kvm_memory_slot *new, u64 last_generation);
+ 
+ static void kvm_release_pfn_dirty(pfn_t pfn);
+ static void mark_page_dirty_in_slot(struct kvm *kvm,
+@@ -474,6 +473,13 @@ static struct kvm *kvm_create_vm(unsigned long type)
+ 	kvm->memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
+ 	if (!kvm->memslots)
+ 		goto out_err_no_srcu;
++
++	/*
++	 * Init kvm generation close to the maximum to easily test the
++	 * code of handling generation number wrap-around.
++	 */
++	kvm->memslots->generation = -150;
++
+ 	kvm_init_memslots_id(kvm);
+ 	if (init_srcu_struct(&kvm->srcu))
+ 		goto out_err_no_srcu;
+@@ -685,8 +691,7 @@ static void sort_memslots(struct kvm_memslots *slots)
+ }
+ 
+ static void update_memslots(struct kvm_memslots *slots,
+-			    struct kvm_memory_slot *new,
+-			    u64 last_generation)
++			    struct kvm_memory_slot *new)
+ {
+ 	if (new) {
+ 		int id = new->id;
+@@ -697,8 +702,6 @@ static void update_memslots(struct kvm_memslots *slots,
+ 		if (new->npages != npages)
+ 			sort_memslots(slots);
+ 	}
+-
+-	slots->generation = last_generation + 1;
+ }
+ 
+ static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
+@@ -720,10 +723,24 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm,
+ {
+ 	struct kvm_memslots *old_memslots = kvm->memslots;
+ 
+-	update_memslots(slots, new, kvm->memslots->generation);
++	/*
++	 * Set the low bit in the generation, which disables SPTE caching
++	 * until the end of synchronize_srcu_expedited.
++	 */
++	WARN_ON(old_memslots->generation & 1);
++	slots->generation = old_memslots->generation + 1;
++
++	update_memslots(slots, new);
+ 	rcu_assign_pointer(kvm->memslots, slots);
+ 	synchronize_srcu_expedited(&kvm->srcu);
+ 
++	/*
++	 * Increment the new memslot generation a second time. This prevents
++	 * vm exits that race with memslot updates from caching a memslot
++	 * generation that will (potentially) be valid forever.
++	 */
++	slots->generation++;
++
+ 	kvm_arch_memslots_updated(kvm);
+ 
+ 	return old_memslots;
+@@ -1973,6 +1990,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
+ 	if (vcpu->kvm->mm != current->mm)
+ 		return -EIO;
+ 
++	if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
++		return -EINVAL;
++
+ #if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)
+ 	/*
+ 	 * Special cases: vcpu ioctls that are asynchronous to vcpu execution,


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-11-29 18:05 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-11-29 18:05 UTC (permalink / raw
  To: gentoo-commits

commit:     41cf3e1a269f2ff1d94992251fbc4e65e0c35417
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Nov 29 18:03:46 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Nov 29 18:03:46 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=41cf3e1a

Bump BFQ patchset to v7r6-3.16

---
 ...-cgroups-kconfig-build-bits-for-v7r6-3.16.patch |   6 +-
 ...ck-introduce-the-v7r6-I-O-sched-for-3.17.patch1 | 421 ++++++++++++++++++---
 ...add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch | 194 ++++++----
 3 files changed, 474 insertions(+), 147 deletions(-)

diff --git a/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r6-3.16.patch
similarity index 97%
rename from 5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
rename to 5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r6-3.16.patch
index 088bd05..7f6a5f4 100644
--- a/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
+++ b/5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r6-3.16.patch
@@ -1,7 +1,7 @@
-From 6519e5beef1063a86d3fc917cff2592cb599e824 Mon Sep 17 00:00:00 2001
+From 92ef290b97a50b9d60eb928166413140cd7a4802 Mon Sep 17 00:00:00 2001
 From: Paolo Valente <paolo.valente@unimore.it>
 Date: Thu, 22 May 2014 11:59:35 +0200
-Subject: [PATCH 1/3] block: cgroups, kconfig, build bits for BFQ-v7r5-3.16
+Subject: [PATCH 1/3] block: cgroups, kconfig, build bits for BFQ-v7r6-3.16
 
 Update Kconfig.iosched and do the related Makefile changes to include
 kernel configuration options for BFQ. Also add the bfqio controller
@@ -100,5 +100,5 @@ index 98c4f9b..13b010d 100644
  SUBSYS(perf_event)
  #endif
 -- 
-2.0.3
+2.1.2
 

diff --git a/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1 b/5002_BFQ-2-block-introduce-the-v7r6-I-O-sched-for-3.17.patch1
similarity index 92%
rename from 5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
rename to 5002_BFQ-2-block-introduce-the-v7r6-I-O-sched-for-3.17.patch1
index 6f630ba..7ae3298 100644
--- a/5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
+++ b/5002_BFQ-2-block-introduce-the-v7r6-I-O-sched-for-3.17.patch1
@@ -1,9 +1,9 @@
-From c56e6c5db41f7137d3e0b38063ef0c944eec1898 Mon Sep 17 00:00:00 2001
+From e4fcd78909604194d930e38874a9313090b80348 Mon Sep 17 00:00:00 2001
 From: Paolo Valente <paolo.valente@unimore.it>
 Date: Thu, 9 May 2013 19:10:02 +0200
-Subject: [PATCH 2/3] block: introduce the BFQ-v7r5 I/O sched for 3.16
+Subject: [PATCH 2/3] block: introduce the BFQ-v7r6 I/O sched for 3.16
 
-Add the BFQ-v7r5 I/O scheduler to 3.16.
+Add the BFQ-v7r6 I/O scheduler to 3.16.
 The general structure is borrowed from CFQ, as much of the code for
 handling I/O contexts. Over time, several useful features have been
 ported from CFQ as well (details in the changelog in README.BFQ). A
@@ -56,12 +56,12 @@ until it expires.
 Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
 Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
 ---
- block/bfq-cgroup.c  |  930 +++++++++++++
+ block/bfq-cgroup.c  |  930 ++++++++++++
  block/bfq-ioc.c     |   36 +
- block/bfq-iosched.c | 3617 +++++++++++++++++++++++++++++++++++++++++++++++++++
- block/bfq-sched.c   | 1207 +++++++++++++++++
- block/bfq.h         |  742 +++++++++++
- 5 files changed, 6532 insertions(+)
+ block/bfq-iosched.c | 3887 +++++++++++++++++++++++++++++++++++++++++++++++++++
+ block/bfq-sched.c   | 1207 ++++++++++++++++
+ block/bfq.h         |  773 ++++++++++
+ 5 files changed, 6833 insertions(+)
  create mode 100644 block/bfq-cgroup.c
  create mode 100644 block/bfq-ioc.c
  create mode 100644 block/bfq-iosched.c
@@ -1048,10 +1048,10 @@ index 0000000..7f6b000
 +}
 diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
 new file mode 100644
-index 0000000..0a0891b
+index 0000000..b919b03
 --- /dev/null
 +++ b/block/bfq-iosched.c
-@@ -0,0 +1,3617 @@
+@@ -0,0 +1,3887 @@
 +/*
 + * Budget Fair Queueing (BFQ) disk scheduler.
 + *
@@ -1625,6 +1625,220 @@ index 0000000..0a0891b
 +	return dur;
 +}
 +
++/* Empty burst list and add just bfqq (see comments to bfq_handle_burst) */
++static inline void bfq_reset_burst_list(struct bfq_data *bfqd,
++					struct bfq_queue *bfqq)
++{
++	struct bfq_queue *item;
++	struct hlist_node *n;
++
++	hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
++		hlist_del_init(&item->burst_list_node);
++	hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
++	bfqd->burst_size = 1;
++}
++
++/* Add bfqq to the list of queues in current burst (see bfq_handle_burst) */
++static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)
++{
++	/* Increment burst size to take into account also bfqq */
++	bfqd->burst_size++;
++
++	if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) {
++		struct bfq_queue *pos, *bfqq_item;
++		struct hlist_node *n;
++
++		/*
++		 * Enough queues have been activated shortly after each
++		 * other to consider this burst as large.
++		 */
++		bfqd->large_burst = true;
++
++		/*
++		 * We can now mark all queues in the burst list as
++		 * belonging to a large burst.
++		 */
++		hlist_for_each_entry(bfqq_item, &bfqd->burst_list,
++				     burst_list_node)
++		        bfq_mark_bfqq_in_large_burst(bfqq_item);
++		bfq_mark_bfqq_in_large_burst(bfqq);
++
++		/*
++		 * From now on, and until the current burst finishes, any
++		 * new queue being activated shortly after the last queue
++		 * was inserted in the burst can be immediately marked as
++		 * belonging to a large burst. So the burst list is not
++		 * needed any more. Remove it.
++		 */
++		hlist_for_each_entry_safe(pos, n, &bfqd->burst_list,
++					  burst_list_node)
++			hlist_del_init(&pos->burst_list_node);
++	} else /* burst not yet large: add bfqq to the burst list */
++		hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
++}
++
++/*
++ * If many queues happen to become active shortly after each other, then,
++ * to help the processes associated to these queues get their job done as
++ * soon as possible, it is usually better to not grant either weight-raising
++ * or device idling to these queues. In this comment we describe, firstly,
++ * the reasons why this fact holds, and, secondly, the next function, which
++ * implements the main steps needed to properly mark these queues so that
++ * they can then be treated in a different way.
++ *
++ * As for the terminology, we say that a queue becomes active, i.e.,
++ * switches from idle to backlogged, either when it is created (as a
++ * consequence of the arrival of an I/O request), or, if already existing,
++ * when a new request for the queue arrives while the queue is idle.
++ * Bursts of activations, i.e., activations of different queues occurring
++ * shortly after each other, are typically caused by services or applications
++ * that spawn or reactivate many parallel threads/processes. Examples are
++ * systemd during boot or git grep.
++ *
++ * These services or applications benefit mostly from a high throughput:
++ * the quicker the requests of the activated queues are cumulatively served,
++ * the sooner the target job of these queues gets completed. As a consequence,
++ * weight-raising any of these queues, which also implies idling the device
++ * for it, is almost always counterproductive: in most cases it just lowers
++ * throughput.
++ *
++ * On the other hand, a burst of activations may be also caused by the start
++ * of an application that does not consist in a lot of parallel I/O-bound
++ * threads. In fact, with a complex application, the burst may be just a
++ * consequence of the fact that several processes need to be executed to
++ * start-up the application. To start an application as quickly as possible,
++ * the best thing to do is to privilege the I/O related to the application
++ * with respect to all other I/O. Therefore, the best strategy to start as
++ * quickly as possible an application that causes a burst of activations is
++ * to weight-raise all the queues activated during the burst. This is the
++ * exact opposite of the best strategy for the other type of bursts.
++ *
++ * In the end, to take the best action for each of the two cases, the two
++ * types of bursts need to be distinguished. Fortunately, this seems
++ * relatively easy to do, by looking at the sizes of the bursts. In
++ * particular, we found a threshold such that bursts with a larger size
++ * than that threshold are apparently caused only by services or commands
++ * such as systemd or git grep. For brevity, hereafter we call just 'large'
++ * these bursts. BFQ *does not* weight-raise queues whose activations occur
++ * in a large burst. In addition, for each of these queues BFQ performs or
++ * does not perform idling depending on which choice boosts the throughput
++ * most. The exact choice depends on the device and request pattern at
++ * hand.
++ *
++ * Turning back to the next function, it implements all the steps needed
++ * to detect the occurrence of a large burst and to properly mark all the
++ * queues belonging to it (so that they can then be treated in a different
++ * way). This goal is achieved by maintaining a special "burst list" that
++ * holds, temporarily, the queues that belong to the burst in progress. The
++ * list is then used to mark these queues as belonging to a large burst if
++ * the burst does become large. The main steps are the following.
++ *
++ * . when the very first queue is activated, the queue is inserted into the
++ *   list (as it could be the first queue in a possible burst)
++ *
++ * . if the current burst has not yet become large, and a queue Q that does
++ *   not yet belong to the burst is activated shortly after the last time
++ *   at which a new queue entered the burst list, then the function appends
++ *   Q to the burst list
++ *
++ * . if, as a consequence of the previous step, the burst size reaches
++ *   the large-burst threshold, then
++ *
++ *     . all the queues in the burst list are marked as belonging to a
++ *       large burst
++ *
++ *     . the burst list is deleted; in fact, the burst list already served
++ *       its purpose (keeping temporarily track of the queues in a burst,
++ *       so as to be able to mark them as belonging to a large burst in the
++ *       previous sub-step), and now is not needed any more
++ *
++ *     . the device enters a large-burst mode
++ *
++ * . if a queue Q that does not belong to the burst is activated while
++ *   the device is in large-burst mode and shortly after the last time
++ *   at which a queue either entered the burst list or was marked as
++ *   belonging to the current large burst, then Q is immediately marked
++ *   as belonging to a large burst.
++ *
++ * . if a queue Q that does not belong to the burst is activated a while
++ *   later, i.e., not shortly after, than the last time at which a queue
++ *   either entered the burst list or was marked as belonging to the
++ *   current large burst, then the current burst is deemed as finished and:
++ *
++ *        . the large-burst mode is reset if set
++ *
++ *        . the burst list is emptied
++ *
++ *        . Q is inserted in the burst list, as Q may be the first queue
++ *          in a possible new burst (then the burst list contains just Q
++ *          after this step).
++ */
++static void bfq_handle_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq,
++			     bool idle_for_long_time)
++{
++	/*
++	 * If bfqq happened to be activated in a burst, but has been idle
++	 * for at least as long as an interactive queue, then we assume
++	 * that, in the overall I/O initiated in the burst, the I/O
++	 * associated to bfqq is finished. So bfqq does not need to be
++	 * treated as a queue belonging to a burst anymore. Accordingly,
++	 * we reset bfqq's in_large_burst flag if set, and remove bfqq
++	 * from the burst list if it's there. We do not decrement instead
++	 * burst_size, because the fact that bfqq does not need to belong
++	 * to the burst list any more does not invalidate the fact that
++	 * bfqq may have been activated during the current burst.
++	 */
++	if (idle_for_long_time) {
++		hlist_del_init(&bfqq->burst_list_node);
++		bfq_clear_bfqq_in_large_burst(bfqq);
++	}
++
++	/*
++	 * If bfqq is already in the burst list or is part of a large
++	 * burst, then there is nothing else to do.
++	 */
++	if (!hlist_unhashed(&bfqq->burst_list_node) ||
++	    bfq_bfqq_in_large_burst(bfqq))
++		return;
++
++	/*
++	 * If bfqq's activation happens late enough, then the current
++	 * burst is finished, and related data structures must be reset.
++	 *
++	 * In this respect, consider the special case where bfqq is the very
++	 * first queue being activated. In this case, last_ins_in_burst is
++	 * not yet significant when we get here. But it is easy to verify
++	 * that, whether or not the following condition is true, bfqq will
++	 * end up being inserted into the burst list. In particular the
++	 * list will happen to contain only bfqq. And this is exactly what
++	 * has to happen, as bfqq may be the first queue in a possible
++	 * burst.
++	 */
++	if (time_is_before_jiffies(bfqd->last_ins_in_burst +
++	    bfqd->bfq_burst_interval)) {
++		bfqd->large_burst = false;
++		bfq_reset_burst_list(bfqd, bfqq);
++		return;
++	}
++
++	/*
++	 * If we get here, then bfqq is being activated shortly after the
++	 * last queue. So, if the current burst is also large, we can mark
++	 * bfqq as belonging to this large burst immediately.
++	 */
++	if (bfqd->large_burst) {
++		bfq_mark_bfqq_in_large_burst(bfqq);
++		return;
++	}
++
++	/*
++	 * If we get here, then a large-burst state has not yet been
++	 * reached, but bfqq is being activated shortly after the last
++	 * queue. Then we add bfqq to the burst.
++	 */
++	bfq_add_to_burst(bfqd, bfqq);
++}
++
 +static void bfq_add_request(struct request *rq)
 +{
 +	struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -1632,7 +1846,7 @@ index 0000000..0a0891b
 +	struct bfq_data *bfqd = bfqq->bfqd;
 +	struct request *next_rq, *prev;
 +	unsigned long old_wr_coeff = bfqq->wr_coeff;
-+	int idle_for_long_time = 0;
++	bool interactive = false;
 +
 +	bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));
 +	bfqq->queued[rq_is_sync(rq)]++;
@@ -1655,11 +1869,35 @@ index 0000000..0a0891b
 +		bfq_rq_pos_tree_add(bfqd, bfqq);
 +
 +	if (!bfq_bfqq_busy(bfqq)) {
-+		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++		bool soft_rt,
++		     idle_for_long_time = time_is_before_jiffies(
++						bfqq->budget_timeout +
++						bfqd->bfq_wr_min_idle_time);
++
++		if (bfq_bfqq_sync(bfqq)) {
++			bool already_in_burst =
++			   !hlist_unhashed(&bfqq->burst_list_node) ||
++			   bfq_bfqq_in_large_burst(bfqq);
++			bfq_handle_burst(bfqd, bfqq, idle_for_long_time);
++			/*
++			 * If bfqq was not already in the current burst,
++			 * then, at this point, bfqq either has been
++			 * added to the current burst or has caused the
++			 * current burst to terminate. In particular, in
++			 * the second case, bfqq has become the first
++			 * queue in a possible new burst.
++			 * In both cases last_ins_in_burst needs to be
++			 * moved forward.
++			 */
++			if (!already_in_burst)
++				bfqd->last_ins_in_burst = jiffies;
++		}
++
++		soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
++			!bfq_bfqq_in_large_burst(bfqq) &&
 +			time_is_before_jiffies(bfqq->soft_rt_next_start);
-+		idle_for_long_time = time_is_before_jiffies(
-+			bfqq->budget_timeout +
-+			bfqd->bfq_wr_min_idle_time);
++		interactive = !bfq_bfqq_in_large_burst(bfqq) &&
++			      idle_for_long_time;
 +		entity->budget = max_t(unsigned long, bfqq->max_budget,
 +				       bfq_serv_to_charge(next_rq, bfqq));
 +
@@ -1682,9 +1920,9 @@ index 0000000..0a0891b
 +		 * If the queue is not being boosted and has been idle
 +		 * for enough time, start a weight-raising period
 +		 */
-+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
++		if (old_wr_coeff == 1 && (interactive || soft_rt)) {
 +			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
-+			if (idle_for_long_time)
++			if (interactive)
 +				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
 +			else
 +				bfqq->wr_cur_max_time =
@@ -1694,11 +1932,12 @@ index 0000000..0a0891b
 +				     jiffies,
 +				     jiffies_to_msecs(bfqq->wr_cur_max_time));
 +		} else if (old_wr_coeff > 1) {
-+			if (idle_for_long_time)
++			if (interactive)
 +				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-+			else if (bfqq->wr_cur_max_time ==
-+				 bfqd->bfq_wr_rt_max_time &&
-+				 !soft_rt) {
++			else if (bfq_bfqq_in_large_burst(bfqq) ||
++				 (bfqq->wr_cur_max_time ==
++				  bfqd->bfq_wr_rt_max_time &&
++				  !soft_rt)) {
 +				bfqq->wr_coeff = 1;
 +				bfq_log_bfqq(bfqd, bfqq,
 +					"wrais ending at %lu, rais_max_time %u",
@@ -1787,8 +2026,7 @@ index 0000000..0a0891b
 +	}
 +
 +	if (bfqd->low_latency &&
-+		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 ||
-+		 idle_for_long_time))
++		(old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))
 +		bfqq->last_wr_start_finish = jiffies;
 +}
 +
@@ -2291,9 +2529,7 @@ index 0000000..0a0891b
 +	return rq;
 +}
 +
-+/*
-+ * Must be called with the queue_lock held.
-+ */
++/* Must be called with the queue_lock held. */
 +static int bfqq_process_refs(struct bfq_queue *bfqq)
 +{
 +	int process_refs, io_refs;
@@ -2896,16 +3132,26 @@ index 0000000..0a0891b
 + * long comment, we try to briefly describe all the details and motivations
 + * behind the components of this logical expression.
 + *
-+ * First, the expression may be true only for sync queues. Besides, if
-+ * bfqq is also being weight-raised, then the expression always evaluates
-+ * to true, as device idling is instrumental for preserving low-latency
-+ * guarantees (see [1]). Otherwise, the expression evaluates to true only
-+ * if bfqq has a non-null idle window and at least one of the following
-+ * two conditions holds. The first condition is that the device is not
-+ * performing NCQ, because idling the device most certainly boosts the
-+ * throughput if this condition holds and bfqq has been granted a non-null
-+ * idle window. The second compound condition is made of the logical AND of
-+ * two components.
++ * First, the expression is false if bfqq is not sync, or if: bfqq happened
++ * to become active during a large burst of queue activations, and the
++ * pattern of requests bfqq contains boosts the throughput if bfqq is
++ * expired. In fact, queues that became active during a large burst benefit
++ * only from throughput, as discussed in the comments to bfq_handle_burst.
++ * In this respect, expiring bfqq certainly boosts the throughput on NCQ-
++ * capable flash-based devices, whereas, on rotational devices, it boosts
++ * the throughput only if bfqq contains random requests.
++ *
++ * On the opposite end, if (a) bfqq is sync, (b) the above burst-related
++ * condition does not hold, and (c) bfqq is being weight-raised, then the
++ * expression always evaluates to true, as device idling is instrumental
++ * for preserving low-latency guarantees (see [1]). If, instead, conditions
++ * (a) and (b) do hold, but (c) does not, then the expression evaluates to
++ * true only if: (1) bfqq is I/O-bound and has a non-null idle window, and
++ * (2) at least one of the following two conditions holds.
++ * The first condition is that the device is not performing NCQ, because
++ * idling the device most certainly boosts the throughput if this condition
++ * holds and bfqq is I/O-bound and has been granted a non-null idle window.
++ * The second compound condition is made of the logical AND of two components.
 + *
 + * The first component is true only if there is no weight-raised busy
 + * queue. This guarantees that the device is not idled for a sync non-
@@ -3022,6 +3268,12 @@ index 0000000..0a0891b
 +#define cond_for_seeky_on_ncq_hdd (bfq_bfqq_constantly_seeky(bfqq) && \
 +				   bfqd->busy_in_flight_queues == \
 +				   bfqd->const_seeky_busy_in_flight_queues)
++
++#define cond_for_expiring_in_burst	(bfq_bfqq_in_large_burst(bfqq) && \
++					 bfqd->hw_tag && \
++					 (blk_queue_nonrot(bfqd->queue) || \
++					  bfq_bfqq_constantly_seeky(bfqq)))
++
 +/*
 + * Condition for expiring a non-weight-raised queue (and hence not idling
 + * the device).
@@ -3033,9 +3285,9 @@ index 0000000..0a0891b
 +				      cond_for_seeky_on_ncq_hdd))))
 +
 +	return bfq_bfqq_sync(bfqq) &&
-+		(bfq_bfqq_IO_bound(bfqq) || bfqq->wr_coeff > 1) &&
++		!cond_for_expiring_in_burst &&
 +		(bfqq->wr_coeff > 1 ||
-+		 (bfq_bfqq_idle_window(bfqq) &&
++		 (bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_idle_window(bfqq) &&
 +		  !cond_for_expiring_non_wr)
 +	);
 +}
@@ -3179,10 +3431,12 @@ index 0000000..0a0891b
 +		if (entity->ioprio_changed)
 +			bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
 +		/*
-+		 * If too much time has elapsed from the beginning
-+		 * of this weight-raising, stop it.
++		 * If the queue was activated in a burst, or
++		 * too much time has elapsed from the beginning
++		 * of this weight-raising, then end weight raising.
 +		 */
-+		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
++		if (bfq_bfqq_in_large_burst(bfqq) ||
++		    time_is_before_jiffies(bfqq->last_wr_start_finish +
 +					   bfqq->wr_cur_max_time)) {
 +			bfqq->last_wr_start_finish = jiffies;
 +			bfq_log_bfqq(bfqd, bfqq,
@@ -3387,6 +3641,17 @@ index 0000000..0a0891b
 +	BUG_ON(bfq_bfqq_busy(bfqq));
 +	BUG_ON(bfqd->in_service_queue == bfqq);
 +
++	if (bfq_bfqq_sync(bfqq))
++		/*
++		 * The fact that this queue is being destroyed does not
++		 * invalidate the fact that this queue may have been
++		 * activated during the current burst. As a consequence,
++		 * although the queue does not exist anymore, and hence
++		 * needs to be removed from the burst list if there,
++		 * the burst size has not to be decremented.
++		 */
++		hlist_del_init(&bfqq->burst_list_node);
++
 +	bfq_log_bfqq(bfqd, bfqq, "put_queue: %p freed", bfqq);
 +
 +	kmem_cache_free(bfq_pool, bfqq);
@@ -3540,6 +3805,7 @@ index 0000000..0a0891b
 +{
 +	RB_CLEAR_NODE(&bfqq->entity.rb_node);
 +	INIT_LIST_HEAD(&bfqq->fifo);
++	INIT_HLIST_NODE(&bfqq->burst_list_node);
 +
 +	atomic_set(&bfqq->ref, 0);
 +	bfqq->bfqd = bfqd;
@@ -4298,6 +4564,7 @@ index 0000000..0a0891b
 +
 +	INIT_LIST_HEAD(&bfqd->active_list);
 +	INIT_LIST_HEAD(&bfqd->idle_list);
++	INIT_HLIST_HEAD(&bfqd->burst_list);
 +
 +	bfqd->hw_tag = -1;
 +
@@ -4318,6 +4585,9 @@ index 0000000..0a0891b
 +	bfqd->bfq_failed_cooperations = 7000;
 +	bfqd->bfq_requests_within_timer = 120;
 +
++	bfqd->bfq_large_burst_thresh = 11;
++	bfqd->bfq_burst_interval = msecs_to_jiffies(500);
++
 +	bfqd->low_latency = true;
 +
 +	bfqd->bfq_wr_coeff = 20;
@@ -4653,7 +4923,7 @@ index 0000000..0a0891b
 +	device_speed_thresh[1] = (R_fast[1] + R_slow[1]) / 2;
 +
 +	elv_register(&iosched_bfq);
-+	pr_info("BFQ I/O-scheduler version: v7r5");
++	pr_info("BFQ I/O-scheduler version: v7r6");
 +
 +	return 0;
 +}
@@ -5884,12 +6154,12 @@ index 0000000..c4831b7
 +}
 diff --git a/block/bfq.h b/block/bfq.h
 new file mode 100644
-index 0000000..a83e69d
+index 0000000..0378c86
 --- /dev/null
 +++ b/block/bfq.h
-@@ -0,0 +1,742 @@
+@@ -0,0 +1,773 @@
 +/*
-+ * BFQ-v7r5 for 3.16.0: data structures and common functions prototypes.
++ * BFQ-v7r6 for 3.16.0: data structures and common functions prototypes.
 + *
 + * Based on ideas and code from CFQ:
 + * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
@@ -6086,6 +6356,7 @@ index 0000000..a83e69d
 + * @dispatched: number of requests on the dispatch list or inside driver.
 + * @flags: status flags.
 + * @bfqq_list: node for active/idle bfqq list inside our bfqd.
++ * @burst_list_node: node for the device's burst list.
 + * @seek_samples: number of seeks sampled
 + * @seek_total: sum of the distances of the seeks sampled
 + * @seek_mean: mean seek distance
@@ -6146,6 +6417,8 @@ index 0000000..a83e69d
 +
 +	struct list_head bfqq_list;
 +
++	struct hlist_node burst_list_node;
++
 +	unsigned int seek_samples;
 +	u64 seek_total;
 +	sector_t seek_mean;
@@ -6298,22 +6571,38 @@ index 0000000..a83e69d
 + *                             again idling to a queue which was marked as
 + *                             non-I/O-bound (see the definition of the
 + *                             IO_bound flag for further details).
-+ * @bfq_wr_coeff: Maximum factor by which the weight of a weight-raised
-+ *                queue is multiplied
-+ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies)
-+ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes
++ * @last_ins_in_burst: last time at which a queue entered the current
++ *                     burst of queues being activated shortly after
++ *                     each other; for more details about this and the
++ *                     following parameters related to a burst of
++ *                     activations, see the comments to the function
++ *                     @bfq_handle_burst.
++ * @bfq_burst_interval: reference time interval used to decide whether a
++ *                      queue has been activated shortly after
++ *                      @last_ins_in_burst.
++ * @burst_size: number of queues in the current burst of queue activations.
++ * @bfq_large_burst_thresh: maximum burst size above which the current
++ * 			    queue-activation burst is deemed as 'large'.
++ * @large_burst: true if a large queue-activation burst is in progress.
++ * @burst_list: head of the burst list (as for the above fields, more details
++ * 		in the comments to the function bfq_handle_burst).
++ * @low_latency: if set to true, low-latency heuristics are enabled.
++ * @bfq_wr_coeff: maximum factor by which the weight of a weight-raised
++ *                queue is multiplied.
++ * @bfq_wr_max_time: maximum duration of a weight-raising period (jiffies).
++ * @bfq_wr_rt_max_time: maximum duration for soft real-time processes.
 + * @bfq_wr_min_idle_time: minimum idle period after which weight-raising
-+ *			  may be reactivated for a queue (in jiffies)
++ *			  may be reactivated for a queue (in jiffies).
 + * @bfq_wr_min_inter_arr_async: minimum period between request arrivals
 + *				after which weight-raising may be
 + *				reactivated for an already busy queue
-+ *				(in jiffies)
++ *				(in jiffies).
 + * @bfq_wr_max_softrt_rate: max service-rate for a soft real-time queue,
-+ *			    sectors per seconds
++ *			    sectors per seconds.
 + * @RT_prod: cached value of the product R*T used for computing the maximum
-+ *	     duration of the weight raising automatically
-+ * @device_speed: device-speed class for the low-latency heuristic
-+ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions
++ *	     duration of the weight raising automatically.
++ * @device_speed: device-speed class for the low-latency heuristic.
++ * @oom_bfqq: fallback dummy bfqq for extreme OOM conditions.
 + *
 + * All the fields are protected by the @queue lock.
 + */
@@ -6377,6 +6666,13 @@ index 0000000..a83e69d
 +	unsigned int bfq_failed_cooperations;
 +	unsigned int bfq_requests_within_timer;
 +
++	unsigned long last_ins_in_burst;
++	unsigned long bfq_burst_interval;
++	int burst_size;
++	unsigned long bfq_large_burst_thresh;
++	bool large_burst;
++	struct hlist_head burst_list;
++
 +	bool low_latency;
 +
 +	/* parameters of the low_latency heuristics */
@@ -6406,6 +6702,10 @@ index 0000000..a83e69d
 +					 * having consumed at most 2/10 of
 +					 * its budget
 +					 */
++	BFQ_BFQQ_FLAG_in_large_burst,	/*
++					 * bfqq activated in a large burst,
++					 * see comments to bfq_handle_burst.
++					 */
 +	BFQ_BFQQ_FLAG_constantly_seeky,	/*
 +					 * bfqq has proved to be slow and
 +					 * seeky until budget timeout
@@ -6441,6 +6741,7 @@ index 0000000..a83e69d
 +BFQ_BFQQ_FNS(sync);
 +BFQ_BFQQ_FNS(budget_new);
 +BFQ_BFQQ_FNS(IO_bound);
++BFQ_BFQQ_FNS(in_large_burst);
 +BFQ_BFQQ_FNS(constantly_seeky);
 +BFQ_BFQQ_FNS(coop);
 +BFQ_BFQQ_FNS(split_coop);
@@ -6561,15 +6862,15 @@ index 0000000..a83e69d
 +}
 +
 +static inline struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic,
-+					    int is_sync)
++					    bool is_sync)
 +{
-+	return bic->bfqq[!!is_sync];
++	return bic->bfqq[is_sync];
 +}
 +
 +static inline void bic_set_bfqq(struct bfq_io_cq *bic,
-+				struct bfq_queue *bfqq, int is_sync)
++				struct bfq_queue *bfqq, bool is_sync)
 +{
-+	bic->bfqq[!!is_sync] = bfqq;
++	bic->bfqq[is_sync] = bfqq;
 +}
 +
 +static inline struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)
@@ -6631,5 +6932,5 @@ index 0000000..a83e69d
 +
 +#endif /* _BFQ_H */
 -- 
-2.0.3
+2.1.2
 

diff --git a/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
similarity index 87%
rename from 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
rename to 5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
index e606f5d..53e7c76 100644
--- a/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
+++ b/5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
@@ -1,7 +1,7 @@
-From 5b290be286aa74051b4b77a216032b771ceadd23 Mon Sep 17 00:00:00 2001
+From 5428334e0390ccad40fa21dd046eb163025a4f74 Mon Sep 17 00:00:00 2001
 From: Mauro Andreolini <mauro.andreolini@unimore.it>
-Date: Wed, 18 Jun 2014 17:38:07 +0200
-Subject: [PATCH 3/3] block, bfq: add Early Queue Merge (EQM) to BFQ-v7r5 for
+Date: Sun, 19 Oct 2014 01:15:59 +0200
+Subject: [PATCH 3/3] block, bfq: add Early Queue Merge (EQM) to BFQ-v7r6 for
  3.16.0
 
 A set of processes may happen  to  perform interleaved reads, i.e.,requests
@@ -34,13 +34,13 @@ Signed-off-by: Mauro Andreolini <mauro.andreolini@unimore.it>
 Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
 Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
 ---
- block/bfq-iosched.c | 736 ++++++++++++++++++++++++++++++++++++----------------
+ block/bfq-iosched.c | 743 +++++++++++++++++++++++++++++++++++++---------------
  block/bfq-sched.c   |  28 --
- block/bfq.h         |  46 +++-
- 3 files changed, 556 insertions(+), 254 deletions(-)
+ block/bfq.h         |  54 +++-
+ 3 files changed, 573 insertions(+), 252 deletions(-)
 
 diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
-index 0a0891b..d1d8e67 100644
+index b919b03..bbfb4e1 100644
 --- a/block/bfq-iosched.c
 +++ b/block/bfq-iosched.c
 @@ -571,6 +571,57 @@ static inline unsigned int bfq_wr_duration(struct bfq_data *bfqd)
@@ -64,7 +64,9 @@ index 0a0891b..d1d8e67 100644
 +		bfq_mark_bfqq_IO_bound(bfqq);
 +	else
 +		bfq_clear_bfqq_IO_bound(bfqq);
++	/* Assuming that the flag in_large_burst is already correctly set */
 +	if (bic->wr_time_left && bfqq->bfqd->low_latency &&
++	    !bfq_bfqq_in_large_burst(bfqq) &&
 +	    bic->cooperations < bfqq->bfqd->bfq_coop_thresh) {
 +		/*
 +		 * Start a weight raising period with the duration given by
@@ -85,9 +87,7 @@ index 0a0891b..d1d8e67 100644
 +	bic->wr_time_left = 0;
 +}
 +
-+/*
-+ * Must be called with the queue_lock held.
-+ */
++/* Must be called with the queue_lock held. */
 +static int bfqq_process_refs(struct bfq_queue *bfqq)
 +{
 +	int process_refs, io_refs;
@@ -98,23 +98,35 @@ index 0a0891b..d1d8e67 100644
 +	return process_refs;
 +}
 +
- static void bfq_add_request(struct request *rq)
- {
- 	struct bfq_queue *bfqq = RQ_BFQQ(rq);
-@@ -602,8 +653,11 @@ static void bfq_add_request(struct request *rq)
+ /* Empty burst list and add just bfqq (see comments to bfq_handle_burst) */
+ static inline void bfq_reset_burst_list(struct bfq_data *bfqd,
+ 					struct bfq_queue *bfqq)
+@@ -815,7 +866,7 @@ static void bfq_add_request(struct request *rq)
+ 		bfq_rq_pos_tree_add(bfqd, bfqq);
  
  	if (!bfq_bfqq_busy(bfqq)) {
- 		int soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
-+			bfq_bfqq_cooperations(bfqq) < bfqd->bfq_coop_thresh &&
+-		bool soft_rt,
++		bool soft_rt, coop_or_in_burst,
+ 		     idle_for_long_time = time_is_before_jiffies(
+ 						bfqq->budget_timeout +
+ 						bfqd->bfq_wr_min_idle_time);
+@@ -839,11 +890,12 @@ static void bfq_add_request(struct request *rq)
+ 				bfqd->last_ins_in_burst = jiffies;
+ 		}
+ 
++		coop_or_in_burst = bfq_bfqq_in_large_burst(bfqq) ||
++			bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh;
+ 		soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+-			!bfq_bfqq_in_large_burst(bfqq) &&
++			!coop_or_in_burst &&
  			time_is_before_jiffies(bfqq->soft_rt_next_start);
--		idle_for_long_time = time_is_before_jiffies(
-+		idle_for_long_time = bfq_bfqq_cooperations(bfqq) <
-+				     bfqd->bfq_coop_thresh &&
-+			time_is_before_jiffies(
- 			bfqq->budget_timeout +
- 			bfqd->bfq_wr_min_idle_time);
+-		interactive = !bfq_bfqq_in_large_burst(bfqq) &&
+-			      idle_for_long_time;
++		interactive = !coop_or_in_burst && idle_for_long_time;
  		entity->budget = max_t(unsigned long, bfqq->max_budget,
-@@ -624,11 +678,20 @@ static void bfq_add_request(struct request *rq)
+ 				       bfq_serv_to_charge(next_rq, bfqq));
+ 
+@@ -862,11 +914,20 @@ static void bfq_add_request(struct request *rq)
  		if (!bfqd->low_latency)
  			goto add_bfqq_busy;
  
@@ -132,28 +144,22 @@ index 0a0891b..d1d8e67 100644
 +		 *   requests have not been redirected to a shared queue)
 +		 * start a weight-raising period.
  		 */
--		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt)) {
-+		if (old_wr_coeff == 1 && (idle_for_long_time || soft_rt) &&
+-		if (old_wr_coeff == 1 && (interactive || soft_rt)) {
++		if (old_wr_coeff == 1 && (interactive || soft_rt) &&
 +		    (!bfq_bfqq_sync(bfqq) || bfqq->bic != NULL)) {
  			bfqq->wr_coeff = bfqd->bfq_wr_coeff;
- 			if (idle_for_long_time)
+ 			if (interactive)
  				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
-@@ -642,9 +705,11 @@ static void bfq_add_request(struct request *rq)
+@@ -880,7 +941,7 @@ static void bfq_add_request(struct request *rq)
  		} else if (old_wr_coeff > 1) {
- 			if (idle_for_long_time)
+ 			if (interactive)
  				bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);
--			else if (bfqq->wr_cur_max_time ==
--				 bfqd->bfq_wr_rt_max_time &&
--				 !soft_rt) {
-+			else if (bfq_bfqq_cooperations(bfqq) >=
-+					bfqd->bfq_coop_thresh ||
-+				 (bfqq->wr_cur_max_time ==
-+				  bfqd->bfq_wr_rt_max_time &&
-+				  !soft_rt)) {
- 				bfqq->wr_coeff = 1;
- 				bfq_log_bfqq(bfqd, bfqq,
- 					"wrais ending at %lu, rais_max_time %u",
-@@ -660,18 +725,18 @@ static void bfq_add_request(struct request *rq)
+-			else if (bfq_bfqq_in_large_burst(bfqq) ||
++			else if (coop_or_in_burst ||
+ 				 (bfqq->wr_cur_max_time ==
+ 				  bfqd->bfq_wr_rt_max_time &&
+ 				  !soft_rt)) {
+@@ -899,18 +960,18 @@ static void bfq_add_request(struct request *rq)
  				/*
  				 *
  				 * The remaining weight-raising time is lower
@@ -184,7 +190,7 @@ index 0a0891b..d1d8e67 100644
  				 *
  				 * In addition, the application is now meeting
  				 * the requirements for being deemed soft rt.
-@@ -706,6 +771,7 @@ static void bfq_add_request(struct request *rq)
+@@ -945,6 +1006,7 @@ static void bfq_add_request(struct request *rq)
  					bfqd->bfq_wr_rt_max_time;
  			}
  		}
@@ -192,7 +198,7 @@ index 0a0891b..d1d8e67 100644
  		if (old_wr_coeff != bfqq->wr_coeff)
  			entity->ioprio_changed = 1;
  add_bfqq_busy:
-@@ -918,90 +984,35 @@ static void bfq_end_wr(struct bfq_data *bfqd)
+@@ -1156,90 +1218,35 @@ static void bfq_end_wr(struct bfq_data *bfqd)
  	spin_unlock_irq(bfqd->queue->queue_lock);
  }
  
@@ -297,7 +303,7 @@ index 0a0891b..d1d8e67 100644
  
  	if (RB_EMPTY_ROOT(root))
  		return NULL;
-@@ -1020,7 +1031,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+@@ -1258,7 +1265,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
  	 * next_request position).
  	 */
  	__bfqq = rb_entry(parent, struct bfq_queue, pos_node);
@@ -306,7 +312,7 @@ index 0a0891b..d1d8e67 100644
  		return __bfqq;
  
  	if (blk_rq_pos(__bfqq->next_rq) < sector)
-@@ -1031,7 +1042,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+@@ -1269,7 +1276,7 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
  		return NULL;
  
  	__bfqq = rb_entry(node, struct bfq_queue, pos_node);
@@ -315,7 +321,7 @@ index 0a0891b..d1d8e67 100644
  		return __bfqq;
  
  	return NULL;
-@@ -1040,14 +1051,12 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
+@@ -1278,14 +1285,12 @@ static struct bfq_queue *bfqq_close(struct bfq_data *bfqd)
  /*
   * bfqd - obvious
   * cur_bfqq - passed in so that we don't decide that the current queue
@@ -334,7 +340,7 @@ index 0a0891b..d1d8e67 100644
  {
  	struct bfq_queue *bfqq;
  
-@@ -1067,7 +1076,7 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+@@ -1305,7 +1310,7 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
  	 * working closely on the same area of the disk. In that case,
  	 * we can group them together and don't waste time idling.
  	 */
@@ -343,7 +349,7 @@ index 0a0891b..d1d8e67 100644
  	if (bfqq == NULL || bfqq == cur_bfqq)
  		return NULL;
  
-@@ -1094,6 +1103,305 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
+@@ -1332,6 +1337,307 @@ static struct bfq_queue *bfq_close_cooperator(struct bfq_data *bfqd,
  	return bfqq;
  }
  
@@ -508,6 +514,8 @@ index 0a0891b..d1d8e67 100644
 +		bfqq->bic->wr_time_left = 0;
 +	bfqq->bic->saved_idle_window = bfq_bfqq_idle_window(bfqq);
 +	bfqq->bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
++	bfqq->bic->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);
++	bfqq->bic->was_in_burst_list = !hlist_unhashed(&bfqq->burst_list_node);
 +	bfqq->bic->cooperations++;
 +	bfqq->bic->failed_cooperations = 0;
 +}
@@ -649,13 +657,11 @@ index 0a0891b..d1d8e67 100644
  /*
   * If enough samples have been computed, return the current max budget
   * stored in bfqd, which is dynamically updated according to the
-@@ -1237,63 +1545,6 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
+@@ -1475,61 +1781,6 @@ static struct request *bfq_check_fifo(struct bfq_queue *bfqq)
  	return rq;
  }
  
--/*
-- * Must be called with the queue_lock held.
-- */
+-/* Must be called with the queue_lock held. */
 -static int bfqq_process_refs(struct bfq_queue *bfqq)
 -{
 -	int process_refs, io_refs;
@@ -713,7 +719,7 @@ index 0a0891b..d1d8e67 100644
  static inline unsigned long bfq_bfqq_budget_left(struct bfq_queue *bfqq)
  {
  	struct bfq_entity *entity = &bfqq->entity;
-@@ -2011,7 +2262,7 @@ static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
+@@ -2263,7 +2514,7 @@ static inline bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)
   */
  static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
  {
@@ -722,7 +728,7 @@ index 0a0891b..d1d8e67 100644
  	struct request *next_rq;
  	enum bfqq_expiration reason = BFQ_BFQQ_BUDGET_TIMEOUT;
  
-@@ -2021,17 +2272,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+@@ -2273,17 +2524,6 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
  
  	bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");
  
@@ -740,7 +746,7 @@ index 0a0891b..d1d8e67 100644
  	if (bfq_may_expire_for_budg_timeout(bfqq) &&
  	    !timer_pending(&bfqd->idle_slice_timer) &&
  	    !bfq_bfqq_must_idle(bfqq))
-@@ -2070,10 +2310,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+@@ -2322,10 +2562,7 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
  				bfq_clear_bfqq_wait_request(bfqq);
  				del_timer(&bfqd->idle_slice_timer);
  			}
@@ -752,7 +758,7 @@ index 0a0891b..d1d8e67 100644
  		}
  	}
  
-@@ -2082,40 +2319,30 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
+@@ -2334,40 +2571,30 @@ static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)
  	 * in flight (possibly waiting for a completion) or is idling for a
  	 * new request, then keep it.
  	 */
@@ -800,25 +806,25 @@ index 0a0891b..d1d8e67 100644
  			jiffies_to_msecs(bfqq->wr_cur_max_time),
  			bfqq->wr_coeff,
  			bfqq->entity.weight, bfqq->entity.orig_weight);
-@@ -2124,11 +2351,15 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+@@ -2376,12 +2603,16 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
  		       entity->orig_weight * bfqq->wr_coeff);
  		if (entity->ioprio_changed)
  			bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");
 +
  		/*
- 		 * If too much time has elapsed from the beginning
--		 * of this weight-raising, stop it.
+ 		 * If the queue was activated in a burst, or
+ 		 * too much time has elapsed from the beginning
+-		 * of this weight-raising, then end weight raising.
 +		 * of this weight-raising period, or the queue has
 +		 * exceeded the acceptable number of cooperations,
-+		 * stop it.
++		 * then end weight raising.
  		 */
--		if (time_is_before_jiffies(bfqq->last_wr_start_finish +
-+		if (bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh ||
-+		    time_is_before_jiffies(bfqq->last_wr_start_finish +
+ 		if (bfq_bfqq_in_large_burst(bfqq) ||
++		    bfq_bfqq_cooperations(bfqq) >= bfqd->bfq_coop_thresh ||
+ 		    time_is_before_jiffies(bfqq->last_wr_start_finish +
  					   bfqq->wr_cur_max_time)) {
  			bfqq->last_wr_start_finish = jiffies;
- 			bfq_log_bfqq(bfqd, bfqq,
-@@ -2136,11 +2367,13 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
+@@ -2390,11 +2621,13 @@ static void bfq_update_wr_data(struct bfq_data *bfqd,
  				     bfqq->last_wr_start_finish,
  				     jiffies_to_msecs(bfqq->wr_cur_max_time));
  			bfq_bfqq_end_wr(bfqq);
@@ -835,7 +841,7 @@ index 0a0891b..d1d8e67 100644
  }
  
  /*
-@@ -2377,6 +2610,25 @@ static inline void bfq_init_icq(struct io_cq *icq)
+@@ -2642,6 +2875,25 @@ static inline void bfq_init_icq(struct io_cq *icq)
  	struct bfq_io_cq *bic = icq_to_bic(icq);
  
  	bic->ttime.last_end_request = jiffies;
@@ -861,7 +867,7 @@ index 0a0891b..d1d8e67 100644
  }
  
  static void bfq_exit_icq(struct io_cq *icq)
-@@ -2390,6 +2642,13 @@ static void bfq_exit_icq(struct io_cq *icq)
+@@ -2655,6 +2907,13 @@ static void bfq_exit_icq(struct io_cq *icq)
  	}
  
  	if (bic->bfqq[BLK_RW_SYNC]) {
@@ -875,7 +881,7 @@ index 0a0891b..d1d8e67 100644
  		bfq_exit_bfqq(bfqd, bic->bfqq[BLK_RW_SYNC]);
  		bic->bfqq[BLK_RW_SYNC] = NULL;
  	}
-@@ -2678,6 +2937,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
+@@ -2944,6 +3203,10 @@ static void bfq_update_idle_window(struct bfq_data *bfqd,
  	if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq))
  		return;
  
@@ -886,7 +892,7 @@ index 0a0891b..d1d8e67 100644
  	enable_idle = bfq_bfqq_idle_window(bfqq);
  
  	if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||
-@@ -2725,6 +2988,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+@@ -2991,6 +3254,7 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  	if (bfqq->entity.service > bfq_max_budget(bfqd) / 8 ||
  	    !BFQQ_SEEKY(bfqq))
  		bfq_update_idle_window(bfqd, bfqq, bic);
@@ -894,7 +900,7 @@ index 0a0891b..d1d8e67 100644
  
  	bfq_log_bfqq(bfqd, bfqq,
  		     "rq_enqueued: idle_window=%d (seeky %d, mean %llu)",
-@@ -2785,13 +3049,49 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+@@ -3051,13 +3315,49 @@ static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
  static void bfq_insert_request(struct request_queue *q, struct request *rq)
  {
  	struct bfq_data *bfqd = q->elevator->elevator_data;
@@ -945,7 +951,7 @@ index 0a0891b..d1d8e67 100644
  	rq->fifo_time = jiffies + bfqd->bfq_fifo_expire[rq_is_sync(rq)];
  	list_add_tail(&rq->queuelist, &bfqq->fifo);
  
-@@ -2956,18 +3256,6 @@ static void bfq_put_request(struct request *rq)
+@@ -3222,18 +3522,6 @@ static void bfq_put_request(struct request *rq)
  	}
  }
  
@@ -964,7 +970,7 @@ index 0a0891b..d1d8e67 100644
  /*
   * Returns NULL if a new bfqq should be allocated, or the old bfqq if this
   * was the last process referring to said bfqq.
-@@ -2976,6 +3264,9 @@ static struct bfq_queue *
+@@ -3242,6 +3530,9 @@ static struct bfq_queue *
  bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)
  {
  	bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");
@@ -974,7 +980,7 @@ index 0a0891b..d1d8e67 100644
  	if (bfqq_process_refs(bfqq) == 1) {
  		bfqq->pid = current->pid;
  		bfq_clear_bfqq_coop(bfqq);
-@@ -3004,6 +3295,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
+@@ -3270,6 +3561,7 @@ static int bfq_set_request(struct request_queue *q, struct request *rq,
  	struct bfq_queue *bfqq;
  	struct bfq_group *bfqg;
  	unsigned long flags;
@@ -982,9 +988,21 @@ index 0a0891b..d1d8e67 100644
  
  	might_sleep_if(gfp_mask & __GFP_WAIT);
  
-@@ -3022,24 +3314,14 @@ new_queue:
+@@ -3287,25 +3579,26 @@ new_queue:
+ 	if (bfqq == NULL || bfqq == &bfqd->oom_bfqq) {
  		bfqq = bfq_get_queue(bfqd, bfqg, is_sync, bic, gfp_mask);
  		bic_set_bfqq(bic, bfqq, is_sync);
++		if (split && is_sync) {
++			if ((bic->was_in_burst_list && bfqd->large_burst) ||
++			    bic->saved_in_large_burst)
++				bfq_mark_bfqq_in_large_burst(bfqq);
++			else {
++			    bfq_clear_bfqq_in_large_burst(bfqq);
++			    if (bic->was_in_burst_list)
++			       hlist_add_head(&bfqq->burst_list_node,
++				              &bfqd->burst_list);
++			}
++		}
  	} else {
 -		/*
 -		 * If the queue was seeky for too long, break it apart.
@@ -1009,7 +1027,7 @@ index 0a0891b..d1d8e67 100644
  	}
  
  	bfqq->allocated[rw]++;
-@@ -3050,6 +3332,26 @@ new_queue:
+@@ -3316,6 +3609,26 @@ new_queue:
  	rq->elv.priv[0] = bic;
  	rq->elv.priv[1] = bfqq;
  
@@ -1076,10 +1094,10 @@ index c4831b7..546a254 100644
  {
  	if (bfqd->in_service_bic != NULL) {
 diff --git a/block/bfq.h b/block/bfq.h
-index a83e69d..ebbd040 100644
+index 0378c86..93a2d24 100644
 --- a/block/bfq.h
 +++ b/block/bfq.h
-@@ -215,18 +215,21 @@ struct bfq_group;
+@@ -216,18 +216,21 @@ struct bfq_group;
   *                      idle @bfq_queue with no outstanding requests, then
   *                      the task associated with the queue it is deemed as
   *                      soft real-time (see the comments to the function
@@ -1107,7 +1125,7 @@ index a83e69d..ebbd040 100644
   * All the fields are protected by the queue lock of the containing bfqd.
   */
  struct bfq_queue {
-@@ -264,6 +267,7 @@ struct bfq_queue {
+@@ -267,6 +270,7 @@ struct bfq_queue {
  	unsigned int requests_within_timer;
  
  	pid_t pid;
@@ -1115,7 +1133,7 @@ index a83e69d..ebbd040 100644
  
  	/* weight-raising fields */
  	unsigned long wr_cur_max_time;
-@@ -293,12 +297,34 @@ struct bfq_ttime {
+@@ -296,12 +300,42 @@ struct bfq_ttime {
   * @icq: associated io_cq structure
   * @bfqq: array of two process queues, the sync and the async
   * @ttime: associated @bfq_ttime struct
@@ -1130,6 +1148,11 @@ index a83e69d..ebbd040 100644
 + *                     window
 + * @saved_IO_bound: same purpose as the previous two fields for the I/O
 + *                  bound classification of a queue
++ * @saved_in_large_burst: same purpose as the previous fields for the
++ *                        value of the field keeping the queue's belonging
++ *                        to a large burst
++ * @was_in_burst_list: true if the queue belonged to a burst list
++ *                     before its merge with another cooperating queue
 + * @cooperations: counter of consecutive successful queue merges underwent
 + *                by any of the process' @bfq_queues
 + * @failed_cooperations: counter of consecutive failed queue merges of any
@@ -1142,15 +1165,18 @@ index a83e69d..ebbd040 100644
  	int ioprio;
 +
 +	unsigned int wr_time_left;
-+	unsigned int saved_idle_window;
-+	unsigned int saved_IO_bound;
++	bool saved_idle_window;
++	bool saved_IO_bound;
++
++	bool saved_in_large_burst;
++	bool was_in_burst_list;
 +
 +	unsigned int cooperations;
 +	unsigned int failed_cooperations;
  };
  
  enum bfq_device_speed {
-@@ -511,7 +537,7 @@ enum bfqq_state_flags {
+@@ -537,7 +571,7 @@ enum bfqq_state_flags {
  	BFQ_BFQQ_FLAG_prio_changed,	/* task priority has changed */
  	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
  	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
@@ -1159,7 +1185,7 @@ index a83e69d..ebbd040 100644
  					 * bfqq has timed-out at least once
  					 * having consumed at most 2/10 of
  					 * its budget
-@@ -520,12 +546,13 @@ enum bfqq_state_flags {
+@@ -550,12 +584,13 @@ enum bfqq_state_flags {
  					 * bfqq has proved to be slow and
  					 * seeky until budget timeout
  					 */
@@ -1175,7 +1201,7 @@ index a83e69d..ebbd040 100644
  };
  
  #define BFQ_BFQQ_FNS(name)						\
-@@ -554,6 +581,7 @@ BFQ_BFQQ_FNS(IO_bound);
+@@ -585,6 +620,7 @@ BFQ_BFQQ_FNS(in_large_burst);
  BFQ_BFQQ_FNS(constantly_seeky);
  BFQ_BFQQ_FNS(coop);
  BFQ_BFQQ_FNS(split_coop);
@@ -1184,5 +1210,5 @@ index a83e69d..ebbd040 100644
  #undef BFQ_BFQQ_FNS
  
 -- 
-2.0.3
+2.1.2
 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-11-29 18:05 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-11-29 18:05 UTC (permalink / raw
  To: gentoo-commits

commit:     fece5ecf1633709a681cc9b0bca7897a3ec477e1
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Nov 29 18:04:53 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Nov 29 18:04:53 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=fece5ecf

Merge branch '3.16' of git+ssh://git.overlays.gentoo.org/proj/linux-patches into 3.16

update readme file


 0000_README             |    4 +
 1006_linux-3.16.7.patch | 6873 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 6877 insertions(+)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-11-29 18:05 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-11-29 18:05 UTC (permalink / raw
  To: gentoo-commits

commit:     962dfa012d1b748e4df287f9ba85609a57d18345
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Nov 29 18:04:32 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Nov 29 18:04:32 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=962dfa01

Update readme

---
 0000_README | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/0000_README b/0000_README
index a7526a7..b532df4 100644
--- a/0000_README
+++ b/0000_README
@@ -102,17 +102,17 @@ Patch:  5000_enable-additional-cpu-optimizations-for-gcc.patch
 From:   https://github.com/graysky2/kernel_gcc_patch/
 Desc:   Kernel patch enables gcc optimizations for additional CPUs.
 
-Patch:  5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r5-3.16.patch
+Patch:  5001_BFQ-1-block-cgroups-kconfig-build-bits-for-v7r6-3.16.patch
 From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
-Desc:   BFQ v7r5 patch 1 for 3.16: Build, cgroups and kconfig bits
+Desc:   BFQ v7r6 patch 1 for 3.16: Build, cgroups and kconfig bits
 
-Patch:  5002_BFQ-2-block-introduce-the-v7r5-I-O-sched-for-3.16.patch1
+Patch:  5002_BFQ-2-block-introduce-the-v7r6-I-O-sched-for-3.16.patch1
 From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
-Desc:   BFQ v7r5 patch 2 for 3.16: BFQ Scheduler
+Desc:   BFQ v7r6 patch 2 for 3.16: BFQ Scheduler
 
-Patch:  5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r5-for-3.16.0.patch
+Patch:  5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
 From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
-Desc:   BFQ v7r5 patch 3 for 3.16: Early Queue Merge (EQM)
+Desc:   BFQ v7r6 patch 3 for 3.16: Early Queue Merge (EQM)
 
 Patch:  5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
 From:   http://multipath-tcp.org/


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-11-29 18:11 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-11-29 18:11 UTC (permalink / raw
  To: gentoo-commits

commit:     3c8127d4ebd36a23547beb8064cbedc12447d782
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Sat Nov 29 18:11:33 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Sat Nov 29 18:11:33 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=3c8127d4

Update multipath patch

---
 0000_README                                        |   2 +-
 ... => 5010_multipath-tcp-v3.16-075df3a63833.patch | 328 +++++++++++++++++++--
 2 files changed, 312 insertions(+), 18 deletions(-)

diff --git a/0000_README b/0000_README
index 0ab3968..8719a11 100644
--- a/0000_README
+++ b/0000_README
@@ -118,7 +118,7 @@ Patch:  5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
 From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
 Desc:   BFQ v7r6 patch 3 for 3.16: Early Queue Merge (EQM)
 
-Patch:  5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
+Patch:  5010_multipath-tcp-v3.16-075df3a63833.patch
 From:   http://multipath-tcp.org/
 Desc:   Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
 

diff --git a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch b/5010_multipath-tcp-v3.16-075df3a63833.patch
similarity index 98%
rename from 5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
rename to 5010_multipath-tcp-v3.16-075df3a63833.patch
index 3000da3..7520b4a 100644
--- a/5010_multipath-tcp-v3.16-872d7f6c6f4e.patch
+++ b/5010_multipath-tcp-v3.16-075df3a63833.patch
@@ -2572,10 +2572,10 @@ index 4db3c2a1679c..04cb17d4b0ce 100644
  		goto drop;
  
 diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
-index 05c57f0fcabe..630434db0085 100644
+index 05c57f0fcabe..811286a6aa9c 100644
 --- a/net/ipv4/Kconfig
 +++ b/net/ipv4/Kconfig
-@@ -556,6 +556,30 @@ config TCP_CONG_ILLINOIS
+@@ -556,6 +556,38 @@ config TCP_CONG_ILLINOIS
  	For further details see:
  	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
  
@@ -2603,10 +2603,18 @@ index 05c57f0fcabe..630434db0085 100644
 +	wVegas congestion control for MPTCP
 +	To enable it, just put 'wvegas' in tcp_congestion_control
 +
++config TCP_CONG_BALIA
++	tristate "MPTCP BALIA CONGESTION CONTROL"
++	depends on MPTCP
++	default n
++	---help---
++	Multipath TCP Balanced Linked Adaptation Congestion Control
++	To enable it, just put 'balia' in tcp_congestion_control
++
  choice
  	prompt "Default TCP congestion control"
  	default DEFAULT_CUBIC
-@@ -584,6 +608,15 @@ choice
+@@ -584,6 +616,18 @@ choice
  	config DEFAULT_WESTWOOD
  		bool "Westwood" if TCP_CONG_WESTWOOD=y
  
@@ -2619,15 +2627,19 @@ index 05c57f0fcabe..630434db0085 100644
 +	config DEFAULT_WVEGAS
 +		bool "Wvegas" if TCP_CONG_WVEGAS=y
 +
++	config DEFAULT_BALIA
++		bool "Balia" if TCP_CONG_BALIA=y
++
  	config DEFAULT_RENO
  		bool "Reno"
  
-@@ -605,6 +638,8 @@ config DEFAULT_TCP_CONG
+@@ -605,6 +649,9 @@ config DEFAULT_TCP_CONG
  	default "vegas" if DEFAULT_VEGAS
  	default "westwood" if DEFAULT_WESTWOOD
  	default "veno" if DEFAULT_VENO
 +	default "coupled" if DEFAULT_COUPLED
 +	default "wvegas" if DEFAULT_WVEGAS
++	default "balia" if DEFAULT_BALIA
  	default "reno" if DEFAULT_RENO
  	default "cubic"
  
@@ -7087,10 +7099,10 @@ index 000000000000..cdfc03adabf8
 +
 diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
 new file mode 100644
-index 000000000000..35561a7012e3
+index 000000000000..2feb3e873206
 --- /dev/null
 +++ b/net/mptcp/Makefile
-@@ -0,0 +1,20 @@
+@@ -0,0 +1,21 @@
 +#
 +## Makefile for MultiPath TCP support code.
 +#
@@ -7104,6 +7116,7 @@ index 000000000000..35561a7012e3
 +obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
 +obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
 +obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
++obj-$(CONFIG_TCP_CONG_BALIA) += mptcp_balia.o
 +obj-$(CONFIG_MPTCP_FULLMESH) += mptcp_fullmesh.o
 +obj-$(CONFIG_MPTCP_NDIFFPORTS) += mptcp_ndiffports.o
 +obj-$(CONFIG_MPTCP_BINDER) += mptcp_binder.o
@@ -7111,6 +7124,279 @@ index 000000000000..35561a7012e3
 +
 +mptcp-$(subst m,y,$(CONFIG_IPV6)) += mptcp_ipv6.o
 +
+diff --git a/net/mptcp/mptcp_balia.c b/net/mptcp/mptcp_balia.c
+new file mode 100644
+index 000000000000..5cc224d80b01
+--- /dev/null
++++ b/net/mptcp/mptcp_balia.c
+@@ -0,0 +1,267 @@
++/*
++ *	MPTCP implementation - Balia Congestion Control
++ *	(Balanced Linked Adaptation Algorithm)
++ *
++ *	Analysis, Design and Implementation:
++ *	Qiuyu Peng <qpeng@caltech.edu>
++ *	Anwar Walid <anwar@research.bell-labs.com>
++ *	Jaehyun Hwang <jh.hwang@alcatel-lucent.com>
++ *	Steven H. Low <slow@caltech.edu>
++ *
++ *	This program is free software; you can redistribute it and/or
++ *	modify it under the terms of the GNU General Public License
++ *	as published by the Free Software Foundation; either version
++ *	2 of the License, or (at your option) any later version.
++ */
++
++#include <net/tcp.h>
++#include <net/mptcp.h>
++
++#include <linux/module.h>
++
++/* The variable 'rate' (i.e., x_r) will be scaled down
++ * e.g., from B/s to KB/s, MB/s, or GB/s
++ * if max_rate > 2^rate_scale_limit
++ */
++
++static int rate_scale_limit = 30;
++static int scale_num = 10;
++
++struct mptcp_balia {
++	u64	ai;
++	u64	md;
++	bool	forced_update;
++};
++
++static inline int mptcp_balia_sk_can_send(const struct sock *sk)
++{
++	return mptcp_sk_can_send(sk) && tcp_sk(sk)->srtt_us;
++}
++
++static inline u64 mptcp_get_ai(const struct sock *meta_sk)
++{
++	return ((struct mptcp_balia *)inet_csk_ca(meta_sk))->ai;
++}
++
++static inline void mptcp_set_ai(const struct sock *meta_sk, u64 ai)
++{
++	((struct mptcp_balia *)inet_csk_ca(meta_sk))->ai = ai;
++}
++
++static inline u64 mptcp_get_md(const struct sock *meta_sk)
++{
++	return ((struct mptcp_balia *)inet_csk_ca(meta_sk))->md;
++}
++
++static inline void mptcp_set_md(const struct sock *meta_sk, u64 md)
++{
++	((struct mptcp_balia *)inet_csk_ca(meta_sk))->md = md;
++}
++
++static inline u64 mptcp_balia_scale(u64 val, int scale)
++{
++	return (u64) val << scale;
++}
++
++static inline bool mptcp_get_forced(const struct sock *meta_sk)
++{
++	return ((struct mptcp_balia *)inet_csk_ca(meta_sk))->forced_update;
++}
++
++static inline void mptcp_set_forced(const struct sock *meta_sk, bool force)
++{
++	((struct mptcp_balia *)inet_csk_ca(meta_sk))->forced_update = force;
++}
++
++static void mptcp_balia_recalc_ai(const struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++	const struct sock *sub_sk;
++	int can_send = 0;
++	u64 max_rate = 0, rate = 0, sum_rate = 0;
++	u64 alpha = 0, ai = 0, md = 0;
++	int num_scale_down = 0;
++
++	if (!mpcb)
++		return;
++
++	/* Only one subflow left - fall back to normal reno-behavior */
++	if (mpcb->cnt_established <= 1)
++		goto exit;
++
++	/* Find max_rate first */
++	mptcp_for_each_sk(mpcb, sub_sk) {
++		struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++		u64 tmp;
++
++		if (!mptcp_balia_sk_can_send(sub_sk))
++			continue;
++
++		can_send++;
++
++		tmp = div_u64((u64)tp->mss_cache * sub_tp->snd_cwnd
++				* (USEC_PER_SEC << 3), sub_tp->srtt_us);
++		sum_rate += tmp;
++
++		if (tmp >= max_rate)
++			max_rate = tmp;
++	}
++
++	/* No subflow is able to send - we don't care anymore */
++	if (unlikely(!can_send))
++		goto exit;
++
++	rate = div_u64((u64)tp->mss_cache * tp->snd_cwnd *
++			(USEC_PER_SEC << 3), tp->srtt_us);
++	alpha = div64_u64(max_rate, rate);
++
++	/* Scale down max_rate from B/s to KB/s, MB/s, or GB/s
++	 * if max_rate is too high (i.e., >2^30)
++	 */
++	while (max_rate > mptcp_balia_scale(1, rate_scale_limit)) {
++		max_rate >>= scale_num;
++		num_scale_down++;
++	}
++
++	if (num_scale_down) {
++		sum_rate = 0;
++		mptcp_for_each_sk(mpcb, sub_sk) {
++			struct tcp_sock *sub_tp = tcp_sk(sub_sk);
++			u64 tmp;
++
++			tmp = div_u64((u64)tp->mss_cache * sub_tp->snd_cwnd
++				* (USEC_PER_SEC << 3), sub_tp->srtt_us);
++			tmp >>= (scale_num * num_scale_down);
++
++			sum_rate += tmp;
++		}
++		rate >>= (scale_num * num_scale_down);
++	}
++
++	/*	(sum_rate)^2 * 10 * w_r
++	 * ai = ------------------------------------
++	 *	(x_r + max_rate) * (4x_r + max_rate)
++	 */
++	sum_rate *= sum_rate;
++
++	ai = div64_u64(sum_rate * 10, rate + max_rate);
++	ai = div64_u64(ai * tp->snd_cwnd, (rate << 2) + max_rate);
++
++	if (unlikely(!ai))
++		ai = tp->snd_cwnd;
++
++	md = ((tp->snd_cwnd >> 1) * min(mptcp_balia_scale(alpha, scale_num),
++					mptcp_balia_scale(3, scale_num) >> 1))
++					>> scale_num;
++
++exit:
++	mptcp_set_ai(sk, ai);
++	mptcp_set_md(sk, md);
++}
++
++static void mptcp_balia_init(struct sock *sk)
++{
++	if (mptcp(tcp_sk(sk))) {
++		mptcp_set_forced(sk, 0);
++		mptcp_set_ai(sk, 0);
++		mptcp_set_md(sk, 0);
++	}
++}
++
++static void mptcp_balia_cwnd_event(struct sock *sk, enum tcp_ca_event event)
++{
++	if (event == CA_EVENT_COMPLETE_CWR || event == CA_EVENT_LOSS)
++		mptcp_balia_recalc_ai(sk);
++}
++
++static void mptcp_balia_set_state(struct sock *sk, u8 ca_state)
++{
++	if (!mptcp(tcp_sk(sk)))
++		return;
++
++	mptcp_set_forced(sk, 1);
++}
++
++static void mptcp_balia_cong_avoid(struct sock *sk, u32 ack, u32 acked)
++{
++	struct tcp_sock *tp = tcp_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++	int snd_cwnd;
++
++	if (!mptcp(tp)) {
++		tcp_reno_cong_avoid(sk, ack, acked);
++		return;
++	}
++
++	if (!tcp_is_cwnd_limited(sk))
++		return;
++
++	if (tp->snd_cwnd <= tp->snd_ssthresh) {
++		/* In "safe" area, increase. */
++		tcp_slow_start(tp, acked);
++		mptcp_balia_recalc_ai(sk);
++		return;
++	}
++
++	if (mptcp_get_forced(mptcp_meta_sk(sk))) {
++		mptcp_balia_recalc_ai(sk);
++		mptcp_set_forced(sk, 0);
++	}
++
++	if (mpcb->cnt_established > 1)
++		snd_cwnd = (int) mptcp_get_ai(sk);
++	else
++		snd_cwnd = tp->snd_cwnd;
++
++	if (tp->snd_cwnd_cnt >= snd_cwnd) {
++		if (tp->snd_cwnd < tp->snd_cwnd_clamp) {
++			tp->snd_cwnd++;
++			mptcp_balia_recalc_ai(sk);
++		}
++
++		tp->snd_cwnd_cnt = 0;
++	} else {
++		tp->snd_cwnd_cnt++;
++	}
++}
++
++static u32 mptcp_balia_ssthresh(struct sock *sk)
++{
++	const struct tcp_sock *tp = tcp_sk(sk);
++	const struct mptcp_cb *mpcb = tp->mpcb;
++
++	if (unlikely(!mptcp(tp) || mpcb->cnt_established <= 1))
++		return tcp_reno_ssthresh(sk);
++	else
++		return max((u32)(tp->snd_cwnd - mptcp_get_md(sk)), 1U);
++}
++
++static struct tcp_congestion_ops mptcp_balia = {
++	.init		= mptcp_balia_init,
++	.ssthresh	= mptcp_balia_ssthresh,
++	.cong_avoid	= mptcp_balia_cong_avoid,
++	.cwnd_event	= mptcp_balia_cwnd_event,
++	.set_state	= mptcp_balia_set_state,
++	.owner		= THIS_MODULE,
++	.name		= "balia",
++};
++
++static int __init mptcp_balia_register(void)
++{
++	BUILD_BUG_ON(sizeof(struct mptcp_balia) > ICSK_CA_PRIV_SIZE);
++	return tcp_register_congestion_control(&mptcp_balia);
++}
++
++static void __exit mptcp_balia_unregister(void)
++{
++	tcp_unregister_congestion_control(&mptcp_balia);
++}
++
++module_init(mptcp_balia_register);
++module_exit(mptcp_balia_unregister);
++
++MODULE_AUTHOR("Jaehyun Hwang, Anwar Walid, Qiuyu Peng, Steven H. Low");
++MODULE_LICENSE("GPL");
++MODULE_DESCRIPTION("MPTCP BALIA CONGESTION CONTROL ALGORITHM");
++MODULE_VERSION("0.1");
 diff --git a/net/mptcp/mptcp_binder.c b/net/mptcp/mptcp_binder.c
 new file mode 100644
 index 000000000000..95d8da560715
@@ -10289,10 +10575,10 @@ index 000000000000..28dfa0479f5e
 +}
 diff --git a/net/mptcp/mptcp_fullmesh.c b/net/mptcp/mptcp_fullmesh.c
 new file mode 100644
-index 000000000000..3a54413ce25b
+index 000000000000..2e4895c9e49c
 --- /dev/null
 +++ b/net/mptcp/mptcp_fullmesh.c
-@@ -0,0 +1,1722 @@
+@@ -0,0 +1,1730 @@
 +#include <linux/module.h>
 +
 +#include <net/mptcp.h>
@@ -11282,10 +11568,10 @@ index 000000000000..3a54413ce25b
 +static int inet6_addr_event(struct notifier_block *this,
 +				     unsigned long event, void *ptr);
 +
-+static int ipv6_is_in_dad_state(const struct inet6_ifaddr *ifa)
++static bool ipv6_dad_finished(const struct inet6_ifaddr *ifa)
 +{
-+	return (ifa->flags & IFA_F_TENTATIVE) &&
-+	       ifa->state == INET6_IFADDR_STATE_DAD;
++	return !(ifa->flags & IFA_F_TENTATIVE) ||
++	       ifa->state > INET6_IFADDR_STATE_DAD;
 +}
 +
 +static void dad_init_timer(struct mptcp_dad_data *data,
@@ -11304,14 +11590,22 @@ index 000000000000..3a54413ce25b
 +{
 +	struct mptcp_dad_data *data = (struct mptcp_dad_data *)arg;
 +
-+	if (ipv6_is_in_dad_state(data->ifa)) {
++	/* DAD failed or IP brought down? */
++	if (data->ifa->state == INET6_IFADDR_STATE_ERRDAD ||
++	    data->ifa->state == INET6_IFADDR_STATE_DEAD)
++		goto exit;
++
++	if (!ipv6_dad_finished(data->ifa)) {
 +		dad_init_timer(data, data->ifa);
 +		add_timer(&data->timer);
-+	} else {
-+		inet6_addr_event(NULL, NETDEV_UP, data->ifa);
-+		in6_ifa_put(data->ifa);
-+		kfree(data);
++		return;
 +	}
++
++	inet6_addr_event(NULL, NETDEV_UP, data->ifa);
++
++exit:
++	in6_ifa_put(data->ifa);
++	kfree(data);
 +}
 +
 +static inline void dad_setup_timer(struct inet6_ifaddr *ifa)
@@ -11376,7 +11670,7 @@ index 000000000000..3a54413ce25b
 +	      event == NETDEV_CHANGE))
 +		return NOTIFY_DONE;
 +
-+	if (ipv6_is_in_dad_state(ifa6))
++	if (!ipv6_dad_finished(ifa6))
 +		dad_setup_timer(ifa6);
 +	else
 +		addr6_event_handler(ifa6, event, net);


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [gentoo-commits] proj/linux-patches:3.16 commit in: /
@ 2014-12-16 17:29 Mike Pagano
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Pagano @ 2014-12-16 17:29 UTC (permalink / raw
  To: gentoo-commits

commit:     b40e4b7205dd73330cf29bf39590327f973a473b
Author:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
AuthorDate: Tue Dec 16 17:29:50 2014 +0000
Commit:     Mike Pagano <mpagano <AT> gentoo <DOT> org>
CommitDate: Tue Dec 16 17:29:50 2014 +0000
URL:        http://sources.gentoo.org/gitweb/?p=proj/linux-patches.git;a=commit;h=b40e4b72

Updating multipath tcp patch

---
 0000_README                                        |   2 +-
 ... => 5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch | 250 ++++++++++++---------
 2 files changed, 139 insertions(+), 113 deletions(-)

diff --git a/0000_README b/0000_README
index 8719a11..7122ab1 100644
--- a/0000_README
+++ b/0000_README
@@ -118,7 +118,7 @@ Patch:  5003_BFQ-3-block-add-Early-Queue-Merge-EQM-v7r6-for-3.16.0.patch
 From:   http://algo.ing.unimo.it/people/paolo/disk_sched/
 Desc:   BFQ v7r6 patch 3 for 3.16: Early Queue Merge (EQM)
 
-Patch:  5010_multipath-tcp-v3.16-075df3a63833.patch
+Patch:  5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch
 From:   http://multipath-tcp.org/
 Desc:   Patch for simultaneous use of several IP-addresses/interfaces in TCP for better resource utilization, better throughput and smoother reaction to failures.
 

diff --git a/5010_multipath-tcp-v3.16-075df3a63833.patch b/5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch
similarity index 98%
rename from 5010_multipath-tcp-v3.16-075df3a63833.patch
rename to 5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch
index 7520b4a..2858f5b 100644
--- a/5010_multipath-tcp-v3.16-075df3a63833.patch
+++ b/5010_multipath-tcp-v3.16-ac0ec67aa8bb.patch
@@ -1991,7 +1991,7 @@ index 156350745700..0e23cae8861f 100644
  struct timewait_sock_ops;
  struct inet_hashinfo;
 diff --git a/include/net/tcp.h b/include/net/tcp.h
-index 7286db80e8b8..ff92e74cd684 100644
+index 7286db80e8b8..2130c1c7fe6e 100644
 --- a/include/net/tcp.h
 +++ b/include/net/tcp.h
 @@ -177,6 +177,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
@@ -2030,7 +2030,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  extern struct inet_timewait_death_row tcp_death_row;
  
  /* sysctl variables for tcp */
-@@ -344,6 +366,107 @@ extern struct proto tcp_prot;
+@@ -344,6 +366,108 @@ extern struct proto tcp_prot;
  #define TCP_ADD_STATS_USER(net, field, val) SNMP_ADD_STATS_USER((net)->mib.tcp_statistics, field, val)
  #define TCP_ADD_STATS(net, field, val)	SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
  
@@ -2040,6 +2040,7 @@ index 7286db80e8b8..ff92e74cd684 100644
 +
 +struct mptcp_options_received;
 +
++void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited);
 +void tcp_enter_quickack_mode(struct sock *sk);
 +int tcp_close_state(struct sock *sk);
 +void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
@@ -2138,7 +2139,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  void tcp_tasklet_init(void);
  
  void tcp_v4_err(struct sk_buff *skb, u32);
-@@ -440,6 +563,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -440,6 +564,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  		size_t len, int nonblock, int flags, int *addr_len);
  void tcp_parse_options(const struct sk_buff *skb,
  		       struct tcp_options_received *opt_rx,
@@ -2146,7 +2147,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  		       int estab, struct tcp_fastopen_cookie *foc);
  const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
  
-@@ -493,14 +617,8 @@ static inline u32 tcp_cookie_time(void)
+@@ -493,14 +618,8 @@ static inline u32 tcp_cookie_time(void)
  
  u32 __cookie_v4_init_sequence(const struct iphdr *iph, const struct tcphdr *th,
  			      u16 *mssp);
@@ -2163,7 +2164,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  #endif
  
  __u32 cookie_init_timestamp(struct request_sock *req);
-@@ -516,13 +634,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
+@@ -516,13 +635,6 @@ u32 __cookie_v6_init_sequence(const struct ipv6hdr *iph,
  			      const struct tcphdr *th, u16 *mssp);
  __u32 cookie_v6_init_sequence(struct sock *sk, const struct sk_buff *skb,
  			      __u16 *mss);
@@ -2177,7 +2178,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  #endif
  /* tcp_output.c */
  
-@@ -551,10 +662,17 @@ void tcp_send_delayed_ack(struct sock *sk);
+@@ -551,10 +663,17 @@ void tcp_send_delayed_ack(struct sock *sk);
  void tcp_send_loss_probe(struct sock *sk);
  bool tcp_schedule_loss_probe(struct sock *sk);
  
@@ -2195,7 +2196,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  
  /* tcp_timer.c */
  void tcp_init_xmit_timers(struct sock *);
-@@ -703,14 +821,27 @@ void tcp_send_window_probe(struct sock *sk);
+@@ -703,14 +822,27 @@ void tcp_send_window_probe(struct sock *sk);
   */
  struct tcp_skb_cb {
  	union {
@@ -2226,7 +2227,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  	__u8		tcp_flags;	/* TCP header flags. (tcp[13])	*/
  
  	__u8		sacked;		/* State flags for SACK/FACK.	*/
-@@ -1075,7 +1206,8 @@ u32 tcp_default_init_rwnd(u32 mss);
+@@ -1075,7 +1207,8 @@ u32 tcp_default_init_rwnd(u32 mss);
  /* Determine a window scaling and initial window to offer. */
  void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
  			       __u32 *window_clamp, int wscale_ok,
@@ -2236,7 +2237,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  
  static inline int tcp_win_from_space(int space)
  {
-@@ -1084,15 +1216,34 @@ static inline int tcp_win_from_space(int space)
+@@ -1084,6 +1217,19 @@ static inline int tcp_win_from_space(int space)
  		space - (space>>sysctl_tcp_adv_win_scale);
  }
  
@@ -2256,22 +2257,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  /* Note: caller must be prepared to deal with negative returns */ 
  static inline int tcp_space(const struct sock *sk)
  {
-+	if (mptcp(tcp_sk(sk)))
-+		sk = tcp_sk(sk)->meta_sk;
-+
- 	return tcp_win_from_space(sk->sk_rcvbuf -
- 				  atomic_read(&sk->sk_rmem_alloc));
- } 
- 
- static inline int tcp_full_space(const struct sock *sk)
- {
-+	if (mptcp(tcp_sk(sk)))
-+		sk = tcp_sk(sk)->meta_sk;
-+
- 	return tcp_win_from_space(sk->sk_rcvbuf); 
- }
- 
-@@ -1115,6 +1266,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
+@@ -1115,6 +1261,8 @@ static inline void tcp_openreq_init(struct request_sock *req,
  	ireq->wscale_ok = rx_opt->wscale_ok;
  	ireq->acked = 0;
  	ireq->ecn_ok = 0;
@@ -2280,7 +2266,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  	ireq->ir_rmt_port = tcp_hdr(skb)->source;
  	ireq->ir_num = ntohs(tcp_hdr(skb)->dest);
  }
-@@ -1585,6 +1738,11 @@ int tcp4_proc_init(void);
+@@ -1585,6 +1733,11 @@ int tcp4_proc_init(void);
  void tcp4_proc_exit(void);
  #endif
  
@@ -2292,7 +2278,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  /* TCP af-specific functions */
  struct tcp_sock_af_ops {
  #ifdef CONFIG_TCP_MD5SIG
-@@ -1601,7 +1759,32 @@ struct tcp_sock_af_ops {
+@@ -1601,7 +1754,33 @@ struct tcp_sock_af_ops {
  #endif
  };
  
@@ -2317,6 +2303,7 @@ index 7286db80e8b8..ff92e74cd684 100644
 +	void (*time_wait)(struct sock *sk, int state, int timeo);
 +	void (*cleanup_rbuf)(struct sock *sk, int copied);
 +	void (*init_congestion_control)(struct sock *sk);
++	void (*cwnd_validate)(struct sock *sk, bool is_cwnd_limited);
 +};
 +extern const struct tcp_sock_ops tcp_specific;
 +
@@ -2325,7 +2312,7 @@ index 7286db80e8b8..ff92e74cd684 100644
  #ifdef CONFIG_TCP_MD5SIG
  	struct tcp_md5sig_key	*(*md5_lookup) (struct sock *sk,
  						struct request_sock *req);
-@@ -1611,8 +1794,39 @@ struct tcp_request_sock_ops {
+@@ -1611,8 +1790,39 @@ struct tcp_request_sock_ops {
  						  const struct request_sock *req,
  						  const struct sk_buff *skb);
  #endif
@@ -2572,20 +2559,20 @@ index 4db3c2a1679c..04cb17d4b0ce 100644
  		goto drop;
  
 diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
-index 05c57f0fcabe..811286a6aa9c 100644
+index 05c57f0fcabe..a1ba825c6acd 100644
 --- a/net/ipv4/Kconfig
 +++ b/net/ipv4/Kconfig
 @@ -556,6 +556,38 @@ config TCP_CONG_ILLINOIS
  	For further details see:
  	  http://www.ews.uiuc.edu/~shaoliu/tcpillinois/index.html
  
-+config TCP_CONG_COUPLED
-+	tristate "MPTCP COUPLED CONGESTION CONTROL"
++config TCP_CONG_LIA
++	tristate "MPTCP Linked Increase"
 +	depends on MPTCP
 +	default n
 +	---help---
-+	MultiPath TCP Coupled Congestion Control
-+	To enable it, just put 'coupled' in tcp_congestion_control
++	MultiPath TCP Linked Increase Congestion Control
++	To enable it, just put 'lia' in tcp_congestion_control
 +
 +config TCP_CONG_OLIA
 +	tristate "MPTCP Opportunistic Linked Increase"
@@ -2618,8 +2605,8 @@ index 05c57f0fcabe..811286a6aa9c 100644
  	config DEFAULT_WESTWOOD
  		bool "Westwood" if TCP_CONG_WESTWOOD=y
  
-+	config DEFAULT_COUPLED
-+		bool "Coupled" if TCP_CONG_COUPLED=y
++	config DEFAULT_LIA
++		bool "Lia" if TCP_CONG_LIA=y
 +
 +	config DEFAULT_OLIA
 +		bool "Olia" if TCP_CONG_OLIA=y
@@ -2637,7 +2624,7 @@ index 05c57f0fcabe..811286a6aa9c 100644
  	default "vegas" if DEFAULT_VEGAS
  	default "westwood" if DEFAULT_WESTWOOD
  	default "veno" if DEFAULT_VENO
-+	default "coupled" if DEFAULT_COUPLED
++	default "lia" if DEFAULT_LIA
 +	default "wvegas" if DEFAULT_WVEGAS
 +	default "balia" if DEFAULT_BALIA
  	default "reno" if DEFAULT_RENO
@@ -2815,7 +2802,7 @@ index c86624b36a62..0ff3fe004d62 100644
  	ireq->rcv_wscale  = rcv_wscale;
  
 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
-index 9d2118e5fbc7..2cb89f886d45 100644
+index 9d2118e5fbc7..cb59aef70d26 100644
 --- a/net/ipv4/tcp.c
 +++ b/net/ipv4/tcp.c
 @@ -271,6 +271,7 @@
@@ -2826,7 +2813,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  #include <net/tcp.h>
  #include <net/xfrm.h>
  #include <net/ip.h>
-@@ -371,6 +372,24 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
+@@ -371,6 +372,25 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
  	return period;
  }
  
@@ -2846,12 +2833,13 @@ index 9d2118e5fbc7..2cb89f886d45 100644
 +	.retransmit_timer		= tcp_retransmit_timer,
 +	.time_wait			= tcp_time_wait,
 +	.cleanup_rbuf			= tcp_cleanup_rbuf,
++	.cwnd_validate			= tcp_cwnd_validate,
 +};
 +
  /* Address-family independent initialization for a tcp_sock.
   *
   * NOTE: A lot of things set to zero explicitly by call to
-@@ -419,6 +438,8 @@ void tcp_init_sock(struct sock *sk)
+@@ -419,6 +439,8 @@ void tcp_init_sock(struct sock *sk)
  	sk->sk_sndbuf = sysctl_tcp_wmem[1];
  	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
  
@@ -2860,7 +2848,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	local_bh_disable();
  	sock_update_memcg(sk);
  	sk_sockets_allocated_inc(sk);
-@@ -726,6 +747,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+@@ -726,6 +748,14 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
  	int ret;
  
  	sock_rps_record_flow(sk);
@@ -2875,7 +2863,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	/*
  	 * We can't seek on a socket input
  	 */
-@@ -821,8 +850,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
+@@ -821,8 +851,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
  	return NULL;
  }
  
@@ -2885,7 +2873,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  {
  	struct tcp_sock *tp = tcp_sk(sk);
  	u32 xmit_size_goal, old_size_goal;
-@@ -872,8 +900,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
+@@ -872,8 +901,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
  {
  	int mss_now;
  
@@ -2901,7 +2889,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  
  	return mss_now;
  }
-@@ -892,11 +925,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
+@@ -892,11 +926,32 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
  	 * is fully established.
  	 */
  	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
@@ -2935,7 +2923,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
  
  	mss_now = tcp_send_mss(sk, &size_goal, flags);
-@@ -1001,8 +1055,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
+@@ -1001,8 +1056,9 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
  {
  	ssize_t res;
  
@@ -2947,7 +2935,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  		return sock_no_sendpage(sk->sk_socket, page, offset, size,
  					flags);
  
-@@ -1018,6 +1073,9 @@ static inline int select_size(const struct sock *sk, bool sg)
+@@ -1018,6 +1074,9 @@ static inline int select_size(const struct sock *sk, bool sg)
  	const struct tcp_sock *tp = tcp_sk(sk);
  	int tmp = tp->mss_cache;
  
@@ -2957,7 +2945,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	if (sg) {
  		if (sk_can_gso(sk)) {
  			/* Small frames wont use a full page:
-@@ -1100,11 +1158,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1100,11 +1159,18 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  	 * is fully established.
  	 */
  	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
@@ -2977,7 +2965,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	if (unlikely(tp->repair)) {
  		if (tp->repair_queue == TCP_RECV_QUEUE) {
  			copied = tcp_send_rcvq(sk, msg, size);
-@@ -1132,7 +1197,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1132,7 +1198,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
  		goto out_err;
  
@@ -2989,7 +2977,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  
  	while (--iovlen >= 0) {
  		size_t seglen = iov->iov_len;
-@@ -1183,8 +1251,15 @@ new_segment:
+@@ -1183,8 +1252,15 @@ new_segment:
  
  				/*
  				 * Check whether we can use HW checksum.
@@ -3006,7 +2994,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  					skb->ip_summed = CHECKSUM_PARTIAL;
  
  				skb_entail(sk, skb);
-@@ -1422,7 +1497,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
+@@ -1422,7 +1498,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
  
  		/* Optimize, __tcp_select_window() is not cheap. */
  		if (2*rcv_window_now <= tp->window_clamp) {
@@ -3015,7 +3003,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  
  			/* Send ACK now, if this read freed lots of space
  			 * in our buffer. Certainly, new_window is new window.
-@@ -1587,7 +1662,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
+@@ -1587,7 +1663,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
  	/* Clean up data we have read: This will do ACK frames. */
  	if (copied > 0) {
  		tcp_recv_skb(sk, seq, &offset);
@@ -3024,7 +3012,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	}
  	return copied;
  }
-@@ -1623,6 +1698,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1623,6 +1699,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  
  	lock_sock(sk);
  
@@ -3039,7 +3027,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	err = -ENOTCONN;
  	if (sk->sk_state == TCP_LISTEN)
  		goto out;
-@@ -1761,7 +1844,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1761,7 +1845,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  			}
  		}
  
@@ -3048,7 +3036,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  
  		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
  			/* Install new reader */
-@@ -1813,7 +1896,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
+@@ -1813,7 +1897,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  			if (tp->rcv_wnd == 0 &&
  			    !skb_queue_empty(&sk->sk_async_wait_queue)) {
  				tcp_service_net_dma(sk, true);
@@ -3057,7 +3045,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  			} else
  				dma_async_issue_pending(tp->ucopy.dma_chan);
  		}
-@@ -1993,7 +2076,7 @@ skip_copy:
+@@ -1993,7 +2077,7 @@ skip_copy:
  	 */
  
  	/* Clean up data we have read: This will do ACK frames. */
@@ -3066,7 +3054,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  
  	release_sock(sk);
  	return copied;
-@@ -2070,7 +2153,7 @@ static const unsigned char new_state[16] = {
+@@ -2070,7 +2154,7 @@ static const unsigned char new_state[16] = {
    /* TCP_CLOSING	*/ TCP_CLOSING,
  };
  
@@ -3075,7 +3063,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  {
  	int next = (int)new_state[sk->sk_state];
  	int ns = next & TCP_STATE_MASK;
-@@ -2100,7 +2183,7 @@ void tcp_shutdown(struct sock *sk, int how)
+@@ -2100,7 +2184,7 @@ void tcp_shutdown(struct sock *sk, int how)
  	     TCPF_SYN_RECV | TCPF_CLOSE_WAIT)) {
  		/* Clear out any half completed packets.  FIN if needed. */
  		if (tcp_close_state(sk))
@@ -3084,7 +3072,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	}
  }
  EXPORT_SYMBOL(tcp_shutdown);
-@@ -2125,6 +2208,11 @@ void tcp_close(struct sock *sk, long timeout)
+@@ -2125,6 +2209,11 @@ void tcp_close(struct sock *sk, long timeout)
  	int data_was_unread = 0;
  	int state;
  
@@ -3096,7 +3084,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	lock_sock(sk);
  	sk->sk_shutdown = SHUTDOWN_MASK;
  
-@@ -2167,7 +2255,7 @@ void tcp_close(struct sock *sk, long timeout)
+@@ -2167,7 +2256,7 @@ void tcp_close(struct sock *sk, long timeout)
  		/* Unread data was tossed, zap the connection. */
  		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
  		tcp_set_state(sk, TCP_CLOSE);
@@ -3105,7 +3093,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
  		/* Check zero linger _after_ checking for unread data. */
  		sk->sk_prot->disconnect(sk, 0);
-@@ -2247,7 +2335,7 @@ adjudge_to_death:
+@@ -2247,7 +2336,7 @@ adjudge_to_death:
  		struct tcp_sock *tp = tcp_sk(sk);
  		if (tp->linger2 < 0) {
  			tcp_set_state(sk, TCP_CLOSE);
@@ -3114,7 +3102,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  			NET_INC_STATS_BH(sock_net(sk),
  					LINUX_MIB_TCPABORTONLINGER);
  		} else {
-@@ -2257,7 +2345,8 @@ adjudge_to_death:
+@@ -2257,7 +2346,8 @@ adjudge_to_death:
  				inet_csk_reset_keepalive_timer(sk,
  						tmo - TCP_TIMEWAIT_LEN);
  			} else {
@@ -3124,7 +3112,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  				goto out;
  			}
  		}
-@@ -2266,7 +2355,7 @@ adjudge_to_death:
+@@ -2266,7 +2356,7 @@ adjudge_to_death:
  		sk_mem_reclaim(sk);
  		if (tcp_check_oom(sk, 0)) {
  			tcp_set_state(sk, TCP_CLOSE);
@@ -3133,7 +3121,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  			NET_INC_STATS_BH(sock_net(sk),
  					LINUX_MIB_TCPABORTONMEMORY);
  		}
-@@ -2291,15 +2380,6 @@ out:
+@@ -2291,15 +2381,6 @@ out:
  }
  EXPORT_SYMBOL(tcp_close);
  
@@ -3149,7 +3137,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  int tcp_disconnect(struct sock *sk, int flags)
  {
  	struct inet_sock *inet = inet_sk(sk);
-@@ -2322,7 +2402,7 @@ int tcp_disconnect(struct sock *sk, int flags)
+@@ -2322,7 +2403,7 @@ int tcp_disconnect(struct sock *sk, int flags)
  		/* The last check adjusts for discrepancy of Linux wrt. RFC
  		 * states
  		 */
@@ -3158,7 +3146,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  		sk->sk_err = ECONNRESET;
  	} else if (old_state == TCP_SYN_SENT)
  		sk->sk_err = ECONNRESET;
-@@ -2340,6 +2420,13 @@ int tcp_disconnect(struct sock *sk, int flags)
+@@ -2340,6 +2421,13 @@ int tcp_disconnect(struct sock *sk, int flags)
  	if (!(sk->sk_userlocks & SOCK_BINDADDR_LOCK))
  		inet_reset_saddr(sk);
  
@@ -3172,7 +3160,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	sk->sk_shutdown = 0;
  	sock_reset_flag(sk, SOCK_DONE);
  	tp->srtt_us = 0;
-@@ -2632,6 +2719,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+@@ -2632,6 +2720,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
  		break;
  
  	case TCP_DEFER_ACCEPT:
@@ -3185,7 +3173,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  		/* Translate value in seconds to number of retransmits */
  		icsk->icsk_accept_queue.rskq_defer_accept =
  			secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
-@@ -2659,7 +2752,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+@@ -2659,7 +2753,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
  			    (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
  			    inet_csk_ack_scheduled(sk)) {
  				icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
@@ -3194,7 +3182,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  				if (!(val & 1))
  					icsk->icsk_ack.pingpong = 1;
  			}
-@@ -2699,6 +2792,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
+@@ -2699,6 +2793,18 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
  		tp->notsent_lowat = val;
  		sk->sk_write_space(sk);
  		break;
@@ -3213,7 +3201,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	default:
  		err = -ENOPROTOOPT;
  		break;
-@@ -2931,6 +3036,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
+@@ -2931,6 +3037,11 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
  	case TCP_NOTSENT_LOWAT:
  		val = tp->notsent_lowat;
  		break;
@@ -3225,7 +3213,7 @@ index 9d2118e5fbc7..2cb89f886d45 100644
  	default:
  		return -ENOPROTOOPT;
  	}
-@@ -3120,8 +3230,11 @@ void tcp_done(struct sock *sk)
+@@ -3120,8 +3231,11 @@ void tcp_done(struct sock *sk)
  	if (sk->sk_state == TCP_SYN_SENT || sk->sk_state == TCP_SYN_RECV)
  		TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
  
@@ -3299,7 +3287,7 @@ index 9771563ab564..5c230d96c4c1 100644
  	WARN_ON(req->sk == NULL);
  	return true;
 diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
-index 40639c288dc2..3273bb69f387 100644
+index 40639c288dc2..71033189797d 100644
 --- a/net/ipv4/tcp_input.c
 +++ b/net/ipv4/tcp_input.c
 @@ -74,6 +74,9 @@
@@ -3391,7 +3379,7 @@ index 40639c288dc2..3273bb69f387 100644
 -	if (tp->rcv_ssthresh < tp->window_clamp &&
 -	    (int)tp->rcv_ssthresh < tcp_space(sk) &&
 +	if (meta_tp->rcv_ssthresh < meta_tp->window_clamp &&
-+	    (int)meta_tp->rcv_ssthresh < tcp_space(sk) &&
++	    (int)meta_tp->rcv_ssthresh < tcp_space(meta_sk) &&
  	    !sk_under_memory_pressure(sk)) {
  		int incr;
  
@@ -5203,7 +5191,7 @@ index e68e0d4af6c9..ae6946857dff 100644
  	return ret;
  }
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
-index 179b51e6bda3..efd31b6c5784 100644
+index 179b51e6bda3..267d5f7eb303 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -36,6 +36,12 @@
@@ -5559,6 +5547,15 @@ index 179b51e6bda3..efd31b6c5784 100644
  
  /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
   * As additional protections, we do not touch cwnd in retransmission phases,
+@@ -1402,7 +1448,7 @@ static void tcp_cwnd_application_limited(struct sock *sk)
+ 	tp->snd_cwnd_stamp = tcp_time_stamp;
+ }
+ 
+-static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited)
++void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited)
+ {
+ 	struct tcp_sock *tp = tcp_sk(sk);
+ 
 @@ -1446,8 +1492,8 @@ static bool tcp_minshall_check(const struct tcp_sock *tp)
   * But we can avoid doing the divide again given we already have
   *  skb_pcount = skb->len / mss_now
@@ -5680,7 +5677,17 @@ index 179b51e6bda3..efd31b6c5784 100644
  		/* Do MTU probing. */
  		result = tcp_mtu_probe(sk);
  		if (!result) {
-@@ -2099,7 +2150,8 @@ void tcp_send_loss_probe(struct sock *sk)
+@@ -2004,7 +2055,8 @@ repair:
+ 		/* Send one loss probe per tail loss episode. */
+ 		if (push_one != 2)
+ 			tcp_schedule_loss_probe(sk);
+-		tcp_cwnd_validate(sk, is_cwnd_limited);
++		if (tp->ops->cwnd_validate)
++			tp->ops->cwnd_validate(sk, is_cwnd_limited);
+ 		return false;
+ 	}
+ 	return (push_one == 2) || (!tp->packets_out && tcp_send_head(sk));
+@@ -2099,7 +2151,8 @@ void tcp_send_loss_probe(struct sock *sk)
  	int err = -1;
  
  	if (tcp_send_head(sk) != NULL) {
@@ -5690,7 +5697,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  		goto rearm_timer;
  	}
  
-@@ -2159,8 +2211,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
+@@ -2159,8 +2212,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
  	if (unlikely(sk->sk_state == TCP_CLOSE))
  		return;
  
@@ -5701,7 +5708,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  		tcp_check_probe_timer(sk);
  }
  
-@@ -2173,7 +2225,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
+@@ -2173,7 +2226,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
  
  	BUG_ON(!skb || skb->len < mss_now);
  
@@ -5711,7 +5718,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  }
  
  /* This function returns the amount that we can raise the
-@@ -2386,6 +2439,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+@@ -2386,6 +2440,10 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
  	if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN)
  		return;
  
@@ -5722,7 +5729,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  	tcp_for_write_queue_from_safe(skb, tmp, sk) {
  		if (!tcp_can_collapse(sk, skb))
  			break;
-@@ -2843,7 +2900,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
+@@ -2843,7 +2901,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
  
  	/* RFC1323: The window in SYN & SYN/ACK segments is never scaled. */
  	th->window = htons(min(req->rcv_wnd, 65535U));
@@ -5731,7 +5738,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  	th->doff = (tcp_header_size >> 2);
  	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_OUTSEGS);
  
-@@ -2897,13 +2954,13 @@ static void tcp_connect_init(struct sock *sk)
+@@ -2897,13 +2955,13 @@ static void tcp_connect_init(struct sock *sk)
  	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
  		tp->window_clamp = tcp_full_space(sk);
  
@@ -5752,7 +5759,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  
  	tp->rx_opt.rcv_wscale = rcv_wscale;
  	tp->rcv_ssthresh = tp->rcv_wnd;
-@@ -2927,6 +2984,36 @@ static void tcp_connect_init(struct sock *sk)
+@@ -2927,6 +2985,36 @@ static void tcp_connect_init(struct sock *sk)
  	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
  	inet_csk(sk)->icsk_retransmits = 0;
  	tcp_clear_retrans(tp);
@@ -5789,7 +5796,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  }
  
  static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb)
-@@ -3176,6 +3263,7 @@ void tcp_send_ack(struct sock *sk)
+@@ -3176,6 +3264,7 @@ void tcp_send_ack(struct sock *sk)
  	TCP_SKB_CB(buff)->when = tcp_time_stamp;
  	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
  }
@@ -5797,7 +5804,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  
  /* This routine sends a packet with an out of date sequence
   * number. It assumes the other end will try to ack it.
-@@ -3188,7 +3276,7 @@ void tcp_send_ack(struct sock *sk)
+@@ -3188,7 +3277,7 @@ void tcp_send_ack(struct sock *sk)
   * one is with SEG.SEQ=SND.UNA to deliver urgent pointer, another is
   * out-of-date with SND.UNA-1 to probe window.
   */
@@ -5806,7 +5813,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  {
  	struct tcp_sock *tp = tcp_sk(sk);
  	struct sk_buff *skb;
-@@ -3270,7 +3358,7 @@ void tcp_send_probe0(struct sock *sk)
+@@ -3270,7 +3359,7 @@ void tcp_send_probe0(struct sock *sk)
  	struct tcp_sock *tp = tcp_sk(sk);
  	int err;
  
@@ -5815,7 +5822,7 @@ index 179b51e6bda3..efd31b6c5784 100644
  
  	if (tp->packets_out || !tcp_send_head(sk)) {
  		/* Cancel probe timer, if it is not required. */
-@@ -3301,3 +3389,18 @@ void tcp_send_probe0(struct sock *sk)
+@@ -3301,3 +3390,18 @@ void tcp_send_probe0(struct sock *sk)
  					  TCP_RTO_MAX);
  	}
  }
@@ -7099,7 +7106,7 @@ index 000000000000..cdfc03adabf8
 +
 diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
 new file mode 100644
-index 000000000000..2feb3e873206
+index 000000000000..5c70e7cca3b3
 --- /dev/null
 +++ b/net/mptcp/Makefile
 @@ -0,0 +1,21 @@
@@ -7113,7 +7120,7 @@ index 000000000000..2feb3e873206
 +mptcp-y := mptcp_ctrl.o mptcp_ipv4.o mptcp_ofo_queue.o mptcp_pm.o \
 +	   mptcp_output.o mptcp_input.o mptcp_sched.o
 +
-+obj-$(CONFIG_TCP_CONG_COUPLED) += mptcp_coupled.o
++obj-$(CONFIG_TCP_CONG_LIA) += mptcp_coupled.o
 +obj-$(CONFIG_TCP_CONG_OLIA) += mptcp_olia.o
 +obj-$(CONFIG_TCP_CONG_WVEGAS) += mptcp_wvegas.o
 +obj-$(CONFIG_TCP_CONG_BALIA) += mptcp_balia.o
@@ -7126,7 +7133,7 @@ index 000000000000..2feb3e873206
 +
 diff --git a/net/mptcp/mptcp_balia.c b/net/mptcp/mptcp_balia.c
 new file mode 100644
-index 000000000000..5cc224d80b01
+index 000000000000..565cb75e2cea
 --- /dev/null
 +++ b/net/mptcp/mptcp_balia.c
 @@ -0,0 +1,267 @@
@@ -7156,8 +7163,9 @@ index 000000000000..5cc224d80b01
 + * if max_rate > 2^rate_scale_limit
 + */
 +
-+static int rate_scale_limit = 30;
-+static int scale_num = 10;
++static int rate_scale_limit = 25;
++static int alpha_scale = 10;
++static int scale_num = 5;
 +
 +struct mptcp_balia {
 +	u64	ai;
@@ -7210,7 +7218,6 @@ index 000000000000..5cc224d80b01
 +	const struct tcp_sock *tp = tcp_sk(sk);
 +	const struct mptcp_cb *mpcb = tp->mpcb;
 +	const struct sock *sub_sk;
-+	int can_send = 0;
 +	u64 max_rate = 0, rate = 0, sum_rate = 0;
 +	u64 alpha = 0, ai = 0, md = 0;
 +	int num_scale_down = 0;
@@ -7230,27 +7237,24 @@ index 000000000000..5cc224d80b01
 +		if (!mptcp_balia_sk_can_send(sub_sk))
 +			continue;
 +
-+		can_send++;
-+
 +		tmp = div_u64((u64)tp->mss_cache * sub_tp->snd_cwnd
 +				* (USEC_PER_SEC << 3), sub_tp->srtt_us);
 +		sum_rate += tmp;
 +
++		if (tp == sub_tp)
++			rate = tmp;
++
 +		if (tmp >= max_rate)
 +			max_rate = tmp;
 +	}
 +
-+	/* No subflow is able to send - we don't care anymore */
-+	if (unlikely(!can_send))
++	/* At least, the current subflow should be able to send */
++	if (unlikely(!rate))
 +		goto exit;
 +
-+	rate = div_u64((u64)tp->mss_cache * tp->snd_cwnd *
-+			(USEC_PER_SEC << 3), tp->srtt_us);
 +	alpha = div64_u64(max_rate, rate);
 +
-+	/* Scale down max_rate from B/s to KB/s, MB/s, or GB/s
-+	 * if max_rate is too high (i.e., >2^30)
-+	 */
++	/* Scale down max_rate if it is too high (e.g., >2^25) */
 +	while (max_rate > mptcp_balia_scale(1, rate_scale_limit)) {
 +		max_rate >>= scale_num;
 +		num_scale_down++;
@@ -7262,6 +7266,9 @@ index 000000000000..5cc224d80b01
 +			struct tcp_sock *sub_tp = tcp_sk(sub_sk);
 +			u64 tmp;
 +
++			if (!mptcp_balia_sk_can_send(sub_sk))
++				continue;
++
 +			tmp = div_u64((u64)tp->mss_cache * sub_tp->snd_cwnd
 +				* (USEC_PER_SEC << 3), sub_tp->srtt_us);
 +			tmp >>= (scale_num * num_scale_down);
@@ -7283,9 +7290,9 @@ index 000000000000..5cc224d80b01
 +	if (unlikely(!ai))
 +		ai = tp->snd_cwnd;
 +
-+	md = ((tp->snd_cwnd >> 1) * min(mptcp_balia_scale(alpha, scale_num),
-+					mptcp_balia_scale(3, scale_num) >> 1))
-+					>> scale_num;
++	md = ((tp->snd_cwnd >> 1) * min(mptcp_balia_scale(alpha, alpha_scale),
++					mptcp_balia_scale(3, alpha_scale) >> 1))
++					>> alpha_scale;
 +
 +exit:
 +	mptcp_set_ai(sk, ai);
@@ -16520,10 +16527,10 @@ index 000000000000..53f5c43bb488
 +MODULE_VERSION("0.1");
 diff --git a/net/mptcp/mptcp_output.c b/net/mptcp/mptcp_output.c
 new file mode 100644
-index 000000000000..400ea254c078
+index 000000000000..e2a6a6d6522d
 --- /dev/null
 +++ b/net/mptcp/mptcp_output.c
-@@ -0,0 +1,1743 @@
+@@ -0,0 +1,1758 @@
 +/*
 + *	MPTCP implementation - Sending side
 + *
@@ -17181,11 +17188,9 @@ index 000000000000..400ea254c078
 +	struct sock *subsk = NULL;
 +	struct mptcp_cb *mpcb = meta_tp->mpcb;
 +	struct sk_buff *skb;
-+	unsigned int sent_pkts;
 +	int reinject = 0;
 +	unsigned int sublimit;
-+
-+	sent_pkts = 0;
++	__u32 path_mask = 0;
 +
 +	while ((skb = mpcb->sched_ops->next_segment(meta_sk, &reinject, &subsk,
 +						    &sublimit))) {
@@ -17266,6 +17271,7 @@ index 000000000000..400ea254c078
 +		 * always push on the subflow
 +		 */
 +		__tcp_push_pending_frames(subsk, mss_now, TCP_NAGLE_PUSH);
++		path_mask |= mptcp_pi_to_flag(subtp->mptcp->path_index);
 +		TCP_SKB_CB(skb)->when = tcp_time_stamp;
 +
 +		if (!reinject) {
@@ -17276,7 +17282,6 @@ index 000000000000..400ea254c078
 +		}
 +
 +		tcp_minshall_update(meta_tp, mss_now, skb);
-+		sent_pkts += tcp_skb_pcount(skb);
 +
 +		if (reinject > 0) {
 +			__skb_unlink(skb, &mpcb->reinject_queue);
@@ -17287,6 +17292,22 @@ index 000000000000..400ea254c078
 +			break;
 +	}
 +
++	mptcp_for_each_sk(mpcb, subsk) {
++		subtp = tcp_sk(subsk);
++
++		if (!(path_mask & mptcp_pi_to_flag(subtp->mptcp->path_index)))
++			continue;
++
++		/* We have pushed data on this subflow. We ignore the call to
++		 * cwnd_validate in tcp_write_xmit as is_cwnd_limited will never
++		 * be true (we never push more than what the cwnd can accept).
++		 * We need to ensure that we call tcp_cwnd_validate with
++		 * is_cwnd_limited set to true if we have filled the cwnd.
++		 */
++		tcp_cwnd_validate(subsk, tcp_packets_in_flight(subtp) >=
++				  subtp->snd_cwnd);
++	}
++
 +	return !meta_tp->packets_out && tcp_send_head(meta_sk);
 +}
 +
@@ -17299,6 +17320,7 @@ index 000000000000..400ea254c078
 +{
 +	struct inet_connection_sock *icsk = inet_csk(sk);
 +	struct tcp_sock *tp = tcp_sk(sk), *meta_tp = mptcp_meta_tp(tp);
++	struct sock *meta_sk = mptcp_meta_sk(sk);
 +	int mss, free_space, full_space, window;
 +
 +	/* MSS for the peer's data.  Previous versions used mss_clamp
@@ -17308,9 +17330,9 @@ index 000000000000..400ea254c078
 +	 * fluctuations.  --SAW  1998/11/1
 +	 */
 +	mss = icsk->icsk_ack.rcv_mss;
-+	free_space = tcp_space(sk);
++	free_space = tcp_space(meta_sk);
 +	full_space = min_t(int, meta_tp->window_clamp,
-+			tcp_full_space(sk));
++			tcp_full_space(meta_sk));
 +
 +	if (mss > full_space)
 +		mss = full_space;
@@ -18751,10 +18773,10 @@ index 000000000000..93278f684069
 +MODULE_VERSION("0.89");
 diff --git a/net/mptcp/mptcp_sched.c b/net/mptcp/mptcp_sched.c
 new file mode 100644
-index 000000000000..6c7ff4eceac1
+index 000000000000..4a578821f50e
 --- /dev/null
 +++ b/net/mptcp/mptcp_sched.c
-@@ -0,0 +1,493 @@
+@@ -0,0 +1,497 @@
 +/* MPTCP Scheduler module selector. Highly inspired by tcp_cong.c */
 +
 +#include <linux/module.h>
@@ -18979,8 +19001,12 @@ index 000000000000..6c7ff4eceac1
 +		if (tp_it != tp &&
 +		    TCP_SKB_CB(skb_head)->path_mask & mptcp_pi_to_flag(tp_it->mptcp->path_index)) {
 +			if (tp->srtt_us < tp_it->srtt_us && inet_csk((struct sock *)tp_it)->icsk_ca_state == TCP_CA_Open) {
++				u32 prior_cwnd = tp_it->snd_cwnd;
++
 +				tp_it->snd_cwnd = max(tp_it->snd_cwnd >> 1U, 1U);
-+				if (tp_it->snd_ssthresh != TCP_INFINITE_SSTHRESH)
++
++				/* If in slow start, do not reduce the ssthresh */
++				if (prior_cwnd >= tp_it->snd_ssthresh)
 +					tp_it->snd_ssthresh = max(tp_it->snd_ssthresh >> 1U, 2U);
 +
 +				dsp->last_rbuf_opti = tcp_time_stamp;


^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-12-16 17:29 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-17 22:19 [gentoo-commits] proj/linux-patches:3.16 commit in: / Anthony G. Basile
  -- strict thread matches above, loose matches on Subject: below --
2014-12-16 17:29 Mike Pagano
2014-11-29 18:11 Mike Pagano
2014-11-29 18:05 Mike Pagano
2014-11-29 18:05 Mike Pagano
2014-11-29 18:05 Mike Pagano
2014-10-30 19:29 Mike Pagano
2014-10-15 12:42 Mike Pagano
2014-10-09 19:54 Mike Pagano
2014-10-07  1:34 Anthony G. Basile
2014-10-07  1:28 Anthony G. Basile
2014-10-06 11:39 Mike Pagano
2014-10-06 11:38 Mike Pagano
2014-10-06 11:16 Anthony G. Basile
2014-10-06 11:16 Anthony G. Basile
2014-09-27 13:37 Mike Pagano
2014-09-26 19:40 Mike Pagano
2014-09-22 23:37 Mike Pagano
2014-09-09 21:38 Vlastimil Babka
2014-08-26 12:16 Mike Pagano
2014-08-19 11:44 Mike Pagano
2014-08-14 11:51 ` Mike Pagano
2014-08-08 19:48 Mike Pagano
2014-08-19 11:44 ` Mike Pagano
2014-07-15 12:23 Mike Pagano
2014-07-15 12:18 Mike Pagano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox